저작자표시 비영리 공연 및 방송할 수...
TRANSCRIPT
저 시-비 리-동 조건 경허락 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
l 차적 저 물 성할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 허락조건 확하게 나타내어야 합니다.
l 저 터 허가를 러한 조건들 적 지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 적 할 수 없습니다.
동 조건 경허락. 하가 저 물 개 , 형 또는 가공했 경에는, 저 물과 동 한 허락조건하에서만 포할 수 습니다.
이학박사 학위논문
Two Stage Dantzig Selector for
High Dimensional Data
고차원 자료를 위한 이단계 단치그 셀렉터
2014년 2월
서울대학교 대학원
통계학과
한 상 미
Two Stage Dantzig Selector for
High Dimensional Data
지도교수 김 용 대
이 논문을 이학박사 학위논문으로 제출함.
2013년 10월
서울대학교 대학원
통계학과
한 상 미
한상미의 이학박사 학위논문을 인준함.
2013년 12월
위 원 장 : 박 병 욱 (인)
부 위원장 : 김 용 대 (인)
위 원 : 임 요 한 (인)
위 원 : 장 원 철 (인)
위 원 : 권 성 훈 (인)
Two Stage Dantzig Selector for
High Dimensional Data
by
Sangmi Han
A Thesis
Submitted in fulfillment of the requirements
for the degree of
Doctor of Philosophy
in Statistics
Department of Statistics
College of Natural Sciences
Seoul National University
Feburary, 2014
Abstract
Variable selection is important in high dimensional regression. The traditional
variable selection methods such as stepwise selection are unstable which means
that the set of the selected variables are varying according to the data sets.
As an alternative to those methods, a series of penalized methods are used
for estimation and variable selection simultaneously. The LASSO yields sparse
solution, but it is not selection consistent and biased. Non-convex penalized
methods such as the SCAD and the MCP are known to be selection consistent
and yield unbiased estimator. However they suffer from multiple local minima
and their computations are unstable for tuning parameter. Two stage methods
based on the LASSO such as one step LLA and calibrated CCCP are developed
which can obtain the oracle estimator as the unique local minimum.
We propose a two stage method based on Dantzig selector. The motivation
of our proposed method is that lessening the effect of the noise variables is
important in the two stage method. The `1 norm of the Dantzig selector is
i
always less than equal to that of the LASSO and the non-asymptotic error
bounds of Dantzig selector tent to be lesser than those of LASSO for the same
tuning parameter. Therefore we expect the improvement on the estimation
using the Dantzig selector instead of the LASSO in the two stage method
while this proposed method also satisfies the selection consistency. The results
of the numerical experiments can support our contention.
We also apply these two stage methods which are based on LASSO or
Dantzig selector to estimation of inverse covariance matrix (a.k.a. precision
matrix). Precision matrix estimation is essential not only because it can be
used in various applications but also because it refers to the direct relation-
ship between variables via the conditional dependence of variables under the
normality assumption. Under some regularity conditions our methods hold se-
lection consistency and obtain columnwise√n-consistent estimator for true
nonzero precision matrix elements. The numerical analyses show that the pro-
posed methods perform well in terms of variable selection and estimation.
Keywords: High dimensional regression, variable selection, Dantzig selector,
selection consistency, oracle estimator, inverse covariance matrix estimation
Student Number: 2007− 20263
ii
Contents
Abstract i
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Two Stage Dantzig Selecotr for High Dimensional Linear Re-
gression 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Sparse regularization methods . . . . . . . . . . . . . . . . . . . 8
2.2.1 The `1 regularization . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Nonconvex penalized methods . . . . . . . . . . . . . . . 13
2.2.3 Two stage methods . . . . . . . . . . . . . . . . . . . . . 21
2.3 Two Stage Dantzig Selector . . . . . . . . . . . . . . . . . . . . 25
iii
2.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Theoretical properties . . . . . . . . . . . . . . . . . . . 29
2.3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.5 Tuning regularization parameter . . . . . . . . . . . . . . 37
2.4 Numerical analyses . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . 44
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Two Stage Methods for Precision Matrix Estimation 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Estimation of precision matrix via columnwise two-stage methods 49
3.2.1 Two stage method based on LASSO . . . . . . . . . . . 50
3.2.2 Two stage Dantzig selector . . . . . . . . . . . . . . . . . 53
3.2.3 Theoretical results . . . . . . . . . . . . . . . . . . . . . 55
3.3 Numerical analyses . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . 83
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
iv
4 Concluding remarks 88
5 Appendix 90
5.1 Algorithms for Dantzig selector . . . . . . . . . . . . . . . . . . 90
5.1.1 Primal-dual interior point algorithm (Candes and Romberg,
2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.2 DASSO (James et al., 2009) . . . . . . . . . . . . . . . . 96
5.1.3 Alternating direction method (ADM) (Lu et al., 2012) . 100
Abstract (in Korean) 111
감사의 글 114
v
List of Tables
2.1 Example 1 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Example 1 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Example 2 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Example 2 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5 Real Data (TRIM) . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Example 1 (p=100, q=99) . . . . . . . . . . . . . . . . . . . . . 68
3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 70
3.3 Example 1 (p=200, q=199) . . . . . . . . . . . . . . . . . . . . 72
3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 74
3.5 Example 2 (p=100, q=59) . . . . . . . . . . . . . . . . . . . . . 76
3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 78
3.7 Example 2 (p=200, q=92) . . . . . . . . . . . . . . . . . . . . . 80
3.8 ‖Ω −Ω∗‖ of Example 2 (p=200, q=92) . . . . . . . . . . . . . 82
vi
3.9 Real Data (Breast Cancer) . . . . . . . . . . . . . . . . . . . . . 86
vii
List of Figures
2.1 LASSO and Dantzig selector . . . . . . . . . . . . . . . . . . . 10
2.2 Penalized method . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 LLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 CCCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Nonconvex penalties . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Adaptive DS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 ROC curve of Example 1 (p=100, q=99) . . . . . . . . . . . . . 67
3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 69
3.3 ROC curve of Example 1 (p=200, q=199) . . . . . . . . . . . . 71
3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 73
3.5 ROC curve of Example 2 (p=100, q=59) . . . . . . . . . . . . . 75
3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 77
3.7 ROC curve of Example 2 (p=200, q=92) . . . . . . . . . . . . . 79
viii
3.8 ‖Ω −Ω∗‖ of Example 2 (p=200, q=92) . . . . . . . . . . . . . 81
ix
Chapter 1
Introduction
1.1 Overview
High dimensional data analysis has been received much attention due to ad-
vance of technologies of data collection. High dimensional data where the num-
ber of covariates exceeds the number of observation arises in various fields such
as genomics, neuroscience, economics, finance, and machine learning. Variable
selection is fundamental to data analysis because it can identify significant vari-
ables among many covariates. In high dimension, the importance of variable
selection proliferates. There are two approaches for variable selection which are
subset selection and sparse regularization. In high dimension, subset selection
methods such as best subset selection are challenging to compute, unstable
1
and hard to draw sampling properties (Breiman, 1996).
To handle these drawbacks, many sparse regularization methods have been
proposed. These methods can select significant variables and estimate coeffi-
cients simultaneously. Two major approaches in sparse regularization methods
are `1 regularization including LASSO (Tibshirani, 1996) and Dantzig selector
(Candes and Tao, 2007) and nonconvex penalization including SCAD (Fan and
Li, 2001) and MCP (Zhang, 2010). For selection consistence, `1 regularization
methods need stringent conditions such as strong irrepresentable condition
(Zhao and Yu, 2006).
On the other hand, nonconvex penalized methods not only do not need
such conditions for selection consistency but also they can reduce the innate
bias problem of `1 regularization methods. Despite of these good properties of
nonconvex penalized methods, they suffer from multiple local minima and can-
not guarantee the converged solution to be oracle estimator. As an alternative
to these nonconvex penalized methods, two stage methods based on LASSO
such as one step LLA and calibrated CCCP algorithms proposed to obtain the
oracle estimator.
In this paper, we deal with the regularization methods in high dimensional
linear regression model. We focus on developing a new regularization method
to obtain oracle estimator. We propose a two stage method based on Dantzig
2
selector. Our method can improve variable selection and estimation via deleting
noise variable more efficiently by using Dantzig selector instead of LASSO.
We also deal with precision matrix estimation as one of applications of high
dimensional linear regression. For sparse precision matrix estimation, many
regularization methods have been considered. Most of them belong to two reg-
ularization frameworks which are maximum likelihood approach and regression
based approach. We apply two stage methods based on LASSO or Dantzig se-
lector to regression based approach and show they can obtain the columnwise
oracle estimator of precision matrix. Numerical results show that our proposed
methods are superior to other regularized methods in terms of support recovery
and estimation of precision matrix.
1.2 Outline of the thesis
The thesis is organized as follows. In chapter 2, we deal with high dimensional
linear regression. We review diverse sparse regularization methods for high
dimensional linear regression and propose two stage Dantzig selector. Theo-
retical properties and algorithm for two stage Dantzig selector are provided,
and we compare our method to other methods in numerical analyses. In chap-
ter 3, precision matrix estimation using regularization methods is considered.
3
We review existing regularization methods and propose new methods which
utilize the two stage methods based on LASSO or Dantzig selector. We prove
theoretical properties of two stage methods, and numerical analyses are con-
ducted. Concluding remarks follow in chapter 4. In Appendix, algorithms for
the Dantzig selector are summarized.
4
Chapter 2
Two Stage Dantzig Selecotr for
High Dimensional Linear
Regression
2.1 Introduction
Variable selection is essential for linear regression analysis. There are two ap-
proaches for variable selection, which are subset selection and regularization.
Subset selection is selecting a subset of covariates and using only the selected
covariates for fitting model. Popular examples of subset selection are best sub-
set selection, forward selection, backward elimination, stepwise selection, and
5
etc. In high dimension, these subset selection methods are computationally
demanding and unstable. Furthermore, their sampling properties are hard to
derive (see Breiman (1996) for more discussions).
To deal with these drawbacks, many sparse regularization methods have
been proposed which can select the important variables and estimate the ef-
fect of covariates on the response simultaneously. The `1 regularization meth-
ods and the nonconvex penalized methods are two mainstreams of regularized
estimators for high dimensional regression model. The least absolute shrinkage
and selection operator (LASSO) (Tibshirani, 1996) and the Dantzig selector
(Candes and Tao, 2007) are two representative examples of the `1 regulariza-
tion. They are easy to calculate and have good estimation properties (Bickel
et al., 2009; Raskutti et al., 2011). However, they have intrinsic bias and se-
lection consistency only under awkward conditions such as the irrepresentable
condition (Zhao and Yu, 2006).
On the other hand, nonconvex penalized methods such as the smoothly
clipped absolute deviation (SCAD) (Fan and Li, 2001), and the minimax con-
cave penalty (MCP) (Zhang, 2010) can have unbiasedness and selection consis-
tency, simultaneously. The most fascinating property of nonconvex penalized
methods is the oracle property. The oracle property means that covariates
are selected consistently and the efficiency of the estimator is equivalent to
6
the least square estimator obtained with knowing true nonzero coefficients in
advance (Fan and Li, 2001; Kim et al., 2008; Kim and Kwon, 2012).
However, due to their nonconvexity, there can be many local minima in the
corresponding objective function. Therefore, it is not guaranteed for an ob-
tained estimator to be the oracle estimator. Even though some previous works
(Kim and Kwon, 2012; Zhang, 2010) showed that the objective function with
a nonconvex penalty can have a unique local minimizer under some regularity
conditions, its computation may be demanding for high dimensional models.
Typically, optimization problems corresponding to nonconvex penalized meth-
ods are solved by iterative algorithms, including the concave convex procedure
(CCCP) (Kim et al., 2008) and the local linear approximation (LLA) (Zou and
Li, 2008), where the nonconvex objective function is approximated by a locally
linear function. However, it takes a significant amount of time for algorithms
to converge, and typically the nonconvex penalized methods suffer instability
in tuning the regularization parameter. Furthermore, these algorithms only
assure the convergence to a local minimum which is not necessarily the oracle
estimator (Wang et al., 2013).
Two stage methods based on LASSO are proposed to obtain the oracle
estimator such as the one step LLA (Zou and Li, 2008; Fan et al., 2012)) and
the calibrated CCCP (Wang et al., 2013). The main idea of these methods is
7
to obtain the oracle estimator by solving the LASSO problem twice.
In this chapter, we propose a two stage method based on Dantzig selector
to obtain the oracle estimator, which is named as two stage Dantzig selector.
The Dantzig selector used in our method can improve variable selection and
estimation through lessening the effects of noise variables more efficiently than
LASSO. We prove two stage Dantzig selector can obtain oracle estimator under
regularity conditions. The proposed method can be easily implemented by
general algorithms for the standard Dantzig selector. Numerical results show
that our proposed method outperforms other sparse regularization methods
with respect to variable selection and estimation.
2.2 Sparse regularization methods
In this section, we review various sparse regularization methods for high di-
mensional linear regression. Consider the linear regression model.
y = Xβ + ε,
where y is an n×1 vector of response,X = (x1, . . . ,xn)T = (X1, . . . ,Xp) is an
n×p matrix of covariate where Xj ∈ Rn and xi ∈ Rp, β is a p×1 vector of un-
knwon regression coefficients and ε is an n×1 vector of random error. On high
dimension, the ordinal least square estimator is not defined uniquely and the
8
traditional variable selection methods such as best subset selection, stepwise
selection based on AIC or BIC criteria are computationally intensive. Fur-
thermore, they are hard to draw sampling properties and unstable (Breiman,
1996). As an alternative, sparse regularization methods are used to estimate
coefficients and select variables. There are two mainstreams of regularization
methods which are `1 regularization and the nonconvex penalization.
2.2.1 The `1 regularization
The `1 regularization can achieve sparsity. The sparsity means that the esti-
mator produces exactly zero coefficients hence it can reduce model complexity.
The LASSO (Tibshirani, 1996) and the Dantzig selector (Candes and Tao,
2007) are two representative examples of the `1 constrained methods. The
LASSO estimator βLASSO is defined as the solution of
minβ
(||y −Xβ||2/2n+ λ‖β‖1
)or equivalently,
minβ‖y −Xβ‖2 subject to ‖β‖1 ≤ t.
where ‖a‖ =√∑n
i=1 a2i , ‖a‖1 =
∑ni=1 |ai|.
9
−1.5 −0.5 0.5 1.0 1.5
−1
.0−
0.5
0.0
0.5
1.0
1.5
2.0
LASSO
β1
β 2
true
ols
true
ols
−1.5 −0.5 0.5 1.0 1.5
−1
.0−
0.5
0.0
0.5
1.0
1.5
2.0
Dantzig Selector
β1
β 2
true
ols
Figure 2.1: LASSO and Dantzig selector
The Dantzig selector βDantzig is defined similarly to the LASSO estimator
by
minβ‖β‖1 subject to ‖ 1
nXT (y −Xβ)‖∞ ≤ λ
or equivalently,
minβ‖XT (y −Xβ)‖∞ subject to ‖β‖1 ≤ t,
where ‖a‖∞ = maxi |ai| with a ∈ Rn.
As shown in Figure 2.1, the solution of the LASSO occurs at the point of
contact between dotted ellipsoid and solid diamond whereas the solution of
Dantzig selector occurs at the point of contact between dotted parallelogram
10
and diamond. Hence the exact zero element of solution can be obtained. The
dotted ellipsoid means the set of points which has the same distance (β −
βols)TXTX(β − βols) from the ordinal least square estimator βols and the
dotted parallelogram means the set of points which has the same value of
‖XTX(β − βols)‖∞.
The penalized form of the LASSO and the definition of the Dantzig selector
are related. The LASSO estimate is always in the constrained set (feasible set)
of the Dantzig selector. The Karash-Kuhn-Tucker conditions for the Lagrange
form of the LASSO are given by
XTj (y −Xβ) = λsign(βj) for |βj| > 0,
|XTj (y −Xβ)| ≤ λ for βj = 0.
Therefore, |XT (y−XβLASSO(λ))|∞ ≤ λ, and here ‖βDantzig(λ)‖1 ≤ ‖βLASSO(λ)‖1.
The Dantzig selector and the LASSO share some similarities. They yield the
same solution path under some suitable conditions on the design matrix. Mein-
shausen et al. (2007) proved that the LASSO and the Dantzig selector share the
identical solution path under the diagonal dominance condition, which means
that Mjj >∑
i 6=j |Mij| for all j = 1, . . . , p where M = (XTX)−1. James et al.
(2009) showed the equivalence of the LASSO and the Dantzig selector under a
condition which is similar to the irrepresentable condition (IC) (Zhao and Yu,
2006). This condition is that ‖XTXA(λ)u‖∞ ≤ 1 for u = (XTA(λ)XA(λ))
−11
11
with a tuning parameter λ and the active set A(λ) = j : βj(λ) 6= 0.
The Dantzig selector and the LASSO estimator can achieve the minimax
optimal error bound (Raskutti et al., 2011; Bickel et al., 2009). Raskutti et al.
(2011) showed that the minimax optimal convergence rate of l2-error is
O(√s log p/n) under some regularity conditions. Bickel et al. (2009) proved
the similar prediction error rate of the LASSO and the Dantzig selector and
the asymptotic equivalence under the restricted eigenvalue condition.
Not only the theoretical properties, but also the algorithms for the LASSO
and the Dantzig selector are comparable. Similar to the LARS (Efron et al.,
2004), which is an efficient algorithm for the LASSO estimator giving piece-
wise linear path, the DASSO (James et al., 2009) algorithm gives a piecewise
linear solution path. These algorithms will be summarized and compared in
Appendices
Despite of good asymptotic properties and efficient algorithms, there are
some limitations in the LASSO and the Dantzig selector. First, the LASSO and
the Dantzig selector are biased. Since the same amount of shrinkage is enforced
on all nonzero coefficients, they cannot achieve the unbiasedness Second, the
LASSO and the Dantzig selector rarely have the model selection consistency.
The correlation structure of covariates is crucial in the selection consistency
such as ICs (Zhao and Yu, 2006), and the coherence property (Candes and
12
Plan, 2009). Zhao and Yu (2006) proved the weak oracle property of the LASSO
estimator under the ICs. Gai et al. (2013) proved the weak oracle property of
the Dantzig selector under the modified ICs related to KKT conditions of
Dantzig selector, which are more complex than the ICs of the LASSO. Those
ICs mean that the regression coefficients of the inactive variables on s active
variables should be uniformly bounded by a constant less than equal to one.
As we can see in the simulation results of Zhao and Yu (2006), these ICs are
too strict especially in high dimensions. Hence the LASSO and the Dantzig
selector cannot have the selection consistency in most cases.
2.2.2 Nonconvex penalized methods
Nonconvex penalized methods can be good alternatives to the `1 regularized es-
timators since they have selection consistency and unbiasedness. A nonconvex
penalized least square estimator is defined as the minimizer of Qλ(β) where
Qλ(β) = ||y −Xβ||2/2n+ Pλ(|β|)
and Pλ(|β|) is a nonconvex penalty including bridge estimator (Frank and
Friedman, 1993), the SCAD (Fan and Li, 2001), and the MCP (Zhang, 2010).
The bridge estimator is defined as Pλ(β) = λ∑p
j=1 |βj|q, 0 < q < 1. The
13
−4 −2 0 2 4
01
23
45
Penalty functions
β
Pλ(β
)
lasso
MCP
SCAD
bridge
lasso
MCP
SCAD
bridge
lasso
MCP
SCAD
bridge
Figure 2.2: Penalized method
penalty function of the SCAD is defined as
Pλ(β) =
p∑j=1
[ λ|βj|I(|βj| < λ)
+
aλ(|βj| − λ)− (β2
j − λ2)/2a− 1
+ λ2I(λ ≤ |βj| < aλ)
+
(a− 1)λ2
2+ λ2
I(|βj| ≥ aλ)
].
Zhang (2010) proposed the MCP with
Pλ(β) =
p∑j=1
[−β2
j /2a+ λ|βj|I(|βj| ≤ aλ) +1
2aλ2I(|βj| > aλ)
].
Figure 2.2 shows those nonconvex penalty functions and the LASSO penalty.
The SCAD and the MCP estimators satisfy good properties of penalized
14
estimator which are unbiasedness, sparsity, and continuity as introduced by
(Fan and Li, 2001) while the bridge estimator is lack of continuity and the
LASSO is lack of unbiasedness.
For identifying unknown signal variables, nonconvex penalized methods
have received great attention recently because they can achieve the model se-
lection consistency without stringent conditions such as ICs. Instead, they need
weaker conditions on the design matrix such as sparse Rieze condition (Zhang
and Huang, 2008) and positive minimum eigenvalue of submatrix of XTX
which only uses the signal covariates. Let the true beta β∗ = (β∗T1 ,0T )T be such
that the first s regression coefficients β∗1 are nonzero and others to be zeros. The
oracle estimator β(o) is defined as (β(o)1 ,0T )T where β
(o)1 = (XT
1 X1)−1XT
1 y,
X = (X1,X2), X1 is n × s, and X2 is n × (p − s) subset of X. Assume
√n(β
(o)1 − β∗1)
d→ Ns(0,Σ). An estimator β = (βT1 , βT2 )T is said to have the
oracle property if
Pr[j : βj 6= 0 = 1, . . . , s
]= 1,
√n(β1 − β∗1)
d→ Ns(0,Σ).
Many previous works showed that various non-convex penalized methods have
the oracle property (Fan and Li, 2001; Kim et al., 2008; Huang et al., 2008a;
Zhang, 2010; Kim and Kwon, 2012).
15
There are three types of oracle properties - weak, global, and strong oracle
properties. The weak oracle property is that there exists a sequence of λn such
that one of the local minimizers of Qλn(β) is the oracle estimator. Fan and Li
(2001) and Kim et al. (2008) proved the weak oracle property of the SCAD
for p ≤ n and p > n, respectively. The global oracle property is that there
exists a sequence of λn such that the global minimizer β(λn) of Qλn(β) has
the oracle property. Huang et al. (2008a) proved the global oracle property of
bridge estimator and Kim et al. (2008) proved that of the SCAD for p ≤ n.
The strong oracle property means that there exists a sequence of λn such that
the unique local minimizer of Qλn(β) is the oracle estimator. The SCAD and
the MCP can obtain the oracle estimator as a unique local optimizer with
probability tending to one (Kim and Kwon, 2012; Zhang, 2010).
Computing the global solution of the nonconvex penalized methods is infea-
sible in high dimensional setting. Maximizing nondifferentiable and nonconvex
functions is challenging. Iterative algorithms which locally approximate a non-
convex penalized objective to a convex function and solve the convex optimiza-
tion are used. Local quadratic approximation (LQA) (Fan and Li, 2001), local
linear approximation (LLA) (Zou and Li, 2008), concave convex procedure
(CCCP) (Kim et al., 2008) are invented to get a nonconvex penalized likeli-
hood estimate. The LQA uses the second order approximation of the penalty
16
as follows,
[Pλ(|βj|)]′ = P ′λ(|βj|)sign(βj) ≈ P ′λ(|β(0)j |)/|β
(0)j |βj.
Pλ(|βj|) ≈ Pλ(|β0j|) +1
2
P ′λ(|β0j|)|β(0)j |
(β2j − |β
(0)j |2) for βj ≈ β
(0)j
The LQA algorithm is updates the solutions as follows until it converges,
β(k+1) = argminβ
1
2n‖y −Xβ‖2 +
1
2
p∑j=1
P ′λ(|β(k)j |)
|β(k)j |
β2j
.
To avoid numerical instability, when |β(k)j | < ε0 (prespecified value), we set βj =
0 and delete the jth component of X from the iteration. In every iteration,
the solution is
β(k+1) = XTX +Σλ(β(k))−1XTy
whereΣλ(β(k)) = diag(P ′λ(|β
(k)1 |)/|β
(k)1 |/2, . . . , P ′λ(|β
(k)p |)/|β(k)
p |/2) for k = 0, 1, 2, . . ..
Since the LQA removes the variables with small magnitude of coefficients, once
βj is removed from the model, it is permanently excluded and hence the choice
of ε0 affects significantly the degree of sparsity of the solution and speed of con-
vergence. To relieve this problem, instead of removing variables, perturbation
τ0 on the numerator can be considered.
β(k+1) = argminβ
1
2n‖y −Xβ‖2 +
1
2
p∑j=1
P ′λ(|β(k)j |)
|β(k)j |+ τ0
β2j
However, the τ0 plays the similar role to ε0.
17
−4 −2 0 2 4
01
23
4
SCAD
β
Pλ(β
)
−4 −2 0 2 4
01
23
4
MCP
β
Pλ(β
)
Figure 2.3: LLA
The CCCP and LLA can make up for LQA’s shortcomings and they can
be easily implemented by the algorithms for LASSO. The LLA algorithm is
defined as follows. For k = 1, 2, . . ., until it converges, repeat the following
optimization problem:
β(k+1) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|β(k)j |)|βj|.
The LLA is the first order approximation of the nonconvex penalty function.
Figure 2.3 shows the LLA of SCAD and MCP. The CCCP decomposes the
nonconvex penalty into LASSO penalty and concave penalty. The concave part
of them is approximated by the tight local linear function. The decompositions
18
−4 −2 0 2 4
−2
−1
01
23
4
SCAD
β
Pλ(β
)
−4 −2 0 2 4
−2
−1
01
23
4
MCP
β
Pλ(β
)
Figure 2.4: CCCP
of nonconvex penalty function of SCAD and MCP are represented in Figure
2.4. The CCCP algorithm iteratively minimizes Q(β|β(k), λ) until it converges
when
Q(β|β(k), λ) =1
2n‖y −Xβ‖2 +
p∑j=1
∇Jλ(|β(k)j |)βj + λ
p∑j=1
|βj|
where Pλ(|βj|) = Jλ(|βj|) + λ|βj| and ∇Jλ(t) = dJλ(t)dt
.
Since the CCCP and the LLA algorithm use the first order derivative of
the nonconvex penalty, a class of the nonconvex penalties which can use these
algorithms is defined as Pλ(|t|) = Pa,λ(|t|) defined on t ∈ (−∞,∞) satisfying
(P1) Pλ(t) is increasing and concave for t ∈ [0,∞) with continuous derivative
19
on t ∈ (0,∞) with P ′λ(0) := P ′λ(0+) ≥ a1λ.
(P2) P ′λ(t) ≥ a1λ for t ∈ (0, a2λ)
(P3) P ′λ(t) = 0 for t > aλ > a2λ
for some positive constant a1,a2 and a.
As shown in the Firgure 2.5, the SCAD and the MCP satisfy above conditions
because the derivative of SCAD is defined as
P ′λ(t) = λIt≤λ +(aλ− t)+a− 1
It>λ, for some a > 2 with a1 = a2 = 1,
and the derivative of MCP is
P ′λ(t) = (λ− t
a)+, for some a > 1 with a1 = 1− a−1, a2 = 1.
However due to the nature of nonconvexity of penalty, multiple minima
can occur and the existing algorithms for nonconvex penalized methods only
guarantee the convergence to not the oracle estimator but a local minimum.
Although under some conditions, the nonconvex penalized methods yield the
oracle estimator as the unique minimizer (Kim and Kwon, 2012; Zhang, 2010),
the direct computation of the global solution is infeasible in high dimension.
Especially when it comes to tuning parameter, the computation is unstable.
To deal with these difficulties, the only one-step algorithms with a good initial
20
−4 −2 0 2 4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Penalty function
β
Pλ(β
)
MCP
SCAD
−4 −2 0 2 4
0.0
0.5
1.0
1.5
2.0
Derivative of penalty function
β
Pλ(β
)’
MCP
SCAD
Figure 2.5: Nonconvex penalties
estimator are proposed (Zou and Li, 2008; Fan et al., 2012; Wang et al., 2013)
and we will call them two stage methods .
2.2.3 Two stage methods
The two stage methods based on LASSO are proposed to obtain the oracle
estimator such as the one step LLA (Zou and Li, 2008; Fan et al., 2012) and
the calibrated CCCP(Wang et al., 2013). The main idea of these methods is to
obtain the oracle estimator by solving the LASSO problem. Zou and Li (2008)
proved that one step LLA algorithm can get the oracle estimator with a good
21
initial estimator. They suggested using the maximum likelihood estimator as
an initial estimator for n > p. The one step LLA estimator is defined as
β(λ) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|βinitj |)|βj|.
This can be reformulated as
β(λ)Ac = argminβAc‖rA −XAcβAc‖2/2n+
∑j∈Ac
P ′λ(|βinitj |)|βj|,
β(λ)A = (XTAXA)−1XT
A(y −XAcβ(λ)Ac),
where A = A(λ) = j : |βinitj | > aλ with the parameter of nonconvex penalty
a, and rA = y −XAβ(λ)A.
In the first stage, a good initial estimator which is close to the true coeffi-
cient should be attained. For a regularization parameter λ such that
min|β∗j | : β∗j 6= 0, j = 1, . . . , p > (a+ 1)λ and ‖β∗ − βinit‖∞ < λ,
the true signal set and the signal set of initial estimator are equivalent, i.e.,
A = A0 = j : β∗j 6= 0 and P ′λ(|βinitj |) ≈ 0 for j ∈ A0,
and hence β(λ)A0 = (XTA0XA0)
−1XTA0
(y−XAc0β(λ)Ac0). Since estimating β(λ)
can be recast as estimating β(λ)Ac0 and plug β(λ)Ac0 into the equation for
β(λ)A0 , the oracle estimator can be obtained via β(λ)Ac0 = 0. Hence removing
the effect of noise variables is important in the second stage of the two stage
methods.
22
Fan et al. (2012) suggested the LASSO estimator with smaller regulariza-
tion parameter (λinit ≤ aγLSs−1/2λ/4) as an initial estimator and then they
can obtain the oracle estimator with high probability, where s is the number
of nonzero coefficients, a is the parameter of nonconvex penalty, and γLS is the
restricted eigenvalue defined by γLS = minδ 6= 0
‖δAc0‖1 ≤ 3‖δA0
‖1
‖Xδ‖√n‖δA0
‖ > 0.
Wang et al. (2013) proved that the calibrated CCCP estimator using the
LASSO initial estimator with smaller regularization parameter (λinit = τλ, τ =
o(1), e.g., τ = 1/ log n or λ) can obtain the oracle estimator with high proba-
bility. They remarked on the choice of τ which can be related to the number
of the signal parameter and the restricted eigenvalue. For the sparse and well
behaved design matrix, τ = 1/ log n can be used. If the true model is not very
sparse (s → ∞) or the design matrix does not behave well (γLS → 0), then
τ = λ can be considered.
Tuning the regularization parameter is crucial issue to obtain the oracle
estimator. Wang et al. (2013) proposed the high dimensional BIC (HBIC) for
choosing the regularization parameter of calibrated CCCP, which is defined by
HBIC(λ) = log(‖y −Xβ(λ)‖2/n) + Cn ∗ log p/n ∗ |Mλ| (2.1)
where Mλ = j : βj(λ) 6= 0, Cn → ∞ (e.g., Cn = log n or log(log n)).
Furthermore they proved the oracle estimator can be attained by using the
23
tuning parameter chosen from HBIC with high probability under some regu-
larity conditions,
Pr(Mλ = j : β∗j 6= 0)→ 1,
where λ = argminλ∈λ:|Mλ|≤Kn
HBIC(λ). This criterion is the extension of the results
of Chen and Chen (2008) and Kim et al. (2012). This result of the selection
consistency can be applied to the methods which satisfy the strong oracle
property.
Corollary 1. Let the HBIC(λ) be defined as (2.1). Assume that regularity
conditions, which are necessary for an estimator β(λ) to be the oracle estimator
with probability tending to one, hold and there exists a positive constant γ such
that
limn→∞
minA6=A0,|A|≤Kn
n−1‖(In −HA)XA0β∗A0‖2 ≥ γ,
where In denotes the n × n identity matrix and HA denotes the projection
matrix onto the linear space spanned by XA. If Cn → ∞, sCn log p/n = o(1)
and K2n log p log n/n = o(1), then
Pr(Mλ = j : β∗j 6= 0)→ 1,
where Mλ = j : βj(λ) 6= 0 and λ = argminλ∈λ:|Mλ|≤Kn
HBIC(λ).
This Corollary 1 is the generalization of the Theorem 3.5 of Wang et al.
(2013) and it is automatically proven by the proof of the Theorem 3.5 of Wang
24
et al. (2013).
2.3 Two Stage Dantzig Selector
2.3.1 Method
We modified the one step LLA for the nonconvex penalized estimator by re-
placing the LLA to the adaptive Dantzig selector type estimator.
min
p∑j=1
P ′λ(|βinitj |)|βj| (2.2)
subject to | 1nXT
j (y −Xβ)| ≤ P ′λ(|βinitj |), j = 1, . . . , p,
where βinitj is an initial estimate and P ′λ(t) is satisfying (P1)-(P3) which are
defined in Section 2.2. I call this estimator two stage Dantzig selector
βTSDS(λ). The initial estimate βinitj can be the LASSO or the Dantzig estimate
with tuning parameter λinit = λ/ log n or λ2.
2.3.2 Motivation
In order to achieve the oracle estimator, the key to the two stage method is
removing the noise variables in the second stage. For example, let’s consider the
one step LLA with SCAD penalty Pλ with initial βinit. Let yn×1 = Xn×pβ∗p×1+
εn×1. Consider a λ such that min|β∗j | : j ∈ A0 > (a + 1)λ where A0 = j :
25
β∗j 6= 0 and a is a parameter for SCAD penalty. Suppose that ‖βinit−β∗‖∞ <
λ then the one step LLA estimator is defined by
β(λ) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|βinitj |)|βj|,
or equivalently,
β(λ)Ac0 = argminβAc0‖rA0 −XAc0βAc0‖
2/2n+∑j∈Ac0
λ|βj|,
β(λ)A0 = (XTA0XA0)
−1XTA0
(y −XAc0β(λ)Ac0),
where rA0 = y−XA0β(λ)A0 . Hence the aim of the second stage is deleting the
noise variables and then the one step LLA can acquire the oracle estimator.
So we focus on deleting or lessening the effect of noise variables and we find
out that some properties of Dantzig selector shows the superiority of Dantzig
selector over LASSO.
The `1 norm of the Dantzig selector is always less than or equal to that of
the LASSO estimator because the Dantzig selector is defined as the minimizer
of ‖β‖1 subject to ‖XT (y −Xβ)‖∞ ≤ λ and the LASSO estimator satisfies
the constraint. Recall the fact that the LASSO estimate βLS(λ) is always in
the feasible set of the Dantzig selector βDS(λ) because of the KKT conditions
for the Lagrangean form of the LASSO in Section 2.1. If there are no signal
variables then the mean absolute deviation of Dantzig selector is less than or
equal to that of LASSO for the same regularization parameter λ, ‖βDS(λ) −
26
β∗‖1 ≤ ‖βLS(λ)−β∗‖1. Furthermore, according to Bickel et al. (2009), the non-
asymptotic `q error bounds of Dantzig selctor are lesser than those of LASSO
for 1 ≤ q ≤ 2. If the mean squared error (MSE) of the Dantzig selector is
lesser than the MSE of the LASSO in the no signal setting, then the two stage
methods can be improved with Dantzig selector in terms of the MSE. Hence
we conduct a simulation to figure out whether or not the MSE of the Dantzig
selector is less than the MSE of the Lasso estimator in the no signal setting.
We simulate whether the `2 error bounds of Dantzig selctor tends to be
lesser than that of LASSO in no signal regression setting. Let y100×1 = X100×1000β1000×1+
ε100×1 where β = (0, . . . , 0)T , X ∼ N(0,Σ) with Σij = R|i−j|, and ε ∼
N(0, I). This simulation is conducted as follows,
1. For 20 tuning parameters, fit LASSO and Dantzig estimator and calcu-
late MSE,
2. Repeat the step 1 100 times and test H1 : MSELASSO > MSEDantzig.
Let Z = MSELASSO − MSEDantzig. In the case of R = 0.3, Z = 0.00093,
sd(Z) = 0.00014, and p-value=0. For R = 0.5, Z = 0.00122, sd(Z) = 0.00012,
and p-value=0. Even in the case of R = 0, Z = 0.0006, sd(Z) = 0.0001.
Therefore we can conclude that the overal MSE of the Dantzig selector tends
to be lesser than the MSE of the LASSO for the same tuning parameter. The
27
two stage method with Dantzig selector can improve the estimation efficiency
satisfying the global oracle property of two stage method with LASSO.
The relationship between the Dantzig selector and the LASSO can be ex-
tended to the relationship between adaptive Dantzig selector (Dicker, 2010)
and the adaptive LASSO (Zou, 2006). The adaptive LASSO is defined as
min1
2n‖y −Xβ‖22 + λ
p∑j=1
wj|βj|
e.g., wj = |βlsj |−1. Similar to the adaptive LASSO, the adaptive Dantzig selec-
tor is defined as
min
p∑j=1
wj|βj| subject to | 1nXT
j (y −Xβ)| ≤ wjλ, j = 1, . . . , p.
Its formula can be derived by the derivative of the objective function of the
adaptive LASSO. For the detail, see the thesis of Dicker (2010). They also
proved that the adaptive Dantzig selector and the adaptive LASSO have the
same asymptotic properties.
The relation between Dantzig selector and the adaptive Dantzig selector is
represented in Figure 2.6. The adaptive Dantzig selector can relieve the bias
problem of Dantzig selector and give a unique solution (Dicker, 2010). We
apply this adaptive Dantzig selector with wj = P ′λ(|βinitj |)/λ to the second
stage of the two stage method. The difference between the adaptive Dantzig
selector and our proposed method is that the weight depends on the tuning
28
Figure 2.6: Adaptive DS
parameter λ.
2.3.3 Theoretical properties
In this section we prove the global oracle property of the two stage Dantzig
selector under some regularity conditions defined as follows,
(A1) The random errors ε = (ε1, . . . , εn) are i.i.d mean zero sub-Gaussian(σ)
with a scale factor σ > 0, i.e., E[exp(tε2i )] ≤ exp(σ2t2/2).
(A2) ηmin(XTA0XA0) > 0 where ηmin(B) is the minimum eigenvalue of B.
29
(A3) The design matrix X satisfies
γ = minθ 6= 0
‖θAc0‖1 ≤ α‖θA0
‖1
‖Xθ‖2√n‖θA0‖2
> 0.
where α = 3 for LASSO initial and α = 1 for Dantzig initial.
The main theorem shows that the two stage Dantzig selector with a good
initial estimator is equivalent to the oracle estimator if the oracle estimator
satisfies the constraint of the two stage Dantzig selector.
Theorem 1. Assume that minj∈A0
|β∗j | > (a + 1)λ where A0 = j : β∗j 6= 0. Let
Fn0 = ‖βinit − β∗‖∞ ≤ a0λ where a0 = min1, a2 and Fn1 = | 1nXTj (Y −
Xβ(o))| ≤ P ′λ(|βinitj |), ∀j where β(o) is the oracle estimator. Under the event
Fn0∩Fn1, the two stage Dantzig selector is equivalent to the oracle estimator.
Proof of Theorem 1. Under the event F0, minj∈A0
|βinitj | > aλ because
minj∈A0
|β∗j | > (a+1)λ. Hence minj∈A0
P ′λ(|βinitj |) = 0. Next we can verify minj∈Ac0
P ′λ(|βinitj |) >
0, because minj∈Ac0|βinit|∞ ≤ a0λ ≤ aλ. Under the event Fn1, the oracle estimator
is in the feasible set of the two stage Dantzig selector. Under the event Fn0∩Fn1,
the minimizer β of (2.2) must be the oracle estimator since P ′λ(|βinitj |) = 0 for
j ∈ A0 and βj for j ∈ Ac0 can be zero.
The following corollaries show that the two stage Dantzig selector satisfies
the global oracle property with a LASSO or Dantzig selector initial under the
30
regularity conditions (A1)-(A3). Condition (A1) implies that
Pr(|aTε| > t) ≤ 2 exp
(− t2
2σ2‖a‖2
),
for t ≥ 0 and a = (a1, . . . , an)T . Condition (A2) means that the signal covari-
ates are not seriously correlated. Condition (A3) is a condition for ‖βinit −
β∗‖∞ ≤ a0λ with respect to the LASSO or the Dantzig selector initial.
Corollary 2. Assume that regularity conditions (A1)-(A3) hold. Let the initial
estimator be the LASSO estimator βLS(τλ) with regularization parameter τλ.
(i) Ifminj∈A0
|β∗j |
a+1> λ > 2
√2σ√M log p
n1τ
and 16τγ−2√s < a0 then
Pr(βTSDS(βLS(τλ), λ) = β(o)) ≥ 1− p0 − p1
where a0 = min1, a2, p0 = Pr(‖βLS(τλ) − β∗‖∞ > 16τλγ−1√s) ≤
2p exp(−nτ2λ2
8Mσ2
)and p1 = Pr(F c
n1) ≤ 2(p − s) · exp(−na1λ2
2σ2M
)with M =
maxj∈Ac0‖Xj‖22/n.
(ii) If nτ 2λ2 →∞, log p = o(nτ 2λ2), and 16τγ−2√s < a0, then
Pr(βTSDS(βLS(τλ), λ) = β(o))→ 1 as n→∞.
Corollary 3. Assume that regularity conditions (A1)-(A3) hold. Let the initial
estimator be the Dantzig selector βDS(τλ) with regularization parameter τλ.
31
(i) Ifminj∈A0
|β∗j |
a+1> λ > σ
√2M log p
n1τ
and 8τγ−2√s < a0 then then
Pr(βTSDS(βDS(τλ), λ) = β(o)) ≥ 1− p0 − p1
where a0 = min1, a2, p0 = Pr(‖βDS(τλ) − β∗‖∞ > 8τλγ−2√s) ≤
2p exp(−nτ2λ2
2Mσ2
)and p1 = Pr(F c
n1) ≤ 2(p − s) · exp(− nλ2
2σ2M
)with M =
maxj∈Ac0‖Xj‖22/n.
(ii) If nτ 2λ2 →∞, log p = o(nτ 2λ2), and 8τγ−2√s < a0, then
Pr(βTSDS(βDS(τλ), λ) = β(o))→ 1 as n→∞.
Proof of Corollary 2 and Corollary 3. We first prove that the ora-
cle estimator β(o) satisfies the constraint of two stage Dantzig selector with
probability at least 1 − 2(p − q) · exp(−na1λ2
2σ2M
). Denote βA0 to be the |A0|-
length sub-vector of β containing only A0 members of β and let HA0 =
XA0(XTA0XA0)
−1XTA0
. Then
Pr(F cn1) ≤
∑j∈Ac0
Pr
(∣∣∣∣ 1nXTj (I −HA0)ε
∣∣∣∣ > P ′λ(|βinitj |))
≤∑j∈Ac0
2 exp
(−nP ′λ(|βinitj |)2
2σ2M
)≤ 2(p− s) · exp
(−na1λ
2
2σ2M
)
because of the assumption that P ′λ(t) ≥ a1λ for t ≤ a2, the regularity condition
(A1), and ‖ 1nXTj (I −HA0)‖22 ≤M/n ∀j ∈ Ac0.
32
Under the event F0, we only have to prove the upper bound of the proba-
bility p0 related to the initial esimator. The upper bound of p1 can be shown
from Theorem 1. We can use the results of Bickel et al. (2009) or Negahban
et al. (2012) to get an estimation bound of βinit − β∗ for the LASSO and the
Dantzig selector. Bickel et al. (2009) showed the asymptotic equivalence of the
LASSO and the Dantzig selector giving the non-asymptotic `q error bound
under the restricted eigenvalue condition and normality error assumption. For
the Dantzig selector, with probability 1− exp(−nτ2λ2
2σ2M),
‖βDS(τλ)− β∗‖∞ ≤ ‖βDS(τλ)− β∗‖l2 ≤ 8
√sτλ
γ2
with τλ > σ√
2M log pn
. For the LASSO estimator, with probability 1−exp(−nτ2λ2
8σ2M),
‖βLS(τλ)− β∗‖∞ ≤ ‖βLS(τλ)− β∗‖l2 ≤ 16
√sτλ
γ2
with τλ > σ√
8M log pn
. The Corollary 2 of Negahban et al. (2012) showed that
for the LASSO estimator, with probability 1− exp(−nτ2λ2
2σ2M),
‖βLS(τλ)− β∗‖∞ ≤ ‖βLS(τλ)− β∗‖l2 ≤ 2
√sτλ
γ2
with τλ ≥ 4σ√M log p
n. From these results, the upper bounds for ‖βLS(τλ) −
β∗‖∞ and ‖βDS(τλ)− β∗‖∞ are verified.
Remark. Comments on the regularity conditions on design matrix:
Define the restricted eigenvalue condition (RE) condition as follows. A p × p
33
sample covariance matrix XTX/n satisfies the RE condition of k with param-
eter (α, γ) if
1
nθTXTXθ ≥ γ‖θ‖22 ∀θ ∈ C(B,α)
where C(B,α) = θ ∈ Rp : ‖θ‖1 ≤ α‖θB‖1 for all subsets B ⊂ 1, . . . , p
such that |B| = k. The RE condition is the weak and general regularity con-
dition for achieving the optimal `2 error bound
(√log pn
)for `1 regularization
methods such as the LASSO (α = 3) and the Dantzig selector (α = 1). A
series of researches have been concerned with which conditions are necessary
for guaranteeing their optimal error bounds.
The restricted isometry property (RIP) (Candes and Tao, 2005) is defined
as follows. X is said to satisfy the s−restricted isometry property with re-
stricted isometry constant δs if there exists a constant δs such that, for every
T ⊂ 1, . . . , p such that |T | ≤ s, n× |T | submatrix XT of X and u ∈ R|T |,
(1− δs)‖u‖22 ≤ ‖XTu‖22 ≤ (1 + δs)‖u‖22.
The s, s′−restricted orthogonality constants θs,s′ for s+ s′ ≤ p is defined as the
smallest quantities such that
|〈XTu,X ′Tu′〉| ≤ θs,s′ · ‖u‖2‖u′‖2
for all T, T ′ ⊂ 1, . . . , p such that T ∩ T ′ =, |T | ≤ s and |T ′| ≤ s′. X
satisfies the uniform uncertainty principle (UUP) (Candes and Tao, 2007) if
34
δ2s + θs,2s < 1 which means that for all s−sparse sets T , the columns of the
matrix corresponding to T are almost orthogonal.
The RIP and the UUP conditions are the earlier conditions which are very
restricted. They contain independent variables which are from Gaussian or
Bernoulli distribution (Candes and Tao, 2007). However, they cannot deal
with substantial dependency. Raskutti et al. (2010) showed a design matrix
whose rows are independently distributed from N(0,Σ) then it satisfies the
RE condition with sample size n = O(s log p) with respect to Σ. The sample
covariance matrices withΣ including Toeplitz matrices, spiked identity model,
or highly degenerate covariance matrices have the RE condition (Raskutti
et al., 2010). Rudelson and Zhou (2013) extends to the sub-Gaussian designs
considering substantial dependency.
Bickel et al. (2009) showed that in more general setting Xiid∼ (0,Σ), if
φmin(s log n) > clogn
then XTX/n holds RE(α, γ) with order s where
γ2 =√φmin(s log n)
(1− c0
√sφmax(s log n− s)
(s log n− s)φmin(s log n)
)
and
φmin(m) = min1≥‖θ‖0≤m
θTXTXθ
‖θ‖22and φmin(m) = max
1≥‖θ‖0≤m
θTXTXθ
‖θ‖22.
φmin(s log n) > clogn
holds for s <√n log−3/2 n (Kim et al., 2012; Greenshtein
and Ritov, 2004). Hence, The RE condition can be possessed by large class of
35
Σ even when the RIP condition is not satisfied with probability converging to
one. For more discussions on the regularity conditions, see Bickel et al. (2009),
Negahban et al. (2012) or Zhang and Zhang (2012).
2.3.4 Algorithm
Two stage Dantzig selector βTSDS(βinit, λ) can be modified as follows. Let
A = j : |βinitj | > aλ then βTSDSAc (βinit, λ) can be calculated by
minβAc
∑j∈Ac
P ′λ(|βinitj |)|βj| subject to
∣∣∣∣ 1nX ′j(I −HA)(y −XAcβAc)∣∣∣∣ ≤ P ′λ(|βinitj |) for j ∈ Ac,
and
βTSDSA (βinit, λ) = (XTAXA)−1XT
A(y −XAcβTSDSAc (βinit, λ)).
Similar to the LLA algorithm, set β = WβAc and X = (I−HA)XAcW−1
where W is diagonal matrix whose entry is P ′λ(|βinitj |) for j ∈ Ac. Then the
above optimization for βTSDSAc (βinit, λ) can be modified as
min ‖β‖1 subject to ‖XT (y − Xβ)‖∞ ≤ 1.
Hence, we can use the same algorithms for Dantzig selector such as gen-
eralized primal-dual interior point algorithm (Candes and Romberg, 2005),
Dantzig selector with sequential optimization (DASSO) (James et al., 2009),
36
and alternating direction method (ADM) (Lu et al., 2012). We briefly review
several popular algorithms for Dantzig selector in the Appendix.
2.3.5 Tuning regularization parameter
Recall the HBIC of Wang et al. (2013),
HBIC(λ) = log(‖y −Xβ(λ)‖2/n) + Cn ∗ log p/n ∗ |Mλ|
where Mλ = j : βj(λ) 6= 0, Cn →∞ (e.g., Cn = log n or log(log n)).
According to the Corollary 1 in the section 2.3, we can use HBIC to tun-
ing regularization parameter because our proposed method satisfies the global
oracle estimator.
2.4 Numerical analyses
In this chater, we investigate the performance of the proposed two stage
Dantzig selector (TSDS). We Suppose there are p covariates variables
x1, . . . , xp. The goal of these numerical studies is to evaluate how the methods
work well in variable selection and accurate estimation. We consider the linear
regression model
y = xTβ + ε,
37
where x = (x1, . . . , xp)T and β ∈ Rp is a coefficient vector.
We compare the proposed TSDS with other methods including the LASSO,
the adaptive LASSO, the Dantzig selector, the adaptive Dantzig selector, the
SCAD, the MCP, and the two stage methods base on LASSO with respect
to selection and estimation. For the SCAD and the MCP, a = 2.1 and a =
1.5 are considered, respectively. Two stage methods use the SCAD penalty
with a = 2.1. Regarding the tuning parameter, five folder cross validation is
considered for the LASSO, the adaptive LASSO, the Dantzig selector, and
the adaptive Dantzig selector. For the SCAD, the MCP, and the two stage
methods, the high-dimensional BIC (HBIC) is used for tuning parameter. The
HBIC is defined as
HBIC = log(‖y −Xβ‖2)/n+ log(log n) ∗ log p/n ∗ |j : βj 6= 0|.
The LASSO and the Dantzig selector are used for the initial estimator with
tuning parameter λ/ log n in the two stage methods. The LARS algorithm is
used for the LASSO and the adaptive LASSO and the primal-dual interior
point algorithm is used for the Dantzig selector and the adaptive Dantzig
selector. The CCCP algorithm is used for the SCAD and the MCP estimator
and the calibrated CCCP (Wang et al., 2013) is used for the two stage method
based on LASSO.
38
2.4.1 Simulations
In this section, we consider two simulation settings. For each experimental
setting, we replicate the simulation 100 times. We simulate data from the true
model
y =
p∑j=1
Xjβj + ε, ε ∼ N(0, 22)
where p = 1000 and the number of observation n = 100. Covariates are gen-
erated from the normal distribution with zero mean and covariance of xi and
xj being R|i−j|, i, j = 1, . . . , p.. For each simulation setting, we generate data
with R = 0.3 and 0.5.
Base on 100 replications, the following statistics are measured for compar-
ision: the average number of falsely estimated non-zero coefficient (FP); the
average number of falsely estimated zero coefficient (FN); the proportion of the
true model exactly identified (TM); MSE=∑100
m=1 ‖β(m)−β∗‖2/100. In the re-
sults of the two stage methods, ”LS+LS”, for example, the first ”LS” refers to
the initial estimator and the last ”LS” refers to the two stage method based on
the LASSO while ”DS+DS” refers to the two stage method based on Dantzig
selector with Dantzig selector initial. The results of total four combinations of
two stage methods using the LASSO and the Dantzig selector are represented
in the following tables.
Example 1 We simulate 100 data sets along with above setting and the
39
true beta coefficient
β = (3, 1.5, 0, 0, 2︸ ︷︷ ︸5
, 0, . . . , 0︸ ︷︷ ︸p−5
)T .
Table 2.1: Example 1 (R=0.3)
Methods FP FN TM MSE
LASSO 25.38(9.019) 0(0) 0 1.040(0.501)
ALASSO 23.11(7.938) 0(0) 0 2.314(0.587)
Dantzig 18.48(10.323) 0(0) 0 1.044(0.530)
ADantzig 15.69(8.284) 0(0) 0 1.859(0.689)
MCP 2.12(1.719) 0.01(0.1) 0.16 0.463(0.479)
SCAD 1.36(1.382) 0.01(0.1) 0.26 0.389(0.606)
LS+LS 1.39(1.550) 0.02(0.141) 0.3 0.405(0.593)
LS+DS 1.39(1.550) 0.02(0.141) 0.3 0.404(0.594)
DS+LS 1.32(1.455) 0.02(0.141) 0.3 0.397(0.574)
DS+DS 1.31(1.461) 0.02(0.141) 0.31 0.394(0.574)
40
Table 2.2: Example 1 (R=0.5)
Methods FP FN TM MSE
LASSO 24.55(9.632) 0(0) 0 0.926(0.453)
ALASSO 21.83(8.263) 0(0) 0 2.290(0.648)
Dantzig 17.3(9.161) 0(0) 0.01 0.859(0.424)
ADantzig 17.17(9.135) 0(0) 0.01 0.934(0.517)
MCP 1.97(1.598) 0.03(0.171) 0.23 0.643(0.880)
SCAD 1.23(1.270) 0.04(0.197) 0.3 0.780(0.981)
LS+LS 1.27(1.370) 0.03(0.171) 0.33 0.578(0.793)
LS+DS 1.24(1.319) 0.03(0.171) 0.33 0.564(0.793)
DS+LS 1.29(1.387) 0.03(0.171) 0.32 0.555(0.794)
DS+DS 1.24(1.296) 0.03(0.171) 0.32 0.549(0.791)
Example 2 We simulate 100 data sets along with above setting and the
true beta coefficient
β = ((3, 1.5, 0, 0, 2︸ ︷︷ ︸5
, 0, . . . , 0︸ ︷︷ ︸15
)× 5, 0, . . . , 0︸ ︷︷ ︸p−100
)T .
41
Table 2.3: Example 2 (R=0.3)
Methods FP FN TM MSE
LASSO 25.96(1.370) 1.03(1.283) 0 18.742(9.205)
ALASSO 20.89(4.479) 1.08(1.398) 0 11.147(8.665)
Dantzig 24.18(3.583) 1.93(1.771) 0 24.685(11.391)
ADantzig 23.9(4.135) 1.94(1.802) 0 15.555(11.128)
MCP 18(8.957) 1.95(3.468) 0.01 21.891(34.190)
SCAD 4.58(6.240) 1.28(2.958) 0.09 11.336(25.522)
LS+LS 6.89(9.331) 0.71(1.866) 0.05 7.736(9.773)
LS+DS 5.24(5.826) 0.71(1.903) 0.03 7.558(9.925)
DS+LS 6.74(8.73) 0.69(1.846) 0.06 7.648(9.972)
DS+DS 4.67(5.924) 0.52(1.573) 0.15 7.055(8.234)
42
Table 2.4: Example 2 (R=0.5)
Methods FP FN TM MSE
LASSO 25.32(0.827) 0.32(0.827) 0 10.676(6.274)
ALASSO 19.89(4.325) 0.34(0.831) 0 6.656(5.120)
Dantzig 24.29(1.546) 0.6(0.974) 0 14.284(7.890)
ADantzig 22.76(4.656) 0.61(0.984) 0 6.912(6.382)
MCP 4.33(4.551) 2.38(2.534) 0.09 14.508(19.203)
SCAD 3.35(3.056) 1.89(2.287) 0.05 12.936(14.131)
LS+LS 4.56(6.609) 0.57(1.358) 0.07 5.467(6.115)
LS+DS 3.52(2.921) 0.58(1.365) 0.08 5.339(6.158)
DS+LS 3.97(4.239) 0.58(1.387) 0.07 5.078(6.161)
DS+DS 3.44(4.613) 0.42(1.249) 0.14 4.902(5.746)
43
2.4.2 Real data analysis
We analyze the data set of Scheetz et al. (2006) containing the 18976 gene
expression levels from the 120 rats. The objective of this analysis is to find the
genes correlated to the gene TRIM32 known to cause Bardet-Biedl syndrome.
Many previous works (Huang et al., 2008b; Kim et al., 2008; Wang et al.,
2013) analyzed this data set. According to these papers, we first select 3,000
genes with the largest variance in expression level and then pick up the top
1,000 correlated genes with TRIM32 among the selected 3,000 genes. With
this data set, we focus on the comparision between two stage methods because
the comparision among the two stage method with LASSO and other methods
already had done by Wang et al. (2013) and to assess the improvement of
TSDS over the previous two stage methods is our main interest. The results is
in the Table 2.5.
Table 2.5: Real Data (TRIM)
Methods #j : βj 6= 0 PE
LS+LS 11.37 0.813
LS+DS 10.47 0.806
DS+LS 8.74 0.857
DS+DS 8.41 0.83
44
2.5 Conclusion
In this paper, we propose a two stage method based on Dantzig selector, which
is named as two stage Dantzig selector. We prove two stage Dantzig selector can
obtain oracle estimator under regularity conditions. The proposed method can
be easily implemented by general algorithms for the standard Dantzig selector.
The numerical results support our contention that the Dantzig selector used
in our method can improve variable selection and estimation through lessening
the effects of noise variables more efficiently than LASSO. Furthermore, the
numerical results show that our proposed method outperforms other sparse
regularization methods with respect to variable selection and estimation.
45
Chapter 3
Two Stage Methods for
Precision Matrix Estimation
3.1 Introduction
Precision matrix (inverse covariance matrix) estimation is important prob-
lem in high dimensional statistical analysis and useful for various applications
such as Gaussian graphical model, gene classification, optimal portfolio al-
location, and speech recognition. Under the normality assumption, suppose
X = (X1, . . . , Xp) ∼ N(µ,Σ) then the zero elements in precision matrix
Ω = (ωij)p×p imply conditional inpendences of variables, that is, ωij = 0 if
and only if Xi and Xj are independent given X\Xi, Xj. Therefore the sup-
46
port of precision matrix is related to the structure of the undirected Gaussian
graph G = (V,E) where vertex set V = X1, . . . , Xp and edge set E with
Ec = (i, j) : Xi ⊥⊥ Xj|X\Xi, Xj and ⊥⊥ denotes independece. In the high
dimensional setting, classical methods such as Gaussian graphical model and
inverse of sample covariance matrix cannot provide stable estimate of preci-
sion matrix and additional restrictions should be imposed to get stable and
accurate precision matrix estimation. Hence many regularized methods for pre-
cision matrix estimation are developed based on the relationship between pre-
cision matrix and the Gaussian graphical model. There are two frameworks in
those regularized methods which are regression based approach and maximum
likelihood approach. Meinshausen and Buhlmann (2006) introduced penalized
neighborhood regression model with LASSO penalty. They fitted each variable
on its neighborhood with LASSO penalty and aggregated the results. Peng
et al. (2009) proposed joint neighborhood LASSO selection method which si-
multaneously performed neighborhood selection for all variables. Yuan (2010)
adopted Dantzig selector to the regression based approach and establised it
convergence rate. Yuan and Lin (2007) proposed penalized maximum likelihood
method with LASSO penalty and Friedman et al. (2008) introduced efficient
algorithm called graphical LASSO (glasso) for penalized maximum likelihood
mehtod with LASSO using blockwise coordinate descent algorithm (Banerjee
47
et al., 2008). Fan et al. (2009) dealt with the bias problem of the LASSO penal-
ization and proposed new penalized likelihood methods with adaptive LASSO
and SCAD penalty and the convergence rates of non-convex penalized methods
are shown in Lam and Fan (2009). Cai et al. (2011) proposed a constrained `1
minimization method called CLIME and showed its convergence rates under
various matrix norm.
Most of the existing sparse precision matrix estimators which use the `1
regularization including LASSO or Dantzig selector suffer from selection in-
consistence and bias estimation. Although the penalized likelihood estimation
with SCAD penalty can achieve selection consistency and unbiased estima-
tor, it takes quite much time to converge to its local minimizer and it cannot
guarantee that the local minimizer is the oracle estimator. In this paper, we
especially focus on the selection and the correct recovery of the support of
precision matrix. We propose two stage methods based on LASSO or Dantzig
selector which can correctly select the support of precision matrix with high
probability under some regularity conditions.
48
3.2 Estimation of precision matrix via colum-
nwise two-stage methods
Suppose (X1, . . . , Xp) are jointly generated by mean µ = (µ1, . . . , µp)′ and
covariance matrix Σ∗. It is well known (e.g., Lemma 1 of Peng et al. (2009))
that for i = 1, . . . , p, let
Xi = µi +∑j 6=i
β∗ijXj + εi
then (X1, . . . , Xi−1, Xi+1, . . . , Xp) and εi are uncorrelated if and only if β∗ij =
−ω∗ij/ω∗ii where Σ∗−1 = Ω∗ = (ω∗ij) is the precision matrix. Furthermore, with
those β∗ij, cov(εi, εj) = ω∗ij/(ω∗iiω∗jj) and var(εi) = 1/ω∗ii. Under normality as-
sumption, the forementioned uncorrelation can be replaced by independence.
This regression based approach has been applied to various methods including
Meinshausen and Buhlmann (2006), Peng et al. (2009), and Yuan (2010) by
using LASSO or Dantzig selector. We use this relationship to estimate sparse
precision matrix via two stage regression methods based on the LASSO esti-
mator such as calibrated CCCP (Wang et al., 2013) and one step LLA (Fan
et al., 2012), or based on the Dantzig selector called two stage Dantzig selector.
49
3.2.1 Two stage method based on LASSO
We first briefly introduce the one step LLA (Zou and Li, 2008; Fan et al., 2012)
as a two stage method based on LASSO, and then we apply the one step LLA
to estimate precision matrix. Consider a penalized regression problem,
minβ∈Rp
1
2n‖y −Xβ‖2 +
p∑j=1
Pλ(|βj|)
,
where y is the response vector, and X = (X1, . . . ,Xp) is an n × p covariate
matrix with Xi = (X1i, . . . , Xni)T , and β = (β1, . . . , βp)
T is the vector of
regression coefficients, and ‖ · ‖ is L2 norm, and Pλ(·) is a penalty function
with tuning parameter λ. We consider a class of nonconvex penalty functions
Pλ = Pλ,a satisfying
(P1) Pλ(t) is increasing and concave for t ∈ [0,∞) with continuous derivative
on t ∈ (0,∞) with P ′λ(0) := P ′λ(0+) ≥ a1λ.
(P2) P ′λ(t) ≥ a1λ for t ∈ (0, a2λ)
(P3) P ′λ(t) = 0 for t > aλ > a2λ
for some positive constant a1,a2 and a.
The SCAD and the MCP penalties satisfy above conditions with a1 = 1 for
SCAD and a1 = 1− a−1 for MCP. The derivative of SCAD penalty is defined
50
by
P ′λ(t) = λ
I(t ≤ λ) +
(aλ− t)+(a− 1)λ
I(t > λ)
, for some a > 2,
and the derivative of MCP penalty is defined by
P ′λ(t) =
(λ− t
a
)+
, for some a > 1.
The one step LLA is defined as follows,
β(βinit, λ) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|βinitj |)|βj|. (3.1)
Then the equation (3.1) can be recast as
β(βinit, λ)Ac = argminβAc‖(I −HA)(y −XAcβAc)‖2/2n+∑j∈Ac
P ′λ(|βinitj |)|βj|,
β(βinit, λ)A = (XTAXA)−1XT
A(y −XAcβ(βinit, λ)Ac),
where A = A(βinit, λ) = j : |βinitj | > aλ with the parameter of nonconvex
penalty a, and HA = XA(XTAXA)−1XT
A. Let y = (I −HA)y, X = (I −
HA)XAcW−1 and β = WβAc where W = diag(P ′λ(|βinitj |))j∈Ac then the
equation (3.1) can be recast as the LASSO problem with respect to y, X, β
and tuning parameter 1. Hence the algorithms for LASSO can be used for the
osLLA.
For a good initial estimate which satisfies |βinit−β∗‖∞ < min(a2, 1) ·λ, the
oracle estimator β(o) can be obtained via two stage method based on LASSO
with high probability where β(o) = (XTA0XA0)
−1XTA0y with A0 = j : β∗j 6= 0.
51
Now we apply this one step LLA estimator to precision matrix estimation.
Let the true precision matrix Ω∗ = Σ∗−1, and X = X(1), . . . ,X(n) are inde-
pendent and identically distributed samples from Np(µ,Σ∗). For i = 1, . . . , p,
denote the ith column ofΩ withoutΩii to beΩ−ii = (ω1i, . . . , ωi−1,i, ωi+1,i, . . . , ωpi)T
and
β(i) = (β1(i), . . . , βi−1(i), βi+1(i), . . . , βp(i))T
=
(−ω1i
ωii, . . . ,−ωi−1,i
ωii,−ωi+1,i
ωii, . . . ,−ωpi
ωii
)T.
Let Zi = Xi − Xi where Xi = (X1i, . . . , Xni)T and Xi = (Xi, . . . , Xi)
T with
Xi =∑n
i=j X(j)i /n. Denote the sample covariance S = ZTZ/n where Zn×p =
(Z1, . . . ,Zp). Let Z−i = (Z1,Z2, . . . ,Zi−1,Zi+1, . . . ,Zp). For an initial estimate
Ωinit and a vector of tuning parameters λ = (λ1, . . . , λp)T , our proposed one
step LLA (osLLA) estimator ΩosLLA(Ωinit,λ) = (ωosLLAij (ωinitij , λj))1≤i,j≤p is
defined as follows. First, conduct regression columnwisely for estimating Ω−ii,
i = 1, . . . , p.
For i = 1, . . . , p,
Ωii = Ωii(Ωinit, λi) = 1/(‖Zi − Z−iβ
osLLA(i) (Ωinit, λi)‖2/n)
Ω−ii = Ω−ii(Ωinit, λi) = −β(i)Ωii(Ω
init, λi),
52
where
βosLLA(i) (Ωinit, λi) = argmin
1
2n‖Zi − Z−iβ(i)‖2 +
∑j 6=i
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|.(3.2)
Ω is calculated by conducting p independent optimazation problems defined
as (3.2).
Second, for symmetricity, our final osLLA estimator ΩosLLA(Ωinit, λ) is
defined as
ωosLLAij (ωinitij , λ) = ωijI(|ωij| ≤ |ωji|) + ωjiI(|ωij| > |ωji|).
The initial estimate Ωinit can be a clme estimator (Cai et al., 2011) or
graphical lasso estimator (Yuan and Lin, 2007). The columnwise LASSO or
Dantzig selector with tuning parameter λiniti = λi/ log n (Wang et al., 2013)
also can be considered as an initial estimate.
3.2.2 Two stage Dantzig selector
Two stage Dantzig selector (TSDS) is a Dantzig selector type modification
of LLA of nonconvex penalized method. TSDS for regression is defined as a
solution of the following problem
min
p∑j=1
P ′λ(|βinitj )|βj|
subject to
∣∣∣∣ 1nXTj (y −Xβ)
∣∣∣∣ ≤ P ′λ(|βinitj |), j = 1, . . . , p
53
where Xj = (X1j, . . . , Xnj)T and βinit is an initial estimate.
Similar to the osLLA algorithm, it can be modified as follows,
1. Set β = WβAc and X = XAcW−1 where A and W are defined as in
the previous subsection 5.2.1.
ˆβ = argminβ
‖β‖1 : ‖XT (y − Xβ)‖∞ ≤ 1
2. Let βAc = W−1 ˆβ and βA = (XTAXA)−1XT
A(y −XβAc).
The same algorithms for Dantzig selector including generalized primal-dual
interior point algorithm (Candes and Romberg, 2005), DASSO (James et al.,
2009), alternating direction method (Lu et al., 2012) can be used as well.
We now define our TSDS estimator for precision matrix estimation. It is
similar to the osLLA estimator for precision matrix estimation. Let the colum-
nwise TSDS estimator βTSDS(i) = βTSDS(i) (Ωinit, λi) be the solution of
minβ(i)
∑j 6=i
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|
subject to
∣∣∣∣ 1nZTj (Zi − Z−iβ(i))
∣∣∣∣ ≤ P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) , ∀j 6= i (3.3)
and
Ωii = (Sii−2(βTSDS(i) )TS−i,i+(βTSDS(i) )TS−i,−iβTSDS(i) )−1, Ω−i,i = −Ωiiβ
TSDS(i) .
54
To impose symmetricity on the estimated precision matrix, we set ΩTSDS =
(ωTSDSi,j )1≤i,j≤p such that
ωTSDSi,j = ωTSDSji = ωijI(|ωij| ≤ |ωji|) + ωjiI(|ωij| > |ωji|).
3.2.3 Theoretical results
We prove the selection consistence of our proposed estimators. First, we define
the columnwise oracle estimator of precision matrix Ω(o) as follows:
Ωii = (Sii − 2(β(o)(i) )TS−i,i + (β
(o)(i) )TS−i,−iβ
(o)(i) )−1, and Ω−i,i = −Ωiiβ
(o)(i) .
and β(o)(i) is the oracle estimator of β∗(i) defined as β
(o)A0i(i)
= (ZTA0i
ZA0i)−1ZT
A0iZi
and β(o)j(i) = 0 for j ∈ Ac0i with A0i = j : ω∗ij 6= 0, j 6= i. For symmetricity of
the columnwise oracle precision matrix Ω(o) = (ω(o)ij ), ω
(o)ij = ω
(o)ji = ωijI(|ωij| ≤
|ωji|) + ωjiI(|ωij| > |ωji|) are considered.
Proposition 1. The columnwise oracle estimator Ω(o) is selection consistent
and elementwise√n-consistent estimator of Ω∗ for nonzero element.
Proof of Proposition 1. By the definition, the columnwise oracle estimator
is selection consistent. Since we assume thatX(1), . . . ,X(n) ∼ N(µ,Σ∗) where
Ω∗ = Σ∗−1, then Ω∗−i,i = −Ω∗i,iβ∗(i) and Ω∗i,i = 1/var(εi). For a sparse Ω∗, we
can assume that there exists a positive constant d where the degree of Ω∗
55
maxi=1,...,p |Ai0| < d. For each i,
√n(β
(o)A0i(i)
− β∗A0i(i))→ N(0, var(εi)(E(ZT
A0iZA0i
))−1).
Since ˆvar(εi) = 1n‖Zi−Z−iβ
(o)(i) ‖2 →p var(εi). Then by the continuous mapping
theorem, Ωii = 11n‖Zi−Z−iβ
(o)(i)‖2→p Ω∗ii. Then
√n(ΩA0i,i −Ω∗A0i,i
) =√n(−Ωi,iβ
(o)A0i(i)
+Ω∗i,iβ∗A0i(i)
)
=√n(Ω∗i,i(β
∗A0i(i)
− β(o)A0i(i)
) + op(1) ·Op(1/√n))
→ N(0, Ω∗i,i(E(ZTA0i
ZA0i))−1)).
Since the columnwise oracle estimator ω(o)ij is ωij or ωji whose absolute value
is smaller than the other, ω(o)ij is also
√n−consistent estimator of ω∗ij = ω∗ji for
j ∈ A0i.
We specify regulaity conditions.
(A1) Sparse Model: Ω∗ ∈M1(L, ν0, d) where
M1(τ0, ν0, d) = A 0 : ‖A‖1 < L, ν−10 < φmin(A) < φmax(A) < ν0, deg(A) < d,
where τ0, ν0 > 1, ‖A‖1 = maxj∑n
i=1 |aij|, and deg(A) = maxi∑
j I(Aij 6=
0).
(A2) ηmin(ZTA0i
ZA0i) > 0 ∀i where ηmin(B) is the minimum eigenvalue of B.
56
The following theorems show that osLLA estimator and TSDS estimator
are equivalent to the columnwise oracle estimator with high probability.
Theorem 1. Assume that regularity conditions (A1)-(A2) and (P1)-(P3) hold.
Denote a vector of tuning parameters for each column to be λ = (λ1, . . . , λp)T .
For an initial estimate Ωinit, let Fn0 =
maxj 6=i
∣∣∣ ωinitji
ωinitii− ω∗ji
ω∗ii
∣∣∣ < a0λi, i = 1, . . . , p
where a0 = min(a2, 1) and Fn1 = ∩pi=1
∣∣∣ 1nZTj (Zi − Z−iβ
(o)(i) )∣∣∣ ≤ P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ,∀j 6= i
where β(o)(i) is the oracle estimator of β∗(i). Under the event Fn0∩Fn1, the osLLA
estimator ΩosLLA(Ωinit,λ) is equivalent to the columnwise oracle estimator
Ω(o) of Ω∗.
Proof of Theorem 1. We can directly apply the theoretical result of
osLLA for regression. Under the event Fn0, P′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) = 0 for j ∈ A0i and
P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ≥ a1λi for j ∈ Ac0i. Hence for each i,
βosLLA(i) (Ωinit, λi) = argmin
1
2n‖Zi − Z−iβ(i)‖2 +
∑j∈Aci0
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|
By convexity of ‖Zi − Z−iβ(i)‖2,
‖Zi − Z−iβ(i)‖2 ≥ ‖Zi − Z−iβ(o)(i) ‖
2 +∑j
ZTj (Zi − Z−iβ
(o)(i) )(βj − β(o)
j )
= ‖Zi − Z−iβ(o)(i) ‖
2 +∑j∈Aci
ZTj (Zi − Z−iβ
(o)(i) )(βj − β(o)
j ).
57
Under the event Fn1, for each i 1
2n‖Zi − Z−iβ(i)‖2 +
∑j∈Aci
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|
−
1
2n‖Zi − Z−iβ
(o)(i) ‖
2 +∑j∈Aci0
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |β(o)j(i)|
≥
∑j∈Aci0
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣))− 1
nZTj (Zi − Z−iβ
(o)(i) ) · sign(βj)
|βj(i)|
≥∑j∈Aci0
a1λi −
1
nZTj (Zi − Z−iβ
(o)(i) ) · sign(βj)
|βj(i)|
≥ 0.
This equality holds only if βj(i) = 0, ∀j ∈ Aci0, and the oracle estimator β(o)(i)
is the minimizer of ‖Zi−Z−iβ(i)‖2. Hence, βosLLA(i) (Ωinit, λi) = β(o)(i) for each i,
and then ΩosLLA(Ωinit,λ) = Ω(o).
Theorem 2. Assume that regularity conditions (A1)-(A2) and (P1)-(P3) hold.
Denote a vector of tuning parameters for each column to be λ = (λ1, . . . , λp)T .
For an initial estimate Ωinit, let Fn0 =
maxj 6=i
∣∣∣ ωinitji
ωinitii− ω∗ji
ω∗ii
∣∣∣ < a0λi
where a0 =
min(a2, 1) and Fn1 = ∩pi=1
| 1nZTj (Zi − Z−iβ
(o)(i) )| ≤ P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ,∀j 6= i
where
β(o)(i) is the oracle estimator of β∗(i). Under the event Fn0 ∩ Fn1, the TSDS esti-
mator ΩosLLA(Ωinit,λ) is equivalent to the columnwise oracle estimator Ω(o)
of Ω∗.
58
Proof of Theorem 2. Under the event Fn0, P′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) = 0 for j ∈ A0i
and P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ≥ a1λ for j ∈ Ac0i. Under the event Fn1, β(0)(i) is in the feasible
set of the two stage Dantzig selector. Under the event Fn0∩Fn1, the minimizer
βTSDS(i) of (3.3) must be the oracle estimator β(o)(i) because P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) = 0 for
j ∈ A0i and βj(i) for j ∈ Ac0i can be zero.
Recall A0i = j : ω∗ji 6= 0, j 6= i and let si = |A0i|. The following
corollaries assert that the clime estimator (Cai et al., 2011) can be a good
initial estimator which the two stage methods can achieve the columnwise
oracle estimator. Cai et al. (2011) showed that ‖Ωclime(λ)−Ω∗‖∞ ≤ 4Lλclime
with probability at least Pr(
maxij |Sij −Σ∗ij| > λclime
L
)where L = ‖Ω∗‖1 =
maxj∑p
i=1 |ω∗ij|, and λclime = C0L√
log pn
. We can use a large deviation result
such that the Lemma 3 of Bickel and Levina (2008) and Fan et al. (2012):
under the regularity condition (A1),
Pr(|Sij −Σ∗ij| ≥ δ) ≤ C1 exp(−C2nδ2),
where C1, C2 depend on ν0 in the regularity condition (A1). Hence
Pr
(maxij|Sij −Σ∗ij| ≥
λclime
L
)≤ C1 exp
(−C2
nλclime2
L2
).
Lemma 1. Let Ωinit be an initial estimator and Ω∗ = (ω∗ij)1≤i,j≤p be the true
precision matrix. Define Ai0 = j : ω∗ij 6= 0, j 6= i for i = 1, . . . , p.
59
Define a0 = min1, a2 where a2 is defined in the penalty conditions (P1)-(P3).
For each i = 1, . . . , p, if
‖Ωinit −Ω∗‖∞ < a0 ·
1
ωinitii
maxj∈Ai0
(ω∗jiω∗ii
+ 1
)−1then ∣∣∣∣ ωinitji
ωinitii
−ω∗jiω∗ii
∣∣∣∣ < a0λi.
Proof of Lemma 1.∣∣∣∣ ωinitji
ωinitii
−ω∗jiω∗ii
∣∣∣∣ =
∣∣∣∣ω∗iiωinitji − ωinitii ω∗jiωinitii ω∗ii
∣∣∣∣≤
(|ω∗ii|+ |ω∗ji|)‖Ωinit −Ω∗‖∞|ωinitii ω∗ii|
=
(1 +
∣∣∣ω∗jiω∗ii
∣∣∣) ‖Ωinit −Ω∗‖∞|ωinitii |
< a0λi.
Let A0 = (i, j) : Ωij 6= 0 = ∪iA0i.
Corollary 1. Assume that regularity conditions (A1)-(A2) and (P1)-(P3)
hold.
(1) Suppose that for i = 1, . . . , p, minj∈Ai0
∣∣∣ω∗jiω∗ii
∣∣∣ > (a+ 1)λi and
λi > max
(1
a0
1
ωinitii
maxj∈A0i
(ω∗jiω∗ii
+ 1
)· 4Lλclime, 2
a1
√log p
nmaxiω∗−1ii M
)
60
where a0 = min(1, a2), ωinitii = ωclimeii (λclime), λclime = LC0
√log pn
for
C0 > 0, and M = maxj ‖Zj‖22/n. Then
Pr(ΩosLLA(Ωclime(λclime),λ) = Ω(o)) ≥ 1− p0 − p1
where p1 = Pr(F cn1) ≤ 2
(p(p− 1)−
∑pj=1 sj
)· exp
(−na21 mini λ
2i
2Mσ2
), and
p0 = Pr(‖Ωclime(λ)−Ω∗‖∞ > 4Lλclime) ≤ C1 exp
(−C2
nλclime2
L2
).
(2) If nmini λ2i →∞, nλclime2 →∞ and log p = o(nmini λ
2i ), then
Pr(ΩosLLA(Ωclime(λclime),λ) = Ω(o))→ 1
as n→∞.
Corollary 2. Assume that regularity conditions (A1)-(A2) and (P1)-(P3)
hold.
(1) Suppose that for i = 1, . . . , p, minj∈Ai0
∣∣∣ω∗jiω∗ii
∣∣∣ > (a+ 1)λi and
λi > max
(1
a0
1
ωinitii
maxj∈A0i
(ω∗jiω∗ii
+ 1
)· 4Lλclime, 2
a1
√log p
nmaxiω∗−1ii M
)
where a0 = min(1, a2), ωinitii = ωclimeii (λclime), λclime = LC0
√log pn
for
C0 > 0, and M = maxj ‖Zj‖22/n. Then
Pr(ΩTSDS(Ωclime(λclime),λ) = Ω(o)) ≥ 1− p0 − p1
61
where p1 = Pr(F cn1) ≤ 2
(p(p− 1)−
∑pj=1 sj
)· exp
(−na21 mini λ
2i
2Mσ2
), and
p0 = Pr(‖Ωclime(λ)−Ω∗‖∞ > 4Lλclime) ≤ C1 exp
(−C2
nλclime2
L2
).
(2) If nmini λ2i →∞, nλclime2 →∞ and log p = o(nmini λ
2i ), then
Pr(ΩTSDS(Ωclime(λclime),λ) = Ω(o))→ 1
as n→∞.
Proof of Corollary 1 and Corollary 2. This results are follow from the
result of Theorem 1. To calculate Pr(F cn0),
Pr(‖Ωclime(λ)−Ω∗‖∞ ≥ 4Lλclime) ≤ C1 exp
(−C2
nλclime2
L2
)
and the Proposition 2 are used. Pr(F cn1) can be obtained in the similar way of
the linear regression. The difference between regression and precision matrix
estimation is the variance of εi. The variance of εi in precision matrix estimation
is var(εi) = Σ∗ii −Σ∗i,−iΣ∗−1−i,−iΣ∗−i,i = ω∗−1ii hence it should be bounded. Let
62
εi = (ε1i, . . . , εni)T is the vector of iid samples from N(0, var(εi)).
Pr(F cn1) ≤
p∑i=1
∑j∈Ac0i
Pr
(∣∣∣∣ 1nZTj (In −HA0i
)εi
∣∣∣∣ ≤ P ′λi(|ωinitij |)
)
≤ 2
p∑i=1
(p− si) exp
−nminj∈Ac0i
P ′λi(|ωinitij |)|)2
2ω∗−1ii M
≤ 2
p∑i=1
(p− si) · exp
(− na21λ
2i
2ω∗−1ii M
)
≤ 2
(p(p− 1)−
p∑i=1
si
)· exp
(− na21 mini λ
2i
2 maxi ω∗−1ii M
)
because
Pr(|aTεi| > t) ≤ 2 exp
(− t2
2ω∗−1ii ‖a‖22
)∀t ≥ 0
and ‖ 1nZTj (In −HA0i
)‖22 ≤‖Zj‖22nλmax(In −HA0i
) ≤M/n ∀j ∈ Ac0i.
The Lasso or Dantzig selector of Ω−ii can be a good initial estimate based
on the `2 bound of Bickel et al. (2009). Denote⊗
to be the Hadamard
product. If the Hessian of the loglikelihood Γ ∗p2×p2 = Ω∗−1⊗Ω∗−1 satis-
fies the incoherence (or irrepresentable) condition and under some regular-
ity conditions, an elementwise `∞ bound of the graphical lasso estimator is
‖Ωglasso − Ω∗‖∞ = O(√
log pn
) (Ravikumar et al., 2011) where the incoher-
ence (or irrepresentable) condition is that there exists α ∈ (0, 1] such that
maxe∈Ac ‖Γ ∗eA(Γ ∗AA)−1‖1 ≤ (1−α) with A = ∪iAi. The glasso estimate Ωglasso
can be a good initial estimate of Ω∗.
63
3.3 Numerical analyses
We conduct two simulation settings and one real data analysis. These two
simulation settigs are same as in Fan et al. (2012). The real data analysis is
the classification problem using the linear discriminant analysis (LDA) which
needs for estimation of precision matrix.
3.3.1 Simulations
We simulate n independent random vector from Np(0,Σ∗) with a sparse pre-
cision matrix Ω∗ = (Σ∗)−1. We consider two different sparsity patterns of Ω∗.
Example 1.Ω∗ is a tridiagonal matrix by constructingΣ∗ = (σ∗ij) = exp(−|ci−
cj|) for c1 < · · · < cp which are constructed by generating cp− cp−1, . . . , c2− c1
independently from Unif(0.5,1).
Example 2. Ω∗ = UU+I where U = (uij)p×p has zero diagonals and exactly
p nonzero off-diagonal entries. The nonzero entries are generated by uij = tijcij
where tij’s are independently generated from Unif(1,2), and cij’s are indepen-
dent Bernoulli random variables with Pr(cij = 1) = Pr(cij = −1) = 0.5.
We also generate an independent validation set of sample size n to tune
each estimator. In our simulation we let n = 100 and p = 100 or p = 200.
We compute the `1 penalized Gaussian likelihood estimator, denoted by
64
glasso, by using the popular R package glasso (Friedman et al., 2013). CLIME
(Cai et al., 2011) is computed by the R package clime (Cai et al., 2012). We use
GSCAD to denote the one step SCAD penalized Gaussian likelihood estima-
tor with CLIME initial estimate which is proposed by Fan et al. (2012). These
likelihood based approach are tuned by minimizing validation error which is
defined as − log det Ω + trace(ΩSval) where Ω is the generic estimator and
Sval is the sample covariance of validation set. Denote MB to be a columnwise
`1 penalized regression which proposed by Meinshausen and Buhlmann (2006).
We conducted our two stage methods with the LASSO and the Dantzig selec-
tor using two different initial estimator including glasso and CLIME. Denote
clime+LS to be the osLLA with CLIME estimator and clime+DS to be
the TSDS with CLIME estimator. With the glasso initial, glasso+LS and
glasso+DS are denoted as the osLLA and TSDS, respectively. MB and these
two stage methods are tuned columnwisely by minimizing the validation error
‖Zvali − Zval
−i β(i)(λi)‖2/n.
For each model, we generated 100 independet datasets, each consisting
n training samples and n validationi samples. Estimation accuracy is mea-
sured by the average Frobenius norm loss ‖Ω − Ω∗‖F , the average matrix
`1 norm ‖Ω − Ω∗‖1, and the average spectral norm ‖Ω − Ω∗‖2 over the
100 replications, where ‖A‖F =√∑
i,j a2ij, ‖A‖1 = max1≤j≤q
∑pi=1 |aij|, and
65
‖A‖2 = sup|x|≤1 |Ax|2 for a matrix A = (aij) ∈ Rp×q. The selection accuracy
is evaluated by the average edge proportion of false positive (FP) and false
negative (FN), sensitivity, and specificity over 100 replications. The average
number of estimated edge is also measured. We plot the ROC curve with the
average sensitivity and specificity for each method. We also plot the average
Frobenious norm and spectral norm varying with number of edges for each
method. For the two stage methods, we consider two settings which are with
the same tuning parameter overall and with columnwisely different tuning pa-
rameters. The simulation results are summarized in Figure 3.1-3.8 and Table
3.1-3.8. We conduct two stage methods with several glasso and CLIME initials
with a sequence of tuning parameters and we find out over-edged initial esti-
mates which have more edges than selected optimal glasso or CLIME estimates
by validation error. We summarize the best results of our proposed methods in
Figure 3.1-3.8 and Table 3.1-3.8. The selection results of our proposed methods
outperform over others. The two stage methods with the glasso initial tend to
perform better than those with CLIME initial. In Example 2, our proposed
methods achieve the best finite sample performance in both estimation and
selection.
66
0 5 10 15 20
60
70
80
90
10
0
(1−Specificity)*100
Se
nsitiv
ity*1
00
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.1: ROC curve of Example 1 (p=100, q=99)
67
Table 3.1: Example 1 (p=100, q=99)
Methods Edge FP FN Sensitivity Specificity
glasso 1025.655 0.9034 1.00E-04 0.9976 0.8089
gSCAD-osLLA 902.96 0.8839 2.00E-04 0.9917 0.8341
CLIME 586.67 0.8293 2.00E-04 0.9904 0.8993
MB 288.49 0.6604 3.00E-04 0.9852 0.9606
MB(same) 291.27 0.6618 3.00E-04 0.9867 0.9601
clime+LS 281 0.6522 4.00E-04 0.9821 0.9621
clime+DS 283.57 0.6554 4.00E-04 0.9821 0.9616
glasso+LS 134.02 0.3 0.0011 0.9441 0.9916
glasso+DS 133.35 0.2963 0.0011 0.9441 0.9918
clime+LS(same) 290.92 0.6612 3.00E-04 0.9862 0.9602
clime+DS(same) 277.24 0.6452 3.00E-04 0.9837 0.9629
glasso+LS(same) 140.96 0.3217 8.00E-04 0.9598 0.9905
glasso+DS(same) 140.36 0.3185 8.00E-04 0.9593 0.9906
68
0 200 400 600 800 1000
46
81
01
21
4
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 200 400 600 800 1000
01
23
4
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) Spectral norm
Figure 3.2: ‖Ω −Ω∗‖ of Example 1 (p=100, q=99)
69
Table 3.2: ‖Ω −Ω∗‖ of Example 1 (p=100, q=99)
Methods Edge Frob l1 l2
glasso 1025.655 6.6562 2.667 1.5839
gSCAD-osLLA 902.96 4.3703 1.8354 1.1696
CLIME 586.67 5.7808 2.0251 1.3744
MB 288.49 6.136 2.6453 1.9252
MB(same) 291.27 4.7942 1.7009 1.1407
clime+LS 281 6.5012 3.0927 2.335
clime+DS 283.57 6.5829 3.0539 2.2887
glasso+LS 134.02 4.6925 1.9878 1.4721
glasso+DS 133.35 4.6922 2.0082 1.4868
clime+LS(same) 290.92 4.9179 1.786 1.2587
clime+DS(same) 277.24 4.9048 1.748 1.2448
glasso+LS(same) 140.96 4.5504 1.8529 1.3569
glasso+DS(same) 140.36 4.5491 1.8487 1.3571
70
0 2 4 6 8 10
60
70
80
90
10
0
(1−Specificity)*100
Se
nsitiv
ity*1
00
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.3: ROC curve of Example 1 (p=200, q=199)
71
Table 3.3: Example 1 (p=200, q=199)
Methods Edge FP FN Sensitivity Specificity
glasso 2657.33 0.925 0 0.9982 0.8752
gSCAD-osLLA 1729.86 0.8809 1.00E-04 0.9919 0.9222
CLIME 960.83 0.7932 2.00E-04 0.9818 0.9611
MB 604.36 0.6765 2.00E-04 0.9798 0.9792
MB(same) 620.22 0.6818 2.00E-04 0.9825 0.9784
clime+LS 407.61 0.5389 6.00E-04 0.9403 0.9888
clime+DS 407.12 0.5377 6.00E-04 0.9414 0.9888
glasso+LS 586.49 0.6673 2.00E-04 0.9777 0.9801
glasso+DS 608.99 0.6792 2.00E-04 0.9783 0.979
clime+LS(same) 435 0.5578 4.00E-04 0.9597 0.9876
clime+DS(same) 429.43 0.552 4.00E-04 0.9587 0.9879
glasso+LS(same) 599.01 0.6725 2.00E-04 0.9816 0.9795
glasso+DS(same) 585.45 0.6651 2.00E-04 0.9808 0.9802
72
0 200 400 600 800 1000
46
81
01
21
4
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 200 400 600 800 1000
01
23
4
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) Spectral norm
Figure 3.4: ‖Ω −Ω∗‖ of Example 1 (p=200, q=199)
73
Table 3.4: ‖Ω −Ω∗‖ of Example 1 (p=200, q=199)
Methods Edge Frob l1 l2
glasso 2657.33 10.9083 3.0856 1.7999
gSCAD-osLLA 1729.86 7.015 2.2085 1.3832
CLIME 960.83 9.4415 2.1993 1.5915
MB 604.36 10.4144 3.905 3.0842
MB(same) 620.22 7.32 1.8644 1.2588
clime+LS 407.61 10.599 4.0213 3.1616
clime+DS 407.12 10.878 4.2165 3.3455
glasso+LS 586.49 11.1297 4.5991 3.7215
glasso+DS 608.99 11.8905 4.6239 3.7287
clime+LS(same) 435 8.1239 2.2008 1.7033
clime+DS(same) 429.43 8.1794 2.2122 1.7129
glasso+LS(same) 599.01 7.2486 1.854 1.267
glasso+DS(same) 585.45 7.2326 1.8454 1.2634
74
0 2 4 6 8 10
02
04
06
08
01
00
(1−Specificity)*100
Se
nsitiv
ity*1
00
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.5: ROC curve of Example 2 (p=100, q=59)
75
Table 3.5: Example 2 (p=100, q=59)
Methods Edge FP FN Sensitivity Specificity
glasso 1025.655 0.9034 1.00E-04 0.9976 0.8089
gSCAD-osLLA 902.96 0.8839 2.00E-04 0.9917 0.8341
CLIME 586.67 0.8293 2.00E-04 0.9904 0.8993
MB 288.49 0.6604 3.00E-04 0.9852 0.9606
MB(same) 291.27 0.6618 3.00E-04 0.9867 0.9601
clime+LS 281 0.6522 4.00E-04 0.9821 0.9621
clime+DS 283.57 0.6554 4.00E-04 0.9821 0.9616
glasso+LS 134.02 0.3 0.0011 0.9441 0.9916
glasso+DS 133.35 0.2963 0.0011 0.9441 0.9918
clime+LS(same) 290.92 0.6612 3.00E-04 0.9862 0.9602
clime+DS(same) 277.24 0.6452 3.00E-04 0.9837 0.9629
glasso+LS(same) 140.96 0.3217 8.00E-04 0.9598 0.9905
glasso+DS(same) 140.36 0.3185 8.00E-04 0.9593 0.9906
76
0 200 400 600 800 1000
10
15
20
25
30
35
40
45
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 200 400 600 800 1000
68
10
12
14
16
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) Spectral norm
Figure 3.6: ‖Ω −Ω∗‖ of Example 2 (p=100, q=59)
77
Table 3.6: ‖Ω −Ω∗‖ of Example 2 (p=100, q=59)
Methods Edge Frob l1 l2
glasso 1025.655 6.6562 2.667 1.5839
gSCAD-osLLA 902.96 4.3703 1.8354 1.1696
CLIME 586.67 5.7808 2.0251 1.3744
MB 288.49 6.136 2.6453 1.9252
MB(same) 291.27 4.7942 1.7009 1.1407
clime+LS 281 6.5012 3.0927 2.335
clime+DS 283.57 6.5829 3.0539 2.2887
glasso+LS 134.02 4.6925 1.9878 1.4721
glasso+DS 133.35 4.6922 2.0082 1.4868
clime+LS(same) 290.92 4.9179 1.786 1.2587
clime+DS(same) 277.24 4.9048 1.748 1.2448
glasso+LS(same) 140.96 4.5504 1.8529 1.3569
glasso+DS(same) 140.36 4.5491 1.8487 1.3571
78
0 2 4 6 8 10
020
40
60
80
100
(1−Specificity)*100
Sensitiv
ity*100
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.7: ROC curve of Example 2 (p=200, q=92)
79
Table 3.7: Example 2 (p=200, q=92)
Methods Edge FP FN Sensitivity Specificity
glasso 2263.27 0.968 0.0011 0.7851 0.8894
gSCAD-osLLA 536.81 0.8786 0.0016 0.6573 0.976
CLIME 242.06 0.7012 0.0011 0.7698 0.9914
MB 272.85 0.732 0.001 0.7914 0.9899
MB(same) 378.18 0.878 0.0024 0.4965 0.9832
clime+LS 102 0.2251 7.00E-04 0.8561 0.9988
clime+DS 102 0.2252 7.00E-04 0.8561 0.9988
glasso+LS 104.46 0.2115 5.00E-04 0.8932 0.9989
glasso+DS 104.38 0.2109 5.00E-04 0.8932 0.9989
clime+LS(same) 117.73 0.389 0.0011 0.7559 0.9976
clime+DS(same) 117.74 0.389 0.0011 0.7559 0.9976
glasso+LS(same) 87.02 0.1569 0.001 0.7924 0.9993
glasso+DS(same) 86.93 0.1563 0.001 0.7921 0.9993
80
0 500 1000 1500 2000
15
20
25
30
35
40
45
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 500 1000 1500 2000
46
81
01
21
41
6
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) l2 norm
Figure 3.8: ‖Ω −Ω∗‖ of Example 2 (p=200, q=92)
81
Table 3.8: ‖Ω −Ω∗‖ of Example 2 (p=200, q=92)
Methods Edge Frob l1 l2
glasso 2263.27 37.8666 13.3819 9.5821
gSCAD-osLLA 536.81 21.8945 9.7219 7.0486
CLIME 242.06 28.311 11.4089 8.0062
MB 272.85 22.7018 8.2783 5.3505
MB(same) 378.18 30.7944 12.2013 8.1922
clime+LS 102 17.8328 9.0719 6.2429
clime+DS 102 17.8365 9.078 6.2441
glasso+LS 104.46 16.2034 8.0195 5.7519
glasso+DS 104.38 16.2091 8.0138 5.7561
clime+LS(same) 117.73 20.4286 9.2521 6.092
clime+DS(same) 117.74 20.4287 9.2521 6.092
glasso+LS(same) 87.02 18.1129 8.0463 5.6268
glasso+DS(same) 86.93 18.117 8.0515 5.633
82
3.3.2 Real data analysis
We apply our two stage methods to analyzing the breast cancer data set which
were analyzed by Hess et al. (2006) and it is available at
http://bioinformatics.mdanderson.org/. This dataset is also used in previous
studies (Fan et al., 2009; Cai et al., 2011). The aim of this analysis is to com-
pare the results of linear discriminant analysis (LDA) based on several regu-
larization methods for sparse precision matrix estimation. This dataset con-
tains 22,283 gene expression levels of 133 patients where 34 patients of them
achieved pathological complete response (pCR) and others did not achieve
pCR (a.k.a. residual disease (RD)). Since pCR after neoadjuvant (preoper-
ative) chemotherapy indicates high possibility of cancer free survival, it is of
substantial interest to predict wheter or not a patient will achieve pCR. In this
study LDA is utilzed to classify a patients as pCR or RD. Precision matrix
should be estimated in advance of using LDA. Fan et al. (2009) used penalized
loglikelihood method with LASSO, adaptive LASSO, and SCAD penalties to
estimate precision matrix and Cai et al. (2011) used CLIME estimate as a
precision matrix. We follow the same framework used by Fan et al. (2009) and
Cai et al. (2011).
First, we randomly divide dataset into the training and testing datasets.
To maintain the proportion of pCR and RD each time, we use a stratified
83
sampling which randomly select five from pCR and 16 from RD to construct
testing dataset and the remaining subjects are used as the training dataset.
For each training set, we conduct a two-sample t-test between the two groups
for each gene, and select the most significant 113 genes (i.e., with the smallest
p−values) as the covariates for prediction. Because the size of the training
sample is n = 112, the covariates with size p = 113 allow us to examine the
performance when p > n. Second, we standardize each gene level in these
datasets by dividing them with corresponding estimated standard deviation
from the training set. Finally, we conduct precision matrix estimation for each
regularization method and apply it to LDA. According to the LDA framework,
we assume that the normalized gene expression data is normally distributed
as N(µk,Σ), where the two groups are assumed to have the same covariance
matrix, Σ, but different means, µk, k = 1 for pCR and k = 2 for RD. The
LDA scores based on the estimated precision matrix Ω are as follows,
δk(x) = xT Ωµk −1
2µTk Ωµk + log πk,
where πk = nk/n is the proportion of subjects in group k in the training set and
µk = 1nk
∑i∈group k xi is the mean vector of within group in the training set.
The classification rule is given by k(x) = argmaxδk(x) for k = 1, 2. To eval-
uate the classification performance based on precision matrix estimation, we
use specificity, sensitivity, and Mathews correlation coefficient (MCC) criteria,
84
defined as follows:
Specificity =TN
TN + FP, Sensitivity =
TP
TP + FN
MCC =TP× TN− FP× FN√
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
where TP, TN, FP, and FN are the numbers of true positives (pCR), true neg-
atives (RD), false positives and false negatives, respectively. We also compare
the number of nonzero precision matrix elements among the same methods
which are considered in the simulations with the same tuning strategy. The
results are reported in Table 3.9. The proposed two stage methods yield very
sparse precision matrix while they perform well or similar to other methods.
85
Table 3.9: Real Data (Breast Cancer)
Methods SP SN MCC #Edge
glasso 0.876(0.066) 0.404(0.186) 0.307(0.229) 1066.87(31.054)
gSCAD-osLLA 0.784(0.077) 0.682(0.201) 0.428(0.211) 731.37(50.934)
CLIME 0.737(0.068) 0.782(0.173) 0.457(0.18) 2282.18(371.52)
MB 0.677(0.074) 0.824(0.164) 0.433(0.169) 289.06(25.678)
glasso+DS 0.674(0.087) 0.824(0.161) 0.431(0.176) 221.50(22.474)
glasso+LS 0.674(0.088) 0.820(0.164) 0.428(0.179) 224.48(21.468)
clime+DS 0.666(0.072) 0.824(0.169) 0.422(0.17) 260.37(19.059)
clime+LS 0.669(0.077) 0.822(0.168) 0.424(0.165) 333.09(22.069)
86
3.4 Conclusion
In this paper, we especially focus on the selection and the correct recovery of
the support of precision matrix. We propose a regression based method which
applies two stage methods based on LASSO or Dantzig selector to columnwise
estimation of precision matrix. We prove that these proposed methods can
correctly recover the support of precision matrix and obtain√n-consistent
estimator for the nonzero elements of precision matrix with high probability
under some regularity conditions. Numerical results show that our proposed
methods outperform existing regularization methods including glasso, gSCAD,
and CLIME in terms of estimation and especially support recovery of precision
matrix.
87
Chapter 4
Concluding remarks
In this thesis, we propose a two stage method based on Dantzig selector, which
is called two stage Dantzig selector in high dimensional regression model. We
prove that two stage Dantzig selector satisfies strong oracle property. Numer-
ical results support our contention that Dantzig selector used in our proposed
method can improve variable selection and estimation than LASSO. Further-
more, two stage Dantzig selector outperform other regularization methods in-
cluding LASSO, Dantzig selector, SCAD and MCP.
We also apply the two stage methods based on the LASSO or the Dantzig
selector to sparse precision matrix estimation. We prove that these proposed
methods can correctly recover the support of precision matrix and obtain√n-
consistent estimator for the nonzero elements of precision matrix. For esti-
88
mation of sparse precision matrix, the two stage methods perform well in
estimation and especially in support recovery of precision matrix.
89
Chapter 5
Appendix
5.1 Algorithms for Dantzig selector
There are several algorithms for Dantzig selector. Recall the definition of
Dantzig selector which is defined by
minβ‖β‖1 subject to ‖ 1
nXT (y −Xβ)‖∞ ≤ λ. (5.1)
A standard way to solve (5.1) is using linear program (LP) techniques because
(5.1) can be recast as a LP problem. Candes and Romberg (2005) proposed
`1-magic package which applied a primal-dual interior point method which is
one of LP techniques to solve the reformulated LP problem. This algorithm is
known to be efficient whenX is sparse or it can be efficiently trainsformed into
90
a diagonal matrix, but it can be inefficient for large-scale problems because of
Newton step for each iteration (Wang and Yuan, 2012).
There are homotopy methods to to compute the entire solution path of the
Dantzig selector (Romberg, 2008; James et al., 2009). They are also problem-
atic in dealing with high dimensional data (Becker et al., 2011). As an effort
to solve (5.1) efficiently in large-scale problem, first-order methods have been
proposed (Lu, 2012; Becker et al., 2011). Lu et al. (2012) applied alternat-
ing direction method (ADM) which has been widely used to solve large-scale
problems to solving (5.1) and its variations have been developed for large scale
problems (Wang and Yuan, 2012).
We go into three representative algorithms for Dantzig selector which are
primal-dual interior point method (Candes and Romberg, 2005), Dantzig selec-
tor with sequential optimization (DASSO) (James et al., 2009)), and alternat-
ing direction method (ADM) (Lu et al., 2012). We abstract main algorithms
for Dantzig selector from these three papers.
5.1.1 Primal-dual interior point algorithm (Candes and
Romberg, 2005)
Dantzig selector can be recast to a linear program (LP). The LP is an optimiza-
tion problem with linear objective function and linear equality or inequality
91
constraints. There are many solvers for LP such as simplex method, barrier
method, primal-dual interior point method. Candes and Romberg (2005) in-
troduced primal-dual interior point method for LPs and second-order cone
programs(SOCPs). Here we especially extract algorithm for Dantzig selector
from Candes and Romberg (2005)’s paper.
An equivalent linear program to (5.1) is given by:
minβ,u
∑i
ui subject to β − u ≤ 0,
−β − u ≤ 0,
XTr − λ1 ≤ 0,
−XTr − λ1 ≤ 0,
where r = Xβ − y. Taking
fu1 = β − u, fu2 = −β − u, fλ1 = XTr − λ1, fλ2 = −XTr − λ1,
and f = (fu1 ,fu2 ,fλ1 ,fλ2)T then at the optimal point β∗,u∗, there exists dual
vectors γ∗ = (γ∗u1,γ∗u2
,γ∗λ1 ,γ∗λ2
)T , γ∗ ≥ 0 such that the following Karash-
Kuhn-Tucker condition are satisfiesd:
1− γ∗u1− γ∗u2
= 0,
fu1 ≤ 0, fu2 ≤0, fλ1 ≤ 0, fλ2 ≤ 0,
γu1,ifu1,i = 0, γu2,ifu2,i =0, γλ1,ifu1,i = 0, γλ2,ifu2,i = 0, i = 1, . . . , p.
92
The complementary slackness condition γifi = 0 can be relaxed practically to
γ(k)i fi(β
(k),u(k)) = −1/τ (k), (5.2)
with increasing τ (k) thorough the iterations. The relaxed-(KKT) condition re-
places the complementary slackness condition with (5.2). The optimal solution
β∗ of the primal dual algorithm satisfies the relaxed-(KKT) condition along
with optimal dual vectors γ∗. The solution is obtained through the classical
Newton method constrained by its interior region(fi(β(k),u(k)) < 0, γ
(k)i > 0).
The dual and central residuals quantify how close a point (β,u;γu1,γu2
,γλ1 ,γλ2)
is to satisfying (KKT ) with (5.2) in place of the slackness condition:
rdual =
γu1− γu2
+XTX(γλ1 − γλ2)
1− γu1− γu2
,
rcent = −Γf − (1/τ)1
where Γ is a diagonal matrix with (Γ )ii = γi. and the Newton step is the
solution toXTXΣaXTX +Σ11 Σ12
Σ12 Σ11
∆β∆u
=
−(1/τ) · (XTX(−f−1λ1 + f−1λ2 ))− f−1u1+ f−1u2
−1− (1/τ) · (f−1u1 + f−1u2 )
:=
w1
w2
93
where
Σ11 = −Γu1F−1u1− Γu2F
−1u2
Σ12 = Γu1F−1u1− Γu2F
−1u2
Σa = −Γλ1F−1λ1− Γλ2F−1λ2
.
Set
Σβ = Σ11 −Σ212Σ
−111 ,
then we can eliminate
∆u = Σ−111 (w2 −Σ12∆β),
and solve
(XTXΣaXTX +Σβ)∆β = w1 −Σ12Σ
−111 w2
for ∆β. As before, the system is symmetric positive definite, and the conjugate
gradient (CG) algorithm can be used to solve it.
Given ∆β, ∆u, the step directions for the inequality dual variables are
given by
∆γu1= −Γu1F
−1u1
(∆β −∆u)− γu1− (1/τ)f−1u1
∆γu2 = −Γu2F−1u2
(−∆β −∆u)− γu2− (1/τ)f−1u2
∆γλ1 = −Γλ1F−1λ1(XTX∆β)− γλ1 − (1/τ)f−1λ1
∆γλ2 = −Γλ2F−1λ2(−XTX∆β)− γλ2 − (1/τ)f−1λ2 .
94
where F is a diagonal matrix with (F )ii = fi. With the (∆β, ∆u, ∆γ) we have
a step direction. To choose the step length 0 < s ≤ 1, we ask that it satisfy
two criteria:
1. β+ s∆β, u + s∆u and γ + s∆γ are in the interior, i.e. fi(β+ s∆β,u +
s∆u) < 0, γi > 0 for all i.
2. The norm of the residuals has decreased sufficiently:
‖rτ (β + s∆β,u + s∆u,γ + s∆γ)‖2 ≤ (1− αs) · ‖rτ (β,u,γ)‖2,
where α is a user-sprecified parameter (in all of our implementations, we
have set α = 0.01).
Since the fi are linear functionals, item 1 is easily addressed. We choose the
maximum step size that just keeps us in the interior. Let
I+f = i : 〈ci, ∆z〉 > 0, I−γ = i : ∆γ < 0,
where z = (β,u)T and fi = 〈ci, z〉, and set
smax = 0.99 ·min1, −fi(z)/〈ci, ∆z〉, i ∈ I+f , −γi/∆γi, i ∈ I−γ.
Then starting with s = smax, we check if item 2 above is satisfied; if not, we set
s′ = ν · s and try again. We have taken ν = 1/2 in all of our implementations.
When rdual is small, the surrogate duality gap η = −fTγ is an approxima-
tion to how close a certain (β,u,γ) is to being opitmal (i.e. 〈c0, z〉−〈c0, z∗〉 ≈
95
η) where∑
i ui = 〈c0, z〉. The primal-dual algorithm repeats the Newton iter-
ations described above until η has decreased below a given tolerance.
5.1.2 DASSO (James et al., 2009)
James et al. (2009) proposed a homotopy algorithm for Dantzig which is named
as Dantzig selector with sequential optimization (DASSO). DASSO constructs
piecewise linear path while it identifies break points and solves the correspond-
ing linear program. DASSO is similar to least angle regression and selection
(LARS) algorithm (Efron et al., 2004) which is efficient algorithm for LASSO
hence its computational cost is comparable to LARS. We first describe the
LARS algorithm and go into the detail of DASSO. The LARS algorithm is
defined as follows.
LARS (Efron et al., 2004)
1. Initialize:
β = 0, A = argmaxj|∇L(β)|j, γA = −sgn(∇L(β))A, γAC = 0.
where L(β) =∑
i(yi − xTi β)2.
2. While (max |∇L(β)| > 0);
96
(a) d1 = mind > 0 : |∇L(β)j| = |∇L(β)A|, j /∈ A,
d2 = mind > 0 : (β + dγ)j = 0, j ∈ A.
Find step length: d = min(d1, d2).
(b) Take step: β ← β + dγ.
(c) If d = d1 then add variable attaining equality at d to A.
If d = d2 then remove variable attaining 0 at d from A.
(d) Calculate new direction:
γA = (XTAXA)−1sgn(βA) and γAC = 0.
The LARS procedure starts with all zero coefficients and select the most
correlated variable with response variable. LARS proceeds with the direction
of this variable until some other variable has as much correlation with the
current residual and add this new variable to the set of selected variables. The
direction is taken to be equiangular among selected variables and it changes
when addition or deletion happen. Addition occurs when the correlation of
other variable with current residual become same as the correlation of selected
variables with current residual. Deletion happens when one of the coefficients
of the selected variables to be zero while LARS procedure continues along with
the current direction.
DASSO algorithm is to solve (5.1) sequentially through constructing a
97
piecewise linear solution path as well. DASSO is defined as follows.
DASSO (James et al., 2009)
1. Initialize:
l = 1, βl = 0, A = argmaxj|XTj (y −Xβl)|, B = j : βlj 6= 0 = ∅,
c = XT (y −Xβl), γA = −sgn(cA), γAC = 0, sA = sgn(cA)
2. While (maxj |XTj (y −Xβl)| > 0);
(a) d1 = mind > 0 : |XTj (y −Xβl)| = |XT
A(y −Xβl)|, j /∈ A,
d2 = mind > 0 : (βl + dγ)j = 0, j ∈ A.
Find step length: d = min(d1, d2).
(b) If d = d1 then add variable attaining equality at d to A and add
variable j∗ to B.
If d = d2 then remove variable attaining 0 at d from either A and
B.
(c) Calculate new direction:
γA = (XTAXB)−1sA and γAC = 0
(d) Take step: βl+1 ← βl + dγ.
(e) l← l + 1
98
The added variable j∗ and the distance are defined as follows.
• Define the added variable.
Let |A| × (2p + |A|) matrix A = (−sAXTAX sAX
TAX I) and Aj = Aj1
Aj2
be the jth column ofA with Aj2 is a scalar. Let B =
B1
B2
be the columns of A that correspond to the non-zero components of β+
and β− where B1 is a square matrix of dimension |A| − 1.
Define j∗ = argmaxj:qj 6=0,α/qj>0
(1TB−11 Aj1 − 1j≤2p)|qj|−1
where qj = Aj2 −B2B−11 Aj1 and α = B2B
−11 1− 1.
• Define the distance.
Let the distance be d = mind1, d2 where d1 = minj /∈A
ck−cj
(Xk−Xj)TXγ,
ck+cj(Xk+Xj)TXγ
+
for k ∈ A and d2 = minj∈B−βjγj.
This rule for adding variable comes from piecewise linearity of solution
path and the definition of Dantzig selector. For more detail, see the appendix
of James et al. (2009). The distance step is same as in LARS algorithm.
99
5.1.3 Alternating direction method (ADM) (Lu et al.,
2012)
The ADM has recently been widely used to solve large-scale problems. The
general problems which can use the ADM have the following form
minx,y
f(x) + g(y) subject to Ax+By = b, x ∈ C1, y ∈ C2, (5.3)
where f and g are convex functions, A and B are matrices, b is a vector,
and C1 and C2 are closed convex sets. The ADM consists of two subproblems
and a multiplier update. Under mild assumptions, It is known that the ADM
converges to optimal solution of (5.3) (Bertsekas and Tsitsiklis, 1989). The
ADM for Dantzig selector is defined as follows:
minβ,z‖β‖1 subject to XT (Xβ − y)− z = 0, ‖D−1z‖∞ ≤ λ (5.4)
where D is the diagonal matrix whose diagonal elements are the norm of the
column of X. An augmented Lagrangian function for problem (5.4) for some
µ > 0 can be defined as
Lµ(z,β,η) = ‖β‖1 + ηT (XTXβ −XTy − z) +µ
2‖XTXβ −XTy − z‖22.
ADM algorithm for Dantzig selector
1. Initialize: let β0,η0 ∈ Rp and µ > 0.
100
2. For k = 0, 1, . . .
zk+1 = argmin‖D−1z‖≤λ
Lµ(z,βk,ηk),
βk+1 ∈ argminβLµ(zk+1,β,ηk),
ηk+1 = ηk + µ(XTXβk+1 −XTy − zk+1).
End(for)
We go into subproblems of the ADM. Dual problem of (5.3) is given by
maxη
d(η) := −yTXη − λ‖Dη‖1 subject to ‖XTXη‖∞ ≤ 1.
zk+1 = argmin‖D−1z‖≤λ
‖z− (XTXβk −XTy +ηk
µ)‖22
= min
max
XTXβk −XTy +
ηk
µ,−λd
, λd
,
where d is the vector consisting of the diagonal entries of D. Hence the first
subproblem has the closed form solution. For the second subproblem, we can
choose βk+1 which solves the following approximated subproblem
minβ
µ
2‖XTXβ −XTy − zk+1 +
ηk
µ‖22 + ‖β‖1.
This problem can be solved by the nonmonotone gradient methods for nons-
mooth minimization (Lu and Zhang, 2012). The general problems which can
apply nonmonotone gradient method can be defined as
minx∈XF (x) := f(x) + P (x)
101
where f : Rn → R is continuously differentiable, P : Rn → R is con-
vex but not necessarily smooth, and X ⊆ Rn is closed and convex. Here,
f(β) = µ2‖XTXβ −XTy − zk+1 + ηk
µ‖22 and P (β) = ‖β‖1. Then ∇f(β) =
µXTX(XTX)β−XTy−zk+1 + ηk
µ. Then the nonmonotone gradient method
for solving the subproblem of β is defined as follows:
1. Initialize: 0 < τ, σ < 1, 0 < α < 1 and integer M ≤ 0. Let β0 be given
and set α0 = 1.
2. For l = 0, 1, . . .
(a) Let
dl = SoftThresh(βl−αl∇f(βl), αl)−βl, ∆l = ∇f(βl)Tdl+‖βl+dl‖1−‖βl‖1
(b) Find the largest α ∈ 1, τ, τ 2, . . . such that
f(βl + αdl) + ‖βl + αdl‖1 ≤ max[l−M ]+≤i≤l
f(βi) + ‖βi‖1+ σα∆l.
Set αl ← α, βl+1 ← βl + αldl and l← l + 1.
(c) Update αl+1 = minmax ‖sl‖2
slT gl, α, 1, where sl = βl+1 − βl and
gl = ∇f(βl+1)−∇f(βl).
End(for)
where SoftThresh(v, γ) := sgn(v)max0, |v|−γe. For the specific terminating
rules used in ADM, see Lu et al. (2012).
102
Bibliography
Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection
through sparse maximum likelihood estimation for multivariate gaussian or
binary data. The Journal of Machine Learning Research, 9:485–516.
Becker, S. R., Candes, E. J., and Grant, M. C. (2011). Templates for convex
cone problems with applications to sparse signal recovery. Mathematical
Programming Computation, 3(3):165–218.
Bertsekas, D. and Tsitsiklis, J. (1989). Parallel and distributed computation.
Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance
matrices. The Annals of Statistics, pages 199–227.
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of
lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732.
103
Breiman, L. (1996). Heuristics of instability and stabilization in model selec-
tion. The Annals of Statistics, 24(6):2350–2383.
Cai, T., Liu, W., and Luo, X. (2011). A constrained l1 minimization approach
to sparse precision matrix estimation. Journal of the American Statistical
Association, 106(494):594–607.
Cai, T. T., Liu, W., Luo, X. R., and Luo, M. X. R. (2012). Package ’clime’.
Candes, E. and Plan, Y. (2009). Near-ideal model selection by l1 minimization.
The Annals of Statistics, 37(5A):2145–2177.
Candes, E. and Romberg, J. (2005). l1-magic: Recovery of sparse
signals via convex programming. URL: www. acm. caltech.
edu/l1magic/downloads/l1magic. pdf, 4.
Candes, E. and Tao, T. (2005). Decoding by linear programming. Information
Theory, IEEE Transactions on, 51(12):4203–4215.
Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation
when p is much larger than n. The Annals of Statistics, 35(6):2313–2351.
Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for
model selection with large model spaces. Biometrika, 95(3):759–771.
104
Dicker, L. (2010). Regularized Regression Methods for Variable Selection and
Estimation. Collections of the Harvard University Archives: Dissertations.
Harvard University.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle
regression. The Annals of Statistics, 32(2):407–499.
Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive
lasso and scad penalties. The Annals of Applied Statistics, 3(2):521.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli-
hood and its oracle properties. Journal of the American Statistical Associ-
ation, 96(456):1348–1360.
Fan, J., Xue, L., and Zou, H. (2012). Strong oracle optimality of folded concave
penalized estimation. arXiv preprint arXiv:1210.5992.
Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemo-
metrics regression tools. Technometrics, 35(2):109–135.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance
estimation with the graphical lasso. Biostatistics, 9(3):432–441.
Friedman, J., Hastie, T., Tibshirani, R., and Tibshirani, M. R. (2013). Package
’glasso’.
105
Gai, Y., Zhu, L., and Lin, L. (2013). Model selection consistency of dantzig
selector. Statistica Sinica, 23:615–634.
Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional lin-
ear predictor selection and the virtue of overparametrization. Bernoulli,
10(6):971–988.
Hess, K. R., Anderson, K., Symmans, W. F., Valero, V., Ibrahim, N., Mejia,
J. A., Booser, D., Theriault, R. L., Buzdar, A. U., Dempsey, P. J.,
et al. (2006). Pharmacogenomic predictor of sensitivity to preoperative
chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophos-
phamide in breast cancer. Journal of Clinical Oncology, 24(26):4236–4244.
Huang, J., Horowitz, J. L., and Ma, S. (2008a). Asymptotic properties of
bridge estimators in sparse high-dimensional regression models. The Annals
of Statistics, 36(2):587–613.
Huang, J., Ma, S., and Zhang, C.-H. (2008b). Adaptive lasso for sparse high-
dimensional regression models. Statistica Sinica, 18(4):1603.
James, G. M., Radchenko, P., and Lv, J. (2009). Dasso: connections between
the dantzig selector and lasso. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 71(1):127–142.
106
Kim, Y., Choi, H., and Oh, H. (2008). Smoothly clipped absolute devia-
tion on high dimensions. Journal of the American Statistical Association,
103(484):1665–1673.
Kim, Y. and Kwon, S. (2012). Global optimality of nonconvex penalized esti-
mators. Biometrika, 99(2):315–325.
Kim, Y., Kwon, S., and Choi, H. (2012). Consistent model selection criteria on
high dimensions. The Journal of Machine Learning Research, 98888:1037–
1057.
Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large
covariance matrix estimation. Annals of statistics, 37(6B):4254.
Lu, Z. (2012). Primal–dual first-order methods for a class of cone programming.
Optimization Methods and Software, 28(6):1262–1281.
Lu, Z., Pong, T. K., and Zhang, Y. (2012). An alternating direction method
for finding dantzig selectors. Computational Statistics & Data Analysis.
Lu, Z. and Zhang, Y. (2012). An augmented lagrangian approach for sparse
principal component analysis. Mathematical programming, 135(1-2):149–
193.
107
Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and vari-
able selection with the lasso. The Annals of Statistics, 34(3):1436–1462.
Meinshausen, N., Rocha, G., and Yu, B. (2007). Discussion: A tale of three
cousins: Lasso, l2boosting and dantzig. The Annals of Statistics, 35(6):2373–
2384.
Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A
unified framework for high-dimensional analysis of m-estimators with de-
composable regularizers. Statistical Science, 27(4):538–557.
Peng, J., Wang, P., Zhou, N., and Zhu, J. (2009). Partial correlation estima-
tion by joint sparse regression models. Journal of the American Statistical
Association, 104(486):735–746.
Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue
properties for correlated gaussian designs. The Journal of Machine Learning
Research, 99:2241–2259.
Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of es-
timation for high-dimensional linear regression over `q-balls. Information
Theory, IEEE Transactions on, 57(10):6976–6994.
Ravikumar, P., Wainwright, M., Raskutti, G., and Yu, B. (2011).
108
High-dimensional covariance estimation by minimizing l1-penalized log-
determinant divergence. Electronic Journal of Statistics, 5:935–980.
Romberg, J. (2008). The dantzig selector and generalized thresholding. In In-
formation Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference
on, pages 22–25. IEEE.
Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random
measurements. Information Theory, IEEE Transactions on, 59(6):3434–
3447.
Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A.,
Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant,
T. L., et al. (2006). Regulation of gene expression in the mammalian eye
and its relevance to eye disease. Proceedings of the National Academy of
Sciences, 103(39):14429–14434.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society. Series B (Methodological), pages 267–288.
Wang, L., Kim, Y., and Li, R. (2013). Calibrating nonconvex penalized regres-
sion in ultra-high dimension. The Annals of Statistics, 41(5):2505–2536.
Wang, X. and Yuan, X. (2012). The linearized alternating direction method
109
of multipliers for dantzig selector. SIAM Journal on Scientific Computing,
34(5):A2792–A2811.
Yuan, M. (2010). High dimensional inverse covariance matrix estimation via
linear programming. The Journal of Machine Learning Research, 99:2261–
2286.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian
graphical model. Biometrika, 94(1):19–35.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax con-
cave penalty. The Annals of Statistics, 38(2):894–942.
Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection
in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–
1594.
Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regulariza-
tion for high-dimensional sparse estimation problems. Statistical Science,
27(4):576–593.
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The
Journal of Machine Learning Research, 7:2541–2563.
110
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the
American Statistical Association, 101(476):1418–1429.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized
likelihood models. Annals of Statistics, 36(4):1509.
111
국문초록
변수선택은 고차원 회귀분석에서 중요하다. 단계별선택법과 같은 전통적인
변수선택방법들은데이터에따라서선택된변수들이달라지므로불안정하
다. 이에 대한 대안으로 변수선택과 추정을 동시에 하는 벌점화 방법론들이
사용된다. 라소 추정량은 희소성을 가지지만 변수선택 일치성이 없으며 편
향되어있다. SCAD, MCP와 같은 비볼록 벌점화 방법론들은 선택일치성을
가지며비편향추정량임이잘알려져있다.하지만이방법론들은다중국소
해들을 가질 수 있으며 조율계수에 따라 계산이 불안정하다. 신의 추정량을
유일한 국소해로 가질 수 있는 라소에 기반을 둔 이단계 방법론들이 개발되
었다.
본 연구에서는 라소를 단치그 셀렉터로 변형시킨 새로운 이단계 방법론
을제안한다.제안한방법은이단계방법론의두번째단계에서잡음변수들
의 영향력을 줄이는 것이 매우 중요하다는 점에 착안하였다. 단치그 셀렉터
의 `1-놈은같은조율계수에대해라소의 `1-놈보다작거나같고비점근적오
차범위 또한 단치그 셀렉터가 라소보다 작은 경향이 있다. 그러므로 우리는
112
이단계방법론에라소대신단치그셀렉터를이용하므로변수선택일치성을
만족시키면서 추정을 좀더 개선시킬 것이라 기대한다. 본 연구에서는 자료
에 대한 조건 가정 아래, 제안한 방법을 통해 신의 추정량을 얻을 수 있음을
증명하였다. 그리고 수치적 연구를 통해 변수선택과 추정에 있어서 이단계
단치그 셀렉터가 라소에 기반을 둔 이단계 방법론을 개선시킬 수 있으며,
기존의 다른 방법론들과 비교해서도 좋은 성능을 보임을 확인하였다.
본 연구에서는 추가적으로 공분산 역행렬 추정에 이단계 방법론들을 적
용한다. 공분산 역행렬은 다양한 통계적 문제에 활용되며 그 자체로 조건부
상관성을 내포하므로 매우 중요하다. 제안된 방법을 통해 제약 조건 하에서
공분산역행렬계수가 0인지에대한선택일치성을가질수있으며, 0이아닌
참공분산역행렬계수들에대해√n-일치성을가짐을보였다.수치적연구를
통해 제안된 방법이 계수 선택과 추정에 있어서 기존의 방법론들보다 좋은
성능을 가짐을 확인하였다.
주요어 : 고차원 회귀분석, 변수 선택, 단치그 셀렉터, 선택 일치성, 신의
추정량, 공분산 역행렬
학 번 : 2007− 20263
113
감사의 글
학부와 대학원의 관악에서의 11년간 좋은 교수님, 친구, 선배, 후배, 도움주신 많은 분
들과 만나게 하시고 모든 과정을 하나님의 은혜로 마치게 해주심에 감사드립니다. 통계의
전문성을 갖춰 공공의 유익이 되고 싶은 소망을 가지고 대학원에 진학한지도 7년이 지나
이제박사로첫발을내딛으려하니감회가새롭습니다.새출발에앞서지난기간동안도움
주신 많은 분들께 감사의 말씀을 전하고 싶습니다.
가장 먼저 박사과정 전반에 다양한 기회를 주시고 지도해주신 김용대 교수님께 감사드
립니다.그리고항상푸근하게맞아주시고독려해주신전종우교수님께감사드립니다.논문
심사에서조언해주시고수고해주신박병욱교수님,장원철교수님,임요한교수님께도정말
감사드립니다.연구실선배님이신멋있는권성훈교수님께도논문심사와여러가지조언들
에 감사드립니다. 6여 년간의 연구실 생활에 좋으신 선후배님들과 함께여서 즐거웠습니다.
최호식 교수님, 동화오빠, 도현오빠, 범수오빠, 광수아저씨, 재석오빠, 상인오빠, 병엽오빠,
종준오빠, 미애언니, 수희언니, 신선언니, 영희, 건수오빠, 혜림이, 미경이, 효원이, 지영이,
원준이, 지선이, 지영언니, 주유오빠, 민우, 승환이, 우성이, 재성이, 슬기, 동하, 세민이, 구
환이,윤영언니,승남이,오란이그리고네이버팀김유원이사님,정효주박사님,인재오빠,
연하언니 정말 감사했습니다. 학부 때부터 단짝친구 영선이, 3년 넘게 룸메로 힘이 되어준
신영이, 귀엽고 속 깊은 동화 같은 정은이, 박사동기 예지, 성일오빠, 말년에 즐거운 시간들
함께해준 선미언니, 과사 정환언니에게 고마운 마음 전하고 싶습니다.
그리고저의대학원생활전반을함께한통계학과기독인모임에감사드립니다.모임의
큰기둥이되어주신오희석교수님,바쁘신중에도참여해주신박태성교수님,맛있는식사
와격려로힘이되어주신조명희교수님,예배로함께해주신이영조교수님,박성현교수님,
송문섭교수님께도감사드립니다.민정언니,성준오빠때부터지금의지영이,민주,은용이,
성경오빠,재혁이, 보창이까지함께 나누고 교제할 수 있어서 즐거웠습니다. 대학원 생활의
활력소였던 수요채플 찬양 팀 준오빠, 정민이, 은혜, 바우오빠, 송희언니, 민우, 정민오빠,
114
지웅오빠, 재희, 나래, 바뚜, 문수오빠, 신잉, 서교교수님, 경주오빠, 현주, 지나, 찬미, 경만
이, 서림이, 건의, 민화언니, 소정언니, 소영이 모두 덕분에 정말 감사하고 즐거웠습니다.
사랑스러운보신자매들윤진이,효현이,지인이와의소소한즐거움들에정말감사했습니다.
그리고 닮고 싶은 바른 그리스도인의 전형을 보여주신 남승호 교수님과 이원종 교수님, 신
앙성장에도움주시고격려해주신대학교회김동식목사님,마리아사모님,홍종인교수님,
김난주 사모님께도 감사드립니다.
마지막으로 한결같은 사랑과 신뢰, 기도로 뒷받침해주신 부모님과 든든한 버팀목이 되
어준 동생에게 감사합니다. 오랜 지기 새미, 정현이에게도 고마운 마음 전합니다. 수학에
흥미를 가질 수 있도록 도움 주신 은사님이신 구명수 선생님께도 그간 연락드리지 못해
죄송하고 정말 감사했습니다. 그리고 기도로 응원해주신 친척 분들과 교회식구들께 감사
드립니다. 아직도 많이 부족하지만 앞으로 더욱 성실함과 진실함, 사랑하는 마음으로 제가
속한 공동체와 나라에 유익이 되기 위해 노력하겠습니다. 감사합니다.
2014년 2월 한 상 미
115
저 시-비 리-동 조건 경허락 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
l 차적 저 물 성할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 허락조건 확하게 나타내어야 합니다.
l 저 터 허가를 러한 조건들 적 지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 적 할 수 없습니다.
동 조건 경허락. 하가 저 물 개 , 형 또는 가공했 경에는, 저 물과 동 한 허락조건하에서만 포할 수 습니다.
이학박사 학위논문
Two Stage Dantzig Selector for
High Dimensional Data
고차원 자료를 위한 이단계 단치그 셀렉터
2014년 2월
서울대학교 대학원
통계학과
한 상 미
Two Stage Dantzig Selector for
High Dimensional Data
지도교수 김 용 대
이 논문을 이학박사 학위논문으로 제출함.
2013년 10월
서울대학교 대학원
통계학과
한 상 미
한상미의 이학박사 학위논문을 인준함.
2013년 12월
위 원 장 : 박 병 욱 (인)
부 위원장 : 김 용 대 (인)
위 원 : 임 요 한 (인)
위 원 : 장 원 철 (인)
위 원 : 권 성 훈 (인)
Two Stage Dantzig Selector for
High Dimensional Data
by
Sangmi Han
A Thesis
Submitted in fulfillment of the requirements
for the degree of
Doctor of Philosophy
in Statistics
Department of Statistics
College of Natural Sciences
Seoul National University
Feburary, 2014
Abstract
Variable selection is important in high dimensional regression. The traditional
variable selection methods such as stepwise selection are unstable which means
that the set of the selected variables are varying according to the data sets.
As an alternative to those methods, a series of penalized methods are used
for estimation and variable selection simultaneously. The LASSO yields sparse
solution, but it is not selection consistent and biased. Non-convex penalized
methods such as the SCAD and the MCP are known to be selection consistent
and yield unbiased estimator. However they suffer from multiple local minima
and their computations are unstable for tuning parameter. Two stage methods
based on the LASSO such as one step LLA and calibrated CCCP are developed
which can obtain the oracle estimator as the unique local minimum.
We propose a two stage method based on Dantzig selector. The motivation
of our proposed method is that lessening the effect of the noise variables is
important in the two stage method. The `1 norm of the Dantzig selector is
i
always less than equal to that of the LASSO and the non-asymptotic error
bounds of Dantzig selector tent to be lesser than those of LASSO for the same
tuning parameter. Therefore we expect the improvement on the estimation
using the Dantzig selector instead of the LASSO in the two stage method
while this proposed method also satisfies the selection consistency. The results
of the numerical experiments can support our contention.
We also apply these two stage methods which are based on LASSO or
Dantzig selector to estimation of inverse covariance matrix (a.k.a. precision
matrix). Precision matrix estimation is essential not only because it can be
used in various applications but also because it refers to the direct relation-
ship between variables via the conditional dependence of variables under the
normality assumption. Under some regularity conditions our methods hold se-
lection consistency and obtain columnwise√n-consistent estimator for true
nonzero precision matrix elements. The numerical analyses show that the pro-
posed methods perform well in terms of variable selection and estimation.
Keywords: High dimensional regression, variable selection, Dantzig selector,
selection consistency, oracle estimator, inverse covariance matrix estimation
Student Number: 2007− 20263
ii
Contents
Abstract i
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Two Stage Dantzig Selecotr for High Dimensional Linear Re-
gression 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Sparse regularization methods . . . . . . . . . . . . . . . . . . . 8
2.2.1 The `1 regularization . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Nonconvex penalized methods . . . . . . . . . . . . . . . 13
2.2.3 Two stage methods . . . . . . . . . . . . . . . . . . . . . 21
2.3 Two Stage Dantzig Selector . . . . . . . . . . . . . . . . . . . . 25
iii
2.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Theoretical properties . . . . . . . . . . . . . . . . . . . 29
2.3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.5 Tuning regularization parameter . . . . . . . . . . . . . . 37
2.4 Numerical analyses . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . 44
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Two Stage Methods for Precision Matrix Estimation 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Estimation of precision matrix via columnwise two-stage methods 49
3.2.1 Two stage method based on LASSO . . . . . . . . . . . 50
3.2.2 Two stage Dantzig selector . . . . . . . . . . . . . . . . . 53
3.2.3 Theoretical results . . . . . . . . . . . . . . . . . . . . . 55
3.3 Numerical analyses . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . 83
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
iv
4 Concluding remarks 88
5 Appendix 90
5.1 Algorithms for Dantzig selector . . . . . . . . . . . . . . . . . . 90
5.1.1 Primal-dual interior point algorithm (Candes and Romberg,
2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.2 DASSO (James et al., 2009) . . . . . . . . . . . . . . . . 96
5.1.3 Alternating direction method (ADM) (Lu et al., 2012) . 100
Abstract (in Korean) 111
감사의 글 114
v
List of Tables
2.1 Example 1 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Example 1 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Example 2 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Example 2 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5 Real Data (TRIM) . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Example 1 (p=100, q=99) . . . . . . . . . . . . . . . . . . . . . 68
3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 70
3.3 Example 1 (p=200, q=199) . . . . . . . . . . . . . . . . . . . . 72
3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 74
3.5 Example 2 (p=100, q=59) . . . . . . . . . . . . . . . . . . . . . 76
3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 78
3.7 Example 2 (p=200, q=92) . . . . . . . . . . . . . . . . . . . . . 80
3.8 ‖Ω −Ω∗‖ of Example 2 (p=200, q=92) . . . . . . . . . . . . . 82
vi
3.9 Real Data (Breast Cancer) . . . . . . . . . . . . . . . . . . . . . 86
vii
List of Figures
2.1 LASSO and Dantzig selector . . . . . . . . . . . . . . . . . . . 10
2.2 Penalized method . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 LLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 CCCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Nonconvex penalties . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Adaptive DS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 ROC curve of Example 1 (p=100, q=99) . . . . . . . . . . . . . 67
3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 69
3.3 ROC curve of Example 1 (p=200, q=199) . . . . . . . . . . . . 71
3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 73
3.5 ROC curve of Example 2 (p=100, q=59) . . . . . . . . . . . . . 75
3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 77
3.7 ROC curve of Example 2 (p=200, q=92) . . . . . . . . . . . . . 79
viii
3.8 ‖Ω −Ω∗‖ of Example 2 (p=200, q=92) . . . . . . . . . . . . . 81
ix
Chapter 1
Introduction
1.1 Overview
High dimensional data analysis has been received much attention due to ad-
vance of technologies of data collection. High dimensional data where the num-
ber of covariates exceeds the number of observation arises in various fields such
as genomics, neuroscience, economics, finance, and machine learning. Variable
selection is fundamental to data analysis because it can identify significant vari-
ables among many covariates. In high dimension, the importance of variable
selection proliferates. There are two approaches for variable selection which are
subset selection and sparse regularization. In high dimension, subset selection
methods such as best subset selection are challenging to compute, unstable
1
and hard to draw sampling properties (Breiman, 1996).
To handle these drawbacks, many sparse regularization methods have been
proposed. These methods can select significant variables and estimate coeffi-
cients simultaneously. Two major approaches in sparse regularization methods
are `1 regularization including LASSO (Tibshirani, 1996) and Dantzig selector
(Candes and Tao, 2007) and nonconvex penalization including SCAD (Fan and
Li, 2001) and MCP (Zhang, 2010). For selection consistence, `1 regularization
methods need stringent conditions such as strong irrepresentable condition
(Zhao and Yu, 2006).
On the other hand, nonconvex penalized methods not only do not need
such conditions for selection consistency but also they can reduce the innate
bias problem of `1 regularization methods. Despite of these good properties of
nonconvex penalized methods, they suffer from multiple local minima and can-
not guarantee the converged solution to be oracle estimator. As an alternative
to these nonconvex penalized methods, two stage methods based on LASSO
such as one step LLA and calibrated CCCP algorithms proposed to obtain the
oracle estimator.
In this paper, we deal with the regularization methods in high dimensional
linear regression model. We focus on developing a new regularization method
to obtain oracle estimator. We propose a two stage method based on Dantzig
2
selector. Our method can improve variable selection and estimation via deleting
noise variable more efficiently by using Dantzig selector instead of LASSO.
We also deal with precision matrix estimation as one of applications of high
dimensional linear regression. For sparse precision matrix estimation, many
regularization methods have been considered. Most of them belong to two reg-
ularization frameworks which are maximum likelihood approach and regression
based approach. We apply two stage methods based on LASSO or Dantzig se-
lector to regression based approach and show they can obtain the columnwise
oracle estimator of precision matrix. Numerical results show that our proposed
methods are superior to other regularized methods in terms of support recovery
and estimation of precision matrix.
1.2 Outline of the thesis
The thesis is organized as follows. In chapter 2, we deal with high dimensional
linear regression. We review diverse sparse regularization methods for high
dimensional linear regression and propose two stage Dantzig selector. Theo-
retical properties and algorithm for two stage Dantzig selector are provided,
and we compare our method to other methods in numerical analyses. In chap-
ter 3, precision matrix estimation using regularization methods is considered.
3
We review existing regularization methods and propose new methods which
utilize the two stage methods based on LASSO or Dantzig selector. We prove
theoretical properties of two stage methods, and numerical analyses are con-
ducted. Concluding remarks follow in chapter 4. In Appendix, algorithms for
the Dantzig selector are summarized.
4
Chapter 2
Two Stage Dantzig Selecotr for
High Dimensional Linear
Regression
2.1 Introduction
Variable selection is essential for linear regression analysis. There are two ap-
proaches for variable selection, which are subset selection and regularization.
Subset selection is selecting a subset of covariates and using only the selected
covariates for fitting model. Popular examples of subset selection are best sub-
set selection, forward selection, backward elimination, stepwise selection, and
5
etc. In high dimension, these subset selection methods are computationally
demanding and unstable. Furthermore, their sampling properties are hard to
derive (see Breiman (1996) for more discussions).
To deal with these drawbacks, many sparse regularization methods have
been proposed which can select the important variables and estimate the ef-
fect of covariates on the response simultaneously. The `1 regularization meth-
ods and the nonconvex penalized methods are two mainstreams of regularized
estimators for high dimensional regression model. The least absolute shrinkage
and selection operator (LASSO) (Tibshirani, 1996) and the Dantzig selector
(Candes and Tao, 2007) are two representative examples of the `1 regulariza-
tion. They are easy to calculate and have good estimation properties (Bickel
et al., 2009; Raskutti et al., 2011). However, they have intrinsic bias and se-
lection consistency only under awkward conditions such as the irrepresentable
condition (Zhao and Yu, 2006).
On the other hand, nonconvex penalized methods such as the smoothly
clipped absolute deviation (SCAD) (Fan and Li, 2001), and the minimax con-
cave penalty (MCP) (Zhang, 2010) can have unbiasedness and selection consis-
tency, simultaneously. The most fascinating property of nonconvex penalized
methods is the oracle property. The oracle property means that covariates
are selected consistently and the efficiency of the estimator is equivalent to
6
the least square estimator obtained with knowing true nonzero coefficients in
advance (Fan and Li, 2001; Kim et al., 2008; Kim and Kwon, 2012).
However, due to their nonconvexity, there can be many local minima in the
corresponding objective function. Therefore, it is not guaranteed for an ob-
tained estimator to be the oracle estimator. Even though some previous works
(Kim and Kwon, 2012; Zhang, 2010) showed that the objective function with
a nonconvex penalty can have a unique local minimizer under some regularity
conditions, its computation may be demanding for high dimensional models.
Typically, optimization problems corresponding to nonconvex penalized meth-
ods are solved by iterative algorithms, including the concave convex procedure
(CCCP) (Kim et al., 2008) and the local linear approximation (LLA) (Zou and
Li, 2008), where the nonconvex objective function is approximated by a locally
linear function. However, it takes a significant amount of time for algorithms
to converge, and typically the nonconvex penalized methods suffer instability
in tuning the regularization parameter. Furthermore, these algorithms only
assure the convergence to a local minimum which is not necessarily the oracle
estimator (Wang et al., 2013).
Two stage methods based on LASSO are proposed to obtain the oracle
estimator such as the one step LLA (Zou and Li, 2008; Fan et al., 2012)) and
the calibrated CCCP (Wang et al., 2013). The main idea of these methods is
7
to obtain the oracle estimator by solving the LASSO problem twice.
In this chapter, we propose a two stage method based on Dantzig selector
to obtain the oracle estimator, which is named as two stage Dantzig selector.
The Dantzig selector used in our method can improve variable selection and
estimation through lessening the effects of noise variables more efficiently than
LASSO. We prove two stage Dantzig selector can obtain oracle estimator under
regularity conditions. The proposed method can be easily implemented by
general algorithms for the standard Dantzig selector. Numerical results show
that our proposed method outperforms other sparse regularization methods
with respect to variable selection and estimation.
2.2 Sparse regularization methods
In this section, we review various sparse regularization methods for high di-
mensional linear regression. Consider the linear regression model.
y = Xβ + ε,
where y is an n×1 vector of response,X = (x1, . . . ,xn)T = (X1, . . . ,Xp) is an
n×p matrix of covariate where Xj ∈ Rn and xi ∈ Rp, β is a p×1 vector of un-
knwon regression coefficients and ε is an n×1 vector of random error. On high
dimension, the ordinal least square estimator is not defined uniquely and the
8
traditional variable selection methods such as best subset selection, stepwise
selection based on AIC or BIC criteria are computationally intensive. Fur-
thermore, they are hard to draw sampling properties and unstable (Breiman,
1996). As an alternative, sparse regularization methods are used to estimate
coefficients and select variables. There are two mainstreams of regularization
methods which are `1 regularization and the nonconvex penalization.
2.2.1 The `1 regularization
The `1 regularization can achieve sparsity. The sparsity means that the esti-
mator produces exactly zero coefficients hence it can reduce model complexity.
The LASSO (Tibshirani, 1996) and the Dantzig selector (Candes and Tao,
2007) are two representative examples of the `1 constrained methods. The
LASSO estimator βLASSO is defined as the solution of
minβ
(||y −Xβ||2/2n+ λ‖β‖1
)or equivalently,
minβ‖y −Xβ‖2 subject to ‖β‖1 ≤ t.
where ‖a‖ =√∑n
i=1 a2i , ‖a‖1 =
∑ni=1 |ai|.
9
−1.5 −0.5 0.5 1.0 1.5
−1
.0−
0.5
0.0
0.5
1.0
1.5
2.0
LASSO
β1
β 2
true
ols
true
ols
−1.5 −0.5 0.5 1.0 1.5
−1
.0−
0.5
0.0
0.5
1.0
1.5
2.0
Dantzig Selector
β1
β 2
true
ols
Figure 2.1: LASSO and Dantzig selector
The Dantzig selector βDantzig is defined similarly to the LASSO estimator
by
minβ‖β‖1 subject to ‖ 1
nXT (y −Xβ)‖∞ ≤ λ
or equivalently,
minβ‖XT (y −Xβ)‖∞ subject to ‖β‖1 ≤ t,
where ‖a‖∞ = maxi |ai| with a ∈ Rn.
As shown in Figure 2.1, the solution of the LASSO occurs at the point of
contact between dotted ellipsoid and solid diamond whereas the solution of
Dantzig selector occurs at the point of contact between dotted parallelogram
10
and diamond. Hence the exact zero element of solution can be obtained. The
dotted ellipsoid means the set of points which has the same distance (β −
βols)TXTX(β − βols) from the ordinal least square estimator βols and the
dotted parallelogram means the set of points which has the same value of
‖XTX(β − βols)‖∞.
The penalized form of the LASSO and the definition of the Dantzig selector
are related. The LASSO estimate is always in the constrained set (feasible set)
of the Dantzig selector. The Karash-Kuhn-Tucker conditions for the Lagrange
form of the LASSO are given by
XTj (y −Xβ) = λsign(βj) for |βj| > 0,
|XTj (y −Xβ)| ≤ λ for βj = 0.
Therefore, |XT (y−XβLASSO(λ))|∞ ≤ λ, and here ‖βDantzig(λ)‖1 ≤ ‖βLASSO(λ)‖1.
The Dantzig selector and the LASSO share some similarities. They yield the
same solution path under some suitable conditions on the design matrix. Mein-
shausen et al. (2007) proved that the LASSO and the Dantzig selector share the
identical solution path under the diagonal dominance condition, which means
that Mjj >∑
i 6=j |Mij| for all j = 1, . . . , p where M = (XTX)−1. James et al.
(2009) showed the equivalence of the LASSO and the Dantzig selector under a
condition which is similar to the irrepresentable condition (IC) (Zhao and Yu,
2006). This condition is that ‖XTXA(λ)u‖∞ ≤ 1 for u = (XTA(λ)XA(λ))
−11
11
with a tuning parameter λ and the active set A(λ) = j : βj(λ) 6= 0.
The Dantzig selector and the LASSO estimator can achieve the minimax
optimal error bound (Raskutti et al., 2011; Bickel et al., 2009). Raskutti et al.
(2011) showed that the minimax optimal convergence rate of l2-error is
O(√s log p/n) under some regularity conditions. Bickel et al. (2009) proved
the similar prediction error rate of the LASSO and the Dantzig selector and
the asymptotic equivalence under the restricted eigenvalue condition.
Not only the theoretical properties, but also the algorithms for the LASSO
and the Dantzig selector are comparable. Similar to the LARS (Efron et al.,
2004), which is an efficient algorithm for the LASSO estimator giving piece-
wise linear path, the DASSO (James et al., 2009) algorithm gives a piecewise
linear solution path. These algorithms will be summarized and compared in
Appendices
Despite of good asymptotic properties and efficient algorithms, there are
some limitations in the LASSO and the Dantzig selector. First, the LASSO and
the Dantzig selector are biased. Since the same amount of shrinkage is enforced
on all nonzero coefficients, they cannot achieve the unbiasedness Second, the
LASSO and the Dantzig selector rarely have the model selection consistency.
The correlation structure of covariates is crucial in the selection consistency
such as ICs (Zhao and Yu, 2006), and the coherence property (Candes and
12
Plan, 2009). Zhao and Yu (2006) proved the weak oracle property of the LASSO
estimator under the ICs. Gai et al. (2013) proved the weak oracle property of
the Dantzig selector under the modified ICs related to KKT conditions of
Dantzig selector, which are more complex than the ICs of the LASSO. Those
ICs mean that the regression coefficients of the inactive variables on s active
variables should be uniformly bounded by a constant less than equal to one.
As we can see in the simulation results of Zhao and Yu (2006), these ICs are
too strict especially in high dimensions. Hence the LASSO and the Dantzig
selector cannot have the selection consistency in most cases.
2.2.2 Nonconvex penalized methods
Nonconvex penalized methods can be good alternatives to the `1 regularized es-
timators since they have selection consistency and unbiasedness. A nonconvex
penalized least square estimator is defined as the minimizer of Qλ(β) where
Qλ(β) = ||y −Xβ||2/2n+ Pλ(|β|)
and Pλ(|β|) is a nonconvex penalty including bridge estimator (Frank and
Friedman, 1993), the SCAD (Fan and Li, 2001), and the MCP (Zhang, 2010).
The bridge estimator is defined as Pλ(β) = λ∑p
j=1 |βj|q, 0 < q < 1. The
13
−4 −2 0 2 4
01
23
45
Penalty functions
β
Pλ(β
)
lasso
MCP
SCAD
bridge
lasso
MCP
SCAD
bridge
lasso
MCP
SCAD
bridge
Figure 2.2: Penalized method
penalty function of the SCAD is defined as
Pλ(β) =
p∑j=1
[ λ|βj|I(|βj| < λ)
+
aλ(|βj| − λ)− (β2
j − λ2)/2a− 1
+ λ2I(λ ≤ |βj| < aλ)
+
(a− 1)λ2
2+ λ2
I(|βj| ≥ aλ)
].
Zhang (2010) proposed the MCP with
Pλ(β) =
p∑j=1
[−β2
j /2a+ λ|βj|I(|βj| ≤ aλ) +1
2aλ2I(|βj| > aλ)
].
Figure 2.2 shows those nonconvex penalty functions and the LASSO penalty.
The SCAD and the MCP estimators satisfy good properties of penalized
14
estimator which are unbiasedness, sparsity, and continuity as introduced by
(Fan and Li, 2001) while the bridge estimator is lack of continuity and the
LASSO is lack of unbiasedness.
For identifying unknown signal variables, nonconvex penalized methods
have received great attention recently because they can achieve the model se-
lection consistency without stringent conditions such as ICs. Instead, they need
weaker conditions on the design matrix such as sparse Rieze condition (Zhang
and Huang, 2008) and positive minimum eigenvalue of submatrix of XTX
which only uses the signal covariates. Let the true beta β∗ = (β∗T1 ,0T )T be such
that the first s regression coefficients β∗1 are nonzero and others to be zeros. The
oracle estimator β(o) is defined as (β(o)1 ,0T )T where β
(o)1 = (XT
1 X1)−1XT
1 y,
X = (X1,X2), X1 is n × s, and X2 is n × (p − s) subset of X. Assume
√n(β
(o)1 − β∗1)
d→ Ns(0,Σ). An estimator β = (βT1 , βT2 )T is said to have the
oracle property if
Pr[j : βj 6= 0 = 1, . . . , s
]= 1,
√n(β1 − β∗1)
d→ Ns(0,Σ).
Many previous works showed that various non-convex penalized methods have
the oracle property (Fan and Li, 2001; Kim et al., 2008; Huang et al., 2008a;
Zhang, 2010; Kim and Kwon, 2012).
15
There are three types of oracle properties - weak, global, and strong oracle
properties. The weak oracle property is that there exists a sequence of λn such
that one of the local minimizers of Qλn(β) is the oracle estimator. Fan and Li
(2001) and Kim et al. (2008) proved the weak oracle property of the SCAD
for p ≤ n and p > n, respectively. The global oracle property is that there
exists a sequence of λn such that the global minimizer β(λn) of Qλn(β) has
the oracle property. Huang et al. (2008a) proved the global oracle property of
bridge estimator and Kim et al. (2008) proved that of the SCAD for p ≤ n.
The strong oracle property means that there exists a sequence of λn such that
the unique local minimizer of Qλn(β) is the oracle estimator. The SCAD and
the MCP can obtain the oracle estimator as a unique local optimizer with
probability tending to one (Kim and Kwon, 2012; Zhang, 2010).
Computing the global solution of the nonconvex penalized methods is infea-
sible in high dimensional setting. Maximizing nondifferentiable and nonconvex
functions is challenging. Iterative algorithms which locally approximate a non-
convex penalized objective to a convex function and solve the convex optimiza-
tion are used. Local quadratic approximation (LQA) (Fan and Li, 2001), local
linear approximation (LLA) (Zou and Li, 2008), concave convex procedure
(CCCP) (Kim et al., 2008) are invented to get a nonconvex penalized likeli-
hood estimate. The LQA uses the second order approximation of the penalty
16
as follows,
[Pλ(|βj|)]′ = P ′λ(|βj|)sign(βj) ≈ P ′λ(|β(0)j |)/|β
(0)j |βj.
Pλ(|βj|) ≈ Pλ(|β0j|) +1
2
P ′λ(|β0j|)|β(0)j |
(β2j − |β
(0)j |2) for βj ≈ β
(0)j
The LQA algorithm is updates the solutions as follows until it converges,
β(k+1) = argminβ
1
2n‖y −Xβ‖2 +
1
2
p∑j=1
P ′λ(|β(k)j |)
|β(k)j |
β2j
.
To avoid numerical instability, when |β(k)j | < ε0 (prespecified value), we set βj =
0 and delete the jth component of X from the iteration. In every iteration,
the solution is
β(k+1) = XTX +Σλ(β(k))−1XTy
whereΣλ(β(k)) = diag(P ′λ(|β
(k)1 |)/|β
(k)1 |/2, . . . , P ′λ(|β
(k)p |)/|β(k)
p |/2) for k = 0, 1, 2, . . ..
Since the LQA removes the variables with small magnitude of coefficients, once
βj is removed from the model, it is permanently excluded and hence the choice
of ε0 affects significantly the degree of sparsity of the solution and speed of con-
vergence. To relieve this problem, instead of removing variables, perturbation
τ0 on the numerator can be considered.
β(k+1) = argminβ
1
2n‖y −Xβ‖2 +
1
2
p∑j=1
P ′λ(|β(k)j |)
|β(k)j |+ τ0
β2j
However, the τ0 plays the similar role to ε0.
17
−4 −2 0 2 4
01
23
4
SCAD
β
Pλ(β
)
−4 −2 0 2 4
01
23
4
MCP
β
Pλ(β
)
Figure 2.3: LLA
The CCCP and LLA can make up for LQA’s shortcomings and they can
be easily implemented by the algorithms for LASSO. The LLA algorithm is
defined as follows. For k = 1, 2, . . ., until it converges, repeat the following
optimization problem:
β(k+1) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|β(k)j |)|βj|.
The LLA is the first order approximation of the nonconvex penalty function.
Figure 2.3 shows the LLA of SCAD and MCP. The CCCP decomposes the
nonconvex penalty into LASSO penalty and concave penalty. The concave part
of them is approximated by the tight local linear function. The decompositions
18
−4 −2 0 2 4
−2
−1
01
23
4
SCAD
β
Pλ(β
)
−4 −2 0 2 4
−2
−1
01
23
4
MCP
β
Pλ(β
)
Figure 2.4: CCCP
of nonconvex penalty function of SCAD and MCP are represented in Figure
2.4. The CCCP algorithm iteratively minimizes Q(β|β(k), λ) until it converges
when
Q(β|β(k), λ) =1
2n‖y −Xβ‖2 +
p∑j=1
∇Jλ(|β(k)j |)βj + λ
p∑j=1
|βj|
where Pλ(|βj|) = Jλ(|βj|) + λ|βj| and ∇Jλ(t) = dJλ(t)dt
.
Since the CCCP and the LLA algorithm use the first order derivative of
the nonconvex penalty, a class of the nonconvex penalties which can use these
algorithms is defined as Pλ(|t|) = Pa,λ(|t|) defined on t ∈ (−∞,∞) satisfying
(P1) Pλ(t) is increasing and concave for t ∈ [0,∞) with continuous derivative
19
on t ∈ (0,∞) with P ′λ(0) := P ′λ(0+) ≥ a1λ.
(P2) P ′λ(t) ≥ a1λ for t ∈ (0, a2λ)
(P3) P ′λ(t) = 0 for t > aλ > a2λ
for some positive constant a1,a2 and a.
As shown in the Firgure 2.5, the SCAD and the MCP satisfy above conditions
because the derivative of SCAD is defined as
P ′λ(t) = λIt≤λ +(aλ− t)+a− 1
It>λ, for some a > 2 with a1 = a2 = 1,
and the derivative of MCP is
P ′λ(t) = (λ− t
a)+, for some a > 1 with a1 = 1− a−1, a2 = 1.
However due to the nature of nonconvexity of penalty, multiple minima
can occur and the existing algorithms for nonconvex penalized methods only
guarantee the convergence to not the oracle estimator but a local minimum.
Although under some conditions, the nonconvex penalized methods yield the
oracle estimator as the unique minimizer (Kim and Kwon, 2012; Zhang, 2010),
the direct computation of the global solution is infeasible in high dimension.
Especially when it comes to tuning parameter, the computation is unstable.
To deal with these difficulties, the only one-step algorithms with a good initial
20
−4 −2 0 2 4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Penalty function
β
Pλ(β
)
MCP
SCAD
−4 −2 0 2 4
0.0
0.5
1.0
1.5
2.0
Derivative of penalty function
β
Pλ(β
)’
MCP
SCAD
Figure 2.5: Nonconvex penalties
estimator are proposed (Zou and Li, 2008; Fan et al., 2012; Wang et al., 2013)
and we will call them two stage methods .
2.2.3 Two stage methods
The two stage methods based on LASSO are proposed to obtain the oracle
estimator such as the one step LLA (Zou and Li, 2008; Fan et al., 2012) and
the calibrated CCCP(Wang et al., 2013). The main idea of these methods is to
obtain the oracle estimator by solving the LASSO problem. Zou and Li (2008)
proved that one step LLA algorithm can get the oracle estimator with a good
21
initial estimator. They suggested using the maximum likelihood estimator as
an initial estimator for n > p. The one step LLA estimator is defined as
β(λ) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|βinitj |)|βj|.
This can be reformulated as
β(λ)Ac = argminβAc‖rA −XAcβAc‖2/2n+
∑j∈Ac
P ′λ(|βinitj |)|βj|,
β(λ)A = (XTAXA)−1XT
A(y −XAcβ(λ)Ac),
where A = A(λ) = j : |βinitj | > aλ with the parameter of nonconvex penalty
a, and rA = y −XAβ(λ)A.
In the first stage, a good initial estimator which is close to the true coeffi-
cient should be attained. For a regularization parameter λ such that
min|β∗j | : β∗j 6= 0, j = 1, . . . , p > (a+ 1)λ and ‖β∗ − βinit‖∞ < λ,
the true signal set and the signal set of initial estimator are equivalent, i.e.,
A = A0 = j : β∗j 6= 0 and P ′λ(|βinitj |) ≈ 0 for j ∈ A0,
and hence β(λ)A0 = (XTA0XA0)
−1XTA0
(y−XAc0β(λ)Ac0). Since estimating β(λ)
can be recast as estimating β(λ)Ac0 and plug β(λ)Ac0 into the equation for
β(λ)A0 , the oracle estimator can be obtained via β(λ)Ac0 = 0. Hence removing
the effect of noise variables is important in the second stage of the two stage
methods.
22
Fan et al. (2012) suggested the LASSO estimator with smaller regulariza-
tion parameter (λinit ≤ aγLSs−1/2λ/4) as an initial estimator and then they
can obtain the oracle estimator with high probability, where s is the number
of nonzero coefficients, a is the parameter of nonconvex penalty, and γLS is the
restricted eigenvalue defined by γLS = minδ 6= 0
‖δAc0‖1 ≤ 3‖δA0
‖1
‖Xδ‖√n‖δA0
‖ > 0.
Wang et al. (2013) proved that the calibrated CCCP estimator using the
LASSO initial estimator with smaller regularization parameter (λinit = τλ, τ =
o(1), e.g., τ = 1/ log n or λ) can obtain the oracle estimator with high proba-
bility. They remarked on the choice of τ which can be related to the number
of the signal parameter and the restricted eigenvalue. For the sparse and well
behaved design matrix, τ = 1/ log n can be used. If the true model is not very
sparse (s → ∞) or the design matrix does not behave well (γLS → 0), then
τ = λ can be considered.
Tuning the regularization parameter is crucial issue to obtain the oracle
estimator. Wang et al. (2013) proposed the high dimensional BIC (HBIC) for
choosing the regularization parameter of calibrated CCCP, which is defined by
HBIC(λ) = log(‖y −Xβ(λ)‖2/n) + Cn ∗ log p/n ∗ |Mλ| (2.1)
where Mλ = j : βj(λ) 6= 0, Cn → ∞ (e.g., Cn = log n or log(log n)).
Furthermore they proved the oracle estimator can be attained by using the
23
tuning parameter chosen from HBIC with high probability under some regu-
larity conditions,
Pr(Mλ = j : β∗j 6= 0)→ 1,
where λ = argminλ∈λ:|Mλ|≤Kn
HBIC(λ). This criterion is the extension of the results
of Chen and Chen (2008) and Kim et al. (2012). This result of the selection
consistency can be applied to the methods which satisfy the strong oracle
property.
Corollary 1. Let the HBIC(λ) be defined as (2.1). Assume that regularity
conditions, which are necessary for an estimator β(λ) to be the oracle estimator
with probability tending to one, hold and there exists a positive constant γ such
that
limn→∞
minA6=A0,|A|≤Kn
n−1‖(In −HA)XA0β∗A0‖2 ≥ γ,
where In denotes the n × n identity matrix and HA denotes the projection
matrix onto the linear space spanned by XA. If Cn → ∞, sCn log p/n = o(1)
and K2n log p log n/n = o(1), then
Pr(Mλ = j : β∗j 6= 0)→ 1,
where Mλ = j : βj(λ) 6= 0 and λ = argminλ∈λ:|Mλ|≤Kn
HBIC(λ).
This Corollary 1 is the generalization of the Theorem 3.5 of Wang et al.
(2013) and it is automatically proven by the proof of the Theorem 3.5 of Wang
24
et al. (2013).
2.3 Two Stage Dantzig Selector
2.3.1 Method
We modified the one step LLA for the nonconvex penalized estimator by re-
placing the LLA to the adaptive Dantzig selector type estimator.
min
p∑j=1
P ′λ(|βinitj |)|βj| (2.2)
subject to | 1nXT
j (y −Xβ)| ≤ P ′λ(|βinitj |), j = 1, . . . , p,
where βinitj is an initial estimate and P ′λ(t) is satisfying (P1)-(P3) which are
defined in Section 2.2. I call this estimator two stage Dantzig selector
βTSDS(λ). The initial estimate βinitj can be the LASSO or the Dantzig estimate
with tuning parameter λinit = λ/ log n or λ2.
2.3.2 Motivation
In order to achieve the oracle estimator, the key to the two stage method is
removing the noise variables in the second stage. For example, let’s consider the
one step LLA with SCAD penalty Pλ with initial βinit. Let yn×1 = Xn×pβ∗p×1+
εn×1. Consider a λ such that min|β∗j | : j ∈ A0 > (a + 1)λ where A0 = j :
25
β∗j 6= 0 and a is a parameter for SCAD penalty. Suppose that ‖βinit−β∗‖∞ <
λ then the one step LLA estimator is defined by
β(λ) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|βinitj |)|βj|,
or equivalently,
β(λ)Ac0 = argminβAc0‖rA0 −XAc0βAc0‖
2/2n+∑j∈Ac0
λ|βj|,
β(λ)A0 = (XTA0XA0)
−1XTA0
(y −XAc0β(λ)Ac0),
where rA0 = y−XA0β(λ)A0 . Hence the aim of the second stage is deleting the
noise variables and then the one step LLA can acquire the oracle estimator.
So we focus on deleting or lessening the effect of noise variables and we find
out that some properties of Dantzig selector shows the superiority of Dantzig
selector over LASSO.
The `1 norm of the Dantzig selector is always less than or equal to that of
the LASSO estimator because the Dantzig selector is defined as the minimizer
of ‖β‖1 subject to ‖XT (y −Xβ)‖∞ ≤ λ and the LASSO estimator satisfies
the constraint. Recall the fact that the LASSO estimate βLS(λ) is always in
the feasible set of the Dantzig selector βDS(λ) because of the KKT conditions
for the Lagrangean form of the LASSO in Section 2.1. If there are no signal
variables then the mean absolute deviation of Dantzig selector is less than or
equal to that of LASSO for the same regularization parameter λ, ‖βDS(λ) −
26
β∗‖1 ≤ ‖βLS(λ)−β∗‖1. Furthermore, according to Bickel et al. (2009), the non-
asymptotic `q error bounds of Dantzig selctor are lesser than those of LASSO
for 1 ≤ q ≤ 2. If the mean squared error (MSE) of the Dantzig selector is
lesser than the MSE of the LASSO in the no signal setting, then the two stage
methods can be improved with Dantzig selector in terms of the MSE. Hence
we conduct a simulation to figure out whether or not the MSE of the Dantzig
selector is less than the MSE of the Lasso estimator in the no signal setting.
We simulate whether the `2 error bounds of Dantzig selctor tends to be
lesser than that of LASSO in no signal regression setting. Let y100×1 = X100×1000β1000×1+
ε100×1 where β = (0, . . . , 0)T , X ∼ N(0,Σ) with Σij = R|i−j|, and ε ∼
N(0, I). This simulation is conducted as follows,
1. For 20 tuning parameters, fit LASSO and Dantzig estimator and calcu-
late MSE,
2. Repeat the step 1 100 times and test H1 : MSELASSO > MSEDantzig.
Let Z = MSELASSO − MSEDantzig. In the case of R = 0.3, Z = 0.00093,
sd(Z) = 0.00014, and p-value=0. For R = 0.5, Z = 0.00122, sd(Z) = 0.00012,
and p-value=0. Even in the case of R = 0, Z = 0.0006, sd(Z) = 0.0001.
Therefore we can conclude that the overal MSE of the Dantzig selector tends
to be lesser than the MSE of the LASSO for the same tuning parameter. The
27
two stage method with Dantzig selector can improve the estimation efficiency
satisfying the global oracle property of two stage method with LASSO.
The relationship between the Dantzig selector and the LASSO can be ex-
tended to the relationship between adaptive Dantzig selector (Dicker, 2010)
and the adaptive LASSO (Zou, 2006). The adaptive LASSO is defined as
min1
2n‖y −Xβ‖22 + λ
p∑j=1
wj|βj|
e.g., wj = |βlsj |−1. Similar to the adaptive LASSO, the adaptive Dantzig selec-
tor is defined as
min
p∑j=1
wj|βj| subject to | 1nXT
j (y −Xβ)| ≤ wjλ, j = 1, . . . , p.
Its formula can be derived by the derivative of the objective function of the
adaptive LASSO. For the detail, see the thesis of Dicker (2010). They also
proved that the adaptive Dantzig selector and the adaptive LASSO have the
same asymptotic properties.
The relation between Dantzig selector and the adaptive Dantzig selector is
represented in Figure 2.6. The adaptive Dantzig selector can relieve the bias
problem of Dantzig selector and give a unique solution (Dicker, 2010). We
apply this adaptive Dantzig selector with wj = P ′λ(|βinitj |)/λ to the second
stage of the two stage method. The difference between the adaptive Dantzig
selector and our proposed method is that the weight depends on the tuning
28
Figure 2.6: Adaptive DS
parameter λ.
2.3.3 Theoretical properties
In this section we prove the global oracle property of the two stage Dantzig
selector under some regularity conditions defined as follows,
(A1) The random errors ε = (ε1, . . . , εn) are i.i.d mean zero sub-Gaussian(σ)
with a scale factor σ > 0, i.e., E[exp(tε2i )] ≤ exp(σ2t2/2).
(A2) ηmin(XTA0XA0) > 0 where ηmin(B) is the minimum eigenvalue of B.
29
(A3) The design matrix X satisfies
γ = minθ 6= 0
‖θAc0‖1 ≤ α‖θA0
‖1
‖Xθ‖2√n‖θA0‖2
> 0.
where α = 3 for LASSO initial and α = 1 for Dantzig initial.
The main theorem shows that the two stage Dantzig selector with a good
initial estimator is equivalent to the oracle estimator if the oracle estimator
satisfies the constraint of the two stage Dantzig selector.
Theorem 1. Assume that minj∈A0
|β∗j | > (a + 1)λ where A0 = j : β∗j 6= 0. Let
Fn0 = ‖βinit − β∗‖∞ ≤ a0λ where a0 = min1, a2 and Fn1 = | 1nXTj (Y −
Xβ(o))| ≤ P ′λ(|βinitj |), ∀j where β(o) is the oracle estimator. Under the event
Fn0∩Fn1, the two stage Dantzig selector is equivalent to the oracle estimator.
Proof of Theorem 1. Under the event F0, minj∈A0
|βinitj | > aλ because
minj∈A0
|β∗j | > (a+1)λ. Hence minj∈A0
P ′λ(|βinitj |) = 0. Next we can verify minj∈Ac0
P ′λ(|βinitj |) >
0, because minj∈Ac0|βinit|∞ ≤ a0λ ≤ aλ. Under the event Fn1, the oracle estimator
is in the feasible set of the two stage Dantzig selector. Under the event Fn0∩Fn1,
the minimizer β of (2.2) must be the oracle estimator since P ′λ(|βinitj |) = 0 for
j ∈ A0 and βj for j ∈ Ac0 can be zero.
The following corollaries show that the two stage Dantzig selector satisfies
the global oracle property with a LASSO or Dantzig selector initial under the
30
regularity conditions (A1)-(A3). Condition (A1) implies that
Pr(|aTε| > t) ≤ 2 exp
(− t2
2σ2‖a‖2
),
for t ≥ 0 and a = (a1, . . . , an)T . Condition (A2) means that the signal covari-
ates are not seriously correlated. Condition (A3) is a condition for ‖βinit −
β∗‖∞ ≤ a0λ with respect to the LASSO or the Dantzig selector initial.
Corollary 2. Assume that regularity conditions (A1)-(A3) hold. Let the initial
estimator be the LASSO estimator βLS(τλ) with regularization parameter τλ.
(i) Ifminj∈A0
|β∗j |
a+1> λ > 2
√2σ√M log p
n1τ
and 16τγ−2√s < a0 then
Pr(βTSDS(βLS(τλ), λ) = β(o)) ≥ 1− p0 − p1
where a0 = min1, a2, p0 = Pr(‖βLS(τλ) − β∗‖∞ > 16τλγ−1√s) ≤
2p exp(−nτ2λ2
8Mσ2
)and p1 = Pr(F c
n1) ≤ 2(p − s) · exp(−na1λ2
2σ2M
)with M =
maxj∈Ac0‖Xj‖22/n.
(ii) If nτ 2λ2 →∞, log p = o(nτ 2λ2), and 16τγ−2√s < a0, then
Pr(βTSDS(βLS(τλ), λ) = β(o))→ 1 as n→∞.
Corollary 3. Assume that regularity conditions (A1)-(A3) hold. Let the initial
estimator be the Dantzig selector βDS(τλ) with regularization parameter τλ.
31
(i) Ifminj∈A0
|β∗j |
a+1> λ > σ
√2M log p
n1τ
and 8τγ−2√s < a0 then then
Pr(βTSDS(βDS(τλ), λ) = β(o)) ≥ 1− p0 − p1
where a0 = min1, a2, p0 = Pr(‖βDS(τλ) − β∗‖∞ > 8τλγ−2√s) ≤
2p exp(−nτ2λ2
2Mσ2
)and p1 = Pr(F c
n1) ≤ 2(p − s) · exp(− nλ2
2σ2M
)with M =
maxj∈Ac0‖Xj‖22/n.
(ii) If nτ 2λ2 →∞, log p = o(nτ 2λ2), and 8τγ−2√s < a0, then
Pr(βTSDS(βDS(τλ), λ) = β(o))→ 1 as n→∞.
Proof of Corollary 2 and Corollary 3. We first prove that the ora-
cle estimator β(o) satisfies the constraint of two stage Dantzig selector with
probability at least 1 − 2(p − q) · exp(−na1λ2
2σ2M
). Denote βA0 to be the |A0|-
length sub-vector of β containing only A0 members of β and let HA0 =
XA0(XTA0XA0)
−1XTA0
. Then
Pr(F cn1) ≤
∑j∈Ac0
Pr
(∣∣∣∣ 1nXTj (I −HA0)ε
∣∣∣∣ > P ′λ(|βinitj |))
≤∑j∈Ac0
2 exp
(−nP ′λ(|βinitj |)2
2σ2M
)≤ 2(p− s) · exp
(−na1λ
2
2σ2M
)
because of the assumption that P ′λ(t) ≥ a1λ for t ≤ a2, the regularity condition
(A1), and ‖ 1nXTj (I −HA0)‖22 ≤M/n ∀j ∈ Ac0.
32
Under the event F0, we only have to prove the upper bound of the proba-
bility p0 related to the initial esimator. The upper bound of p1 can be shown
from Theorem 1. We can use the results of Bickel et al. (2009) or Negahban
et al. (2012) to get an estimation bound of βinit − β∗ for the LASSO and the
Dantzig selector. Bickel et al. (2009) showed the asymptotic equivalence of the
LASSO and the Dantzig selector giving the non-asymptotic `q error bound
under the restricted eigenvalue condition and normality error assumption. For
the Dantzig selector, with probability 1− exp(−nτ2λ2
2σ2M),
‖βDS(τλ)− β∗‖∞ ≤ ‖βDS(τλ)− β∗‖l2 ≤ 8
√sτλ
γ2
with τλ > σ√
2M log pn
. For the LASSO estimator, with probability 1−exp(−nτ2λ2
8σ2M),
‖βLS(τλ)− β∗‖∞ ≤ ‖βLS(τλ)− β∗‖l2 ≤ 16
√sτλ
γ2
with τλ > σ√
8M log pn
. The Corollary 2 of Negahban et al. (2012) showed that
for the LASSO estimator, with probability 1− exp(−nτ2λ2
2σ2M),
‖βLS(τλ)− β∗‖∞ ≤ ‖βLS(τλ)− β∗‖l2 ≤ 2
√sτλ
γ2
with τλ ≥ 4σ√M log p
n. From these results, the upper bounds for ‖βLS(τλ) −
β∗‖∞ and ‖βDS(τλ)− β∗‖∞ are verified.
Remark. Comments on the regularity conditions on design matrix:
Define the restricted eigenvalue condition (RE) condition as follows. A p × p
33
sample covariance matrix XTX/n satisfies the RE condition of k with param-
eter (α, γ) if
1
nθTXTXθ ≥ γ‖θ‖22 ∀θ ∈ C(B,α)
where C(B,α) = θ ∈ Rp : ‖θ‖1 ≤ α‖θB‖1 for all subsets B ⊂ 1, . . . , p
such that |B| = k. The RE condition is the weak and general regularity con-
dition for achieving the optimal `2 error bound
(√log pn
)for `1 regularization
methods such as the LASSO (α = 3) and the Dantzig selector (α = 1). A
series of researches have been concerned with which conditions are necessary
for guaranteeing their optimal error bounds.
The restricted isometry property (RIP) (Candes and Tao, 2005) is defined
as follows. X is said to satisfy the s−restricted isometry property with re-
stricted isometry constant δs if there exists a constant δs such that, for every
T ⊂ 1, . . . , p such that |T | ≤ s, n× |T | submatrix XT of X and u ∈ R|T |,
(1− δs)‖u‖22 ≤ ‖XTu‖22 ≤ (1 + δs)‖u‖22.
The s, s′−restricted orthogonality constants θs,s′ for s+ s′ ≤ p is defined as the
smallest quantities such that
|〈XTu,X ′Tu′〉| ≤ θs,s′ · ‖u‖2‖u′‖2
for all T, T ′ ⊂ 1, . . . , p such that T ∩ T ′ =, |T | ≤ s and |T ′| ≤ s′. X
satisfies the uniform uncertainty principle (UUP) (Candes and Tao, 2007) if
34
δ2s + θs,2s < 1 which means that for all s−sparse sets T , the columns of the
matrix corresponding to T are almost orthogonal.
The RIP and the UUP conditions are the earlier conditions which are very
restricted. They contain independent variables which are from Gaussian or
Bernoulli distribution (Candes and Tao, 2007). However, they cannot deal
with substantial dependency. Raskutti et al. (2010) showed a design matrix
whose rows are independently distributed from N(0,Σ) then it satisfies the
RE condition with sample size n = O(s log p) with respect to Σ. The sample
covariance matrices withΣ including Toeplitz matrices, spiked identity model,
or highly degenerate covariance matrices have the RE condition (Raskutti
et al., 2010). Rudelson and Zhou (2013) extends to the sub-Gaussian designs
considering substantial dependency.
Bickel et al. (2009) showed that in more general setting Xiid∼ (0,Σ), if
φmin(s log n) > clogn
then XTX/n holds RE(α, γ) with order s where
γ2 =√φmin(s log n)
(1− c0
√sφmax(s log n− s)
(s log n− s)φmin(s log n)
)
and
φmin(m) = min1≥‖θ‖0≤m
θTXTXθ
‖θ‖22and φmin(m) = max
1≥‖θ‖0≤m
θTXTXθ
‖θ‖22.
φmin(s log n) > clogn
holds for s <√n log−3/2 n (Kim et al., 2012; Greenshtein
and Ritov, 2004). Hence, The RE condition can be possessed by large class of
35
Σ even when the RIP condition is not satisfied with probability converging to
one. For more discussions on the regularity conditions, see Bickel et al. (2009),
Negahban et al. (2012) or Zhang and Zhang (2012).
2.3.4 Algorithm
Two stage Dantzig selector βTSDS(βinit, λ) can be modified as follows. Let
A = j : |βinitj | > aλ then βTSDSAc (βinit, λ) can be calculated by
minβAc
∑j∈Ac
P ′λ(|βinitj |)|βj| subject to
∣∣∣∣ 1nX ′j(I −HA)(y −XAcβAc)∣∣∣∣ ≤ P ′λ(|βinitj |) for j ∈ Ac,
and
βTSDSA (βinit, λ) = (XTAXA)−1XT
A(y −XAcβTSDSAc (βinit, λ)).
Similar to the LLA algorithm, set β = WβAc and X = (I−HA)XAcW−1
where W is diagonal matrix whose entry is P ′λ(|βinitj |) for j ∈ Ac. Then the
above optimization for βTSDSAc (βinit, λ) can be modified as
min ‖β‖1 subject to ‖XT (y − Xβ)‖∞ ≤ 1.
Hence, we can use the same algorithms for Dantzig selector such as gen-
eralized primal-dual interior point algorithm (Candes and Romberg, 2005),
Dantzig selector with sequential optimization (DASSO) (James et al., 2009),
36
and alternating direction method (ADM) (Lu et al., 2012). We briefly review
several popular algorithms for Dantzig selector in the Appendix.
2.3.5 Tuning regularization parameter
Recall the HBIC of Wang et al. (2013),
HBIC(λ) = log(‖y −Xβ(λ)‖2/n) + Cn ∗ log p/n ∗ |Mλ|
where Mλ = j : βj(λ) 6= 0, Cn →∞ (e.g., Cn = log n or log(log n)).
According to the Corollary 1 in the section 2.3, we can use HBIC to tun-
ing regularization parameter because our proposed method satisfies the global
oracle estimator.
2.4 Numerical analyses
In this chater, we investigate the performance of the proposed two stage
Dantzig selector (TSDS). We Suppose there are p covariates variables
x1, . . . , xp. The goal of these numerical studies is to evaluate how the methods
work well in variable selection and accurate estimation. We consider the linear
regression model
y = xTβ + ε,
37
where x = (x1, . . . , xp)T and β ∈ Rp is a coefficient vector.
We compare the proposed TSDS with other methods including the LASSO,
the adaptive LASSO, the Dantzig selector, the adaptive Dantzig selector, the
SCAD, the MCP, and the two stage methods base on LASSO with respect
to selection and estimation. For the SCAD and the MCP, a = 2.1 and a =
1.5 are considered, respectively. Two stage methods use the SCAD penalty
with a = 2.1. Regarding the tuning parameter, five folder cross validation is
considered for the LASSO, the adaptive LASSO, the Dantzig selector, and
the adaptive Dantzig selector. For the SCAD, the MCP, and the two stage
methods, the high-dimensional BIC (HBIC) is used for tuning parameter. The
HBIC is defined as
HBIC = log(‖y −Xβ‖2)/n+ log(log n) ∗ log p/n ∗ |j : βj 6= 0|.
The LASSO and the Dantzig selector are used for the initial estimator with
tuning parameter λ/ log n in the two stage methods. The LARS algorithm is
used for the LASSO and the adaptive LASSO and the primal-dual interior
point algorithm is used for the Dantzig selector and the adaptive Dantzig
selector. The CCCP algorithm is used for the SCAD and the MCP estimator
and the calibrated CCCP (Wang et al., 2013) is used for the two stage method
based on LASSO.
38
2.4.1 Simulations
In this section, we consider two simulation settings. For each experimental
setting, we replicate the simulation 100 times. We simulate data from the true
model
y =
p∑j=1
Xjβj + ε, ε ∼ N(0, 22)
where p = 1000 and the number of observation n = 100. Covariates are gen-
erated from the normal distribution with zero mean and covariance of xi and
xj being R|i−j|, i, j = 1, . . . , p.. For each simulation setting, we generate data
with R = 0.3 and 0.5.
Base on 100 replications, the following statistics are measured for compar-
ision: the average number of falsely estimated non-zero coefficient (FP); the
average number of falsely estimated zero coefficient (FN); the proportion of the
true model exactly identified (TM); MSE=∑100
m=1 ‖β(m)−β∗‖2/100. In the re-
sults of the two stage methods, ”LS+LS”, for example, the first ”LS” refers to
the initial estimator and the last ”LS” refers to the two stage method based on
the LASSO while ”DS+DS” refers to the two stage method based on Dantzig
selector with Dantzig selector initial. The results of total four combinations of
two stage methods using the LASSO and the Dantzig selector are represented
in the following tables.
Example 1 We simulate 100 data sets along with above setting and the
39
true beta coefficient
β = (3, 1.5, 0, 0, 2︸ ︷︷ ︸5
, 0, . . . , 0︸ ︷︷ ︸p−5
)T .
Table 2.1: Example 1 (R=0.3)
Methods FP FN TM MSE
LASSO 25.38(9.019) 0(0) 0 1.040(0.501)
ALASSO 23.11(7.938) 0(0) 0 2.314(0.587)
Dantzig 18.48(10.323) 0(0) 0 1.044(0.530)
ADantzig 15.69(8.284) 0(0) 0 1.859(0.689)
MCP 2.12(1.719) 0.01(0.1) 0.16 0.463(0.479)
SCAD 1.36(1.382) 0.01(0.1) 0.26 0.389(0.606)
LS+LS 1.39(1.550) 0.02(0.141) 0.3 0.405(0.593)
LS+DS 1.39(1.550) 0.02(0.141) 0.3 0.404(0.594)
DS+LS 1.32(1.455) 0.02(0.141) 0.3 0.397(0.574)
DS+DS 1.31(1.461) 0.02(0.141) 0.31 0.394(0.574)
40
Table 2.2: Example 1 (R=0.5)
Methods FP FN TM MSE
LASSO 24.55(9.632) 0(0) 0 0.926(0.453)
ALASSO 21.83(8.263) 0(0) 0 2.290(0.648)
Dantzig 17.3(9.161) 0(0) 0.01 0.859(0.424)
ADantzig 17.17(9.135) 0(0) 0.01 0.934(0.517)
MCP 1.97(1.598) 0.03(0.171) 0.23 0.643(0.880)
SCAD 1.23(1.270) 0.04(0.197) 0.3 0.780(0.981)
LS+LS 1.27(1.370) 0.03(0.171) 0.33 0.578(0.793)
LS+DS 1.24(1.319) 0.03(0.171) 0.33 0.564(0.793)
DS+LS 1.29(1.387) 0.03(0.171) 0.32 0.555(0.794)
DS+DS 1.24(1.296) 0.03(0.171) 0.32 0.549(0.791)
Example 2 We simulate 100 data sets along with above setting and the
true beta coefficient
β = ((3, 1.5, 0, 0, 2︸ ︷︷ ︸5
, 0, . . . , 0︸ ︷︷ ︸15
)× 5, 0, . . . , 0︸ ︷︷ ︸p−100
)T .
41
Table 2.3: Example 2 (R=0.3)
Methods FP FN TM MSE
LASSO 25.96(1.370) 1.03(1.283) 0 18.742(9.205)
ALASSO 20.89(4.479) 1.08(1.398) 0 11.147(8.665)
Dantzig 24.18(3.583) 1.93(1.771) 0 24.685(11.391)
ADantzig 23.9(4.135) 1.94(1.802) 0 15.555(11.128)
MCP 18(8.957) 1.95(3.468) 0.01 21.891(34.190)
SCAD 4.58(6.240) 1.28(2.958) 0.09 11.336(25.522)
LS+LS 6.89(9.331) 0.71(1.866) 0.05 7.736(9.773)
LS+DS 5.24(5.826) 0.71(1.903) 0.03 7.558(9.925)
DS+LS 6.74(8.73) 0.69(1.846) 0.06 7.648(9.972)
DS+DS 4.67(5.924) 0.52(1.573) 0.15 7.055(8.234)
42
Table 2.4: Example 2 (R=0.5)
Methods FP FN TM MSE
LASSO 25.32(0.827) 0.32(0.827) 0 10.676(6.274)
ALASSO 19.89(4.325) 0.34(0.831) 0 6.656(5.120)
Dantzig 24.29(1.546) 0.6(0.974) 0 14.284(7.890)
ADantzig 22.76(4.656) 0.61(0.984) 0 6.912(6.382)
MCP 4.33(4.551) 2.38(2.534) 0.09 14.508(19.203)
SCAD 3.35(3.056) 1.89(2.287) 0.05 12.936(14.131)
LS+LS 4.56(6.609) 0.57(1.358) 0.07 5.467(6.115)
LS+DS 3.52(2.921) 0.58(1.365) 0.08 5.339(6.158)
DS+LS 3.97(4.239) 0.58(1.387) 0.07 5.078(6.161)
DS+DS 3.44(4.613) 0.42(1.249) 0.14 4.902(5.746)
43
2.4.2 Real data analysis
We analyze the data set of Scheetz et al. (2006) containing the 18976 gene
expression levels from the 120 rats. The objective of this analysis is to find the
genes correlated to the gene TRIM32 known to cause Bardet-Biedl syndrome.
Many previous works (Huang et al., 2008b; Kim et al., 2008; Wang et al.,
2013) analyzed this data set. According to these papers, we first select 3,000
genes with the largest variance in expression level and then pick up the top
1,000 correlated genes with TRIM32 among the selected 3,000 genes. With
this data set, we focus on the comparision between two stage methods because
the comparision among the two stage method with LASSO and other methods
already had done by Wang et al. (2013) and to assess the improvement of
TSDS over the previous two stage methods is our main interest. The results is
in the Table 2.5.
Table 2.5: Real Data (TRIM)
Methods #j : βj 6= 0 PE
LS+LS 11.37 0.813
LS+DS 10.47 0.806
DS+LS 8.74 0.857
DS+DS 8.41 0.83
44
2.5 Conclusion
In this paper, we propose a two stage method based on Dantzig selector, which
is named as two stage Dantzig selector. We prove two stage Dantzig selector can
obtain oracle estimator under regularity conditions. The proposed method can
be easily implemented by general algorithms for the standard Dantzig selector.
The numerical results support our contention that the Dantzig selector used
in our method can improve variable selection and estimation through lessening
the effects of noise variables more efficiently than LASSO. Furthermore, the
numerical results show that our proposed method outperforms other sparse
regularization methods with respect to variable selection and estimation.
45
Chapter 3
Two Stage Methods for
Precision Matrix Estimation
3.1 Introduction
Precision matrix (inverse covariance matrix) estimation is important prob-
lem in high dimensional statistical analysis and useful for various applications
such as Gaussian graphical model, gene classification, optimal portfolio al-
location, and speech recognition. Under the normality assumption, suppose
X = (X1, . . . , Xp) ∼ N(µ,Σ) then the zero elements in precision matrix
Ω = (ωij)p×p imply conditional inpendences of variables, that is, ωij = 0 if
and only if Xi and Xj are independent given X\Xi, Xj. Therefore the sup-
46
port of precision matrix is related to the structure of the undirected Gaussian
graph G = (V,E) where vertex set V = X1, . . . , Xp and edge set E with
Ec = (i, j) : Xi ⊥⊥ Xj|X\Xi, Xj and ⊥⊥ denotes independece. In the high
dimensional setting, classical methods such as Gaussian graphical model and
inverse of sample covariance matrix cannot provide stable estimate of preci-
sion matrix and additional restrictions should be imposed to get stable and
accurate precision matrix estimation. Hence many regularized methods for pre-
cision matrix estimation are developed based on the relationship between pre-
cision matrix and the Gaussian graphical model. There are two frameworks in
those regularized methods which are regression based approach and maximum
likelihood approach. Meinshausen and Buhlmann (2006) introduced penalized
neighborhood regression model with LASSO penalty. They fitted each variable
on its neighborhood with LASSO penalty and aggregated the results. Peng
et al. (2009) proposed joint neighborhood LASSO selection method which si-
multaneously performed neighborhood selection for all variables. Yuan (2010)
adopted Dantzig selector to the regression based approach and establised it
convergence rate. Yuan and Lin (2007) proposed penalized maximum likelihood
method with LASSO penalty and Friedman et al. (2008) introduced efficient
algorithm called graphical LASSO (glasso) for penalized maximum likelihood
mehtod with LASSO using blockwise coordinate descent algorithm (Banerjee
47
et al., 2008). Fan et al. (2009) dealt with the bias problem of the LASSO penal-
ization and proposed new penalized likelihood methods with adaptive LASSO
and SCAD penalty and the convergence rates of non-convex penalized methods
are shown in Lam and Fan (2009). Cai et al. (2011) proposed a constrained `1
minimization method called CLIME and showed its convergence rates under
various matrix norm.
Most of the existing sparse precision matrix estimators which use the `1
regularization including LASSO or Dantzig selector suffer from selection in-
consistence and bias estimation. Although the penalized likelihood estimation
with SCAD penalty can achieve selection consistency and unbiased estima-
tor, it takes quite much time to converge to its local minimizer and it cannot
guarantee that the local minimizer is the oracle estimator. In this paper, we
especially focus on the selection and the correct recovery of the support of
precision matrix. We propose two stage methods based on LASSO or Dantzig
selector which can correctly select the support of precision matrix with high
probability under some regularity conditions.
48
3.2 Estimation of precision matrix via colum-
nwise two-stage methods
Suppose (X1, . . . , Xp) are jointly generated by mean µ = (µ1, . . . , µp)′ and
covariance matrix Σ∗. It is well known (e.g., Lemma 1 of Peng et al. (2009))
that for i = 1, . . . , p, let
Xi = µi +∑j 6=i
β∗ijXj + εi
then (X1, . . . , Xi−1, Xi+1, . . . , Xp) and εi are uncorrelated if and only if β∗ij =
−ω∗ij/ω∗ii where Σ∗−1 = Ω∗ = (ω∗ij) is the precision matrix. Furthermore, with
those β∗ij, cov(εi, εj) = ω∗ij/(ω∗iiω∗jj) and var(εi) = 1/ω∗ii. Under normality as-
sumption, the forementioned uncorrelation can be replaced by independence.
This regression based approach has been applied to various methods including
Meinshausen and Buhlmann (2006), Peng et al. (2009), and Yuan (2010) by
using LASSO or Dantzig selector. We use this relationship to estimate sparse
precision matrix via two stage regression methods based on the LASSO esti-
mator such as calibrated CCCP (Wang et al., 2013) and one step LLA (Fan
et al., 2012), or based on the Dantzig selector called two stage Dantzig selector.
49
3.2.1 Two stage method based on LASSO
We first briefly introduce the one step LLA (Zou and Li, 2008; Fan et al., 2012)
as a two stage method based on LASSO, and then we apply the one step LLA
to estimate precision matrix. Consider a penalized regression problem,
minβ∈Rp
1
2n‖y −Xβ‖2 +
p∑j=1
Pλ(|βj|)
,
where y is the response vector, and X = (X1, . . . ,Xp) is an n × p covariate
matrix with Xi = (X1i, . . . , Xni)T , and β = (β1, . . . , βp)
T is the vector of
regression coefficients, and ‖ · ‖ is L2 norm, and Pλ(·) is a penalty function
with tuning parameter λ. We consider a class of nonconvex penalty functions
Pλ = Pλ,a satisfying
(P1) Pλ(t) is increasing and concave for t ∈ [0,∞) with continuous derivative
on t ∈ (0,∞) with P ′λ(0) := P ′λ(0+) ≥ a1λ.
(P2) P ′λ(t) ≥ a1λ for t ∈ (0, a2λ)
(P3) P ′λ(t) = 0 for t > aλ > a2λ
for some positive constant a1,a2 and a.
The SCAD and the MCP penalties satisfy above conditions with a1 = 1 for
SCAD and a1 = 1− a−1 for MCP. The derivative of SCAD penalty is defined
50
by
P ′λ(t) = λ
I(t ≤ λ) +
(aλ− t)+(a− 1)λ
I(t > λ)
, for some a > 2,
and the derivative of MCP penalty is defined by
P ′λ(t) =
(λ− t
a
)+
, for some a > 1.
The one step LLA is defined as follows,
β(βinit, λ) = argminβn∑i=1
(yi − xTi β)2/2n+
p∑j=1
P ′λ(|βinitj |)|βj|. (3.1)
Then the equation (3.1) can be recast as
β(βinit, λ)Ac = argminβAc‖(I −HA)(y −XAcβAc)‖2/2n+∑j∈Ac
P ′λ(|βinitj |)|βj|,
β(βinit, λ)A = (XTAXA)−1XT
A(y −XAcβ(βinit, λ)Ac),
where A = A(βinit, λ) = j : |βinitj | > aλ with the parameter of nonconvex
penalty a, and HA = XA(XTAXA)−1XT
A. Let y = (I −HA)y, X = (I −
HA)XAcW−1 and β = WβAc where W = diag(P ′λ(|βinitj |))j∈Ac then the
equation (3.1) can be recast as the LASSO problem with respect to y, X, β
and tuning parameter 1. Hence the algorithms for LASSO can be used for the
osLLA.
For a good initial estimate which satisfies |βinit−β∗‖∞ < min(a2, 1) ·λ, the
oracle estimator β(o) can be obtained via two stage method based on LASSO
with high probability where β(o) = (XTA0XA0)
−1XTA0y with A0 = j : β∗j 6= 0.
51
Now we apply this one step LLA estimator to precision matrix estimation.
Let the true precision matrix Ω∗ = Σ∗−1, and X = X(1), . . . ,X(n) are inde-
pendent and identically distributed samples from Np(µ,Σ∗). For i = 1, . . . , p,
denote the ith column ofΩ withoutΩii to beΩ−ii = (ω1i, . . . , ωi−1,i, ωi+1,i, . . . , ωpi)T
and
β(i) = (β1(i), . . . , βi−1(i), βi+1(i), . . . , βp(i))T
=
(−ω1i
ωii, . . . ,−ωi−1,i
ωii,−ωi+1,i
ωii, . . . ,−ωpi
ωii
)T.
Let Zi = Xi − Xi where Xi = (X1i, . . . , Xni)T and Xi = (Xi, . . . , Xi)
T with
Xi =∑n
i=j X(j)i /n. Denote the sample covariance S = ZTZ/n where Zn×p =
(Z1, . . . ,Zp). Let Z−i = (Z1,Z2, . . . ,Zi−1,Zi+1, . . . ,Zp). For an initial estimate
Ωinit and a vector of tuning parameters λ = (λ1, . . . , λp)T , our proposed one
step LLA (osLLA) estimator ΩosLLA(Ωinit,λ) = (ωosLLAij (ωinitij , λj))1≤i,j≤p is
defined as follows. First, conduct regression columnwisely for estimating Ω−ii,
i = 1, . . . , p.
For i = 1, . . . , p,
Ωii = Ωii(Ωinit, λi) = 1/(‖Zi − Z−iβ
osLLA(i) (Ωinit, λi)‖2/n)
Ω−ii = Ω−ii(Ωinit, λi) = −β(i)Ωii(Ω
init, λi),
52
where
βosLLA(i) (Ωinit, λi) = argmin
1
2n‖Zi − Z−iβ(i)‖2 +
∑j 6=i
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|.(3.2)
Ω is calculated by conducting p independent optimazation problems defined
as (3.2).
Second, for symmetricity, our final osLLA estimator ΩosLLA(Ωinit, λ) is
defined as
ωosLLAij (ωinitij , λ) = ωijI(|ωij| ≤ |ωji|) + ωjiI(|ωij| > |ωji|).
The initial estimate Ωinit can be a clme estimator (Cai et al., 2011) or
graphical lasso estimator (Yuan and Lin, 2007). The columnwise LASSO or
Dantzig selector with tuning parameter λiniti = λi/ log n (Wang et al., 2013)
also can be considered as an initial estimate.
3.2.2 Two stage Dantzig selector
Two stage Dantzig selector (TSDS) is a Dantzig selector type modification
of LLA of nonconvex penalized method. TSDS for regression is defined as a
solution of the following problem
min
p∑j=1
P ′λ(|βinitj )|βj|
subject to
∣∣∣∣ 1nXTj (y −Xβ)
∣∣∣∣ ≤ P ′λ(|βinitj |), j = 1, . . . , p
53
where Xj = (X1j, . . . , Xnj)T and βinit is an initial estimate.
Similar to the osLLA algorithm, it can be modified as follows,
1. Set β = WβAc and X = XAcW−1 where A and W are defined as in
the previous subsection 5.2.1.
ˆβ = argminβ
‖β‖1 : ‖XT (y − Xβ)‖∞ ≤ 1
2. Let βAc = W−1 ˆβ and βA = (XTAXA)−1XT
A(y −XβAc).
The same algorithms for Dantzig selector including generalized primal-dual
interior point algorithm (Candes and Romberg, 2005), DASSO (James et al.,
2009), alternating direction method (Lu et al., 2012) can be used as well.
We now define our TSDS estimator for precision matrix estimation. It is
similar to the osLLA estimator for precision matrix estimation. Let the colum-
nwise TSDS estimator βTSDS(i) = βTSDS(i) (Ωinit, λi) be the solution of
minβ(i)
∑j 6=i
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|
subject to
∣∣∣∣ 1nZTj (Zi − Z−iβ(i))
∣∣∣∣ ≤ P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) , ∀j 6= i (3.3)
and
Ωii = (Sii−2(βTSDS(i) )TS−i,i+(βTSDS(i) )TS−i,−iβTSDS(i) )−1, Ω−i,i = −Ωiiβ
TSDS(i) .
54
To impose symmetricity on the estimated precision matrix, we set ΩTSDS =
(ωTSDSi,j )1≤i,j≤p such that
ωTSDSi,j = ωTSDSji = ωijI(|ωij| ≤ |ωji|) + ωjiI(|ωij| > |ωji|).
3.2.3 Theoretical results
We prove the selection consistence of our proposed estimators. First, we define
the columnwise oracle estimator of precision matrix Ω(o) as follows:
Ωii = (Sii − 2(β(o)(i) )TS−i,i + (β
(o)(i) )TS−i,−iβ
(o)(i) )−1, and Ω−i,i = −Ωiiβ
(o)(i) .
and β(o)(i) is the oracle estimator of β∗(i) defined as β
(o)A0i(i)
= (ZTA0i
ZA0i)−1ZT
A0iZi
and β(o)j(i) = 0 for j ∈ Ac0i with A0i = j : ω∗ij 6= 0, j 6= i. For symmetricity of
the columnwise oracle precision matrix Ω(o) = (ω(o)ij ), ω
(o)ij = ω
(o)ji = ωijI(|ωij| ≤
|ωji|) + ωjiI(|ωij| > |ωji|) are considered.
Proposition 1. The columnwise oracle estimator Ω(o) is selection consistent
and elementwise√n-consistent estimator of Ω∗ for nonzero element.
Proof of Proposition 1. By the definition, the columnwise oracle estimator
is selection consistent. Since we assume thatX(1), . . . ,X(n) ∼ N(µ,Σ∗) where
Ω∗ = Σ∗−1, then Ω∗−i,i = −Ω∗i,iβ∗(i) and Ω∗i,i = 1/var(εi). For a sparse Ω∗, we
can assume that there exists a positive constant d where the degree of Ω∗
55
maxi=1,...,p |Ai0| < d. For each i,
√n(β
(o)A0i(i)
− β∗A0i(i))→ N(0, var(εi)(E(ZT
A0iZA0i
))−1).
Since ˆvar(εi) = 1n‖Zi−Z−iβ
(o)(i) ‖2 →p var(εi). Then by the continuous mapping
theorem, Ωii = 11n‖Zi−Z−iβ
(o)(i)‖2→p Ω∗ii. Then
√n(ΩA0i,i −Ω∗A0i,i
) =√n(−Ωi,iβ
(o)A0i(i)
+Ω∗i,iβ∗A0i(i)
)
=√n(Ω∗i,i(β
∗A0i(i)
− β(o)A0i(i)
) + op(1) ·Op(1/√n))
→ N(0, Ω∗i,i(E(ZTA0i
ZA0i))−1)).
Since the columnwise oracle estimator ω(o)ij is ωij or ωji whose absolute value
is smaller than the other, ω(o)ij is also
√n−consistent estimator of ω∗ij = ω∗ji for
j ∈ A0i.
We specify regulaity conditions.
(A1) Sparse Model: Ω∗ ∈M1(L, ν0, d) where
M1(τ0, ν0, d) = A 0 : ‖A‖1 < L, ν−10 < φmin(A) < φmax(A) < ν0, deg(A) < d,
where τ0, ν0 > 1, ‖A‖1 = maxj∑n
i=1 |aij|, and deg(A) = maxi∑
j I(Aij 6=
0).
(A2) ηmin(ZTA0i
ZA0i) > 0 ∀i where ηmin(B) is the minimum eigenvalue of B.
56
The following theorems show that osLLA estimator and TSDS estimator
are equivalent to the columnwise oracle estimator with high probability.
Theorem 1. Assume that regularity conditions (A1)-(A2) and (P1)-(P3) hold.
Denote a vector of tuning parameters for each column to be λ = (λ1, . . . , λp)T .
For an initial estimate Ωinit, let Fn0 =
maxj 6=i
∣∣∣ ωinitji
ωinitii− ω∗ji
ω∗ii
∣∣∣ < a0λi, i = 1, . . . , p
where a0 = min(a2, 1) and Fn1 = ∩pi=1
∣∣∣ 1nZTj (Zi − Z−iβ
(o)(i) )∣∣∣ ≤ P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ,∀j 6= i
where β(o)(i) is the oracle estimator of β∗(i). Under the event Fn0∩Fn1, the osLLA
estimator ΩosLLA(Ωinit,λ) is equivalent to the columnwise oracle estimator
Ω(o) of Ω∗.
Proof of Theorem 1. We can directly apply the theoretical result of
osLLA for regression. Under the event Fn0, P′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) = 0 for j ∈ A0i and
P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ≥ a1λi for j ∈ Ac0i. Hence for each i,
βosLLA(i) (Ωinit, λi) = argmin
1
2n‖Zi − Z−iβ(i)‖2 +
∑j∈Aci0
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|
By convexity of ‖Zi − Z−iβ(i)‖2,
‖Zi − Z−iβ(i)‖2 ≥ ‖Zi − Z−iβ(o)(i) ‖
2 +∑j
ZTj (Zi − Z−iβ
(o)(i) )(βj − β(o)
j )
= ‖Zi − Z−iβ(o)(i) ‖
2 +∑j∈Aci
ZTj (Zi − Z−iβ
(o)(i) )(βj − β(o)
j ).
57
Under the event Fn1, for each i 1
2n‖Zi − Z−iβ(i)‖2 +
∑j∈Aci
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |βj(i)|
−
1
2n‖Zi − Z−iβ
(o)(i) ‖
2 +∑j∈Aci0
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣) |β(o)j(i)|
≥
∑j∈Aci0
P ′λi
(∣∣∣∣ ωinitji
ωinitii
∣∣∣∣))− 1
nZTj (Zi − Z−iβ
(o)(i) ) · sign(βj)
|βj(i)|
≥∑j∈Aci0
a1λi −
1
nZTj (Zi − Z−iβ
(o)(i) ) · sign(βj)
|βj(i)|
≥ 0.
This equality holds only if βj(i) = 0, ∀j ∈ Aci0, and the oracle estimator β(o)(i)
is the minimizer of ‖Zi−Z−iβ(i)‖2. Hence, βosLLA(i) (Ωinit, λi) = β(o)(i) for each i,
and then ΩosLLA(Ωinit,λ) = Ω(o).
Theorem 2. Assume that regularity conditions (A1)-(A2) and (P1)-(P3) hold.
Denote a vector of tuning parameters for each column to be λ = (λ1, . . . , λp)T .
For an initial estimate Ωinit, let Fn0 =
maxj 6=i
∣∣∣ ωinitji
ωinitii− ω∗ji
ω∗ii
∣∣∣ < a0λi
where a0 =
min(a2, 1) and Fn1 = ∩pi=1
| 1nZTj (Zi − Z−iβ
(o)(i) )| ≤ P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ,∀j 6= i
where
β(o)(i) is the oracle estimator of β∗(i). Under the event Fn0 ∩ Fn1, the TSDS esti-
mator ΩosLLA(Ωinit,λ) is equivalent to the columnwise oracle estimator Ω(o)
of Ω∗.
58
Proof of Theorem 2. Under the event Fn0, P′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) = 0 for j ∈ A0i
and P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) ≥ a1λ for j ∈ Ac0i. Under the event Fn1, β(0)(i) is in the feasible
set of the two stage Dantzig selector. Under the event Fn0∩Fn1, the minimizer
βTSDS(i) of (3.3) must be the oracle estimator β(o)(i) because P ′λi
(∣∣∣ ωinitji
ωinitii
∣∣∣) = 0 for
j ∈ A0i and βj(i) for j ∈ Ac0i can be zero.
Recall A0i = j : ω∗ji 6= 0, j 6= i and let si = |A0i|. The following
corollaries assert that the clime estimator (Cai et al., 2011) can be a good
initial estimator which the two stage methods can achieve the columnwise
oracle estimator. Cai et al. (2011) showed that ‖Ωclime(λ)−Ω∗‖∞ ≤ 4Lλclime
with probability at least Pr(
maxij |Sij −Σ∗ij| > λclime
L
)where L = ‖Ω∗‖1 =
maxj∑p
i=1 |ω∗ij|, and λclime = C0L√
log pn
. We can use a large deviation result
such that the Lemma 3 of Bickel and Levina (2008) and Fan et al. (2012):
under the regularity condition (A1),
Pr(|Sij −Σ∗ij| ≥ δ) ≤ C1 exp(−C2nδ2),
where C1, C2 depend on ν0 in the regularity condition (A1). Hence
Pr
(maxij|Sij −Σ∗ij| ≥
λclime
L
)≤ C1 exp
(−C2
nλclime2
L2
).
Lemma 1. Let Ωinit be an initial estimator and Ω∗ = (ω∗ij)1≤i,j≤p be the true
precision matrix. Define Ai0 = j : ω∗ij 6= 0, j 6= i for i = 1, . . . , p.
59
Define a0 = min1, a2 where a2 is defined in the penalty conditions (P1)-(P3).
For each i = 1, . . . , p, if
‖Ωinit −Ω∗‖∞ < a0 ·
1
ωinitii
maxj∈Ai0
(ω∗jiω∗ii
+ 1
)−1then ∣∣∣∣ ωinitji
ωinitii
−ω∗jiω∗ii
∣∣∣∣ < a0λi.
Proof of Lemma 1.∣∣∣∣ ωinitji
ωinitii
−ω∗jiω∗ii
∣∣∣∣ =
∣∣∣∣ω∗iiωinitji − ωinitii ω∗jiωinitii ω∗ii
∣∣∣∣≤
(|ω∗ii|+ |ω∗ji|)‖Ωinit −Ω∗‖∞|ωinitii ω∗ii|
=
(1 +
∣∣∣ω∗jiω∗ii
∣∣∣) ‖Ωinit −Ω∗‖∞|ωinitii |
< a0λi.
Let A0 = (i, j) : Ωij 6= 0 = ∪iA0i.
Corollary 1. Assume that regularity conditions (A1)-(A2) and (P1)-(P3)
hold.
(1) Suppose that for i = 1, . . . , p, minj∈Ai0
∣∣∣ω∗jiω∗ii
∣∣∣ > (a+ 1)λi and
λi > max
(1
a0
1
ωinitii
maxj∈A0i
(ω∗jiω∗ii
+ 1
)· 4Lλclime, 2
a1
√log p
nmaxiω∗−1ii M
)
60
where a0 = min(1, a2), ωinitii = ωclimeii (λclime), λclime = LC0
√log pn
for
C0 > 0, and M = maxj ‖Zj‖22/n. Then
Pr(ΩosLLA(Ωclime(λclime),λ) = Ω(o)) ≥ 1− p0 − p1
where p1 = Pr(F cn1) ≤ 2
(p(p− 1)−
∑pj=1 sj
)· exp
(−na21 mini λ
2i
2Mσ2
), and
p0 = Pr(‖Ωclime(λ)−Ω∗‖∞ > 4Lλclime) ≤ C1 exp
(−C2
nλclime2
L2
).
(2) If nmini λ2i →∞, nλclime2 →∞ and log p = o(nmini λ
2i ), then
Pr(ΩosLLA(Ωclime(λclime),λ) = Ω(o))→ 1
as n→∞.
Corollary 2. Assume that regularity conditions (A1)-(A2) and (P1)-(P3)
hold.
(1) Suppose that for i = 1, . . . , p, minj∈Ai0
∣∣∣ω∗jiω∗ii
∣∣∣ > (a+ 1)λi and
λi > max
(1
a0
1
ωinitii
maxj∈A0i
(ω∗jiω∗ii
+ 1
)· 4Lλclime, 2
a1
√log p
nmaxiω∗−1ii M
)
where a0 = min(1, a2), ωinitii = ωclimeii (λclime), λclime = LC0
√log pn
for
C0 > 0, and M = maxj ‖Zj‖22/n. Then
Pr(ΩTSDS(Ωclime(λclime),λ) = Ω(o)) ≥ 1− p0 − p1
61
where p1 = Pr(F cn1) ≤ 2
(p(p− 1)−
∑pj=1 sj
)· exp
(−na21 mini λ
2i
2Mσ2
), and
p0 = Pr(‖Ωclime(λ)−Ω∗‖∞ > 4Lλclime) ≤ C1 exp
(−C2
nλclime2
L2
).
(2) If nmini λ2i →∞, nλclime2 →∞ and log p = o(nmini λ
2i ), then
Pr(ΩTSDS(Ωclime(λclime),λ) = Ω(o))→ 1
as n→∞.
Proof of Corollary 1 and Corollary 2. This results are follow from the
result of Theorem 1. To calculate Pr(F cn0),
Pr(‖Ωclime(λ)−Ω∗‖∞ ≥ 4Lλclime) ≤ C1 exp
(−C2
nλclime2
L2
)
and the Proposition 2 are used. Pr(F cn1) can be obtained in the similar way of
the linear regression. The difference between regression and precision matrix
estimation is the variance of εi. The variance of εi in precision matrix estimation
is var(εi) = Σ∗ii −Σ∗i,−iΣ∗−1−i,−iΣ∗−i,i = ω∗−1ii hence it should be bounded. Let
62
εi = (ε1i, . . . , εni)T is the vector of iid samples from N(0, var(εi)).
Pr(F cn1) ≤
p∑i=1
∑j∈Ac0i
Pr
(∣∣∣∣ 1nZTj (In −HA0i
)εi
∣∣∣∣ ≤ P ′λi(|ωinitij |)
)
≤ 2
p∑i=1
(p− si) exp
−nminj∈Ac0i
P ′λi(|ωinitij |)|)2
2ω∗−1ii M
≤ 2
p∑i=1
(p− si) · exp
(− na21λ
2i
2ω∗−1ii M
)
≤ 2
(p(p− 1)−
p∑i=1
si
)· exp
(− na21 mini λ
2i
2 maxi ω∗−1ii M
)
because
Pr(|aTεi| > t) ≤ 2 exp
(− t2
2ω∗−1ii ‖a‖22
)∀t ≥ 0
and ‖ 1nZTj (In −HA0i
)‖22 ≤‖Zj‖22nλmax(In −HA0i
) ≤M/n ∀j ∈ Ac0i.
The Lasso or Dantzig selector of Ω−ii can be a good initial estimate based
on the `2 bound of Bickel et al. (2009). Denote⊗
to be the Hadamard
product. If the Hessian of the loglikelihood Γ ∗p2×p2 = Ω∗−1⊗Ω∗−1 satis-
fies the incoherence (or irrepresentable) condition and under some regular-
ity conditions, an elementwise `∞ bound of the graphical lasso estimator is
‖Ωglasso − Ω∗‖∞ = O(√
log pn
) (Ravikumar et al., 2011) where the incoher-
ence (or irrepresentable) condition is that there exists α ∈ (0, 1] such that
maxe∈Ac ‖Γ ∗eA(Γ ∗AA)−1‖1 ≤ (1−α) with A = ∪iAi. The glasso estimate Ωglasso
can be a good initial estimate of Ω∗.
63
3.3 Numerical analyses
We conduct two simulation settings and one real data analysis. These two
simulation settigs are same as in Fan et al. (2012). The real data analysis is
the classification problem using the linear discriminant analysis (LDA) which
needs for estimation of precision matrix.
3.3.1 Simulations
We simulate n independent random vector from Np(0,Σ∗) with a sparse pre-
cision matrix Ω∗ = (Σ∗)−1. We consider two different sparsity patterns of Ω∗.
Example 1.Ω∗ is a tridiagonal matrix by constructingΣ∗ = (σ∗ij) = exp(−|ci−
cj|) for c1 < · · · < cp which are constructed by generating cp− cp−1, . . . , c2− c1
independently from Unif(0.5,1).
Example 2. Ω∗ = UU+I where U = (uij)p×p has zero diagonals and exactly
p nonzero off-diagonal entries. The nonzero entries are generated by uij = tijcij
where tij’s are independently generated from Unif(1,2), and cij’s are indepen-
dent Bernoulli random variables with Pr(cij = 1) = Pr(cij = −1) = 0.5.
We also generate an independent validation set of sample size n to tune
each estimator. In our simulation we let n = 100 and p = 100 or p = 200.
We compute the `1 penalized Gaussian likelihood estimator, denoted by
64
glasso, by using the popular R package glasso (Friedman et al., 2013). CLIME
(Cai et al., 2011) is computed by the R package clime (Cai et al., 2012). We use
GSCAD to denote the one step SCAD penalized Gaussian likelihood estima-
tor with CLIME initial estimate which is proposed by Fan et al. (2012). These
likelihood based approach are tuned by minimizing validation error which is
defined as − log det Ω + trace(ΩSval) where Ω is the generic estimator and
Sval is the sample covariance of validation set. Denote MB to be a columnwise
`1 penalized regression which proposed by Meinshausen and Buhlmann (2006).
We conducted our two stage methods with the LASSO and the Dantzig selec-
tor using two different initial estimator including glasso and CLIME. Denote
clime+LS to be the osLLA with CLIME estimator and clime+DS to be
the TSDS with CLIME estimator. With the glasso initial, glasso+LS and
glasso+DS are denoted as the osLLA and TSDS, respectively. MB and these
two stage methods are tuned columnwisely by minimizing the validation error
‖Zvali − Zval
−i β(i)(λi)‖2/n.
For each model, we generated 100 independet datasets, each consisting
n training samples and n validationi samples. Estimation accuracy is mea-
sured by the average Frobenius norm loss ‖Ω − Ω∗‖F , the average matrix
`1 norm ‖Ω − Ω∗‖1, and the average spectral norm ‖Ω − Ω∗‖2 over the
100 replications, where ‖A‖F =√∑
i,j a2ij, ‖A‖1 = max1≤j≤q
∑pi=1 |aij|, and
65
‖A‖2 = sup|x|≤1 |Ax|2 for a matrix A = (aij) ∈ Rp×q. The selection accuracy
is evaluated by the average edge proportion of false positive (FP) and false
negative (FN), sensitivity, and specificity over 100 replications. The average
number of estimated edge is also measured. We plot the ROC curve with the
average sensitivity and specificity for each method. We also plot the average
Frobenious norm and spectral norm varying with number of edges for each
method. For the two stage methods, we consider two settings which are with
the same tuning parameter overall and with columnwisely different tuning pa-
rameters. The simulation results are summarized in Figure 3.1-3.8 and Table
3.1-3.8. We conduct two stage methods with several glasso and CLIME initials
with a sequence of tuning parameters and we find out over-edged initial esti-
mates which have more edges than selected optimal glasso or CLIME estimates
by validation error. We summarize the best results of our proposed methods in
Figure 3.1-3.8 and Table 3.1-3.8. The selection results of our proposed methods
outperform over others. The two stage methods with the glasso initial tend to
perform better than those with CLIME initial. In Example 2, our proposed
methods achieve the best finite sample performance in both estimation and
selection.
66
0 5 10 15 20
60
70
80
90
10
0
(1−Specificity)*100
Se
nsitiv
ity*1
00
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.1: ROC curve of Example 1 (p=100, q=99)
67
Table 3.1: Example 1 (p=100, q=99)
Methods Edge FP FN Sensitivity Specificity
glasso 1025.655 0.9034 1.00E-04 0.9976 0.8089
gSCAD-osLLA 902.96 0.8839 2.00E-04 0.9917 0.8341
CLIME 586.67 0.8293 2.00E-04 0.9904 0.8993
MB 288.49 0.6604 3.00E-04 0.9852 0.9606
MB(same) 291.27 0.6618 3.00E-04 0.9867 0.9601
clime+LS 281 0.6522 4.00E-04 0.9821 0.9621
clime+DS 283.57 0.6554 4.00E-04 0.9821 0.9616
glasso+LS 134.02 0.3 0.0011 0.9441 0.9916
glasso+DS 133.35 0.2963 0.0011 0.9441 0.9918
clime+LS(same) 290.92 0.6612 3.00E-04 0.9862 0.9602
clime+DS(same) 277.24 0.6452 3.00E-04 0.9837 0.9629
glasso+LS(same) 140.96 0.3217 8.00E-04 0.9598 0.9905
glasso+DS(same) 140.36 0.3185 8.00E-04 0.9593 0.9906
68
0 200 400 600 800 1000
46
81
01
21
4
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 200 400 600 800 1000
01
23
4
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) Spectral norm
Figure 3.2: ‖Ω −Ω∗‖ of Example 1 (p=100, q=99)
69
Table 3.2: ‖Ω −Ω∗‖ of Example 1 (p=100, q=99)
Methods Edge Frob l1 l2
glasso 1025.655 6.6562 2.667 1.5839
gSCAD-osLLA 902.96 4.3703 1.8354 1.1696
CLIME 586.67 5.7808 2.0251 1.3744
MB 288.49 6.136 2.6453 1.9252
MB(same) 291.27 4.7942 1.7009 1.1407
clime+LS 281 6.5012 3.0927 2.335
clime+DS 283.57 6.5829 3.0539 2.2887
glasso+LS 134.02 4.6925 1.9878 1.4721
glasso+DS 133.35 4.6922 2.0082 1.4868
clime+LS(same) 290.92 4.9179 1.786 1.2587
clime+DS(same) 277.24 4.9048 1.748 1.2448
glasso+LS(same) 140.96 4.5504 1.8529 1.3569
glasso+DS(same) 140.36 4.5491 1.8487 1.3571
70
0 2 4 6 8 10
60
70
80
90
10
0
(1−Specificity)*100
Se
nsitiv
ity*1
00
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.3: ROC curve of Example 1 (p=200, q=199)
71
Table 3.3: Example 1 (p=200, q=199)
Methods Edge FP FN Sensitivity Specificity
glasso 2657.33 0.925 0 0.9982 0.8752
gSCAD-osLLA 1729.86 0.8809 1.00E-04 0.9919 0.9222
CLIME 960.83 0.7932 2.00E-04 0.9818 0.9611
MB 604.36 0.6765 2.00E-04 0.9798 0.9792
MB(same) 620.22 0.6818 2.00E-04 0.9825 0.9784
clime+LS 407.61 0.5389 6.00E-04 0.9403 0.9888
clime+DS 407.12 0.5377 6.00E-04 0.9414 0.9888
glasso+LS 586.49 0.6673 2.00E-04 0.9777 0.9801
glasso+DS 608.99 0.6792 2.00E-04 0.9783 0.979
clime+LS(same) 435 0.5578 4.00E-04 0.9597 0.9876
clime+DS(same) 429.43 0.552 4.00E-04 0.9587 0.9879
glasso+LS(same) 599.01 0.6725 2.00E-04 0.9816 0.9795
glasso+DS(same) 585.45 0.6651 2.00E-04 0.9808 0.9802
72
0 200 400 600 800 1000
46
81
01
21
4
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 200 400 600 800 1000
01
23
4
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) Spectral norm
Figure 3.4: ‖Ω −Ω∗‖ of Example 1 (p=200, q=199)
73
Table 3.4: ‖Ω −Ω∗‖ of Example 1 (p=200, q=199)
Methods Edge Frob l1 l2
glasso 2657.33 10.9083 3.0856 1.7999
gSCAD-osLLA 1729.86 7.015 2.2085 1.3832
CLIME 960.83 9.4415 2.1993 1.5915
MB 604.36 10.4144 3.905 3.0842
MB(same) 620.22 7.32 1.8644 1.2588
clime+LS 407.61 10.599 4.0213 3.1616
clime+DS 407.12 10.878 4.2165 3.3455
glasso+LS 586.49 11.1297 4.5991 3.7215
glasso+DS 608.99 11.8905 4.6239 3.7287
clime+LS(same) 435 8.1239 2.2008 1.7033
clime+DS(same) 429.43 8.1794 2.2122 1.7129
glasso+LS(same) 599.01 7.2486 1.854 1.267
glasso+DS(same) 585.45 7.2326 1.8454 1.2634
74
0 2 4 6 8 10
02
04
06
08
01
00
(1−Specificity)*100
Se
nsitiv
ity*1
00
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.5: ROC curve of Example 2 (p=100, q=59)
75
Table 3.5: Example 2 (p=100, q=59)
Methods Edge FP FN Sensitivity Specificity
glasso 1025.655 0.9034 1.00E-04 0.9976 0.8089
gSCAD-osLLA 902.96 0.8839 2.00E-04 0.9917 0.8341
CLIME 586.67 0.8293 2.00E-04 0.9904 0.8993
MB 288.49 0.6604 3.00E-04 0.9852 0.9606
MB(same) 291.27 0.6618 3.00E-04 0.9867 0.9601
clime+LS 281 0.6522 4.00E-04 0.9821 0.9621
clime+DS 283.57 0.6554 4.00E-04 0.9821 0.9616
glasso+LS 134.02 0.3 0.0011 0.9441 0.9916
glasso+DS 133.35 0.2963 0.0011 0.9441 0.9918
clime+LS(same) 290.92 0.6612 3.00E-04 0.9862 0.9602
clime+DS(same) 277.24 0.6452 3.00E-04 0.9837 0.9629
glasso+LS(same) 140.96 0.3217 8.00E-04 0.9598 0.9905
glasso+DS(same) 140.36 0.3185 8.00E-04 0.9593 0.9906
76
0 200 400 600 800 1000
10
15
20
25
30
35
40
45
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 200 400 600 800 1000
68
10
12
14
16
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) Spectral norm
Figure 3.6: ‖Ω −Ω∗‖ of Example 2 (p=100, q=59)
77
Table 3.6: ‖Ω −Ω∗‖ of Example 2 (p=100, q=59)
Methods Edge Frob l1 l2
glasso 1025.655 6.6562 2.667 1.5839
gSCAD-osLLA 902.96 4.3703 1.8354 1.1696
CLIME 586.67 5.7808 2.0251 1.3744
MB 288.49 6.136 2.6453 1.9252
MB(same) 291.27 4.7942 1.7009 1.1407
clime+LS 281 6.5012 3.0927 2.335
clime+DS 283.57 6.5829 3.0539 2.2887
glasso+LS 134.02 4.6925 1.9878 1.4721
glasso+DS 133.35 4.6922 2.0082 1.4868
clime+LS(same) 290.92 4.9179 1.786 1.2587
clime+DS(same) 277.24 4.9048 1.748 1.2448
glasso+LS(same) 140.96 4.5504 1.8529 1.3569
glasso+DS(same) 140.36 4.5491 1.8487 1.3571
78
0 2 4 6 8 10
020
40
60
80
100
(1−Specificity)*100
Sensitiv
ity*100
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
Figure 3.7: ROC curve of Example 2 (p=200, q=92)
79
Table 3.7: Example 2 (p=200, q=92)
Methods Edge FP FN Sensitivity Specificity
glasso 2263.27 0.968 0.0011 0.7851 0.8894
gSCAD-osLLA 536.81 0.8786 0.0016 0.6573 0.976
CLIME 242.06 0.7012 0.0011 0.7698 0.9914
MB 272.85 0.732 0.001 0.7914 0.9899
MB(same) 378.18 0.878 0.0024 0.4965 0.9832
clime+LS 102 0.2251 7.00E-04 0.8561 0.9988
clime+DS 102 0.2252 7.00E-04 0.8561 0.9988
glasso+LS 104.46 0.2115 5.00E-04 0.8932 0.9989
glasso+DS 104.38 0.2109 5.00E-04 0.8932 0.9989
clime+LS(same) 117.73 0.389 0.0011 0.7559 0.9976
clime+DS(same) 117.74 0.389 0.0011 0.7559 0.9976
glasso+LS(same) 87.02 0.1569 0.001 0.7924 0.9993
glasso+DS(same) 86.93 0.1563 0.001 0.7921 0.9993
80
0 500 1000 1500 2000
15
20
25
30
35
40
45
# of edge
Fro
be
niu
s n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(a) Frobenius norm
0 500 1000 1500 2000
46
81
01
21
41
6
# of edge
l2 n
orm
glasso
gSCAD
CLIME
MB(same)
MB
TS(same)
TS
(b) l2 norm
Figure 3.8: ‖Ω −Ω∗‖ of Example 2 (p=200, q=92)
81
Table 3.8: ‖Ω −Ω∗‖ of Example 2 (p=200, q=92)
Methods Edge Frob l1 l2
glasso 2263.27 37.8666 13.3819 9.5821
gSCAD-osLLA 536.81 21.8945 9.7219 7.0486
CLIME 242.06 28.311 11.4089 8.0062
MB 272.85 22.7018 8.2783 5.3505
MB(same) 378.18 30.7944 12.2013 8.1922
clime+LS 102 17.8328 9.0719 6.2429
clime+DS 102 17.8365 9.078 6.2441
glasso+LS 104.46 16.2034 8.0195 5.7519
glasso+DS 104.38 16.2091 8.0138 5.7561
clime+LS(same) 117.73 20.4286 9.2521 6.092
clime+DS(same) 117.74 20.4287 9.2521 6.092
glasso+LS(same) 87.02 18.1129 8.0463 5.6268
glasso+DS(same) 86.93 18.117 8.0515 5.633
82
3.3.2 Real data analysis
We apply our two stage methods to analyzing the breast cancer data set which
were analyzed by Hess et al. (2006) and it is available at
http://bioinformatics.mdanderson.org/. This dataset is also used in previous
studies (Fan et al., 2009; Cai et al., 2011). The aim of this analysis is to com-
pare the results of linear discriminant analysis (LDA) based on several regu-
larization methods for sparse precision matrix estimation. This dataset con-
tains 22,283 gene expression levels of 133 patients where 34 patients of them
achieved pathological complete response (pCR) and others did not achieve
pCR (a.k.a. residual disease (RD)). Since pCR after neoadjuvant (preoper-
ative) chemotherapy indicates high possibility of cancer free survival, it is of
substantial interest to predict wheter or not a patient will achieve pCR. In this
study LDA is utilzed to classify a patients as pCR or RD. Precision matrix
should be estimated in advance of using LDA. Fan et al. (2009) used penalized
loglikelihood method with LASSO, adaptive LASSO, and SCAD penalties to
estimate precision matrix and Cai et al. (2011) used CLIME estimate as a
precision matrix. We follow the same framework used by Fan et al. (2009) and
Cai et al. (2011).
First, we randomly divide dataset into the training and testing datasets.
To maintain the proportion of pCR and RD each time, we use a stratified
83
sampling which randomly select five from pCR and 16 from RD to construct
testing dataset and the remaining subjects are used as the training dataset.
For each training set, we conduct a two-sample t-test between the two groups
for each gene, and select the most significant 113 genes (i.e., with the smallest
p−values) as the covariates for prediction. Because the size of the training
sample is n = 112, the covariates with size p = 113 allow us to examine the
performance when p > n. Second, we standardize each gene level in these
datasets by dividing them with corresponding estimated standard deviation
from the training set. Finally, we conduct precision matrix estimation for each
regularization method and apply it to LDA. According to the LDA framework,
we assume that the normalized gene expression data is normally distributed
as N(µk,Σ), where the two groups are assumed to have the same covariance
matrix, Σ, but different means, µk, k = 1 for pCR and k = 2 for RD. The
LDA scores based on the estimated precision matrix Ω are as follows,
δk(x) = xT Ωµk −1
2µTk Ωµk + log πk,
where πk = nk/n is the proportion of subjects in group k in the training set and
µk = 1nk
∑i∈group k xi is the mean vector of within group in the training set.
The classification rule is given by k(x) = argmaxδk(x) for k = 1, 2. To eval-
uate the classification performance based on precision matrix estimation, we
use specificity, sensitivity, and Mathews correlation coefficient (MCC) criteria,
84
defined as follows:
Specificity =TN
TN + FP, Sensitivity =
TP
TP + FN
MCC =TP× TN− FP× FN√
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
where TP, TN, FP, and FN are the numbers of true positives (pCR), true neg-
atives (RD), false positives and false negatives, respectively. We also compare
the number of nonzero precision matrix elements among the same methods
which are considered in the simulations with the same tuning strategy. The
results are reported in Table 3.9. The proposed two stage methods yield very
sparse precision matrix while they perform well or similar to other methods.
85
Table 3.9: Real Data (Breast Cancer)
Methods SP SN MCC #Edge
glasso 0.876(0.066) 0.404(0.186) 0.307(0.229) 1066.87(31.054)
gSCAD-osLLA 0.784(0.077) 0.682(0.201) 0.428(0.211) 731.37(50.934)
CLIME 0.737(0.068) 0.782(0.173) 0.457(0.18) 2282.18(371.52)
MB 0.677(0.074) 0.824(0.164) 0.433(0.169) 289.06(25.678)
glasso+DS 0.674(0.087) 0.824(0.161) 0.431(0.176) 221.50(22.474)
glasso+LS 0.674(0.088) 0.820(0.164) 0.428(0.179) 224.48(21.468)
clime+DS 0.666(0.072) 0.824(0.169) 0.422(0.17) 260.37(19.059)
clime+LS 0.669(0.077) 0.822(0.168) 0.424(0.165) 333.09(22.069)
86
3.4 Conclusion
In this paper, we especially focus on the selection and the correct recovery of
the support of precision matrix. We propose a regression based method which
applies two stage methods based on LASSO or Dantzig selector to columnwise
estimation of precision matrix. We prove that these proposed methods can
correctly recover the support of precision matrix and obtain√n-consistent
estimator for the nonzero elements of precision matrix with high probability
under some regularity conditions. Numerical results show that our proposed
methods outperform existing regularization methods including glasso, gSCAD,
and CLIME in terms of estimation and especially support recovery of precision
matrix.
87
Chapter 4
Concluding remarks
In this thesis, we propose a two stage method based on Dantzig selector, which
is called two stage Dantzig selector in high dimensional regression model. We
prove that two stage Dantzig selector satisfies strong oracle property. Numer-
ical results support our contention that Dantzig selector used in our proposed
method can improve variable selection and estimation than LASSO. Further-
more, two stage Dantzig selector outperform other regularization methods in-
cluding LASSO, Dantzig selector, SCAD and MCP.
We also apply the two stage methods based on the LASSO or the Dantzig
selector to sparse precision matrix estimation. We prove that these proposed
methods can correctly recover the support of precision matrix and obtain√n-
consistent estimator for the nonzero elements of precision matrix. For esti-
88
mation of sparse precision matrix, the two stage methods perform well in
estimation and especially in support recovery of precision matrix.
89
Chapter 5
Appendix
5.1 Algorithms for Dantzig selector
There are several algorithms for Dantzig selector. Recall the definition of
Dantzig selector which is defined by
minβ‖β‖1 subject to ‖ 1
nXT (y −Xβ)‖∞ ≤ λ. (5.1)
A standard way to solve (5.1) is using linear program (LP) techniques because
(5.1) can be recast as a LP problem. Candes and Romberg (2005) proposed
`1-magic package which applied a primal-dual interior point method which is
one of LP techniques to solve the reformulated LP problem. This algorithm is
known to be efficient whenX is sparse or it can be efficiently trainsformed into
90
a diagonal matrix, but it can be inefficient for large-scale problems because of
Newton step for each iteration (Wang and Yuan, 2012).
There are homotopy methods to to compute the entire solution path of the
Dantzig selector (Romberg, 2008; James et al., 2009). They are also problem-
atic in dealing with high dimensional data (Becker et al., 2011). As an effort
to solve (5.1) efficiently in large-scale problem, first-order methods have been
proposed (Lu, 2012; Becker et al., 2011). Lu et al. (2012) applied alternat-
ing direction method (ADM) which has been widely used to solve large-scale
problems to solving (5.1) and its variations have been developed for large scale
problems (Wang and Yuan, 2012).
We go into three representative algorithms for Dantzig selector which are
primal-dual interior point method (Candes and Romberg, 2005), Dantzig selec-
tor with sequential optimization (DASSO) (James et al., 2009)), and alternat-
ing direction method (ADM) (Lu et al., 2012). We abstract main algorithms
for Dantzig selector from these three papers.
5.1.1 Primal-dual interior point algorithm (Candes and
Romberg, 2005)
Dantzig selector can be recast to a linear program (LP). The LP is an optimiza-
tion problem with linear objective function and linear equality or inequality
91
constraints. There are many solvers for LP such as simplex method, barrier
method, primal-dual interior point method. Candes and Romberg (2005) in-
troduced primal-dual interior point method for LPs and second-order cone
programs(SOCPs). Here we especially extract algorithm for Dantzig selector
from Candes and Romberg (2005)’s paper.
An equivalent linear program to (5.1) is given by:
minβ,u
∑i
ui subject to β − u ≤ 0,
−β − u ≤ 0,
XTr − λ1 ≤ 0,
−XTr − λ1 ≤ 0,
where r = Xβ − y. Taking
fu1 = β − u, fu2 = −β − u, fλ1 = XTr − λ1, fλ2 = −XTr − λ1,
and f = (fu1 ,fu2 ,fλ1 ,fλ2)T then at the optimal point β∗,u∗, there exists dual
vectors γ∗ = (γ∗u1,γ∗u2
,γ∗λ1 ,γ∗λ2
)T , γ∗ ≥ 0 such that the following Karash-
Kuhn-Tucker condition are satisfiesd:
1− γ∗u1− γ∗u2
= 0,
fu1 ≤ 0, fu2 ≤0, fλ1 ≤ 0, fλ2 ≤ 0,
γu1,ifu1,i = 0, γu2,ifu2,i =0, γλ1,ifu1,i = 0, γλ2,ifu2,i = 0, i = 1, . . . , p.
92
The complementary slackness condition γifi = 0 can be relaxed practically to
γ(k)i fi(β
(k),u(k)) = −1/τ (k), (5.2)
with increasing τ (k) thorough the iterations. The relaxed-(KKT) condition re-
places the complementary slackness condition with (5.2). The optimal solution
β∗ of the primal dual algorithm satisfies the relaxed-(KKT) condition along
with optimal dual vectors γ∗. The solution is obtained through the classical
Newton method constrained by its interior region(fi(β(k),u(k)) < 0, γ
(k)i > 0).
The dual and central residuals quantify how close a point (β,u;γu1,γu2
,γλ1 ,γλ2)
is to satisfying (KKT ) with (5.2) in place of the slackness condition:
rdual =
γu1− γu2
+XTX(γλ1 − γλ2)
1− γu1− γu2
,
rcent = −Γf − (1/τ)1
where Γ is a diagonal matrix with (Γ )ii = γi. and the Newton step is the
solution toXTXΣaXTX +Σ11 Σ12
Σ12 Σ11
∆β∆u
=
−(1/τ) · (XTX(−f−1λ1 + f−1λ2 ))− f−1u1+ f−1u2
−1− (1/τ) · (f−1u1 + f−1u2 )
:=
w1
w2
93
where
Σ11 = −Γu1F−1u1− Γu2F
−1u2
Σ12 = Γu1F−1u1− Γu2F
−1u2
Σa = −Γλ1F−1λ1− Γλ2F−1λ2
.
Set
Σβ = Σ11 −Σ212Σ
−111 ,
then we can eliminate
∆u = Σ−111 (w2 −Σ12∆β),
and solve
(XTXΣaXTX +Σβ)∆β = w1 −Σ12Σ
−111 w2
for ∆β. As before, the system is symmetric positive definite, and the conjugate
gradient (CG) algorithm can be used to solve it.
Given ∆β, ∆u, the step directions for the inequality dual variables are
given by
∆γu1= −Γu1F
−1u1
(∆β −∆u)− γu1− (1/τ)f−1u1
∆γu2 = −Γu2F−1u2
(−∆β −∆u)− γu2− (1/τ)f−1u2
∆γλ1 = −Γλ1F−1λ1(XTX∆β)− γλ1 − (1/τ)f−1λ1
∆γλ2 = −Γλ2F−1λ2(−XTX∆β)− γλ2 − (1/τ)f−1λ2 .
94
where F is a diagonal matrix with (F )ii = fi. With the (∆β, ∆u, ∆γ) we have
a step direction. To choose the step length 0 < s ≤ 1, we ask that it satisfy
two criteria:
1. β+ s∆β, u + s∆u and γ + s∆γ are in the interior, i.e. fi(β+ s∆β,u +
s∆u) < 0, γi > 0 for all i.
2. The norm of the residuals has decreased sufficiently:
‖rτ (β + s∆β,u + s∆u,γ + s∆γ)‖2 ≤ (1− αs) · ‖rτ (β,u,γ)‖2,
where α is a user-sprecified parameter (in all of our implementations, we
have set α = 0.01).
Since the fi are linear functionals, item 1 is easily addressed. We choose the
maximum step size that just keeps us in the interior. Let
I+f = i : 〈ci, ∆z〉 > 0, I−γ = i : ∆γ < 0,
where z = (β,u)T and fi = 〈ci, z〉, and set
smax = 0.99 ·min1, −fi(z)/〈ci, ∆z〉, i ∈ I+f , −γi/∆γi, i ∈ I−γ.
Then starting with s = smax, we check if item 2 above is satisfied; if not, we set
s′ = ν · s and try again. We have taken ν = 1/2 in all of our implementations.
When rdual is small, the surrogate duality gap η = −fTγ is an approxima-
tion to how close a certain (β,u,γ) is to being opitmal (i.e. 〈c0, z〉−〈c0, z∗〉 ≈
95
η) where∑
i ui = 〈c0, z〉. The primal-dual algorithm repeats the Newton iter-
ations described above until η has decreased below a given tolerance.
5.1.2 DASSO (James et al., 2009)
James et al. (2009) proposed a homotopy algorithm for Dantzig which is named
as Dantzig selector with sequential optimization (DASSO). DASSO constructs
piecewise linear path while it identifies break points and solves the correspond-
ing linear program. DASSO is similar to least angle regression and selection
(LARS) algorithm (Efron et al., 2004) which is efficient algorithm for LASSO
hence its computational cost is comparable to LARS. We first describe the
LARS algorithm and go into the detail of DASSO. The LARS algorithm is
defined as follows.
LARS (Efron et al., 2004)
1. Initialize:
β = 0, A = argmaxj|∇L(β)|j, γA = −sgn(∇L(β))A, γAC = 0.
where L(β) =∑
i(yi − xTi β)2.
2. While (max |∇L(β)| > 0);
96
(a) d1 = mind > 0 : |∇L(β)j| = |∇L(β)A|, j /∈ A,
d2 = mind > 0 : (β + dγ)j = 0, j ∈ A.
Find step length: d = min(d1, d2).
(b) Take step: β ← β + dγ.
(c) If d = d1 then add variable attaining equality at d to A.
If d = d2 then remove variable attaining 0 at d from A.
(d) Calculate new direction:
γA = (XTAXA)−1sgn(βA) and γAC = 0.
The LARS procedure starts with all zero coefficients and select the most
correlated variable with response variable. LARS proceeds with the direction
of this variable until some other variable has as much correlation with the
current residual and add this new variable to the set of selected variables. The
direction is taken to be equiangular among selected variables and it changes
when addition or deletion happen. Addition occurs when the correlation of
other variable with current residual become same as the correlation of selected
variables with current residual. Deletion happens when one of the coefficients
of the selected variables to be zero while LARS procedure continues along with
the current direction.
DASSO algorithm is to solve (5.1) sequentially through constructing a
97
piecewise linear solution path as well. DASSO is defined as follows.
DASSO (James et al., 2009)
1. Initialize:
l = 1, βl = 0, A = argmaxj|XTj (y −Xβl)|, B = j : βlj 6= 0 = ∅,
c = XT (y −Xβl), γA = −sgn(cA), γAC = 0, sA = sgn(cA)
2. While (maxj |XTj (y −Xβl)| > 0);
(a) d1 = mind > 0 : |XTj (y −Xβl)| = |XT
A(y −Xβl)|, j /∈ A,
d2 = mind > 0 : (βl + dγ)j = 0, j ∈ A.
Find step length: d = min(d1, d2).
(b) If d = d1 then add variable attaining equality at d to A and add
variable j∗ to B.
If d = d2 then remove variable attaining 0 at d from either A and
B.
(c) Calculate new direction:
γA = (XTAXB)−1sA and γAC = 0
(d) Take step: βl+1 ← βl + dγ.
(e) l← l + 1
98
The added variable j∗ and the distance are defined as follows.
• Define the added variable.
Let |A| × (2p + |A|) matrix A = (−sAXTAX sAX
TAX I) and Aj = Aj1
Aj2
be the jth column ofA with Aj2 is a scalar. Let B =
B1
B2
be the columns of A that correspond to the non-zero components of β+
and β− where B1 is a square matrix of dimension |A| − 1.
Define j∗ = argmaxj:qj 6=0,α/qj>0
(1TB−11 Aj1 − 1j≤2p)|qj|−1
where qj = Aj2 −B2B−11 Aj1 and α = B2B
−11 1− 1.
• Define the distance.
Let the distance be d = mind1, d2 where d1 = minj /∈A
ck−cj
(Xk−Xj)TXγ,
ck+cj(Xk+Xj)TXγ
+
for k ∈ A and d2 = minj∈B−βjγj.
This rule for adding variable comes from piecewise linearity of solution
path and the definition of Dantzig selector. For more detail, see the appendix
of James et al. (2009). The distance step is same as in LARS algorithm.
99
5.1.3 Alternating direction method (ADM) (Lu et al.,
2012)
The ADM has recently been widely used to solve large-scale problems. The
general problems which can use the ADM have the following form
minx,y
f(x) + g(y) subject to Ax+By = b, x ∈ C1, y ∈ C2, (5.3)
where f and g are convex functions, A and B are matrices, b is a vector,
and C1 and C2 are closed convex sets. The ADM consists of two subproblems
and a multiplier update. Under mild assumptions, It is known that the ADM
converges to optimal solution of (5.3) (Bertsekas and Tsitsiklis, 1989). The
ADM for Dantzig selector is defined as follows:
minβ,z‖β‖1 subject to XT (Xβ − y)− z = 0, ‖D−1z‖∞ ≤ λ (5.4)
where D is the diagonal matrix whose diagonal elements are the norm of the
column of X. An augmented Lagrangian function for problem (5.4) for some
µ > 0 can be defined as
Lµ(z,β,η) = ‖β‖1 + ηT (XTXβ −XTy − z) +µ
2‖XTXβ −XTy − z‖22.
ADM algorithm for Dantzig selector
1. Initialize: let β0,η0 ∈ Rp and µ > 0.
100
2. For k = 0, 1, . . .
zk+1 = argmin‖D−1z‖≤λ
Lµ(z,βk,ηk),
βk+1 ∈ argminβLµ(zk+1,β,ηk),
ηk+1 = ηk + µ(XTXβk+1 −XTy − zk+1).
End(for)
We go into subproblems of the ADM. Dual problem of (5.3) is given by
maxη
d(η) := −yTXη − λ‖Dη‖1 subject to ‖XTXη‖∞ ≤ 1.
zk+1 = argmin‖D−1z‖≤λ
‖z− (XTXβk −XTy +ηk
µ)‖22
= min
max
XTXβk −XTy +
ηk
µ,−λd
, λd
,
where d is the vector consisting of the diagonal entries of D. Hence the first
subproblem has the closed form solution. For the second subproblem, we can
choose βk+1 which solves the following approximated subproblem
minβ
µ
2‖XTXβ −XTy − zk+1 +
ηk
µ‖22 + ‖β‖1.
This problem can be solved by the nonmonotone gradient methods for nons-
mooth minimization (Lu and Zhang, 2012). The general problems which can
apply nonmonotone gradient method can be defined as
minx∈XF (x) := f(x) + P (x)
101
where f : Rn → R is continuously differentiable, P : Rn → R is con-
vex but not necessarily smooth, and X ⊆ Rn is closed and convex. Here,
f(β) = µ2‖XTXβ −XTy − zk+1 + ηk
µ‖22 and P (β) = ‖β‖1. Then ∇f(β) =
µXTX(XTX)β−XTy−zk+1 + ηk
µ. Then the nonmonotone gradient method
for solving the subproblem of β is defined as follows:
1. Initialize: 0 < τ, σ < 1, 0 < α < 1 and integer M ≤ 0. Let β0 be given
and set α0 = 1.
2. For l = 0, 1, . . .
(a) Let
dl = SoftThresh(βl−αl∇f(βl), αl)−βl, ∆l = ∇f(βl)Tdl+‖βl+dl‖1−‖βl‖1
(b) Find the largest α ∈ 1, τ, τ 2, . . . such that
f(βl + αdl) + ‖βl + αdl‖1 ≤ max[l−M ]+≤i≤l
f(βi) + ‖βi‖1+ σα∆l.
Set αl ← α, βl+1 ← βl + αldl and l← l + 1.
(c) Update αl+1 = minmax ‖sl‖2
slT gl, α, 1, where sl = βl+1 − βl and
gl = ∇f(βl+1)−∇f(βl).
End(for)
where SoftThresh(v, γ) := sgn(v)max0, |v|−γe. For the specific terminating
rules used in ADM, see Lu et al. (2012).
102
Bibliography
Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection
through sparse maximum likelihood estimation for multivariate gaussian or
binary data. The Journal of Machine Learning Research, 9:485–516.
Becker, S. R., Candes, E. J., and Grant, M. C. (2011). Templates for convex
cone problems with applications to sparse signal recovery. Mathematical
Programming Computation, 3(3):165–218.
Bertsekas, D. and Tsitsiklis, J. (1989). Parallel and distributed computation.
Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance
matrices. The Annals of Statistics, pages 199–227.
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of
lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732.
103
Breiman, L. (1996). Heuristics of instability and stabilization in model selec-
tion. The Annals of Statistics, 24(6):2350–2383.
Cai, T., Liu, W., and Luo, X. (2011). A constrained l1 minimization approach
to sparse precision matrix estimation. Journal of the American Statistical
Association, 106(494):594–607.
Cai, T. T., Liu, W., Luo, X. R., and Luo, M. X. R. (2012). Package ’clime’.
Candes, E. and Plan, Y. (2009). Near-ideal model selection by l1 minimization.
The Annals of Statistics, 37(5A):2145–2177.
Candes, E. and Romberg, J. (2005). l1-magic: Recovery of sparse
signals via convex programming. URL: www. acm. caltech.
edu/l1magic/downloads/l1magic. pdf, 4.
Candes, E. and Tao, T. (2005). Decoding by linear programming. Information
Theory, IEEE Transactions on, 51(12):4203–4215.
Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation
when p is much larger than n. The Annals of Statistics, 35(6):2313–2351.
Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for
model selection with large model spaces. Biometrika, 95(3):759–771.
104
Dicker, L. (2010). Regularized Regression Methods for Variable Selection and
Estimation. Collections of the Harvard University Archives: Dissertations.
Harvard University.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle
regression. The Annals of Statistics, 32(2):407–499.
Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive
lasso and scad penalties. The Annals of Applied Statistics, 3(2):521.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli-
hood and its oracle properties. Journal of the American Statistical Associ-
ation, 96(456):1348–1360.
Fan, J., Xue, L., and Zou, H. (2012). Strong oracle optimality of folded concave
penalized estimation. arXiv preprint arXiv:1210.5992.
Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemo-
metrics regression tools. Technometrics, 35(2):109–135.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance
estimation with the graphical lasso. Biostatistics, 9(3):432–441.
Friedman, J., Hastie, T., Tibshirani, R., and Tibshirani, M. R. (2013). Package
’glasso’.
105
Gai, Y., Zhu, L., and Lin, L. (2013). Model selection consistency of dantzig
selector. Statistica Sinica, 23:615–634.
Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional lin-
ear predictor selection and the virtue of overparametrization. Bernoulli,
10(6):971–988.
Hess, K. R., Anderson, K., Symmans, W. F., Valero, V., Ibrahim, N., Mejia,
J. A., Booser, D., Theriault, R. L., Buzdar, A. U., Dempsey, P. J.,
et al. (2006). Pharmacogenomic predictor of sensitivity to preoperative
chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophos-
phamide in breast cancer. Journal of Clinical Oncology, 24(26):4236–4244.
Huang, J., Horowitz, J. L., and Ma, S. (2008a). Asymptotic properties of
bridge estimators in sparse high-dimensional regression models. The Annals
of Statistics, 36(2):587–613.
Huang, J., Ma, S., and Zhang, C.-H. (2008b). Adaptive lasso for sparse high-
dimensional regression models. Statistica Sinica, 18(4):1603.
James, G. M., Radchenko, P., and Lv, J. (2009). Dasso: connections between
the dantzig selector and lasso. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 71(1):127–142.
106
Kim, Y., Choi, H., and Oh, H. (2008). Smoothly clipped absolute devia-
tion on high dimensions. Journal of the American Statistical Association,
103(484):1665–1673.
Kim, Y. and Kwon, S. (2012). Global optimality of nonconvex penalized esti-
mators. Biometrika, 99(2):315–325.
Kim, Y., Kwon, S., and Choi, H. (2012). Consistent model selection criteria on
high dimensions. The Journal of Machine Learning Research, 98888:1037–
1057.
Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large
covariance matrix estimation. Annals of statistics, 37(6B):4254.
Lu, Z. (2012). Primal–dual first-order methods for a class of cone programming.
Optimization Methods and Software, 28(6):1262–1281.
Lu, Z., Pong, T. K., and Zhang, Y. (2012). An alternating direction method
for finding dantzig selectors. Computational Statistics & Data Analysis.
Lu, Z. and Zhang, Y. (2012). An augmented lagrangian approach for sparse
principal component analysis. Mathematical programming, 135(1-2):149–
193.
107
Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and vari-
able selection with the lasso. The Annals of Statistics, 34(3):1436–1462.
Meinshausen, N., Rocha, G., and Yu, B. (2007). Discussion: A tale of three
cousins: Lasso, l2boosting and dantzig. The Annals of Statistics, 35(6):2373–
2384.
Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A
unified framework for high-dimensional analysis of m-estimators with de-
composable regularizers. Statistical Science, 27(4):538–557.
Peng, J., Wang, P., Zhou, N., and Zhu, J. (2009). Partial correlation estima-
tion by joint sparse regression models. Journal of the American Statistical
Association, 104(486):735–746.
Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue
properties for correlated gaussian designs. The Journal of Machine Learning
Research, 99:2241–2259.
Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of es-
timation for high-dimensional linear regression over `q-balls. Information
Theory, IEEE Transactions on, 57(10):6976–6994.
Ravikumar, P., Wainwright, M., Raskutti, G., and Yu, B. (2011).
108
High-dimensional covariance estimation by minimizing l1-penalized log-
determinant divergence. Electronic Journal of Statistics, 5:935–980.
Romberg, J. (2008). The dantzig selector and generalized thresholding. In In-
formation Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference
on, pages 22–25. IEEE.
Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random
measurements. Information Theory, IEEE Transactions on, 59(6):3434–
3447.
Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A.,
Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant,
T. L., et al. (2006). Regulation of gene expression in the mammalian eye
and its relevance to eye disease. Proceedings of the National Academy of
Sciences, 103(39):14429–14434.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society. Series B (Methodological), pages 267–288.
Wang, L., Kim, Y., and Li, R. (2013). Calibrating nonconvex penalized regres-
sion in ultra-high dimension. The Annals of Statistics, 41(5):2505–2536.
Wang, X. and Yuan, X. (2012). The linearized alternating direction method
109
of multipliers for dantzig selector. SIAM Journal on Scientific Computing,
34(5):A2792–A2811.
Yuan, M. (2010). High dimensional inverse covariance matrix estimation via
linear programming. The Journal of Machine Learning Research, 99:2261–
2286.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian
graphical model. Biometrika, 94(1):19–35.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax con-
cave penalty. The Annals of Statistics, 38(2):894–942.
Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection
in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–
1594.
Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regulariza-
tion for high-dimensional sparse estimation problems. Statistical Science,
27(4):576–593.
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The
Journal of Machine Learning Research, 7:2541–2563.
110
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the
American Statistical Association, 101(476):1418–1429.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized
likelihood models. Annals of Statistics, 36(4):1509.
111
국문초록
변수선택은 고차원 회귀분석에서 중요하다. 단계별선택법과 같은 전통적인
변수선택방법들은데이터에따라서선택된변수들이달라지므로불안정하
다. 이에 대한 대안으로 변수선택과 추정을 동시에 하는 벌점화 방법론들이
사용된다. 라소 추정량은 희소성을 가지지만 변수선택 일치성이 없으며 편
향되어있다. SCAD, MCP와 같은 비볼록 벌점화 방법론들은 선택일치성을
가지며비편향추정량임이잘알려져있다.하지만이방법론들은다중국소
해들을 가질 수 있으며 조율계수에 따라 계산이 불안정하다. 신의 추정량을
유일한 국소해로 가질 수 있는 라소에 기반을 둔 이단계 방법론들이 개발되
었다.
본 연구에서는 라소를 단치그 셀렉터로 변형시킨 새로운 이단계 방법론
을제안한다.제안한방법은이단계방법론의두번째단계에서잡음변수들
의 영향력을 줄이는 것이 매우 중요하다는 점에 착안하였다. 단치그 셀렉터
의 `1-놈은같은조율계수에대해라소의 `1-놈보다작거나같고비점근적오
차범위 또한 단치그 셀렉터가 라소보다 작은 경향이 있다. 그러므로 우리는
112
이단계방법론에라소대신단치그셀렉터를이용하므로변수선택일치성을
만족시키면서 추정을 좀더 개선시킬 것이라 기대한다. 본 연구에서는 자료
에 대한 조건 가정 아래, 제안한 방법을 통해 신의 추정량을 얻을 수 있음을
증명하였다. 그리고 수치적 연구를 통해 변수선택과 추정에 있어서 이단계
단치그 셀렉터가 라소에 기반을 둔 이단계 방법론을 개선시킬 수 있으며,
기존의 다른 방법론들과 비교해서도 좋은 성능을 보임을 확인하였다.
본 연구에서는 추가적으로 공분산 역행렬 추정에 이단계 방법론들을 적
용한다. 공분산 역행렬은 다양한 통계적 문제에 활용되며 그 자체로 조건부
상관성을 내포하므로 매우 중요하다. 제안된 방법을 통해 제약 조건 하에서
공분산역행렬계수가 0인지에대한선택일치성을가질수있으며, 0이아닌
참공분산역행렬계수들에대해√n-일치성을가짐을보였다.수치적연구를
통해 제안된 방법이 계수 선택과 추정에 있어서 기존의 방법론들보다 좋은
성능을 가짐을 확인하였다.
주요어 : 고차원 회귀분석, 변수 선택, 단치그 셀렉터, 선택 일치성, 신의
추정량, 공분산 역행렬
학 번 : 2007− 20263
113
감사의 글
학부와 대학원의 관악에서의 11년간 좋은 교수님, 친구, 선배, 후배, 도움주신 많은 분
들과 만나게 하시고 모든 과정을 하나님의 은혜로 마치게 해주심에 감사드립니다. 통계의
전문성을 갖춰 공공의 유익이 되고 싶은 소망을 가지고 대학원에 진학한지도 7년이 지나
이제박사로첫발을내딛으려하니감회가새롭습니다.새출발에앞서지난기간동안도움
주신 많은 분들께 감사의 말씀을 전하고 싶습니다.
가장 먼저 박사과정 전반에 다양한 기회를 주시고 지도해주신 김용대 교수님께 감사드
립니다.그리고항상푸근하게맞아주시고독려해주신전종우교수님께감사드립니다.논문
심사에서조언해주시고수고해주신박병욱교수님,장원철교수님,임요한교수님께도정말
감사드립니다.연구실선배님이신멋있는권성훈교수님께도논문심사와여러가지조언들
에 감사드립니다. 6여 년간의 연구실 생활에 좋으신 선후배님들과 함께여서 즐거웠습니다.
최호식 교수님, 동화오빠, 도현오빠, 범수오빠, 광수아저씨, 재석오빠, 상인오빠, 병엽오빠,
종준오빠, 미애언니, 수희언니, 신선언니, 영희, 건수오빠, 혜림이, 미경이, 효원이, 지영이,
원준이, 지선이, 지영언니, 주유오빠, 민우, 승환이, 우성이, 재성이, 슬기, 동하, 세민이, 구
환이,윤영언니,승남이,오란이그리고네이버팀김유원이사님,정효주박사님,인재오빠,
연하언니 정말 감사했습니다. 학부 때부터 단짝친구 영선이, 3년 넘게 룸메로 힘이 되어준
신영이, 귀엽고 속 깊은 동화 같은 정은이, 박사동기 예지, 성일오빠, 말년에 즐거운 시간들
함께해준 선미언니, 과사 정환언니에게 고마운 마음 전하고 싶습니다.
그리고저의대학원생활전반을함께한통계학과기독인모임에감사드립니다.모임의
큰기둥이되어주신오희석교수님,바쁘신중에도참여해주신박태성교수님,맛있는식사
와격려로힘이되어주신조명희교수님,예배로함께해주신이영조교수님,박성현교수님,
송문섭교수님께도감사드립니다.민정언니,성준오빠때부터지금의지영이,민주,은용이,
성경오빠,재혁이, 보창이까지함께 나누고 교제할 수 있어서 즐거웠습니다. 대학원 생활의
활력소였던 수요채플 찬양 팀 준오빠, 정민이, 은혜, 바우오빠, 송희언니, 민우, 정민오빠,
114
지웅오빠, 재희, 나래, 바뚜, 문수오빠, 신잉, 서교교수님, 경주오빠, 현주, 지나, 찬미, 경만
이, 서림이, 건의, 민화언니, 소정언니, 소영이 모두 덕분에 정말 감사하고 즐거웠습니다.
사랑스러운보신자매들윤진이,효현이,지인이와의소소한즐거움들에정말감사했습니다.
그리고 닮고 싶은 바른 그리스도인의 전형을 보여주신 남승호 교수님과 이원종 교수님, 신
앙성장에도움주시고격려해주신대학교회김동식목사님,마리아사모님,홍종인교수님,
김난주 사모님께도 감사드립니다.
마지막으로 한결같은 사랑과 신뢰, 기도로 뒷받침해주신 부모님과 든든한 버팀목이 되
어준 동생에게 감사합니다. 오랜 지기 새미, 정현이에게도 고마운 마음 전합니다. 수학에
흥미를 가질 수 있도록 도움 주신 은사님이신 구명수 선생님께도 그간 연락드리지 못해
죄송하고 정말 감사했습니다. 그리고 기도로 응원해주신 친척 분들과 교회식구들께 감사
드립니다. 아직도 많이 부족하지만 앞으로 더욱 성실함과 진실함, 사랑하는 마음으로 제가
속한 공동체와 나라에 유익이 되기 위해 노력하겠습니다. 감사합니다.
2014년 2월 한 상 미
115