저작자표시 비영리 공연 및 방송할 수...

저 시-비 리-동 조건 경허락 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

l 차적 저 물 성할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 허락조건 확하게 나타내어야 합니다.

l 저 터 허가를 러한 조건들 적 지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 적 할 수 없습니다.

동 조건 경허락. 하가 저 물 개 , 형 또는 가공했 경에는, 저 물과 동 한 허락조건하에서만 포할 수 습니다.

http://creativecommons.org/licenses/by-nc-sa/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-sa/2.0/kr/

이학박사 학위논문

Two Stage Dantzig Selector for

High Dimensional Data

고차원 자료를 위한 이단계 단치그 셀렉터

2014년 2월

서울대학교 대학원

통계학과

한 상 미



지도교수 김 용 대

이 논문을 이학박사 학위논문으로 제출함.

2013년 10월


통계학과

한 상 미

한상미의 이학박사 학위논문을 인준함.

2013년 12월

위 원 장 : 박 병 욱 (인)

부 위원장 : 김 용 대 (인)

위 원 : 임 요 한 (인)

위 원 : 장 원 철 (인)

위 원 : 권 성 훈 (인)



by

Sangmi Han

A Thesis

Submitted in fulfillment of the requirements

for the degree of

Doctor of Philosophy

in Statistics

Department of Statistics

College of Natural Sciences

Seoul National University

Feburary, 2014

Abstract

Variable selection is important in high dimensional regression. The traditional

variable selection methods such as stepwise selection are unstable which means

that the set of the selected variables are varying according to the data sets.

As an alternative to those methods, a series of penalized methods are used

for estimation and variable selection simultaneously. The LASSO yields sparse

solution, but it is not selection consistent and biased. Non-convex penalized

methods such as the SCAD and the MCP are known to be selection consistent

and yield unbiased estimator. However they suffer from multiple local minima

and their computations are unstable for tuning parameter. Two stage methods

based on the LASSO such as one step LLA and calibrated CCCP are developed

which can obtain the oracle estimator as the unique local minimum.

We propose a two stage method based on Dantzig selector. The motivation

of our proposed method is that lessening the effect of the noise variables is

important in the two stage method. The `1 norm of the Dantzig selector is

i

always less than equal to that of the LASSO and the non-asymptotic error

bounds of Dantzig selector tent to be lesser than those of LASSO for the same

tuning parameter. Therefore we expect the improvement on the estimation

using the Dantzig selector instead of the LASSO in the two stage method

while this proposed method also satisfies the selection consistency. The results

of the numerical experiments can support our contention.

We also apply these two stage methods which are based on LASSO or

Dantzig selector to estimation of inverse covariance matrix (a.k.a. precision

matrix). Precision matrix estimation is essential not only because it can be

used in various applications but also because it refers to the direct relation-

ship between variables via the conditional dependence of variables under the

normality assumption. Under some regularity conditions our methods hold se-

lection consistency and obtain columnwise√n-consistent estimator for true

nonzero precision matrix elements. The numerical analyses show that the pro-

posed methods perform well in terms of variable selection and estimation.

Keywords: High dimensional regression, variable selection, Dantzig selector,

selection consistency, oracle estimator, inverse covariance matrix estimation

Student Number: 2007− 20263

ii

Contents

Abstract i

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Two Stage Dantzig Selecotr for High Dimensional Linear Re-

gression 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Sparse regularization methods . . . . . . . . . . . . . . . . . . . 8

2.2.1 The `1 regularization . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Nonconvex penalized methods . . . . . . . . . . . . . . . 13

2.2.3 Two stage methods . . . . . . . . . . . . . . . . . . . . . 21

2.3 Two Stage Dantzig Selector . . . . . . . . . . . . . . . . . . . . 25

iii

2.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.3 Theoretical properties . . . . . . . . . . . . . . . . . . . 29

2.3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.5 Tuning regularization parameter . . . . . . . . . . . . . . 37

2.4 Numerical analyses . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . 44

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Two Stage Methods for Precision Matrix Estimation 46

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Estimation of precision matrix via columnwise two-stage methods 49

3.2.1 Two stage method based on LASSO . . . . . . . . . . . 50

3.2.2 Two stage Dantzig selector . . . . . . . . . . . . . . . . . 53

3.2.3 Theoretical results . . . . . . . . . . . . . . . . . . . . . 55


3.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 64


3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

iv

4 Concluding remarks 88

5 Appendix 90

5.1 Algorithms for Dantzig selector . . . . . . . . . . . . . . . . . . 90

5.1.1 Primal-dual interior point algorithm (Candes and Romberg,

2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1.2 DASSO (James et al., 2009) . . . . . . . . . . . . . . . . 96

5.1.3 Alternating direction method (ADM) (Lu et al., 2012) . 100

Abstract (in Korean) 111

감사의 글 114

v

List of Tables

2.1 Example 1 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2 Example 1 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.3 Example 2 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 Example 2 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.5 Real Data (TRIM) . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1 Example 1 (p=100, q=99) . . . . . . . . . . . . . . . . . . . . . 68

3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 70

3.3 Example 1 (p=200, q=199) . . . . . . . . . . . . . . . . . . . . 72

3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 74

3.5 Example 2 (p=100, q=59) . . . . . . . . . . . . . . . . . . . . . 76

3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 78

3.7 Example 2 (p=200, q=92) . . . . . . . . . . . . . . . . . . . . . 80

3.8 ‖Ω −Ω∗‖ of Example 2 (p=200, q=92) . . . . . . . . . . . . . 82

vi

3.9 Real Data (Breast Cancer) . . . . . . . . . . . . . . . . . . . . . 86

vii

List of Figures

2.1 LASSO and Dantzig selector . . . . . . . . . . . . . . . . . . . 10

2.2 Penalized method . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 LLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 CCCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Nonconvex penalties . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Adaptive DS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 ROC curve of Example 1 (p=100, q=99) . . . . . . . . . . . . . 67

3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 69

3.3 ROC curve of Example 1 (p=200, q=199) . . . . . . . . . . . . 71

3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 73


3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 77


viii

3.8 ‖Ω −Ω∗‖ of Example 2 (p=200, q=92) . . . . . . . . . . . . . 81

ix

Chapter 1

Introduction

1.1 Overview

High dimensional data analysis has been received much attention due to ad-

vance of technologies of data collection. High dimensional data where the num-

ber of covariates exceeds the number of observation arises in various fields such

as genomics, neuroscience, economics, finance, and machine learning. Variable

selection is fundamental to data analysis because it can identify significant vari-

ables among many covariates. In high dimension, the importance of variable

selection proliferates. There are two approaches for variable selection which are

subset selection and sparse regularization. In high dimension, subset selection

methods such as best subset selection are challenging to compute, unstable

1

and hard to draw sampling properties (Breiman, 1996).

To handle these drawbacks, many sparse regularization methods have been

proposed. These methods can select significant variables and estimate coeffi-

cients simultaneously. Two major approaches in sparse regularization methods

are `1 regularization including LASSO (Tibshirani, 1996) and Dantzig selector

(Candes and Tao, 2007) and nonconvex penalization including SCAD (Fan and

Li, 2001) and MCP (Zhang, 2010). For selection consistence, `1 regularization

methods need stringent conditions such as strong irrepresentable condition

(Zhao and Yu, 2006).

On the other hand, nonconvex penalized methods not only do not need

such conditions for selection consistency but also they can reduce the innate

bias problem of `1 regularization methods. Despite of these good properties of

nonconvex penalized methods, they suffer from multiple local minima and can-

not guarantee the converged solution to be oracle estimator. As an alternative

to these nonconvex penalized methods, two stage methods based on LASSO

such as one step LLA and calibrated CCCP algorithms proposed to obtain the

oracle estimator.

In this paper, we deal with the regularization methods in high dimensional

linear regression model. We focus on developing a new regularization method

to obtain oracle estimator. We propose a two stage method based on Dantzig

2

selector. Our method can improve variable selection and estimation via deleting

noise variable more efficiently by using Dantzig selector instead of LASSO.

We also deal with precision matrix estimation as one of applications of high

dimensional linear regression. For sparse precision matrix estimation, many

regularization methods have been considered. Most of them belong to two reg-

ularization frameworks which are maximum likelihood approach and regression

based approach. We apply two stage methods based on LASSO or Dantzig se-

lector to regression based approach and show they can obtain the columnwise

oracle estimator of precision matrix. Numerical results show that our proposed

methods are superior to other regularized methods in terms of support recovery

and estimation of precision matrix.

1.2 Outline of the thesis

The thesis is organized as follows. In chapter 2, we deal with high dimensional

linear regression. We review diverse sparse regularization methods for high

dimensional linear regression and propose two stage Dantzig selector. Theo-

retical properties and algorithm for two stage Dantzig selector are provided,

and we compare our method to other methods in numerical analyses. In chap-

ter 3, precision matrix estimation using regularization methods is considered.

3

We review existing regularization methods and propose new methods which

utilize the two stage methods based on LASSO or Dantzig selector. We prove

theoretical properties of two stage methods, and numerical analyses are con-

ducted. Concluding remarks follow in chapter 4. In Appendix, algorithms for

the Dantzig selector are summarized.

4

Chapter 2

Two Stage Dantzig Selecotr for

High Dimensional Linear

Regression

2.1 Introduction

Variable selection is essential for linear regression analysis. There are two ap-

proaches for variable selection, which are subset selection and regularization.

Subset selection is selecting a subset of covariates and using only the selected

covariates for fitting model. Popular examples of subset selection are best sub-

set selection, forward selection, backward elimination, stepwise selection, and

5

etc. In high dimension, these subset selection methods are computationally

demanding and unstable. Furthermore, their sampling properties are hard to

derive (see Breiman (1996) for more discussions).

To deal with these drawbacks, many sparse regularization methods have

been proposed which can select the important variables and estimate the ef-

fect of covariates on the response simultaneously. The `1 regularization meth-

ods and the nonconvex penalized methods are two mainstreams of regularized

estimators for high dimensional regression model. The least absolute shrinkage

and selection operator (LASSO) (Tibshirani, 1996) and the Dantzig selector

(Candes and Tao, 2007) are two representative examples of the `1 regulariza-

tion. They are easy to calculate and have good estimation properties (Bickel

et al., 2009; Raskutti et al., 2011). However, they have intrinsic bias and se-

lection consistency only under awkward conditions such as the irrepresentable

condition (Zhao and Yu, 2006).

On the other hand, nonconvex penalized methods such as the smoothly

clipped absolute deviation (SCAD) (Fan and Li, 2001), and the minimax con-

cave penalty (MCP) (Zhang, 2010) can have unbiasedness and selection consis-

tency, simultaneously. The most fascinating property of nonconvex penalized

methods is the oracle property. The oracle property means that covariates

are selected consistently and the efficiency of the estimator is equivalent to

6

the least square estimator obtained with knowing true nonzero coefficients in

advance (Fan and Li, 2001; Kim et al., 2008; Kim and Kwon, 2012).

However, due to their nonconvexity, there can be many local minima in the

corresponding objective function. Therefore, it is not guaranteed for an ob-

tained estimator to be the oracle estimator. Even though some previous works

(Kim and Kwon, 2012; Zhang, 2010) showed that the objective function with

a nonconvex penalty can have a unique local minimizer under some regularity

conditions, its computation may be demanding for high dimensional models.

Typically, optimization problems corresponding to nonconvex penalized meth-

ods are solved by iterative algorithms, including the concave convex procedure

(CCCP) (Kim et al., 2008) and the local linear approximation (LLA) (Zou and

Li, 2008), where the nonconvex objective function is approximated by a locally

linear function. However, it takes a significant amount of time for algorithms

to converge, and typically the nonconvex penalized methods suffer instability

in tuning the regularization parameter. Furthermore, these algorithms only

assure the convergence to a local minimum which is not necessarily the oracle

estimator (Wang et al., 2013).

Two stage methods based on LASSO are proposed to obtain the oracle

estimator such as the one step LLA (Zou and Li, 2008; Fan et al., 2012)) and

the calibrated CCCP (Wang et al., 2013). The main idea of these methods is

7

to obtain the oracle estimator by solving the LASSO problem twice.

In this chapter, we propose a two stage method based on Dantzig selector

to obtain the oracle estimator, which is named as two stage Dantzig selector.

The Dantzig selector used in our method can improve variable selection and

estimation through lessening the effects of noise variables more efficiently than

LASSO. We prove two stage Dantzig selector can obtain oracle estimator under

regularity conditions. The proposed method can be easily implemented by

general algorithms for the standard Dantzig selector. Numerical results show

that our proposed method outperforms other sparse regularization methods

with respect to variable selection and estimation.

2.2 Sparse regularization methods

In this section, we review various sparse regularization methods for high di-

mensional linear regression. Consider the linear regression model.

y = Xβ + ε,

where y is an n×1 vector of response,X = (x1, . . . ,xn)T = (X1, . . . ,Xp) is an

n×p matrix of covariate where Xj ∈ Rn and xi ∈ Rp, β is a p×1 vector of un-

knwon regression coefficients and ε is an n×1 vector of random error. On high

dimension, the ordinal least square estimator is not defined uniquely and the

8

traditional variable selection methods such as best subset selection, stepwise

selection based on AIC or BIC criteria are computationally intensive. Fur-

thermore, they are hard to draw sampling properties and unstable (Breiman,

1996). As an alternative, sparse regularization methods are used to estimate

coefficients and select variables. There are two mainstreams of regularization

methods which are `1 regularization and the nonconvex penalization.

2.2.1 The `1 regularization

The `1 regularization can achieve sparsity. The sparsity means that the esti-

mator produces exactly zero coefficients hence it can reduce model complexity.

The LASSO (Tibshirani, 1996) and the Dantzig selector (Candes and Tao,

2007) are two representative examples of the `1 constrained methods. The

LASSO estimator βLASSO is defined as the solution of

minβ

(||y −Xβ||2/2n+ λ‖β‖1

)or equivalently,

minβ‖y −Xβ‖2 subject to ‖β‖1 ≤ t.

where ‖a‖ =√∑n

i=1 a2i , ‖a‖1 =

∑ni=1 |ai|.

9

−1.5 −0.5 0.5 1.0 1.5

−1

.0−

0.5

0.0

0.5

1.0

1.5

2.0

LASSO

β1

β 2

true

ols

true

ols

−1.5 −0.5 0.5 1.0 1.5

−1

.0−

0.5

0.0

0.5

1.0

1.5

2.0

Dantzig Selector

β1

β 2

true

ols

Figure 2.1: LASSO and Dantzig selector

The Dantzig selector βDantzig is defined similarly to the LASSO estimator

by

minβ‖β‖1 subject to ‖ 1

nXT (y −Xβ)‖∞ ≤ λ

or equivalently,

minβ‖XT (y −Xβ)‖∞ subject to ‖β‖1 ≤ t,

where ‖a‖∞ = maxi |ai| with a ∈ Rn.

As shown in Figure 2.1, the solution of the LASSO occurs at the point of

contact between dotted ellipsoid and solid diamond whereas the solution of

Dantzig selector occurs at the point of contact between dotted parallelogram

10

and diamond. Hence the exact zero element of solution can be obtained. The

dotted ellipsoid means the set of points which has the same distance (β −

βols)TXTX(β − βols) from the ordinal least square estimator βols and the

dotted parallelogram means the set of points which has the same value of

‖XTX(β − βols)‖∞.

The penalized form of the LASSO and the definition of the Dantzig selector

are related. The LASSO estimate is always in the constrained set (feasible set)

of the Dantzig selector. The Karash-Kuhn-Tucker conditions for the Lagrange

form of the LASSO are given by

XTj (y −Xβ) = λsign(βj) for |βj| > 0,

|XTj (y −Xβ)| ≤ λ for βj = 0.

Therefore, |XT (y−XβLASSO(λ))|∞ ≤ λ, and here ‖βDantzig(λ)‖1 ≤ ‖βLASSO(λ)‖1.

The Dantzig selector and the LASSO share some similarities. They yield the

same solution path under some suitable conditions on the design matrix. Mein-

shausen et al. (2007) proved that the LASSO and the Dantzig selector share the

identical solution path under the diagonal dominance condition, which means

that Mjj >∑

i 6=j |Mij| for all j = 1, . . . , p where M = (XTX)−1. James et al.

(2009) showed the equivalence of the LASSO and the Dantzig selector under a

condition which is similar to the irrepresentable condition (IC) (Zhao and Yu,

2006). This condition is that ‖XTXA(λ)u‖∞ ≤ 1 for u = (XTA(λ)XA(λ))

−11

11

with a tuning parameter λ and the active set A(λ) = j : βj(λ) 6= 0.

The Dantzig selector and the LASSO estimator can achieve the minimax

optimal error bound (Raskutti et al., 2011; Bickel et al., 2009). Raskutti et al.

(2011) showed that the minimax optimal convergence rate of l2-error is

O(√s log p/n) under some regularity conditions. Bickel et al. (2009) proved

the similar prediction error rate of the LASSO and the Dantzig selector and

the asymptotic equivalence under the restricted eigenvalue condition.

Not only the theoretical properties, but also the algorithms for the LASSO

and the Dantzig selector are comparable. Similar to the LARS (Efron et al.,

2004), which is an efficient algorithm for the LASSO estimator giving piece-

wise linear path, the DASSO (James et al., 2009) algorithm gives a piecewise

linear solution path. These algorithms will be summarized and compared in

Appendices

Despite of good asymptotic properties and efficient algorithms, there are

some limitations in the LASSO and the Dantzig selector. First, the LASSO and

the Dantzig selector are biased. Since the same amount of shrinkage is enforced

on all nonzero coefficients, they cannot achieve the unbiasedness Second, the

LASSO and the Dantzig selector rarely have the model selection consistency.

The correlation structure of covariates is crucial in the selection consistency

such as ICs (Zhao and Yu, 2006), and the coherence property (Candes and

12

Plan, 2009). Zhao and Yu (2006) proved the weak oracle property of the LASSO

estimator under the ICs. Gai et al. (2013) proved the weak oracle property of

the Dantzig selector under the modified ICs related to KKT conditions of

Dantzig selector, which are more complex than the ICs of the LASSO. Those

ICs mean that the regression coefficients of the inactive variables on s active

variables should be uniformly bounded by a constant less than equal to one.

As we can see in the simulation results of Zhao and Yu (2006), these ICs are

too strict especially in high dimensions. Hence the LASSO and the Dantzig

selector cannot have the selection consistency in most cases.

2.2.2 Nonconvex penalized methods

Nonconvex penalized methods can be good alternatives to the `1 regularized es-

timators since they have selection consistency and unbiasedness. A nonconvex

penalized least square estimator is defined as the minimizer of Qλ(β) where

Qλ(β) = ||y −Xβ||2/2n+ Pλ(|β|)

and Pλ(|β|) is a nonconvex penalty including bridge estimator (Frank and

Friedman, 1993), the SCAD (Fan and Li, 2001), and the MCP (Zhang, 2010).

The bridge estimator is defined as Pλ(β) = λ∑p

j=1 |βj|q, 0 < q < 1. The

13

−4 −2 0 2 4

01

23

45

Penalty functions

β

Pλ(β

)

lasso

MCP

SCAD

bridge

lasso

MCP

SCAD

bridge

lasso

MCP

SCAD

bridge

Figure 2.2: Penalized method

penalty function of the SCAD is defined as

Pλ(β) =

p∑j=1

[ λ|βj|I(|βj| < λ)

+

aλ(|βj| − λ)− (β2

j − λ2)/2a− 1

+ λ2I(λ ≤ |βj| < aλ)

+

(a− 1)λ2

2+ λ2

I(|βj| ≥ aλ)

].

Zhang (2010) proposed the MCP with

Pλ(β) =

p∑j=1

[−β2

j /2a+ λ|βj|I(|βj| ≤ aλ) +1

2aλ2I(|βj| > aλ)

].

Figure 2.2 shows those nonconvex penalty functions and the LASSO penalty.

The SCAD and the MCP estimators satisfy good properties of penalized

14

estimator which are unbiasedness, sparsity, and continuity as introduced by

(Fan and Li, 2001) while the bridge estimator is lack of continuity and the

LASSO is lack of unbiasedness.

For identifying unknown signal variables, nonconvex penalized methods

have received great attention recently because they can achieve the model se-

lection consistency without stringent conditions such as ICs. Instead, they need

weaker conditions on the design matrix such as sparse Rieze condition (Zhang

and Huang, 2008) and positive minimum eigenvalue of submatrix of XTX

which only uses the signal covariates. Let the true beta β∗ = (β∗T1 ,0T )T be such

that the first s regression coefficients β∗1 are nonzero and others to be zeros. The

oracle estimator β(o) is defined as (β(o)1 ,0T )T where β

(o)1 = (XT

1 X1)−1XT

1 y,

X = (X1,X2), X1 is n × s, and X2 is n × (p − s) subset of X. Assume

√n(β

(o)1 − β∗1)

d→ Ns(0,Σ). An estimator β = (βT1 , βT2 )T is said to have the

oracle property if

Pr[j : βj 6= 0 = 1, . . . , s

]= 1,

√n(β1 − β∗1)

d→ Ns(0,Σ).

Many previous works showed that various non-convex penalized methods have

the oracle property (Fan and Li, 2001; Kim et al., 2008; Huang et al., 2008a;

Zhang, 2010; Kim and Kwon, 2012).

15

There are three types of oracle properties - weak, global, and strong oracle

properties. The weak oracle property is that there exists a sequence of λn such

that one of the local minimizers of Qλn(β) is the oracle estimator. Fan and Li

(2001) and Kim et al. (2008) proved the weak oracle property of the SCAD

for p ≤ n and p > n, respectively. The global oracle property is that there

exists a sequence of λn such that the global minimizer β(λn) of Qλn(β) has

the oracle property. Huang et al. (2008a) proved the global oracle property of

bridge estimator and Kim et al. (2008) proved that of the SCAD for p ≤ n.

The strong oracle property means that there exists a sequence of λn such that

the unique local minimizer of Qλn(β) is the oracle estimator. The SCAD and

the MCP can obtain the oracle estimator as a unique local optimizer with

probability tending to one (Kim and Kwon, 2012; Zhang, 2010).

Computing the global solution of the nonconvex penalized methods is infea-

sible in high dimensional setting. Maximizing nondifferentiable and nonconvex

functions is challenging. Iterative algorithms which locally approximate a non-

convex penalized objective to a convex function and solve the convex optimiza-

tion are used. Local quadratic approximation (LQA) (Fan and Li, 2001), local

linear approximation (LLA) (Zou and Li, 2008), concave convex procedure

(CCCP) (Kim et al., 2008) are invented to get a nonconvex penalized likeli-

hood estimate. The LQA uses the second order approximation of the penalty

16

as follows,

[Pλ(|βj|)]′ = P ′λ(|βj|)sign(βj) ≈ P ′λ(|β(0)j |)/|β

(0)j |βj.

Pλ(|βj|) ≈ Pλ(|β0j|) +1

2

P ′λ(|β0j|)|β(0)j |

(β2j − |β

(0)j |2) for βj ≈ β

(0)j

The LQA algorithm is updates the solutions as follows until it converges,

β(k+1) = argminβ

1

2n‖y −Xβ‖2 +

1

2

p∑j=1

P ′λ(|β(k)j |)

|β(k)j |

β2j

.

To avoid numerical instability, when |β(k)j | < ε0 (prespecified value), we set βj =

0 and delete the jth component of X from the iteration. In every iteration,

the solution is

β(k+1) = XTX +Σλ(β(k))−1XTy

whereΣλ(β(k)) = diag(P ′λ(|β

(k)1 |)/|β

(k)1 |/2, . . . , P ′λ(|β

(k)p |)/|β(k)

p |/2) for k = 0, 1, 2, . . ..

Since the LQA removes the variables with small magnitude of coefficients, once

βj is removed from the model, it is permanently excluded and hence the choice

of ε0 affects significantly the degree of sparsity of the solution and speed of con-

vergence. To relieve this problem, instead of removing variables, perturbation

τ0 on the numerator can be considered.

β(k+1) = argminβ

1

2n‖y −Xβ‖2 +

1

2

p∑j=1

P ′λ(|β(k)j |)

|β(k)j |+ τ0

β2j

However, the τ0 plays the similar role to ε0.

17

−4 −2 0 2 4

01

23

4

SCAD

β

Pλ(β

)

−4 −2 0 2 4

01

23

4

MCP

β

Pλ(β

)

Figure 2.3: LLA

The CCCP and LLA can make up for LQA’s shortcomings and they can

be easily implemented by the algorithms for LASSO. The LLA algorithm is

defined as follows. For k = 1, 2, . . ., until it converges, repeat the following

optimization problem:

β(k+1) = argminβn∑i=1

(yi − xTi β)2/2n+

p∑j=1

P ′λ(|β(k)j |)|βj|.

The LLA is the first order approximation of the nonconvex penalty function.

Figure 2.3 shows the LLA of SCAD and MCP. The CCCP decomposes the

nonconvex penalty into LASSO penalty and concave penalty. The concave part

of them is approximated by the tight local linear function. The decompositions

18

−4 −2 0 2 4

−2

−1

01

23

4

SCAD

β

Pλ(β

)

−4 −2 0 2 4

−2

−1

01

23

4

MCP

β

Pλ(β

)

Figure 2.4: CCCP

of nonconvex penalty function of SCAD and MCP are represented in Figure

2.4. The CCCP algorithm iteratively minimizes Q(β|β(k), λ) until it converges

when

Q(β|β(k), λ) =1

2n‖y −Xβ‖2 +

p∑j=1

∇Jλ(|β(k)j |)βj + λ

p∑j=1

|βj|

where Pλ(|βj|) = Jλ(|βj|) + λ|βj| and ∇Jλ(t) = dJλ(t)dt

.

Since the CCCP and the LLA algorithm use the first order derivative of

the nonconvex penalty, a class of the nonconvex penalties which can use these

algorithms is defined as Pλ(|t|) = Pa,λ(|t|) defined on t ∈ (−∞,∞) satisfying

(P1) Pλ(t) is increasing and concave for t ∈ [0,∞) with continuous derivative

19

on t ∈ (0,∞) with P ′λ(0) := P ′λ(0+) ≥ a1λ.

(P2) P ′λ(t) ≥ a1λ for t ∈ (0, a2λ)

(P3) P ′λ(t) = 0 for t > aλ > a2λ

for some positive constant a1,a2 and a.

As shown in the Firgure 2.5, the SCAD and the MCP satisfy above conditions

because the derivative of SCAD is defined as

P ′λ(t) = λIt≤λ +(aλ− t)+a− 1

It>λ, for some a > 2 with a1 = a2 = 1,

and the derivative of MCP is

P ′λ(t) = (λ− t

a)+, for some a > 1 with a1 = 1− a−1, a2 = 1.

However due to the nature of nonconvexity of penalty, multiple minima

can occur and the existing algorithms for nonconvex penalized methods only

guarantee the convergence to not the oracle estimator but a local minimum.

Although under some conditions, the nonconvex penalized methods yield the

oracle estimator as the unique minimizer (Kim and Kwon, 2012; Zhang, 2010),

the direct computation of the global solution is infeasible in high dimension.

Especially when it comes to tuning parameter, the computation is unstable.

To deal with these difficulties, the only one-step algorithms with a good initial

20

−4 −2 0 2 4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Penalty function

β

Pλ(β

)

MCP

SCAD

−4 −2 0 2 4

0.0

0.5

1.0

1.5

2.0

Derivative of penalty function

β

Pλ(β

)’

MCP

SCAD

Figure 2.5: Nonconvex penalties

estimator are proposed (Zou and Li, 2008; Fan et al., 2012; Wang et al., 2013)

and we will call them two stage methods .

2.2.3 Two stage methods

The two stage methods based on LASSO are proposed to obtain the oracle

estimator such as the one step LLA (Zou and Li, 2008; Fan et al., 2012) and

the calibrated CCCP(Wang et al., 2013). The main idea of these methods is to

obtain the oracle estimator by solving the LASSO problem. Zou and Li (2008)

proved that one step LLA algorithm can get the oracle estimator with a good

21

initial estimator. They suggested using the maximum likelihood estimator as

an initial estimator for n > p. The one step LLA estimator is defined as

β(λ) = argminβn∑i=1


p∑j=1

P ′λ(|βinitj |)|βj|.

This can be reformulated as

β(λ)Ac = argminβAc‖rA −XAcβAc‖2/2n+

∑j∈Ac

P ′λ(|βinitj |)|βj|,

β(λ)A = (XTAXA)−1XT

A(y −XAcβ(λ)Ac),

where A = A(λ) = j : |βinitj | > aλ with the parameter of nonconvex penalty

a, and rA = y −XAβ(λ)A.

In the first stage, a good initial estimator which is close to the true coeffi-

cient should be attained. For a regularization parameter λ such that

min|β∗j | : β∗j 6= 0, j = 1, . . . , p > (a+ 1)λ and ‖β∗ − βinit‖∞ < λ,

the true signal set and the signal set of initial estimator are equivalent, i.e.,

A = A0 = j : β∗j 6= 0 and P ′λ(|βinitj |) ≈ 0 for j ∈ A0,

and hence β(λ)A0 = (XTA0XA0)

−1XTA0

(y−XAc0β(λ)Ac0). Since estimating β(λ)

can be recast as estimating β(λ)Ac0 and plug β(λ)Ac0 into the equation for

β(λ)A0 , the oracle estimator can be obtained via β(λ)Ac0 = 0. Hence removing

the effect of noise variables is important in the second stage of the two stage

methods.

22

Fan et al. (2012) suggested the LASSO estimator with smaller regulariza-

tion parameter (λinit ≤ aγLSs−1/2λ/4) as an initial estimator and then they

can obtain the oracle estimator with high probability, where s is the number

of nonzero coefficients, a is the parameter of nonconvex penalty, and γLS is the

restricted eigenvalue defined by γLS = minδ 6= 0

‖δAc0‖1 ≤ 3‖δA0

‖1

‖Xδ‖√n‖δA0

‖ > 0.

Wang et al. (2013) proved that the calibrated CCCP estimator using the

LASSO initial estimator with smaller regularization parameter (λinit = τλ, τ =

o(1), e.g., τ = 1/ log n or λ) can obtain the oracle estimator with high proba-

bility. They remarked on the choice of τ which can be related to the number

of the signal parameter and the restricted eigenvalue. For the sparse and well

behaved design matrix, τ = 1/ log n can be used. If the true model is not very

sparse (s → ∞) or the design matrix does not behave well (γLS → 0), then

τ = λ can be considered.

Tuning the regularization parameter is crucial issue to obtain the oracle

estimator. Wang et al. (2013) proposed the high dimensional BIC (HBIC) for

choosing the regularization parameter of calibrated CCCP, which is defined by

HBIC(λ) = log(‖y −Xβ(λ)‖2/n) + Cn ∗ log p/n ∗ |Mλ| (2.1)

where Mλ = j : βj(λ) 6= 0, Cn → ∞ (e.g., Cn = log n or log(log n)).

Furthermore they proved the oracle estimator can be attained by using the

23

tuning parameter chosen from HBIC with high probability under some regu-

larity conditions,

Pr(Mλ = j : β∗j 6= 0)→ 1,

where λ = argminλ∈λ:|Mλ|≤Kn

HBIC(λ). This criterion is the extension of the results

of Chen and Chen (2008) and Kim et al. (2012). This result of the selection

consistency can be applied to the methods which satisfy the strong oracle

property.

Corollary 1. Let the HBIC(λ) be defined as (2.1). Assume that regularity

conditions, which are necessary for an estimator β(λ) to be the oracle estimator

with probability tending to one, hold and there exists a positive constant γ such

that

limn→∞

minA6=A0,|A|≤Kn

n−1‖(In −HA)XA0β∗A0‖2 ≥ γ,

where In denotes the n × n identity matrix and HA denotes the projection

matrix onto the linear space spanned by XA. If Cn → ∞, sCn log p/n = o(1)

and K2n log p log n/n = o(1), then

Pr(Mλ = j : β∗j 6= 0)→ 1,

where Mλ = j : βj(λ) 6= 0 and λ = argminλ∈λ:|Mλ|≤Kn

HBIC(λ).

This Corollary 1 is the generalization of the Theorem 3.5 of Wang et al.

(2013) and it is automatically proven by the proof of the Theorem 3.5 of Wang

24

et al. (2013).

2.3 Two Stage Dantzig Selector

2.3.1 Method

We modified the one step LLA for the nonconvex penalized estimator by re-

placing the LLA to the adaptive Dantzig selector type estimator.

min

p∑j=1

P ′λ(|βinitj |)|βj| (2.2)

subject to | 1nXT

j (y −Xβ)| ≤ P ′λ(|βinitj |), j = 1, . . . , p,

where βinitj is an initial estimate and P ′λ(t) is satisfying (P1)-(P3) which are

defined in Section 2.2. I call this estimator two stage Dantzig selector

βTSDS(λ). The initial estimate βinitj can be the LASSO or the Dantzig estimate

with tuning parameter λinit = λ/ log n or λ2.

2.3.2 Motivation

In order to achieve the oracle estimator, the key to the two stage method is

removing the noise variables in the second stage. For example, let’s consider the

one step LLA with SCAD penalty Pλ with initial βinit. Let yn×1 = Xn×pβ∗p×1+

εn×1. Consider a λ such that min|β∗j | : j ∈ A0 > (a + 1)λ where A0 = j :

25

β∗j 6= 0 and a is a parameter for SCAD penalty. Suppose that ‖βinit−β∗‖∞ <

λ then the one step LLA estimator is defined by



p∑j=1


or equivalently,

β(λ)Ac0 = argminβAc0‖rA0 −XAc0βAc0‖

2/2n+∑j∈Ac0

λ|βj|,

β(λ)A0 = (XTA0XA0)

−1XTA0

(y −XAc0β(λ)Ac0),

where rA0 = y−XA0β(λ)A0 . Hence the aim of the second stage is deleting the

noise variables and then the one step LLA can acquire the oracle estimator.

So we focus on deleting or lessening the effect of noise variables and we find

out that some properties of Dantzig selector shows the superiority of Dantzig

selector over LASSO.

The `1 norm of the Dantzig selector is always less than or equal to that of

the LASSO estimator because the Dantzig selector is defined as the minimizer

of ‖β‖1 subject to ‖XT (y −Xβ)‖∞ ≤ λ and the LASSO estimator satisfies

the constraint. Recall the fact that the LASSO estimate βLS(λ) is always in

the feasible set of the Dantzig selector βDS(λ) because of the KKT conditions

for the Lagrangean form of the LASSO in Section 2.1. If there are no signal

variables then the mean absolute deviation of Dantzig selector is less than or

equal to that of LASSO for the same regularization parameter λ, ‖βDS(λ) −

26

β∗‖1 ≤ ‖βLS(λ)−β∗‖1. Furthermore, according to Bickel et al. (2009), the non-

asymptotic `q error bounds of Dantzig selctor are lesser than those of LASSO

for 1 ≤ q ≤ 2. If the mean squared error (MSE) of the Dantzig selector is

lesser than the MSE of the LASSO in the no signal setting, then the two stage

methods can be improved with Dantzig selector in terms of the MSE. Hence

we conduct a simulation to figure out whether or not the MSE of the Dantzig

selector is less than the MSE of the Lasso estimator in the no signal setting.

We simulate whether the `2 error bounds of Dantzig selctor tends to be

lesser than that of LASSO in no signal regression setting. Let y100×1 = X100×1000β1000×1+

ε100×1 where β = (0, . . . , 0)T , X ∼ N(0,Σ) with Σij = R|i−j|, and ε ∼

N(0, I). This simulation is conducted as follows,

1. For 20 tuning parameters, fit LASSO and Dantzig estimator and calcu-

late MSE,

2. Repeat the step 1 100 times and test H1 : MSELASSO > MSEDantzig.

Let Z = MSELASSO − MSEDantzig. In the case of R = 0.3, Z = 0.00093,

sd(Z) = 0.00014, and p-value=0. For R = 0.5, Z = 0.00122, sd(Z) = 0.00012,

and p-value=0. Even in the case of R = 0, Z = 0.0006, sd(Z) = 0.0001.

Therefore we can conclude that the overal MSE of the Dantzig selector tends

to be lesser than the MSE of the LASSO for the same tuning parameter. The

27

two stage method with Dantzig selector can improve the estimation efficiency

satisfying the global oracle property of two stage method with LASSO.

The relationship between the Dantzig selector and the LASSO can be ex-

tended to the relationship between adaptive Dantzig selector (Dicker, 2010)

and the adaptive LASSO (Zou, 2006). The adaptive LASSO is defined as

min1

2n‖y −Xβ‖22 + λ

p∑j=1

wj|βj|

e.g., wj = |βlsj |−1. Similar to the adaptive LASSO, the adaptive Dantzig selec-

tor is defined as

min

p∑j=1

wj|βj| subject to | 1nXT

j (y −Xβ)| ≤ wjλ, j = 1, . . . , p.

Its formula can be derived by the derivative of the objective function of the

adaptive LASSO. For the detail, see the thesis of Dicker (2010). They also

proved that the adaptive Dantzig selector and the adaptive LASSO have the

same asymptotic properties.

The relation between Dantzig selector and the adaptive Dantzig selector is

represented in Figure 2.6. The adaptive Dantzig selector can relieve the bias

problem of Dantzig selector and give a unique solution (Dicker, 2010). We

apply this adaptive Dantzig selector with wj = P ′λ(|βinitj |)/λ to the second

stage of the two stage method. The difference between the adaptive Dantzig

selector and our proposed method is that the weight depends on the tuning

28

Figure 2.6: Adaptive DS

parameter λ.

2.3.3 Theoretical properties

In this section we prove the global oracle property of the two stage Dantzig

selector under some regularity conditions defined as follows,

(A1) The random errors ε = (ε1, . . . , εn) are i.i.d mean zero sub-Gaussian(σ)

with a scale factor σ > 0, i.e., E[exp(tε2i )] ≤ exp(σ2t2/2).

(A2) ηmin(XTA0XA0) > 0 where ηmin(B) is the minimum eigenvalue of B.

29

(A3) The design matrix X satisfies

γ = minθ 6= 0

‖θAc0‖1 ≤ α‖θA0

‖1

‖Xθ‖2√n‖θA0‖2

> 0.

where α = 3 for LASSO initial and α = 1 for Dantzig initial.

The main theorem shows that the two stage Dantzig selector with a good

initial estimator is equivalent to the oracle estimator if the oracle estimator

satisfies the constraint of the two stage Dantzig selector.

Theorem 1. Assume that minj∈A0

|β∗j | > (a + 1)λ where A0 = j : β∗j 6= 0. Let

Fn0 = ‖βinit − β∗‖∞ ≤ a0λ where a0 = min1, a2 and Fn1 = | 1nXTj (Y −

Xβ(o))| ≤ P ′λ(|βinitj |), ∀j where β(o) is the oracle estimator. Under the event

Fn0∩Fn1, the two stage Dantzig selector is equivalent to the oracle estimator.

Proof of Theorem 1. Under the event F0, minj∈A0

|βinitj | > aλ because

minj∈A0

|β∗j | > (a+1)λ. Hence minj∈A0

P ′λ(|βinitj |) = 0. Next we can verify minj∈Ac0

P ′λ(|βinitj |) >

0, because minj∈Ac0|βinit|∞ ≤ a0λ ≤ aλ. Under the event Fn1, the oracle estimator

is in the feasible set of the two stage Dantzig selector. Under the event Fn0∩Fn1,

the minimizer β of (2.2) must be the oracle estimator since P ′λ(|βinitj |) = 0 for

j ∈ A0 and βj for j ∈ Ac0 can be zero.

The following corollaries show that the two stage Dantzig selector satisfies

the global oracle property with a LASSO or Dantzig selector initial under the

30

regularity conditions (A1)-(A3). Condition (A1) implies that

Pr(|aTε| > t) ≤ 2 exp

(− t2

2σ2‖a‖2

),

for t ≥ 0 and a = (a1, . . . , an)T . Condition (A2) means that the signal covari-

ates are not seriously correlated. Condition (A3) is a condition for ‖βinit −

β∗‖∞ ≤ a0λ with respect to the LASSO or the Dantzig selector initial.

Corollary 2. Assume that regularity conditions (A1)-(A3) hold. Let the initial

estimator be the LASSO estimator βLS(τλ) with regularization parameter τλ.

(i) Ifminj∈A0

|β∗j |

a+1> λ > 2

√2σ√M log p

n1τ

and 16τγ−2√s < a0 then

Pr(βTSDS(βLS(τλ), λ) = β(o)) ≥ 1− p0 − p1

where a0 = min1, a2, p0 = Pr(‖βLS(τλ) − β∗‖∞ > 16τλγ−1√s) ≤

2p exp(−nτ2λ2

8Mσ2

)and p1 = Pr(F c

n1) ≤ 2(p − s) · exp(−na1λ2

2σ2M

)with M =

maxj∈Ac0‖Xj‖22/n.

(ii) If nτ 2λ2 →∞, log p = o(nτ 2λ2), and 16τγ−2√s < a0, then

Pr(βTSDS(βLS(τλ), λ) = β(o))→ 1 as n→∞.


estimator be the Dantzig selector βDS(τλ) with regularization parameter τλ.

31

(i) Ifminj∈A0

|β∗j |

a+1> λ > σ

√2M log p

n1τ

and 8τγ−2√s < a0 then then

Pr(βTSDS(βDS(τλ), λ) = β(o)) ≥ 1− p0 − p1

where a0 = min1, a2, p0 = Pr(‖βDS(τλ) − β∗‖∞ > 8τλγ−2√s) ≤

2p exp(−nτ2λ2

2Mσ2

)and p1 = Pr(F c

n1) ≤ 2(p − s) · exp(− nλ2

2σ2M

)with M =



Pr(βTSDS(βDS(τλ), λ) = β(o))→ 1 as n→∞.

Proof of Corollary 2 and Corollary 3. We first prove that the ora-

cle estimator β(o) satisfies the constraint of two stage Dantzig selector with

probability at least 1 − 2(p − q) · exp(−na1λ2

2σ2M

). Denote βA0 to be the |A0|-

length sub-vector of β containing only A0 members of β and let HA0 =

XA0(XTA0XA0)

−1XTA0

. Then

Pr(F cn1) ≤

∑j∈Ac0

Pr

(∣∣∣∣ 1nXTj (I −HA0)ε

∣∣∣∣ > P ′λ(|βinitj |))

≤∑j∈Ac0

2 exp

(−nP ′λ(|βinitj |)2

2σ2M

)≤ 2(p− s) · exp

(−na1λ

2

2σ2M

)

because of the assumption that P ′λ(t) ≥ a1λ for t ≤ a2, the regularity condition

(A1), and ‖ 1nXTj (I −HA0)‖22 ≤M/n ∀j ∈ Ac0.

32

Under the event F0, we only have to prove the upper bound of the proba-

bility p0 related to the initial esimator. The upper bound of p1 can be shown

from Theorem 1. We can use the results of Bickel et al. (2009) or Negahban

et al. (2012) to get an estimation bound of βinit − β∗ for the LASSO and the

Dantzig selector. Bickel et al. (2009) showed the asymptotic equivalence of the

LASSO and the Dantzig selector giving the non-asymptotic `q error bound

under the restricted eigenvalue condition and normality error assumption. For

the Dantzig selector, with probability 1− exp(−nτ2λ2

2σ2M),

‖βDS(τλ)− β∗‖∞ ≤ ‖βDS(τλ)− β∗‖l2 ≤ 8

√sτλ

γ2

with τλ > σ√

2M log pn

. For the LASSO estimator, with probability 1−exp(−nτ2λ2

8σ2M),

‖βLS(τλ)− β∗‖∞ ≤ ‖βLS(τλ)− β∗‖l2 ≤ 16

√sτλ

γ2

with τλ > σ√

8M log pn

. The Corollary 2 of Negahban et al. (2012) showed that

for the LASSO estimator, with probability 1− exp(−nτ2λ2

2σ2M),

‖βLS(τλ)− β∗‖∞ ≤ ‖βLS(τλ)− β∗‖l2 ≤ 2

√sτλ

γ2

with τλ ≥ 4σ√M log p

n. From these results, the upper bounds for ‖βLS(τλ) −

β∗‖∞ and ‖βDS(τλ)− β∗‖∞ are verified.

Remark. Comments on the regularity conditions on design matrix:

Define the restricted eigenvalue condition (RE) condition as follows. A p × p

33

sample covariance matrix XTX/n satisfies the RE condition of k with param-

eter (α, γ) if

1

nθTXTXθ ≥ γ‖θ‖22 ∀θ ∈ C(B,α)

where C(B,α) = θ ∈ Rp : ‖θ‖1 ≤ α‖θB‖1 for all subsets B ⊂ 1, . . . , p

such that |B| = k. The RE condition is the weak and general regularity con-

dition for achieving the optimal `2 error bound

(√log pn

)for `1 regularization

methods such as the LASSO (α = 3) and the Dantzig selector (α = 1). A

series of researches have been concerned with which conditions are necessary

for guaranteeing their optimal error bounds.

The restricted isometry property (RIP) (Candes and Tao, 2005) is defined

as follows. X is said to satisfy the s−restricted isometry property with re-

stricted isometry constant δs if there exists a constant δs such that, for every

T ⊂ 1, . . . , p such that |T | ≤ s, n× |T | submatrix XT of X and u ∈ R|T |,

(1− δs)‖u‖22 ≤ ‖XTu‖22 ≤ (1 + δs)‖u‖22.

The s, s′−restricted orthogonality constants θs,s′ for s+ s′ ≤ p is defined as the

smallest quantities such that

|〈XTu,X ′Tu′〉| ≤ θs,s′ · ‖u‖2‖u′‖2

for all T, T ′ ⊂ 1, . . . , p such that T ∩ T ′ =, |T | ≤ s and |T ′| ≤ s′. X

satisfies the uniform uncertainty principle (UUP) (Candes and Tao, 2007) if

34

δ2s + θs,2s < 1 which means that for all s−sparse sets T , the columns of the

matrix corresponding to T are almost orthogonal.

The RIP and the UUP conditions are the earlier conditions which are very

restricted. They contain independent variables which are from Gaussian or

Bernoulli distribution (Candes and Tao, 2007). However, they cannot deal

with substantial dependency. Raskutti et al. (2010) showed a design matrix

whose rows are independently distributed from N(0,Σ) then it satisfies the

RE condition with sample size n = O(s log p) with respect to Σ. The sample

covariance matrices withΣ including Toeplitz matrices, spiked identity model,

or highly degenerate covariance matrices have the RE condition (Raskutti

et al., 2010). Rudelson and Zhou (2013) extends to the sub-Gaussian designs

considering substantial dependency.

Bickel et al. (2009) showed that in more general setting Xiid∼ (0,Σ), if

φmin(s log n) > clogn

then XTX/n holds RE(α, γ) with order s where

γ2 =√φmin(s log n)

(1− c0

√sφmax(s log n− s)

(s log n− s)φmin(s log n)

)

and

φmin(m) = min1≥‖θ‖0≤m

θTXTXθ

‖θ‖22and φmin(m) = max

1≥‖θ‖0≤m

θTXTXθ

‖θ‖22.


holds for s <√n log−3/2 n (Kim et al., 2012; Greenshtein

and Ritov, 2004). Hence, The RE condition can be possessed by large class of

35

Σ even when the RIP condition is not satisfied with probability converging to

one. For more discussions on the regularity conditions, see Bickel et al. (2009),

Negahban et al. (2012) or Zhang and Zhang (2012).

2.3.4 Algorithm

Two stage Dantzig selector βTSDS(βinit, λ) can be modified as follows. Let

A = j : |βinitj | > aλ then βTSDSAc (βinit, λ) can be calculated by

minβAc

∑j∈Ac

P ′λ(|βinitj |)|βj| subject to

∣∣∣∣ 1nX ′j(I −HA)(y −XAcβAc)∣∣∣∣ ≤ P ′λ(|βinitj |) for j ∈ Ac,

and

βTSDSA (βinit, λ) = (XTAXA)−1XT

A(y −XAcβTSDSAc (βinit, λ)).

Similar to the LLA algorithm, set β = WβAc and X = (I−HA)XAcW−1

where W is diagonal matrix whose entry is P ′λ(|βinitj |) for j ∈ Ac. Then the

above optimization for βTSDSAc (βinit, λ) can be modified as

min ‖β‖1 subject to ‖XT (y − Xβ)‖∞ ≤ 1.

Hence, we can use the same algorithms for Dantzig selector such as gen-

eralized primal-dual interior point algorithm (Candes and Romberg, 2005),

Dantzig selector with sequential optimization (DASSO) (James et al., 2009),

36

and alternating direction method (ADM) (Lu et al., 2012). We briefly review

several popular algorithms for Dantzig selector in the Appendix.

2.3.5 Tuning regularization parameter

Recall the HBIC of Wang et al. (2013),

HBIC(λ) = log(‖y −Xβ(λ)‖2/n) + Cn ∗ log p/n ∗ |Mλ|

where Mλ = j : βj(λ) 6= 0, Cn →∞ (e.g., Cn = log n or log(log n)).

According to the Corollary 1 in the section 2.3, we can use HBIC to tun-

ing regularization parameter because our proposed method satisfies the global

oracle estimator.

2.4 Numerical analyses

In this chater, we investigate the performance of the proposed two stage

Dantzig selector (TSDS). We Suppose there are p covariates variables

x1, . . . , xp. The goal of these numerical studies is to evaluate how the methods

work well in variable selection and accurate estimation. We consider the linear

regression model

y = xTβ + ε,

37

where x = (x1, . . . , xp)T and β ∈ Rp is a coefficient vector.

We compare the proposed TSDS with other methods including the LASSO,

the adaptive LASSO, the Dantzig selector, the adaptive Dantzig selector, the

SCAD, the MCP, and the two stage methods base on LASSO with respect

to selection and estimation. For the SCAD and the MCP, a = 2.1 and a =

1.5 are considered, respectively. Two stage methods use the SCAD penalty

with a = 2.1. Regarding the tuning parameter, five folder cross validation is

considered for the LASSO, the adaptive LASSO, the Dantzig selector, and

the adaptive Dantzig selector. For the SCAD, the MCP, and the two stage

methods, the high-dimensional BIC (HBIC) is used for tuning parameter. The

HBIC is defined as

HBIC = log(‖y −Xβ‖2)/n+ log(log n) ∗ log p/n ∗ |j : βj 6= 0|.

The LASSO and the Dantzig selector are used for the initial estimator with

tuning parameter λ/ log n in the two stage methods. The LARS algorithm is

used for the LASSO and the adaptive LASSO and the primal-dual interior

point algorithm is used for the Dantzig selector and the adaptive Dantzig

selector. The CCCP algorithm is used for the SCAD and the MCP estimator

and the calibrated CCCP (Wang et al., 2013) is used for the two stage method

based on LASSO.

38

2.4.1 Simulations

In this section, we consider two simulation settings. For each experimental

setting, we replicate the simulation 100 times. We simulate data from the true

model

y =

p∑j=1

Xjβj + ε, ε ∼ N(0, 22)

where p = 1000 and the number of observation n = 100. Covariates are gen-

erated from the normal distribution with zero mean and covariance of xi and

xj being R|i−j|, i, j = 1, . . . , p.. For each simulation setting, we generate data

with R = 0.3 and 0.5.

Base on 100 replications, the following statistics are measured for compar-

ision: the average number of falsely estimated non-zero coefficient (FP); the

average number of falsely estimated zero coefficient (FN); the proportion of the

true model exactly identified (TM); MSE=∑100

m=1 ‖β(m)−β∗‖2/100. In the re-

sults of the two stage methods, ”LS+LS”, for example, the first ”LS” refers to

the initial estimator and the last ”LS” refers to the two stage method based on

the LASSO while ”DS+DS” refers to the two stage method based on Dantzig

selector with Dantzig selector initial. The results of total four combinations of

two stage methods using the LASSO and the Dantzig selector are represented

in the following tables.

Example 1 We simulate 100 data sets along with above setting and the

39

true beta coefficient

β = (3, 1.5, 0, 0, 2︸︷︷︸5

, 0, . . . , 0︸︷︷︸p−5

)T .

Table 2.1: Example 1 (R=0.3)

Methods FP FN TM MSE

LASSO 25.38(9.019) 0(0) 0 1.040(0.501)

ALASSO 23.11(7.938) 0(0) 0 2.314(0.587)

Dantzig 18.48(10.323) 0(0) 0 1.044(0.530)

ADantzig 15.69(8.284) 0(0) 0 1.859(0.689)

MCP 2.12(1.719) 0.01(0.1) 0.16 0.463(0.479)

SCAD 1.36(1.382) 0.01(0.1) 0.26 0.389(0.606)

LS+LS 1.39(1.550) 0.02(0.141) 0.3 0.405(0.593)

LS+DS 1.39(1.550) 0.02(0.141) 0.3 0.404(0.594)

DS+LS 1.32(1.455) 0.02(0.141) 0.3 0.397(0.574)

DS+DS 1.31(1.461) 0.02(0.141) 0.31 0.394(0.574)

40



LASSO 24.55(9.632) 0(0) 0 0.926(0.453)

ALASSO 21.83(8.263) 0(0) 0 2.290(0.648)

Dantzig 17.3(9.161) 0(0) 0.01 0.859(0.424)

ADantzig 17.17(9.135) 0(0) 0.01 0.934(0.517)

MCP 1.97(1.598) 0.03(0.171) 0.23 0.643(0.880)

SCAD 1.23(1.270) 0.04(0.197) 0.3 0.780(0.981)

LS+LS 1.27(1.370) 0.03(0.171) 0.33 0.578(0.793)

LS+DS 1.24(1.319) 0.03(0.171) 0.33 0.564(0.793)

DS+LS 1.29(1.387) 0.03(0.171) 0.32 0.555(0.794)

DS+DS 1.24(1.296) 0.03(0.171) 0.32 0.549(0.791)



β = ((3, 1.5, 0, 0, 2︸︷︷︸5

, 0, . . . , 0︸︷︷︸15

)× 5, 0, . . . , 0︸︷︷︸p−100

)T .

41



LASSO 25.96(1.370) 1.03(1.283) 0 18.742(9.205)

ALASSO 20.89(4.479) 1.08(1.398) 0 11.147(8.665)

Dantzig 24.18(3.583) 1.93(1.771) 0 24.685(11.391)

ADantzig 23.9(4.135) 1.94(1.802) 0 15.555(11.128)

MCP 18(8.957) 1.95(3.468) 0.01 21.891(34.190)

SCAD 4.58(6.240) 1.28(2.958) 0.09 11.336(25.522)

LS+LS 6.89(9.331) 0.71(1.866) 0.05 7.736(9.773)

LS+DS 5.24(5.826) 0.71(1.903) 0.03 7.558(9.925)

DS+LS 6.74(8.73) 0.69(1.846) 0.06 7.648(9.972)

DS+DS 4.67(5.924) 0.52(1.573) 0.15 7.055(8.234)

42



LASSO 25.32(0.827) 0.32(0.827) 0 10.676(6.274)

ALASSO 19.89(4.325) 0.34(0.831) 0 6.656(5.120)

Dantzig 24.29(1.546) 0.6(0.974) 0 14.284(7.890)

ADantzig 22.76(4.656) 0.61(0.984) 0 6.912(6.382)

MCP 4.33(4.551) 2.38(2.534) 0.09 14.508(19.203)

SCAD 3.35(3.056) 1.89(2.287) 0.05 12.936(14.131)

LS+LS 4.56(6.609) 0.57(1.358) 0.07 5.467(6.115)

LS+DS 3.52(2.921) 0.58(1.365) 0.08 5.339(6.158)

DS+LS 3.97(4.239) 0.58(1.387) 0.07 5.078(6.161)

DS+DS 3.44(4.613) 0.42(1.249) 0.14 4.902(5.746)

43

2.4.2 Real data analysis

We analyze the data set of Scheetz et al. (2006) containing the 18976 gene

expression levels from the 120 rats. The objective of this analysis is to find the

genes correlated to the gene TRIM32 known to cause Bardet-Biedl syndrome.

Many previous works (Huang et al., 2008b; Kim et al., 2008; Wang et al.,

2013) analyzed this data set. According to these papers, we first select 3,000

genes with the largest variance in expression level and then pick up the top

1,000 correlated genes with TRIM32 among the selected 3,000 genes. With

this data set, we focus on the comparision between two stage methods because

the comparision among the two stage method with LASSO and other methods

already had done by Wang et al. (2013) and to assess the improvement of

TSDS over the previous two stage methods is our main interest. The results is

in the Table 2.5.

Table 2.5: Real Data (TRIM)

Methods #j : βj 6= 0 PE

LS+LS 11.37 0.813

LS+DS 10.47 0.806

DS+LS 8.74 0.857

DS+DS 8.41 0.83

44

2.5 Conclusion

In this paper, we propose a two stage method based on Dantzig selector, which

is named as two stage Dantzig selector. We prove two stage Dantzig selector can

obtain oracle estimator under regularity conditions. The proposed method can

be easily implemented by general algorithms for the standard Dantzig selector.

The numerical results support our contention that the Dantzig selector used

in our method can improve variable selection and estimation through lessening

the effects of noise variables more efficiently than LASSO. Furthermore, the

numerical results show that our proposed method outperforms other sparse

regularization methods with respect to variable selection and estimation.

45

Chapter 3

Two Stage Methods for

Precision Matrix Estimation

3.1 Introduction

Precision matrix (inverse covariance matrix) estimation is important prob-

lem in high dimensional statistical analysis and useful for various applications

such as Gaussian graphical model, gene classification, optimal portfolio al-

location, and speech recognition. Under the normality assumption, suppose

X = (X1, . . . , Xp) ∼ N(µ,Σ) then the zero elements in precision matrix

Ω = (ωij)p×p imply conditional inpendences of variables, that is, ωij = 0 if

and only if Xi and Xj are independent given X\Xi, Xj. Therefore the sup-

46

port of precision matrix is related to the structure of the undirected Gaussian

graph G = (V,E) where vertex set V = X1, . . . , Xp and edge set E with

Ec = (i, j) : Xi ⊥⊥ Xj|X\Xi, Xj and ⊥⊥ denotes independece. In the high

dimensional setting, classical methods such as Gaussian graphical model and

inverse of sample covariance matrix cannot provide stable estimate of preci-

sion matrix and additional restrictions should be imposed to get stable and

accurate precision matrix estimation. Hence many regularized methods for pre-

cision matrix estimation are developed based on the relationship between pre-

cision matrix and the Gaussian graphical model. There are two frameworks in

those regularized methods which are regression based approach and maximum

likelihood approach. Meinshausen and Buhlmann (2006) introduced penalized

neighborhood regression model with LASSO penalty. They fitted each variable

on its neighborhood with LASSO penalty and aggregated the results. Peng

et al. (2009) proposed joint neighborhood LASSO selection method which si-

multaneously performed neighborhood selection for all variables. Yuan (2010)

adopted Dantzig selector to the regression based approach and establised it

convergence rate. Yuan and Lin (2007) proposed penalized maximum likelihood

method with LASSO penalty and Friedman et al. (2008) introduced efficient

algorithm called graphical LASSO (glasso) for penalized maximum likelihood

mehtod with LASSO using blockwise coordinate descent algorithm (Banerjee

47

et al., 2008). Fan et al. (2009) dealt with the bias problem of the LASSO penal-

ization and proposed new penalized likelihood methods with adaptive LASSO

and SCAD penalty and the convergence rates of non-convex penalized methods

are shown in Lam and Fan (2009). Cai et al. (2011) proposed a constrained `1

minimization method called CLIME and showed its convergence rates under

various matrix norm.

Most of the existing sparse precision matrix estimators which use the `1

regularization including LASSO or Dantzig selector suffer from selection in-

consistence and bias estimation. Although the penalized likelihood estimation

with SCAD penalty can achieve selection consistency and unbiased estima-

tor, it takes quite much time to converge to its local minimizer and it cannot

guarantee that the local minimizer is the oracle estimator. In this paper, we

especially focus on the selection and the correct recovery of the support of

precision matrix. We propose two stage methods based on LASSO or Dantzig

selector which can correctly select the support of precision matrix with high

probability under some regularity conditions.

48

3.2 Estimation of precision matrix via colum-

nwise two-stage methods

Suppose (X1, . . . , Xp) are jointly generated by mean µ = (µ1, . . . , µp)′ and

covariance matrix Σ∗. It is well known (e.g., Lemma 1 of Peng et al. (2009))

that for i = 1, . . . , p, let

Xi = µi +∑j 6=i

β∗ijXj + εi

then (X1, . . . , Xi−1, Xi+1, . . . , Xp) and εi are uncorrelated if and only if β∗ij =

−ω∗ij/ω∗ii where Σ∗−1 = Ω∗ = (ω∗ij) is the precision matrix. Furthermore, with

those β∗ij, cov(εi, εj) = ω∗ij/(ω∗iiω∗jj) and var(εi) = 1/ω∗ii. Under normality as-

sumption, the forementioned uncorrelation can be replaced by independence.

This regression based approach has been applied to various methods including

Meinshausen and Buhlmann (2006), Peng et al. (2009), and Yuan (2010) by

using LASSO or Dantzig selector. We use this relationship to estimate sparse

precision matrix via two stage regression methods based on the LASSO esti-

mator such as calibrated CCCP (Wang et al., 2013) and one step LLA (Fan

et al., 2012), or based on the Dantzig selector called two stage Dantzig selector.

49

3.2.1 Two stage method based on LASSO

We first briefly introduce the one step LLA (Zou and Li, 2008; Fan et al., 2012)

as a two stage method based on LASSO, and then we apply the one step LLA

to estimate precision matrix. Consider a penalized regression problem,

minβ∈Rp

1

2n‖y −Xβ‖2 +

p∑j=1

Pλ(|βj|)

,

where y is the response vector, and X = (X1, . . . ,Xp) is an n × p covariate

matrix with Xi = (X1i, . . . , Xni)T , and β = (β1, . . . , βp)

T is the vector of

regression coefficients, and ‖ · ‖ is L2 norm, and Pλ(·) is a penalty function

with tuning parameter λ. We consider a class of nonconvex penalty functions

Pλ = Pλ,a satisfying


on t ∈ (0,∞) with P ′λ(0) := P ′λ(0+) ≥ a1λ.

(P2) P ′λ(t) ≥ a1λ for t ∈ (0, a2λ)

(P3) P ′λ(t) = 0 for t > aλ > a2λ


The SCAD and the MCP penalties satisfy above conditions with a1 = 1 for

SCAD and a1 = 1− a−1 for MCP. The derivative of SCAD penalty is defined

50

by

P ′λ(t) = λ

I(t ≤ λ) +

(aλ− t)+(a− 1)λ

I(t > λ)

, for some a > 2,

and the derivative of MCP penalty is defined by

P ′λ(t) =

(λ− t

a

)+

, for some a > 1.

The one step LLA is defined as follows,

β(βinit, λ) = argminβn∑i=1


p∑j=1

P ′λ(|βinitj |)|βj|. (3.1)

Then the equation (3.1) can be recast as

β(βinit, λ)Ac = argminβAc‖(I −HA)(y −XAcβAc)‖2/2n+∑j∈Ac


β(βinit, λ)A = (XTAXA)−1XT

A(y −XAcβ(βinit, λ)Ac),

where A = A(βinit, λ) = j : |βinitj | > aλ with the parameter of nonconvex

penalty a, and HA = XA(XTAXA)−1XT

A. Let y = (I −HA)y, X = (I −

HA)XAcW−1 and β = WβAc where W = diag(P ′λ(|βinitj |))j∈Ac then the

equation (3.1) can be recast as the LASSO problem with respect to y, X, β

and tuning parameter 1. Hence the algorithms for LASSO can be used for the

osLLA.

For a good initial estimate which satisfies |βinit−β∗‖∞ < min(a2, 1) ·λ, the

oracle estimator β(o) can be obtained via two stage method based on LASSO

with high probability where β(o) = (XTA0XA0)

−1XTA0y with A0 = j : β∗j 6= 0.

51

Now we apply this one step LLA estimator to precision matrix estimation.

Let the true precision matrix Ω∗ = Σ∗−1, and X = X(1), . . . ,X(n) are inde-

pendent and identically distributed samples from Np(µ,Σ∗). For i = 1, . . . , p,

denote the ith column ofΩ withoutΩii to beΩ−ii = (ω1i, . . . , ωi−1,i, ωi+1,i, . . . , ωpi)T

and

β(i) = (β1(i), . . . , βi−1(i), βi+1(i), . . . , βp(i))T

=

(−ω1i

ωii, . . . ,−ωi−1,i

ωii,−ωi+1,i

ωii, . . . ,−ωpi

ωii

)T.

Let Zi = Xi − Xi where Xi = (X1i, . . . , Xni)T and Xi = (Xi, . . . , Xi)

T with

Xi =∑n

i=j X(j)i /n. Denote the sample covariance S = ZTZ/n where Zn×p =

(Z1, . . . ,Zp). Let Z−i = (Z1,Z2, . . . ,Zi−1,Zi+1, . . . ,Zp). For an initial estimate

Ωinit and a vector of tuning parameters λ = (λ1, . . . , λp)T , our proposed one

step LLA (osLLA) estimator ΩosLLA(Ωinit,λ) = (ωosLLAij (ωinitij , λj))1≤i,j≤p is

defined as follows. First, conduct regression columnwisely for estimating Ω−ii,

i = 1, . . . , p.

For i = 1, . . . , p,

Ωii = Ωii(Ωinit, λi) = 1/(‖Zi − Z−iβ

osLLA(i) (Ωinit, λi)‖2/n)

Ω−ii = Ω−ii(Ωinit, λi) = −β(i)Ωii(Ω

init, λi),

52

where

βosLLA(i) (Ωinit, λi) = argmin

1

2n‖Zi − Z−iβ(i)‖2 +

∑j 6=i

P ′λi

(∣∣∣∣ ωinitji

ωinitii

∣∣∣∣) |βj(i)|.(3.2)

Ω is calculated by conducting p independent optimazation problems defined

as (3.2).

Second, for symmetricity, our final osLLA estimator ΩosLLA(Ωinit, λ) is

defined as

ωosLLAij (ωinitij , λ) = ωijI(|ωij| ≤ |ωji|) + ωjiI(|ωij| > |ωji|).

The initial estimate Ωinit can be a clme estimator (Cai et al., 2011) or

graphical lasso estimator (Yuan and Lin, 2007). The columnwise LASSO or

Dantzig selector with tuning parameter λiniti = λi/ log n (Wang et al., 2013)

also can be considered as an initial estimate.

3.2.2 Two stage Dantzig selector

Two stage Dantzig selector (TSDS) is a Dantzig selector type modification

of LLA of nonconvex penalized method. TSDS for regression is defined as a

solution of the following problem

min

p∑j=1

P ′λ(|βinitj )|βj|

subject to

∣∣∣∣ 1nXTj (y −Xβ)

∣∣∣∣ ≤ P ′λ(|βinitj |), j = 1, . . . , p

53

where Xj = (X1j, . . . , Xnj)T and βinit is an initial estimate.

Similar to the osLLA algorithm, it can be modified as follows,

1. Set β = WβAc and X = XAcW−1 where A and W are defined as in

the previous subsection 5.2.1.

ˆβ = argminβ

‖β‖1 : ‖XT (y − Xβ)‖∞ ≤ 1

2. Let βAc = W−1 ˆβ and βA = (XTAXA)−1XT

A(y −XβAc).

The same algorithms for Dantzig selector including generalized primal-dual

interior point algorithm (Candes and Romberg, 2005), DASSO (James et al.,

2009), alternating direction method (Lu et al., 2012) can be used as well.

We now define our TSDS estimator for precision matrix estimation. It is

similar to the osLLA estimator for precision matrix estimation. Let the colum-

nwise TSDS estimator βTSDS(i) = βTSDS(i) (Ωinit, λi) be the solution of

minβ(i)

∑j 6=i

P ′λi


ωinitii

∣∣∣∣) |βj(i)|

subject to

∣∣∣∣ 1nZTj (Zi − Z−iβ(i))

∣∣∣∣ ≤ P ′λi


ωinitii

∣∣∣∣) , ∀j 6= i (3.3)

and

Ωii = (Sii−2(βTSDS(i) )TS−i,i+(βTSDS(i) )TS−i,−iβTSDS(i) )−1, Ω−i,i = −Ωiiβ

TSDS(i) .

54

To impose symmetricity on the estimated precision matrix, we set ΩTSDS =

(ωTSDSi,j )1≤i,j≤p such that

ωTSDSi,j = ωTSDSji = ωijI(|ωij| ≤ |ωji|) + ωjiI(|ωij| > |ωji|).

3.2.3 Theoretical results

We prove the selection consistence of our proposed estimators. First, we define

the columnwise oracle estimator of precision matrix Ω(o) as follows:

Ωii = (Sii − 2(β(o)(i) )TS−i,i + (β

(o)(i) )TS−i,−iβ

(o)(i) )−1, and Ω−i,i = −Ωiiβ

(o)(i) .

and β(o)(i) is the oracle estimator of β∗(i) defined as β

(o)A0i(i)

= (ZTA0i

ZA0i)−1ZT

A0iZi

and β(o)j(i) = 0 for j ∈ Ac0i with A0i = j : ω∗ij 6= 0, j 6= i. For symmetricity of

the columnwise oracle precision matrix Ω(o) = (ω(o)ij ), ω

(o)ij = ω

(o)ji = ωijI(|ωij| ≤

|ωji|) + ωjiI(|ωij| > |ωji|) are considered.

Proposition 1. The columnwise oracle estimator Ω(o) is selection consistent

and elementwise√n-consistent estimator of Ω∗ for nonzero element.

Proof of Proposition 1. By the definition, the columnwise oracle estimator

is selection consistent. Since we assume thatX(1), . . . ,X(n) ∼ N(µ,Σ∗) where

Ω∗ = Σ∗−1, then Ω∗−i,i = −Ω∗i,iβ∗(i) and Ω∗i,i = 1/var(εi). For a sparse Ω∗, we

can assume that there exists a positive constant d where the degree of Ω∗

55

maxi=1,...,p |Ai0| < d. For each i,

√n(β

(o)A0i(i)

− β∗A0i(i))→ N(0, var(εi)(E(ZT

A0iZA0i

))−1).

Since ˆvar(εi) = 1n‖Zi−Z−iβ

(o)(i) ‖2 →p var(εi). Then by the continuous mapping

theorem, Ωii = 11n‖Zi−Z−iβ

(o)(i)‖2→p Ω∗ii. Then

√n(ΩA0i,i −Ω∗A0i,i

) =√n(−Ωi,iβ

(o)A0i(i)

+Ω∗i,iβ∗A0i(i)

)

=√n(Ω∗i,i(β

∗A0i(i)

− β(o)A0i(i)

) + op(1) ·Op(1/√n))

→ N(0, Ω∗i,i(E(ZTA0i

ZA0i))−1)).

Since the columnwise oracle estimator ω(o)ij is ωij or ωji whose absolute value

is smaller than the other, ω(o)ij is also

√n−consistent estimator of ω∗ij = ω∗ji for

j ∈ A0i.

We specify regulaity conditions.

(A1) Sparse Model: Ω∗ ∈M1(L, ν0, d) where

M1(τ0, ν0, d) = A 0 : ‖A‖1 < L, ν−10 < φmin(A) < φmax(A) < ν0, deg(A) < d,

where τ0, ν0 > 1, ‖A‖1 = maxj∑n

i=1 |aij|, and deg(A) = maxi∑

j I(Aij 6=

0).

(A2) ηmin(ZTA0i

ZA0i) > 0 ∀i where ηmin(B) is the minimum eigenvalue of B.

56

The following theorems show that osLLA estimator and TSDS estimator

are equivalent to the columnwise oracle estimator with high probability.

Theorem 1. Assume that regularity conditions (A1)-(A2) and (P1)-(P3) hold.

Denote a vector of tuning parameters for each column to be λ = (λ1, . . . , λp)T .

For an initial estimate Ωinit, let Fn0 =

maxj 6=i

∣∣∣ ωinitji

ωinitii− ω∗ji

ω∗ii

∣∣∣ < a0λi, i = 1, . . . , p

where a0 = min(a2, 1) and Fn1 = ∩pi=1

∣∣∣ 1nZTj (Zi − Z−iβ

(o)(i) )∣∣∣ ≤ P ′λi

(∣∣∣ ωinitji

ωinitii

∣∣∣) ,∀j 6= i

where β(o)(i) is the oracle estimator of β∗(i). Under the event Fn0∩Fn1, the osLLA

estimator ΩosLLA(Ωinit,λ) is equivalent to the columnwise oracle estimator

Ω(o) of Ω∗.

Proof of Theorem 1. We can directly apply the theoretical result of

osLLA for regression. Under the event Fn0, P′λi

(∣∣∣ ωinitji

ωinitii

∣∣∣) = 0 for j ∈ A0i and

P ′λi

(∣∣∣ ωinitji

ωinitii

∣∣∣) ≥ a1λi for j ∈ Ac0i. Hence for each i,


1

2n‖Zi − Z−iβ(i)‖2 +

∑j∈Aci0

P ′λi


ωinitii

∣∣∣∣) |βj(i)|

By convexity of ‖Zi − Z−iβ(i)‖2,

‖Zi − Z−iβ(i)‖2 ≥ ‖Zi − Z−iβ(o)(i) ‖

2 +∑j

ZTj (Zi − Z−iβ

(o)(i) )(βj − β(o)

j )

= ‖Zi − Z−iβ(o)(i) ‖

2 +∑j∈Aci

ZTj (Zi − Z−iβ

(o)(i) )(βj − β(o)

j ).

57

Under the event Fn1, for each i 1

2n‖Zi − Z−iβ(i)‖2 +

∑j∈Aci

P ′λi


ωinitii

∣∣∣∣) |βj(i)|

−

1

2n‖Zi − Z−iβ

(o)(i) ‖

2 +∑j∈Aci0

P ′λi


ωinitii

∣∣∣∣) |β(o)j(i)|

≥

∑j∈Aci0

P ′λi


ωinitii

∣∣∣∣))− 1

nZTj (Zi − Z−iβ

(o)(i) ) · sign(βj)

|βj(i)|

≥∑j∈Aci0

a1λi −

1



|βj(i)|

≥ 0.

This equality holds only if βj(i) = 0, ∀j ∈ Aci0, and the oracle estimator β(o)(i)

is the minimizer of ‖Zi−Z−iβ(i)‖2. Hence, βosLLA(i) (Ωinit, λi) = β(o)(i) for each i,

and then ΩosLLA(Ωinit,λ) = Ω(o).




maxj 6=i

∣∣∣ ωinitji

ωinitii− ω∗ji

ω∗ii

∣∣∣ < a0λi

where a0 =

min(a2, 1) and Fn1 = ∩pi=1

| 1nZTj (Zi − Z−iβ

(o)(i) )| ≤ P ′λi

(∣∣∣ ωinitji

ωinitii

∣∣∣) ,∀j 6= i

where

β(o)(i) is the oracle estimator of β∗(i). Under the event Fn0 ∩ Fn1, the TSDS esti-

mator ΩosLLA(Ωinit,λ) is equivalent to the columnwise oracle estimator Ω(o)

of Ω∗.

58

Proof of Theorem 2. Under the event Fn0, P′λi

(∣∣∣ ωinitji

ωinitii

∣∣∣) = 0 for j ∈ A0i

and P ′λi

(∣∣∣ ωinitji

ωinitii

∣∣∣) ≥ a1λ for j ∈ Ac0i. Under the event Fn1, β(0)(i) is in the feasible

set of the two stage Dantzig selector. Under the event Fn0∩Fn1, the minimizer

βTSDS(i) of (3.3) must be the oracle estimator β(o)(i) because P ′λi

(∣∣∣ ωinitji

ωinitii

∣∣∣) = 0 for

j ∈ A0i and βj(i) for j ∈ Ac0i can be zero.

Recall A0i = j : ω∗ji 6= 0, j 6= i and let si = |A0i|. The following

corollaries assert that the clime estimator (Cai et al., 2011) can be a good

initial estimator which the two stage methods can achieve the columnwise

oracle estimator. Cai et al. (2011) showed that ‖Ωclime(λ)−Ω∗‖∞ ≤ 4Lλclime

with probability at least Pr(

maxij |Sij −Σ∗ij| > λclime

L

)where L = ‖Ω∗‖1 =

maxj∑p

i=1 |ω∗ij|, and λclime = C0L√

log pn

. We can use a large deviation result

such that the Lemma 3 of Bickel and Levina (2008) and Fan et al. (2012):

under the regularity condition (A1),

Pr(|Sij −Σ∗ij| ≥ δ) ≤ C1 exp(−C2nδ2),

where C1, C2 depend on ν0 in the regularity condition (A1). Hence

Pr

(maxij|Sij −Σ∗ij| ≥

λclime

L

)≤ C1 exp

(−C2

nλclime2

L2

).

Lemma 1. Let Ωinit be an initial estimator and Ω∗ = (ω∗ij)1≤i,j≤p be the true

precision matrix. Define Ai0 = j : ω∗ij 6= 0, j 6= i for i = 1, . . . , p.

59

Define a0 = min1, a2 where a2 is defined in the penalty conditions (P1)-(P3).

For each i = 1, . . . , p, if

‖Ωinit −Ω∗‖∞ < a0 ·

1

ωinitii

maxj∈Ai0

(ω∗jiω∗ii

+ 1

)−1then ∣∣∣∣ ωinitji

ωinitii

−ω∗jiω∗ii

∣∣∣∣ < a0λi.

Proof of Lemma 1.∣∣∣∣ ωinitji

ωinitii

−ω∗jiω∗ii

∣∣∣∣ =

∣∣∣∣ω∗iiωinitji − ωinitii ω∗jiωinitii ω∗ii

∣∣∣∣≤

(|ω∗ii|+ |ω∗ji|)‖Ωinit −Ω∗‖∞|ωinitii ω∗ii|

=

(1 +

∣∣∣ω∗jiω∗ii

∣∣∣) ‖Ωinit −Ω∗‖∞|ωinitii |

< a0λi.

Let A0 = (i, j) : Ωij 6= 0 = ∪iA0i.

Corollary 1. Assume that regularity conditions (A1)-(A2) and (P1)-(P3)

hold.

(1) Suppose that for i = 1, . . . , p, minj∈Ai0


∣∣∣ > (a+ 1)λi and

λi > max

(1

a0

1

ωinitii

maxj∈A0i

(ω∗jiω∗ii

+ 1

)· 4Lλclime, 2

a1

√log p

nmaxiω∗−1ii M

)

60

where a0 = min(1, a2), ωinitii = ωclimeii (λclime), λclime = LC0

√log pn

for

C0 > 0, and M = maxj ‖Zj‖22/n. Then

Pr(ΩosLLA(Ωclime(λclime),λ) = Ω(o)) ≥ 1− p0 − p1

where p1 = Pr(F cn1) ≤ 2

(p(p− 1)−

∑pj=1 sj

)· exp

(−na21 mini λ

2i

2Mσ2

), and

p0 = Pr(‖Ωclime(λ)−Ω∗‖∞ > 4Lλclime) ≤ C1 exp

(−C2

nλclime2

L2

).

(2) If nmini λ2i →∞, nλclime2 →∞ and log p = o(nmini λ

2i ), then

Pr(ΩosLLA(Ωclime(λclime),λ) = Ω(o))→ 1

as n→∞.


hold.



∣∣∣ > (a+ 1)λi and

λi > max

(1

a0

1

ωinitii

maxj∈A0i

(ω∗jiω∗ii

+ 1

)· 4Lλclime, 2

a1

√log p

nmaxiω∗−1ii M

)


√log pn

for


Pr(ΩTSDS(Ωclime(λclime),λ) = Ω(o)) ≥ 1− p0 − p1

61


(p(p− 1)−

∑pj=1 sj

)· exp

(−na21 mini λ

2i

2Mσ2

), and


(−C2

nλclime2

L2

).


2i ), then

Pr(ΩTSDS(Ωclime(λclime),λ) = Ω(o))→ 1

as n→∞.

Proof of Corollary 1 and Corollary 2. This results are follow from the

result of Theorem 1. To calculate Pr(F cn0),

Pr(‖Ωclime(λ)−Ω∗‖∞ ≥ 4Lλclime) ≤ C1 exp

(−C2

nλclime2

L2

)

and the Proposition 2 are used. Pr(F cn1) can be obtained in the similar way of

the linear regression. The difference between regression and precision matrix

estimation is the variance of εi. The variance of εi in precision matrix estimation

is var(εi) = Σ∗ii −Σ∗i,−iΣ∗−1−i,−iΣ∗−i,i = ω∗−1ii hence it should be bounded. Let

62

εi = (ε1i, . . . , εni)T is the vector of iid samples from N(0, var(εi)).

Pr(F cn1) ≤

p∑i=1

∑j∈Ac0i

Pr

(∣∣∣∣ 1nZTj (In −HA0i

)εi

∣∣∣∣ ≤ P ′λi(|ωinitij |)

)

≤ 2

p∑i=1

(p− si) exp

−nminj∈Ac0i

P ′λi(|ωinitij |)|)2

2ω∗−1ii M

≤ 2

p∑i=1

(p− si) · exp

(− na21λ

2i

2ω∗−1ii M

)

≤ 2

(p(p− 1)−

p∑i=1

si

)· exp

(− na21 mini λ

2i

2 maxi ω∗−1ii M

)

because

Pr(|aTεi| > t) ≤ 2 exp

(− t2

2ω∗−1ii ‖a‖22

)∀t ≥ 0

and ‖ 1nZTj (In −HA0i

)‖22 ≤‖Zj‖22nλmax(In −HA0i

) ≤M/n ∀j ∈ Ac0i.

The Lasso or Dantzig selector of Ω−ii can be a good initial estimate based

on the `2 bound of Bickel et al. (2009). Denote⊗

to be the Hadamard

product. If the Hessian of the loglikelihood Γ ∗p2×p2 = Ω∗−1⊗Ω∗−1 satis-

fies the incoherence (or irrepresentable) condition and under some regular-

ity conditions, an elementwise `∞ bound of the graphical lasso estimator is

‖Ωglasso − Ω∗‖∞ = O(√

log pn

) (Ravikumar et al., 2011) where the incoher-

ence (or irrepresentable) condition is that there exists α ∈ (0, 1] such that

maxe∈Ac ‖Γ ∗eA(Γ ∗AA)−1‖1 ≤ (1−α) with A = ∪iAi. The glasso estimate Ωglasso

can be a good initial estimate of Ω∗.

63


We conduct two simulation settings and one real data analysis. These two

simulation settigs are same as in Fan et al. (2012). The real data analysis is

the classification problem using the linear discriminant analysis (LDA) which

needs for estimation of precision matrix.

3.3.1 Simulations

We simulate n independent random vector from Np(0,Σ∗) with a sparse pre-

cision matrix Ω∗ = (Σ∗)−1. We consider two different sparsity patterns of Ω∗.

Example 1.Ω∗ is a tridiagonal matrix by constructingΣ∗ = (σ∗ij) = exp(−|ci−

cj|) for c1 < · · · < cp which are constructed by generating cp− cp−1, . . . , c2− c1

independently from Unif(0.5,1).

Example 2. Ω∗ = UU+I where U = (uij)p×p has zero diagonals and exactly

p nonzero off-diagonal entries. The nonzero entries are generated by uij = tijcij

where tij’s are independently generated from Unif(1,2), and cij’s are indepen-

dent Bernoulli random variables with Pr(cij = 1) = Pr(cij = −1) = 0.5.

We also generate an independent validation set of sample size n to tune

each estimator. In our simulation we let n = 100 and p = 100 or p = 200.

We compute the `1 penalized Gaussian likelihood estimator, denoted by

64

glasso, by using the popular R package glasso (Friedman et al., 2013). CLIME

(Cai et al., 2011) is computed by the R package clime (Cai et al., 2012). We use

GSCAD to denote the one step SCAD penalized Gaussian likelihood estima-

tor with CLIME initial estimate which is proposed by Fan et al. (2012). These

likelihood based approach are tuned by minimizing validation error which is

defined as − log det Ω + trace(ΩSval) where Ω is the generic estimator and

Sval is the sample covariance of validation set. Denote MB to be a columnwise

`1 penalized regression which proposed by Meinshausen and Buhlmann (2006).

We conducted our two stage methods with the LASSO and the Dantzig selec-

tor using two different initial estimator including glasso and CLIME. Denote

clime+LS to be the osLLA with CLIME estimator and clime+DS to be

the TSDS with CLIME estimator. With the glasso initial, glasso+LS and

glasso+DS are denoted as the osLLA and TSDS, respectively. MB and these

two stage methods are tuned columnwisely by minimizing the validation error

‖Zvali − Zval

−i β(i)(λi)‖2/n.

For each model, we generated 100 independet datasets, each consisting

n training samples and n validationi samples. Estimation accuracy is mea-

sured by the average Frobenius norm loss ‖Ω − Ω∗‖F , the average matrix

`1 norm ‖Ω − Ω∗‖1, and the average spectral norm ‖Ω − Ω∗‖2 over the

100 replications, where ‖A‖F =√∑

i,j a2ij, ‖A‖1 = max1≤j≤q

∑pi=1 |aij|, and

65

‖A‖2 = sup|x|≤1 |Ax|2 for a matrix A = (aij) ∈ Rp×q. The selection accuracy

is evaluated by the average edge proportion of false positive (FP) and false

negative (FN), sensitivity, and specificity over 100 replications. The average

number of estimated edge is also measured. We plot the ROC curve with the

average sensitivity and specificity for each method. We also plot the average

Frobenious norm and spectral norm varying with number of edges for each

method. For the two stage methods, we consider two settings which are with

the same tuning parameter overall and with columnwisely different tuning pa-

rameters. The simulation results are summarized in Figure 3.1-3.8 and Table

3.1-3.8. We conduct two stage methods with several glasso and CLIME initials

with a sequence of tuning parameters and we find out over-edged initial esti-

mates which have more edges than selected optimal glasso or CLIME estimates

by validation error. We summarize the best results of our proposed methods in

Figure 3.1-3.8 and Table 3.1-3.8. The selection results of our proposed methods

outperform over others. The two stage methods with the glasso initial tend to

perform better than those with CLIME initial. In Example 2, our proposed

methods achieve the best finite sample performance in both estimation and

selection.

66

0 5 10 15 20

60

70

80

90

10

0

(1−Specificity)*100

Se

nsitiv

ity*1

00

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

Figure 3.1: ROC curve of Example 1 (p=100, q=99)

67

Table 3.1: Example 1 (p=100, q=99)

Methods Edge FP FN Sensitivity Specificity

glasso 1025.655 0.9034 1.00E-04 0.9976 0.8089

gSCAD-osLLA 902.96 0.8839 2.00E-04 0.9917 0.8341

CLIME 586.67 0.8293 2.00E-04 0.9904 0.8993

MB 288.49 0.6604 3.00E-04 0.9852 0.9606

MB(same) 291.27 0.6618 3.00E-04 0.9867 0.9601

clime+LS 281 0.6522 4.00E-04 0.9821 0.9621

clime+DS 283.57 0.6554 4.00E-04 0.9821 0.9616

glasso+LS 134.02 0.3 0.0011 0.9441 0.9916

glasso+DS 133.35 0.2963 0.0011 0.9441 0.9918

clime+LS(same) 290.92 0.6612 3.00E-04 0.9862 0.9602

clime+DS(same) 277.24 0.6452 3.00E-04 0.9837 0.9629

glasso+LS(same) 140.96 0.3217 8.00E-04 0.9598 0.9905

glasso+DS(same) 140.36 0.3185 8.00E-04 0.9593 0.9906

68

0 200 400 600 800 1000

46

81

01

21

4

# of edge

Fro

be

niu

s n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(a) Frobenius norm

0 200 400 600 800 1000

01

23

4

# of edge

l2 n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(b) Spectral norm

Figure 3.2: ‖Ω −Ω∗‖ of Example 1 (p=100, q=99)

69

Table 3.2: ‖Ω −Ω∗‖ of Example 1 (p=100, q=99)

Methods Edge Frob l1 l2

glasso 1025.655 6.6562 2.667 1.5839

gSCAD-osLLA 902.96 4.3703 1.8354 1.1696

CLIME 586.67 5.7808 2.0251 1.3744

MB 288.49 6.136 2.6453 1.9252

MB(same) 291.27 4.7942 1.7009 1.1407

clime+LS 281 6.5012 3.0927 2.335

clime+DS 283.57 6.5829 3.0539 2.2887

glasso+LS 134.02 4.6925 1.9878 1.4721

glasso+DS 133.35 4.6922 2.0082 1.4868

clime+LS(same) 290.92 4.9179 1.786 1.2587

clime+DS(same) 277.24 4.9048 1.748 1.2448

glasso+LS(same) 140.96 4.5504 1.8529 1.3569

glasso+DS(same) 140.36 4.5491 1.8487 1.3571

70

0 2 4 6 8 10

60

70

80

90

10

0


Se

nsitiv

ity*1

00

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS


71



glasso 2657.33 0.925 0 0.9982 0.8752

gSCAD-osLLA 1729.86 0.8809 1.00E-04 0.9919 0.9222

CLIME 960.83 0.7932 2.00E-04 0.9818 0.9611

MB 604.36 0.6765 2.00E-04 0.9798 0.9792

MB(same) 620.22 0.6818 2.00E-04 0.9825 0.9784

clime+LS 407.61 0.5389 6.00E-04 0.9403 0.9888

clime+DS 407.12 0.5377 6.00E-04 0.9414 0.9888

glasso+LS 586.49 0.6673 2.00E-04 0.9777 0.9801

glasso+DS 608.99 0.6792 2.00E-04 0.9783 0.979

clime+LS(same) 435 0.5578 4.00E-04 0.9597 0.9876

clime+DS(same) 429.43 0.552 4.00E-04 0.9587 0.9879

glasso+LS(same) 599.01 0.6725 2.00E-04 0.9816 0.9795

glasso+DS(same) 585.45 0.6651 2.00E-04 0.9808 0.9802

72

0 200 400 600 800 1000

46

81

01

21

4

# of edge

Fro

be

niu

s n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(a) Frobenius norm

0 200 400 600 800 1000

01

23

4

# of edge

l2 n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(b) Spectral norm


73



glasso 2657.33 10.9083 3.0856 1.7999

gSCAD-osLLA 1729.86 7.015 2.2085 1.3832

CLIME 960.83 9.4415 2.1993 1.5915

MB 604.36 10.4144 3.905 3.0842

MB(same) 620.22 7.32 1.8644 1.2588

clime+LS 407.61 10.599 4.0213 3.1616

clime+DS 407.12 10.878 4.2165 3.3455

glasso+LS 586.49 11.1297 4.5991 3.7215

glasso+DS 608.99 11.8905 4.6239 3.7287

clime+LS(same) 435 8.1239 2.2008 1.7033

clime+DS(same) 429.43 8.1794 2.2122 1.7129

glasso+LS(same) 599.01 7.2486 1.854 1.267

glasso+DS(same) 585.45 7.2326 1.8454 1.2634

74

0 2 4 6 8 10

02

04

06

08

01

00


Se

nsitiv

ity*1

00

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS


75



glasso 1025.655 0.9034 1.00E-04 0.9976 0.8089

gSCAD-osLLA 902.96 0.8839 2.00E-04 0.9917 0.8341

CLIME 586.67 0.8293 2.00E-04 0.9904 0.8993

MB 288.49 0.6604 3.00E-04 0.9852 0.9606

MB(same) 291.27 0.6618 3.00E-04 0.9867 0.9601

clime+LS 281 0.6522 4.00E-04 0.9821 0.9621

clime+DS 283.57 0.6554 4.00E-04 0.9821 0.9616

glasso+LS 134.02 0.3 0.0011 0.9441 0.9916

glasso+DS 133.35 0.2963 0.0011 0.9441 0.9918

clime+LS(same) 290.92 0.6612 3.00E-04 0.9862 0.9602

clime+DS(same) 277.24 0.6452 3.00E-04 0.9837 0.9629

glasso+LS(same) 140.96 0.3217 8.00E-04 0.9598 0.9905

glasso+DS(same) 140.36 0.3185 8.00E-04 0.9593 0.9906

76

0 200 400 600 800 1000

10

15

20

25

30

35

40

45

# of edge

Fro

be

niu

s n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(a) Frobenius norm

0 200 400 600 800 1000

68

10

12

14

16

# of edge

l2 n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(b) Spectral norm


77



glasso 1025.655 6.6562 2.667 1.5839

gSCAD-osLLA 902.96 4.3703 1.8354 1.1696

CLIME 586.67 5.7808 2.0251 1.3744

MB 288.49 6.136 2.6453 1.9252

MB(same) 291.27 4.7942 1.7009 1.1407

clime+LS 281 6.5012 3.0927 2.335

clime+DS 283.57 6.5829 3.0539 2.2887

glasso+LS 134.02 4.6925 1.9878 1.4721

glasso+DS 133.35 4.6922 2.0082 1.4868

clime+LS(same) 290.92 4.9179 1.786 1.2587

clime+DS(same) 277.24 4.9048 1.748 1.2448

glasso+LS(same) 140.96 4.5504 1.8529 1.3569

glasso+DS(same) 140.36 4.5491 1.8487 1.3571

78

0 2 4 6 8 10

020

40

60

80

100


Sensitiv

ity*100

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS


79



glasso 2263.27 0.968 0.0011 0.7851 0.8894

gSCAD-osLLA 536.81 0.8786 0.0016 0.6573 0.976

CLIME 242.06 0.7012 0.0011 0.7698 0.9914

MB 272.85 0.732 0.001 0.7914 0.9899

MB(same) 378.18 0.878 0.0024 0.4965 0.9832

clime+LS 102 0.2251 7.00E-04 0.8561 0.9988

clime+DS 102 0.2252 7.00E-04 0.8561 0.9988

glasso+LS 104.46 0.2115 5.00E-04 0.8932 0.9989

glasso+DS 104.38 0.2109 5.00E-04 0.8932 0.9989

clime+LS(same) 117.73 0.389 0.0011 0.7559 0.9976

clime+DS(same) 117.74 0.389 0.0011 0.7559 0.9976

glasso+LS(same) 87.02 0.1569 0.001 0.7924 0.9993

glasso+DS(same) 86.93 0.1563 0.001 0.7921 0.9993

80

0 500 1000 1500 2000

15

20

25

30

35

40

45

# of edge

Fro

be

niu

s n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(a) Frobenius norm

0 500 1000 1500 2000

46

81

01

21

41

6

# of edge

l2 n

orm

glasso

gSCAD

CLIME

MB(same)

MB

TS(same)

TS

(b) l2 norm


81



glasso 2263.27 37.8666 13.3819 9.5821

gSCAD-osLLA 536.81 21.8945 9.7219 7.0486

CLIME 242.06 28.311 11.4089 8.0062

MB 272.85 22.7018 8.2783 5.3505

MB(same) 378.18 30.7944 12.2013 8.1922

clime+LS 102 17.8328 9.0719 6.2429

clime+DS 102 17.8365 9.078 6.2441

glasso+LS 104.46 16.2034 8.0195 5.7519

glasso+DS 104.38 16.2091 8.0138 5.7561

clime+LS(same) 117.73 20.4286 9.2521 6.092

clime+DS(same) 117.74 20.4287 9.2521 6.092

glasso+LS(same) 87.02 18.1129 8.0463 5.6268

glasso+DS(same) 86.93 18.117 8.0515 5.633

82


We apply our two stage methods to analyzing the breast cancer data set which

were analyzed by Hess et al. (2006) and it is available at

http://bioinformatics.mdanderson.org/. This dataset is also used in previous

studies (Fan et al., 2009; Cai et al., 2011). The aim of this analysis is to com-

pare the results of linear discriminant analysis (LDA) based on several regu-

larization methods for sparse precision matrix estimation. This dataset con-

tains 22,283 gene expression levels of 133 patients where 34 patients of them

achieved pathological complete response (pCR) and others did not achieve

pCR (a.k.a. residual disease (RD)). Since pCR after neoadjuvant (preoper-

ative) chemotherapy indicates high possibility of cancer free survival, it is of

substantial interest to predict wheter or not a patient will achieve pCR. In this

study LDA is utilzed to classify a patients as pCR or RD. Precision matrix

should be estimated in advance of using LDA. Fan et al. (2009) used penalized

loglikelihood method with LASSO, adaptive LASSO, and SCAD penalties to

estimate precision matrix and Cai et al. (2011) used CLIME estimate as a

precision matrix. We follow the same framework used by Fan et al. (2009) and

Cai et al. (2011).

First, we randomly divide dataset into the training and testing datasets.

To maintain the proportion of pCR and RD each time, we use a stratified

83

sampling which randomly select five from pCR and 16 from RD to construct

testing dataset and the remaining subjects are used as the training dataset.

For each training set, we conduct a two-sample t-test between the two groups

for each gene, and select the most significant 113 genes (i.e., with the smallest

p−values) as the covariates for prediction. Because the size of the training

sample is n = 112, the covariates with size p = 113 allow us to examine the

performance when p > n. Second, we standardize each gene level in these

datasets by dividing them with corresponding estimated standard deviation

from the training set. Finally, we conduct precision matrix estimation for each

regularization method and apply it to LDA. According to the LDA framework,

we assume that the normalized gene expression data is normally distributed

as N(µk,Σ), where the two groups are assumed to have the same covariance

matrix, Σ, but different means, µk, k = 1 for pCR and k = 2 for RD. The

LDA scores based on the estimated precision matrix Ω are as follows,

δk(x) = xT Ωµk −1

2µTk Ωµk + log πk,

where πk = nk/n is the proportion of subjects in group k in the training set and

µk = 1nk

∑i∈group k xi is the mean vector of within group in the training set.

The classification rule is given by k(x) = argmaxδk(x) for k = 1, 2. To eval-

uate the classification performance based on precision matrix estimation, we

use specificity, sensitivity, and Mathews correlation coefficient (MCC) criteria,

84

defined as follows:

Specificity =TN

TN + FP, Sensitivity =

TP

TP + FN

MCC =TP× TN− FP× FN√

(TP + FP)(TP + FN)(TN + FP)(TN + FN)

where TP, TN, FP, and FN are the numbers of true positives (pCR), true neg-

atives (RD), false positives and false negatives, respectively. We also compare

the number of nonzero precision matrix elements among the same methods

which are considered in the simulations with the same tuning strategy. The

results are reported in Table 3.9. The proposed two stage methods yield very

sparse precision matrix while they perform well or similar to other methods.

85

Table 3.9: Real Data (Breast Cancer)

Methods SP SN MCC #Edge

glasso 0.876(0.066) 0.404(0.186) 0.307(0.229) 1066.87(31.054)

gSCAD-osLLA 0.784(0.077) 0.682(0.201) 0.428(0.211) 731.37(50.934)

CLIME 0.737(0.068) 0.782(0.173) 0.457(0.18) 2282.18(371.52)

MB 0.677(0.074) 0.824(0.164) 0.433(0.169) 289.06(25.678)

glasso+DS 0.674(0.087) 0.824(0.161) 0.431(0.176) 221.50(22.474)

glasso+LS 0.674(0.088) 0.820(0.164) 0.428(0.179) 224.48(21.468)

clime+DS 0.666(0.072) 0.824(0.169) 0.422(0.17) 260.37(19.059)

clime+LS 0.669(0.077) 0.822(0.168) 0.424(0.165) 333.09(22.069)

86

3.4 Conclusion

In this paper, we especially focus on the selection and the correct recovery of

the support of precision matrix. We propose a regression based method which

applies two stage methods based on LASSO or Dantzig selector to columnwise

estimation of precision matrix. We prove that these proposed methods can

correctly recover the support of precision matrix and obtain√n-consistent

estimator for the nonzero elements of precision matrix with high probability

under some regularity conditions. Numerical results show that our proposed

methods outperform existing regularization methods including glasso, gSCAD,

and CLIME in terms of estimation and especially support recovery of precision

matrix.

87

Chapter 4

Concluding remarks

In this thesis, we propose a two stage method based on Dantzig selector, which

is called two stage Dantzig selector in high dimensional regression model. We

prove that two stage Dantzig selector satisfies strong oracle property. Numer-

ical results support our contention that Dantzig selector used in our proposed

method can improve variable selection and estimation than LASSO. Further-

more, two stage Dantzig selector outperform other regularization methods in-

cluding LASSO, Dantzig selector, SCAD and MCP.

We also apply the two stage methods based on the LASSO or the Dantzig

selector to sparse precision matrix estimation. We prove that these proposed

methods can correctly recover the support of precision matrix and obtain√n-

consistent estimator for the nonzero elements of precision matrix. For esti-

88

mation of sparse precision matrix, the two stage methods perform well in

estimation and especially in support recovery of precision matrix.

89

Chapter 5

Appendix

5.1 Algorithms for Dantzig selector

There are several algorithms for Dantzig selector. Recall the definition of

Dantzig selector which is defined by


nXT (y −Xβ)‖∞ ≤ λ. (5.1)

A standard way to solve (5.1) is using linear program (LP) techniques because

(5.1) can be recast as a LP problem. Candes and Romberg (2005) proposed

`1-magic package which applied a primal-dual interior point method which is

one of LP techniques to solve the reformulated LP problem. This algorithm is

known to be efficient whenX is sparse or it can be efficiently trainsformed into

90

a diagonal matrix, but it can be inefficient for large-scale problems because of

Newton step for each iteration (Wang and Yuan, 2012).

There are homotopy methods to to compute the entire solution path of the

Dantzig selector (Romberg, 2008; James et al., 2009). They are also problem-

atic in dealing with high dimensional data (Becker et al., 2011). As an effort

to solve (5.1) efficiently in large-scale problem, first-order methods have been

proposed (Lu, 2012; Becker et al., 2011). Lu et al. (2012) applied alternat-

ing direction method (ADM) which has been widely used to solve large-scale

problems to solving (5.1) and its variations have been developed for large scale

problems (Wang and Yuan, 2012).

We go into three representative algorithms for Dantzig selector which are

primal-dual interior point method (Candes and Romberg, 2005), Dantzig selec-

tor with sequential optimization (DASSO) (James et al., 2009)), and alternat-

ing direction method (ADM) (Lu et al., 2012). We abstract main algorithms

for Dantzig selector from these three papers.

5.1.1 Primal-dual interior point algorithm (Candes and

Romberg, 2005)

Dantzig selector can be recast to a linear program (LP). The LP is an optimiza-

tion problem with linear objective function and linear equality or inequality

91

constraints. There are many solvers for LP such as simplex method, barrier

method, primal-dual interior point method. Candes and Romberg (2005) in-

troduced primal-dual interior point method for LPs and second-order cone

programs(SOCPs). Here we especially extract algorithm for Dantzig selector

from Candes and Romberg (2005)’s paper.

An equivalent linear program to (5.1) is given by:

minβ,u

∑i

ui subject to β − u ≤ 0,

−β − u ≤ 0,

XTr − λ1 ≤ 0,

−XTr − λ1 ≤ 0,

where r = Xβ − y. Taking

fu1 = β − u, fu2 = −β − u, fλ1 = XTr − λ1, fλ2 = −XTr − λ1,

and f = (fu1 ,fu2 ,fλ1 ,fλ2)T then at the optimal point β∗,u∗, there exists dual

vectors γ∗ = (γ∗u1,γ∗u2

,γ∗λ1 ,γ∗λ2

)T , γ∗ ≥ 0 such that the following Karash-

Kuhn-Tucker condition are satisfiesd:

1− γ∗u1− γ∗u2

= 0,

fu1 ≤ 0, fu2 ≤0, fλ1 ≤ 0, fλ2 ≤ 0,

γu1,ifu1,i = 0, γu2,ifu2,i =0, γλ1,ifu1,i = 0, γλ2,ifu2,i = 0, i = 1, . . . , p.

92

The complementary slackness condition γifi = 0 can be relaxed practically to

γ(k)i fi(β

(k),u(k)) = −1/τ (k), (5.2)

with increasing τ (k) thorough the iterations. The relaxed-(KKT) condition re-

places the complementary slackness condition with (5.2). The optimal solution

β∗ of the primal dual algorithm satisfies the relaxed-(KKT) condition along

with optimal dual vectors γ∗. The solution is obtained through the classical

Newton method constrained by its interior region(fi(β(k),u(k)) < 0, γ

(k)i > 0).

The dual and central residuals quantify how close a point (β,u;γu1,γu2

,γλ1 ,γλ2)

is to satisfying (KKT ) with (5.2) in place of the slackness condition:

rdual =

γu1− γu2

+XTX(γλ1 − γλ2)

1− γu1− γu2

,

rcent = −Γf − (1/τ)1

where Γ is a diagonal matrix with (Γ )ii = γi. and the Newton step is the

solution toXTXΣaXTX +Σ11 Σ12

Σ12 Σ11

∆β∆u

=

−(1/τ) · (XTX(−f−1λ1 + f−1λ2 ))− f−1u1+ f−1u2

−1− (1/τ) · (f−1u1 + f−1u2 )

:=

w1

w2

93

where

Σ11 = −Γu1F−1u1− Γu2F

−1u2

Σ12 = Γu1F−1u1− Γu2F

−1u2

Σa = −Γλ1F−1λ1− Γλ2F−1λ2

.

Set

Σβ = Σ11 −Σ212Σ

−111 ,

then we can eliminate

∆u = Σ−111 (w2 −Σ12∆β),

and solve

(XTXΣaXTX +Σβ)∆β = w1 −Σ12Σ

−111 w2

for ∆β. As before, the system is symmetric positive definite, and the conjugate

gradient (CG) algorithm can be used to solve it.

Given ∆β, ∆u, the step directions for the inequality dual variables are

given by

∆γu1= −Γu1F

−1u1

(∆β −∆u)− γu1− (1/τ)f−1u1

∆γu2 = −Γu2F−1u2

(−∆β −∆u)− γu2− (1/τ)f−1u2

∆γλ1 = −Γλ1F−1λ1(XTX∆β)− γλ1 − (1/τ)f−1λ1

∆γλ2 = −Γλ2F−1λ2(−XTX∆β)− γλ2 − (1/τ)f−1λ2 .

94

where F is a diagonal matrix with (F )ii = fi. With the (∆β, ∆u, ∆γ) we have

a step direction. To choose the step length 0 < s ≤ 1, we ask that it satisfy

two criteria:

1. β+ s∆β, u + s∆u and γ + s∆γ are in the interior, i.e. fi(β+ s∆β,u +

s∆u) < 0, γi > 0 for all i.

2. The norm of the residuals has decreased sufficiently:

‖rτ (β + s∆β,u + s∆u,γ + s∆γ)‖2 ≤ (1− αs) · ‖rτ (β,u,γ)‖2,

where α is a user-sprecified parameter (in all of our implementations, we

have set α = 0.01).

Since the fi are linear functionals, item 1 is easily addressed. We choose the

maximum step size that just keeps us in the interior. Let

I+f = i : 〈ci, ∆z〉 > 0, I−γ = i : ∆γ < 0,

where z = (β,u)T and fi = 〈ci, z〉, and set

smax = 0.99 ·min1, −fi(z)/〈ci, ∆z〉, i ∈ I+f , −γi/∆γi, i ∈ I−γ.

Then starting with s = smax, we check if item 2 above is satisfied; if not, we set

s′ = ν · s and try again. We have taken ν = 1/2 in all of our implementations.

When rdual is small, the surrogate duality gap η = −fTγ is an approxima-

tion to how close a certain (β,u,γ) is to being opitmal (i.e. 〈c0, z〉−〈c0, z∗〉 ≈

95

η) where∑

i ui = 〈c0, z〉. The primal-dual algorithm repeats the Newton iter-

ations described above until η has decreased below a given tolerance.

5.1.2 DASSO (James et al., 2009)

James et al. (2009) proposed a homotopy algorithm for Dantzig which is named

as Dantzig selector with sequential optimization (DASSO). DASSO constructs

piecewise linear path while it identifies break points and solves the correspond-

ing linear program. DASSO is similar to least angle regression and selection

(LARS) algorithm (Efron et al., 2004) which is efficient algorithm for LASSO

hence its computational cost is comparable to LARS. We first describe the

LARS algorithm and go into the detail of DASSO. The LARS algorithm is

defined as follows.

LARS (Efron et al., 2004)

1. Initialize:

β = 0, A = argmaxj|∇L(β)|j, γA = −sgn(∇L(β))A, γAC = 0.

where L(β) =∑

i(yi − xTi β)2.

2. While (max |∇L(β)| > 0);

96

(a) d1 = mind > 0 : |∇L(β)j| = |∇L(β)A|, j /∈ A,

d2 = mind > 0 : (β + dγ)j = 0, j ∈ A.

Find step length: d = min(d1, d2).

(b) Take step: β ← β + dγ.

(c) If d = d1 then add variable attaining equality at d to A.

If d = d2 then remove variable attaining 0 at d from A.

(d) Calculate new direction:

γA = (XTAXA)−1sgn(βA) and γAC = 0.

The LARS procedure starts with all zero coefficients and select the most

correlated variable with response variable. LARS proceeds with the direction

of this variable until some other variable has as much correlation with the

current residual and add this new variable to the set of selected variables. The

direction is taken to be equiangular among selected variables and it changes

when addition or deletion happen. Addition occurs when the correlation of

other variable with current residual become same as the correlation of selected

variables with current residual. Deletion happens when one of the coefficients

of the selected variables to be zero while LARS procedure continues along with

the current direction.

DASSO algorithm is to solve (5.1) sequentially through constructing a

97

piecewise linear solution path as well. DASSO is defined as follows.

DASSO (James et al., 2009)

1. Initialize:

l = 1, βl = 0, A = argmaxj|XTj (y −Xβl)|, B = j : βlj 6= 0 = ∅,

c = XT (y −Xβl), γA = −sgn(cA), γAC = 0, sA = sgn(cA)

2. While (maxj |XTj (y −Xβl)| > 0);

(a) d1 = mind > 0 : |XTj (y −Xβl)| = |XT

A(y −Xβl)|, j /∈ A,

d2 = mind > 0 : (βl + dγ)j = 0, j ∈ A.


(b) If d = d1 then add variable attaining equality at d to A and add

variable j∗ to B.

If d = d2 then remove variable attaining 0 at d from either A and

B.

(c) Calculate new direction:

γA = (XTAXB)−1sA and γAC = 0

(d) Take step: βl+1 ← βl + dγ.

(e) l← l + 1

98

The added variable j∗ and the distance are defined as follows.

• Define the added variable.

Let |A| × (2p + |A|) matrix A = (−sAXTAX sAX

TAX I) and Aj = Aj1

Aj2

be the jth column ofA with Aj2 is a scalar. Let B =

B1

B2

be the columns of A that correspond to the non-zero components of β+

and β− where B1 is a square matrix of dimension |A| − 1.

Define j∗ = argmaxj:qj 6=0,α/qj>0

(1TB−11 Aj1 − 1j≤2p)|qj|−1

where qj = Aj2 −B2B−11 Aj1 and α = B2B

−11 1− 1.

• Define the distance.

Let the distance be d = mind1, d2 where d1 = minj /∈A

ck−cj

(Xk−Xj)TXγ,

ck+cj(Xk+Xj)TXγ

+

for k ∈ A and d2 = minj∈B−βjγj.

This rule for adding variable comes from piecewise linearity of solution

path and the definition of Dantzig selector. For more detail, see the appendix

of James et al. (2009). The distance step is same as in LARS algorithm.

99

5.1.3 Alternating direction method (ADM) (Lu et al.,

2012)

The ADM has recently been widely used to solve large-scale problems. The

general problems which can use the ADM have the following form

minx,y

f(x) + g(y) subject to Ax+By = b, x ∈ C1, y ∈ C2, (5.3)

where f and g are convex functions, A and B are matrices, b is a vector,

and C1 and C2 are closed convex sets. The ADM consists of two subproblems

and a multiplier update. Under mild assumptions, It is known that the ADM

converges to optimal solution of (5.3) (Bertsekas and Tsitsiklis, 1989). The

ADM for Dantzig selector is defined as follows:

minβ,z‖β‖1 subject to XT (Xβ − y)− z = 0, ‖D−1z‖∞ ≤ λ (5.4)

where D is the diagonal matrix whose diagonal elements are the norm of the

column of X. An augmented Lagrangian function for problem (5.4) for some

µ > 0 can be defined as

Lµ(z,β,η) = ‖β‖1 + ηT (XTXβ −XTy − z) +µ

2‖XTXβ −XTy − z‖22.

ADM algorithm for Dantzig selector

1. Initialize: let β0,η0 ∈ Rp and µ > 0.

100

2. For k = 0, 1, . . .

zk+1 = argmin‖D−1z‖≤λ

Lµ(z,βk,ηk),

βk+1 ∈ argminβLµ(zk+1,β,ηk),

ηk+1 = ηk + µ(XTXβk+1 −XTy − zk+1).

End(for)

We go into subproblems of the ADM. Dual problem of (5.3) is given by

maxη

d(η) := −yTXη − λ‖Dη‖1 subject to ‖XTXη‖∞ ≤ 1.


‖z− (XTXβk −XTy +ηk

µ)‖22

= min

max

XTXβk −XTy +

ηk

µ,−λd

, λd

,

where d is the vector consisting of the diagonal entries of D. Hence the first

subproblem has the closed form solution. For the second subproblem, we can

choose βk+1 which solves the following approximated subproblem

minβ

µ

2‖XTXβ −XTy − zk+1 +

ηk

µ‖22 + ‖β‖1.

This problem can be solved by the nonmonotone gradient methods for nons-

mooth minimization (Lu and Zhang, 2012). The general problems which can

apply nonmonotone gradient method can be defined as

minx∈XF (x) := f(x) + P (x)

101

where f : Rn → R is continuously differentiable, P : Rn → R is con-

vex but not necessarily smooth, and X ⊆ Rn is closed and convex. Here,

f(β) = µ2‖XTXβ −XTy − zk+1 + ηk

µ‖22 and P (β) = ‖β‖1. Then ∇f(β) =

µXTX(XTX)β−XTy−zk+1 + ηk

µ. Then the nonmonotone gradient method

for solving the subproblem of β is defined as follows:

1. Initialize: 0 < τ, σ < 1, 0 < α < 1 and integer M ≤ 0. Let β0 be given

and set α0 = 1.

2. For l = 0, 1, . . .

(a) Let

dl = SoftThresh(βl−αl∇f(βl), αl)−βl, ∆l = ∇f(βl)Tdl+‖βl+dl‖1−‖βl‖1

(b) Find the largest α ∈ 1, τ, τ 2, . . . such that

f(βl + αdl) + ‖βl + αdl‖1 ≤ max[l−M ]+≤i≤l

f(βi) + ‖βi‖1+ σα∆l.

Set αl ← α, βl+1 ← βl + αldl and l← l + 1.

(c) Update αl+1 = minmax ‖sl‖2

slT gl, α, 1, where sl = βl+1 − βl and

gl = ∇f(βl+1)−∇f(βl).

End(for)

where SoftThresh(v, γ) := sgn(v)max0, |v|−γe. For the specific terminating

rules used in ADM, see Lu et al. (2012).

102

Bibliography

Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection

through sparse maximum likelihood estimation for multivariate gaussian or

binary data. The Journal of Machine Learning Research, 9:485–516.

Becker, S. R., Candes, E. J., and Grant, M. C. (2011). Templates for convex

cone problems with applications to sparse signal recovery. Mathematical

Programming Computation, 3(3):165–218.

Bertsekas, D. and Tsitsiklis, J. (1989). Parallel and distributed computation.

Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance

matrices. The Annals of Statistics, pages 199–227.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of

lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732.

103

Breiman, L. (1996). Heuristics of instability and stabilization in model selec-

tion. The Annals of Statistics, 24(6):2350–2383.

Cai, T., Liu, W., and Luo, X. (2011). A constrained l1 minimization approach

to sparse precision matrix estimation. Journal of the American Statistical

Association, 106(494):594–607.

Cai, T. T., Liu, W., Luo, X. R., and Luo, M. X. R. (2012). Package ’clime’.

Candes, E. and Plan, Y. (2009). Near-ideal model selection by l1 minimization.

The Annals of Statistics, 37(5A):2145–2177.

Candes, E. and Romberg, J. (2005). l1-magic: Recovery of sparse

signals via convex programming. URL: www. acm. caltech.

edu/l1magic/downloads/l1magic. pdf, 4.

Candes, E. and Tao, T. (2005). Decoding by linear programming. Information

Theory, IEEE Transactions on, 51(12):4203–4215.

Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation

when p is much larger than n. The Annals of Statistics, 35(6):2313–2351.

Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for

model selection with large model spaces. Biometrika, 95(3):759–771.

104

Dicker, L. (2010). Regularized Regression Methods for Variable Selection and

Estimation. Collections of the Harvard University Archives: Dissertations.

Harvard University.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle

regression. The Annals of Statistics, 32(2):407–499.

Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive

lasso and scad penalties. The Annals of Applied Statistics, 3(2):521.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli-

hood and its oracle properties. Journal of the American Statistical Associ-

ation, 96(456):1348–1360.

Fan, J., Xue, L., and Zou, H. (2012). Strong oracle optimality of folded concave

penalized estimation. arXiv preprint arXiv:1210.5992.

Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemo-

metrics regression tools. Technometrics, 35(2):109–135.

Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance

estimation with the graphical lasso. Biostatistics, 9(3):432–441.

Friedman, J., Hastie, T., Tibshirani, R., and Tibshirani, M. R. (2013). Package

’glasso’.

105

Gai, Y., Zhu, L., and Lin, L. (2013). Model selection consistency of dantzig

selector. Statistica Sinica, 23:615–634.

Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional lin-

ear predictor selection and the virtue of overparametrization. Bernoulli,

10(6):971–988.

Hess, K. R., Anderson, K., Symmans, W. F., Valero, V., Ibrahim, N., Mejia,

J. A., Booser, D., Theriault, R. L., Buzdar, A. U., Dempsey, P. J.,

et al. (2006). Pharmacogenomic predictor of sensitivity to preoperative

chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophos-

phamide in breast cancer. Journal of Clinical Oncology, 24(26):4236–4244.

Huang, J., Horowitz, J. L., and Ma, S. (2008a). Asymptotic properties of

bridge estimators in sparse high-dimensional regression models. The Annals

of Statistics, 36(2):587–613.

Huang, J., Ma, S., and Zhang, C.-H. (2008b). Adaptive lasso for sparse high-

dimensional regression models. Statistica Sinica, 18(4):1603.

James, G. M., Radchenko, P., and Lv, J. (2009). Dasso: connections between

the dantzig selector and lasso. Journal of the Royal Statistical Society: Series

B (Statistical Methodology), 71(1):127–142.

106

Kim, Y., Choi, H., and Oh, H. (2008). Smoothly clipped absolute devia-

tion on high dimensions. Journal of the American Statistical Association,

103(484):1665–1673.

Kim, Y. and Kwon, S. (2012). Global optimality of nonconvex penalized esti-

mators. Biometrika, 99(2):315–325.

Kim, Y., Kwon, S., and Choi, H. (2012). Consistent model selection criteria on

high dimensions. The Journal of Machine Learning Research, 98888:1037–

1057.

Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large

covariance matrix estimation. Annals of statistics, 37(6B):4254.

Lu, Z. (2012). Primal–dual first-order methods for a class of cone programming.

Optimization Methods and Software, 28(6):1262–1281.

Lu, Z., Pong, T. K., and Zhang, Y. (2012). An alternating direction method

for finding dantzig selectors. Computational Statistics & Data Analysis.

Lu, Z. and Zhang, Y. (2012). An augmented lagrangian approach for sparse

principal component analysis. Mathematical programming, 135(1-2):149–

193.

107

Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and vari-

able selection with the lasso. The Annals of Statistics, 34(3):1436–1462.

Meinshausen, N., Rocha, G., and Yu, B. (2007). Discussion: A tale of three

cousins: Lasso, l2boosting and dantzig. The Annals of Statistics, 35(6):2373–

2384.

Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A

unified framework for high-dimensional analysis of m-estimators with de-

composable regularizers. Statistical Science, 27(4):538–557.

Peng, J., Wang, P., Zhou, N., and Zhu, J. (2009). Partial correlation estima-

tion by joint sparse regression models. Journal of the American Statistical

Association, 104(486):735–746.

Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue

properties for correlated gaussian designs. The Journal of Machine Learning

Research, 99:2241–2259.

Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of es-

timation for high-dimensional linear regression over `q-balls. Information


Ravikumar, P., Wainwright, M., Raskutti, G., and Yu, B. (2011).

108

High-dimensional covariance estimation by minimizing l1-penalized log-

determinant divergence. Electronic Journal of Statistics, 5:935–980.

Romberg, J. (2008). The dantzig selector and generalized thresholding. In In-

formation Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference

on, pages 22–25. IEEE.

Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random

measurements. Information Theory, IEEE Transactions on, 59(6):3434–

3447.

Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A.,

Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant,

T. L., et al. (2006). Regulation of gene expression in the mammalian eye

and its relevance to eye disease. Proceedings of the National Academy of

Sciences, 103(39):14429–14434.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal

of the Royal Statistical Society. Series B (Methodological), pages 267–288.

Wang, L., Kim, Y., and Li, R. (2013). Calibrating nonconvex penalized regres-

sion in ultra-high dimension. The Annals of Statistics, 41(5):2505–2536.

Wang, X. and Yuan, X. (2012). The linearized alternating direction method

109

of multipliers for dantzig selector. SIAM Journal on Scientific Computing,

34(5):A2792–A2811.

Yuan, M. (2010). High dimensional inverse covariance matrix estimation via

linear programming. The Journal of Machine Learning Research, 99:2261–

2286.

Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian

graphical model. Biometrika, 94(1):19–35.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax con-

cave penalty. The Annals of Statistics, 38(2):894–942.

Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection

in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–

1594.

Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regulariza-

tion for high-dimensional sparse estimation problems. Statistical Science,

27(4):576–593.

Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The

Journal of Machine Learning Research, 7:2541–2563.

110

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the

American Statistical Association, 101(476):1418–1429.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized

likelihood models. Annals of Statistics, 36(4):1509.

111

국문초록

변수선택은 고차원 회귀분석에서 중요하다. 단계별선택법과 같은 전통적인

변수선택방법들은데이터에따라서선택된변수들이달라지므로불안정하

다. 이에 대한 대안으로 변수선택과 추정을 동시에 하는 벌점화 방법론들이

사용된다. 라소 추정량은 희소성을 가지지만 변수선택 일치성이 없으며 편

향되어있다. SCAD, MCP와 같은 비볼록 벌점화 방법론들은 선택일치성을

가지며비편향추정량임이잘알려져있다.하지만이방법론들은다중국소

해들을 가질 수 있으며 조율계수에 따라 계산이 불안정하다. 신의 추정량을

유일한 국소해로 가질 수 있는 라소에 기반을 둔 이단계 방법론들이 개발되

었다.

본 연구에서는 라소를 단치그 셀렉터로 변형시킨 새로운 이단계 방법론

을제안한다.제안한방법은이단계방법론의두번째단계에서잡음변수들

의 영향력을 줄이는 것이 매우 중요하다는 점에 착안하였다. 단치그 셀렉터

의 `1-놈은같은조율계수에대해라소의 `1-놈보다작거나같고비점근적오

차범위 또한 단치그 셀렉터가 라소보다 작은 경향이 있다. 그러므로 우리는

112

이단계방법론에라소대신단치그셀렉터를이용하므로변수선택일치성을

만족시키면서 추정을 좀더 개선시킬 것이라 기대한다. 본 연구에서는 자료

에 대한 조건 가정 아래, 제안한 방법을 통해 신의 추정량을 얻을 수 있음을

증명하였다. 그리고 수치적 연구를 통해 변수선택과 추정에 있어서 이단계

단치그 셀렉터가 라소에 기반을 둔 이단계 방법론을 개선시킬 수 있으며,

기존의 다른 방법론들과 비교해서도 좋은 성능을 보임을 확인하였다.

본 연구에서는 추가적으로 공분산 역행렬 추정에 이단계 방법론들을 적

용한다. 공분산 역행렬은 다양한 통계적 문제에 활용되며 그 자체로 조건부

상관성을 내포하므로 매우 중요하다. 제안된 방법을 통해 제약 조건 하에서

공분산역행렬계수가 0인지에대한선택일치성을가질수있으며, 0이아닌

참공분산역행렬계수들에대해√n-일치성을가짐을보였다.수치적연구를

통해 제안된 방법이 계수 선택과 추정에 있어서 기존의 방법론들보다 좋은

성능을 가짐을 확인하였다.

주요어 : 고차원 회귀분석, 변수 선택, 단치그 셀렉터, 선택 일치성, 신의

추정량, 공분산 역행렬

학 번 : 2007− 20263

113

감사의 글

학부와 대학원의 관악에서의 11년간 좋은 교수님, 친구, 선배, 후배, 도움주신 많은 분

들과 만나게 하시고 모든 과정을 하나님의 은혜로 마치게 해주심에 감사드립니다. 통계의

전문성을 갖춰 공공의 유익이 되고 싶은 소망을 가지고 대학원에 진학한지도 7년이 지나

이제박사로첫발을내딛으려하니감회가새롭습니다.새출발에앞서지난기간동안도움

주신 많은 분들께 감사의 말씀을 전하고 싶습니다.

가장 먼저 박사과정 전반에 다양한 기회를 주시고 지도해주신 김용대 교수님께 감사드

립니다.그리고항상푸근하게맞아주시고독려해주신전종우교수님께감사드립니다.논문

심사에서조언해주시고수고해주신박병욱교수님,장원철교수님,임요한교수님께도정말

감사드립니다.연구실선배님이신멋있는권성훈교수님께도논문심사와여러가지조언들

에 감사드립니다. 6여 년간의 연구실 생활에 좋으신 선후배님들과 함께여서 즐거웠습니다.

최호식 교수님, 동화오빠, 도현오빠, 범수오빠, 광수아저씨, 재석오빠, 상인오빠, 병엽오빠,

종준오빠, 미애언니, 수희언니, 신선언니, 영희, 건수오빠, 혜림이, 미경이, 효원이, 지영이,

원준이, 지선이, 지영언니, 주유오빠, 민우, 승환이, 우성이, 재성이, 슬기, 동하, 세민이, 구

환이,윤영언니,승남이,오란이그리고네이버팀김유원이사님,정효주박사님,인재오빠,

연하언니 정말 감사했습니다. 학부 때부터 단짝친구 영선이, 3년 넘게 룸메로 힘이 되어준

신영이, 귀엽고 속 깊은 동화 같은 정은이, 박사동기 예지, 성일오빠, 말년에 즐거운 시간들

함께해준 선미언니, 과사 정환언니에게 고마운 마음 전하고 싶습니다.

그리고저의대학원생활전반을함께한통계학과기독인모임에감사드립니다.모임의

큰기둥이되어주신오희석교수님,바쁘신중에도참여해주신박태성교수님,맛있는식사

와격려로힘이되어주신조명희교수님,예배로함께해주신이영조교수님,박성현교수님,

송문섭교수님께도감사드립니다.민정언니,성준오빠때부터지금의지영이,민주,은용이,

성경오빠,재혁이, 보창이까지함께 나누고 교제할 수 있어서 즐거웠습니다. 대학원 생활의

활력소였던 수요채플 찬양 팀 준오빠, 정민이, 은혜, 바우오빠, 송희언니, 민우, 정민오빠,

114

지웅오빠, 재희, 나래, 바뚜, 문수오빠, 신잉, 서교교수님, 경주오빠, 현주, 지나, 찬미, 경만

이, 서림이, 건의, 민화언니, 소정언니, 소영이 모두 덕분에 정말 감사하고 즐거웠습니다.

사랑스러운보신자매들윤진이,효현이,지인이와의소소한즐거움들에정말감사했습니다.

그리고 닮고 싶은 바른 그리스도인의 전형을 보여주신 남승호 교수님과 이원종 교수님, 신

앙성장에도움주시고격려해주신대학교회김동식목사님,마리아사모님,홍종인교수님,

김난주 사모님께도 감사드립니다.

마지막으로 한결같은 사랑과 신뢰, 기도로 뒷받침해주신 부모님과 든든한 버팀목이 되

어준 동생에게 감사합니다. 오랜 지기 새미, 정현이에게도 고마운 마음 전합니다. 수학에

흥미를 가질 수 있도록 도움 주신 은사님이신 구명수 선생님께도 그간 연락드리지 못해

죄송하고 정말 감사했습니다. 그리고 기도로 응원해주신 친척 분들과 교회식구들께 감사

드립니다. 아직도 많이 부족하지만 앞으로 더욱 성실함과 진실함, 사랑하는 마음으로 제가

속한 공동체와 나라에 유익이 되기 위해 노력하겠습니다. 감사합니다.

2014년 2월 한 상 미

115

저 시-비 리-동 조건 경허락 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

l 차적 저 물 성할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 허락조건 확하게 나타내어야 합니다.

l 저 터 허가를 러한 조건들 적 지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 적 할 수 없습니다.

동 조건 경허락. 하가 저 물 개 , 형 또는 가공했 경에는, 저 물과 동 한 허락조건하에서만 포할 수 습니다.

http://creativecommons.org/licenses/by-nc-sa/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-sa/2.0/kr/

이학박사 학위논문



고차원 자료를 위한 이단계 단치그 셀렉터

2014년 2월


통계학과

한 상 미



지도교수 김 용 대

이 논문을 이학박사 학위논문으로 제출함.

2013년 10월


통계학과

한 상 미

한상미의 이학박사 학위논문을 인준함.

2013년 12월

위 원 장 : 박 병 욱 (인)

부 위원장 : 김 용 대 (인)

위 원 : 임 요 한 (인)

위 원 : 장 원 철 (인)

위 원 : 권 성 훈 (인)



by

Sangmi Han

A Thesis

Submitted in fulfillment of the requirements

for the degree of

Doctor of Philosophy

in Statistics

Department of Statistics

College of Natural Sciences

Seoul National University

Feburary, 2014

Abstract

Variable selection is important in high dimensional regression. The traditional

variable selection methods such as stepwise selection are unstable which means

that the set of the selected variables are varying according to the data sets.

As an alternative to those methods, a series of penalized methods are used

for estimation and variable selection simultaneously. The LASSO yields sparse

solution, but it is not selection consistent and biased. Non-convex penalized

methods such as the SCAD and the MCP are known to be selection consistent

and yield unbiased estimator. However they suffer from multiple local minima

and their computations are unstable for tuning parameter. Two stage methods

based on the LASSO such as one step LLA and calibrated CCCP are developed

which can obtain the oracle estimator as the unique local minimum.

We propose a two stage method based on Dantzig selector. The motivation

of our proposed method is that lessening the effect of the noise variables is

important in the two stage method. The `1 norm of the Dantzig selector is

i

always less than equal to that of the LASSO and the non-asymptotic error

bounds of Dantzig selector tent to be lesser than those of LASSO for the same

tuning parameter. Therefore we expect the improvement on the estimation

using the Dantzig selector instead of the LASSO in the two stage method

while this proposed method also satisfies the selection consistency. The results

of the numerical experiments can support our contention.

We also apply these two stage methods which are based on LASSO or

Dantzig selector to estimation of inverse covariance matrix (a.k.a. precision

matrix). Precision matrix estimation is essential not only because it can be

used in various applications but also because it refers to the direct relation-

ship between variables via the conditional dependence of variables under the

normality assumption. Under some regularity conditions our methods hold se-

lection consistency and obtain columnwise√n-consistent estimator for true

nonzero precision matrix elements. The numerical analyses show that the pro-

posed methods perform well in terms of variable selection and estimation.

Keywords: High dimensional regression, variable selection, Dantzig selector,

selection consistency, oracle estimator, inverse covariance matrix estimation

Student Number: 2007− 20263

ii

Contents

Abstract i

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Two Stage Dantzig Selecotr for High Dimensional Linear Re-

gression 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Sparse regularization methods . . . . . . . . . . . . . . . . . . . 8

2.2.1 The `1 regularization . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Nonconvex penalized methods . . . . . . . . . . . . . . . 13

2.2.3 Two stage methods . . . . . . . . . . . . . . . . . . . . . 21

2.3 Two Stage Dantzig Selector . . . . . . . . . . . . . . . . . . . . 25

iii

2.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.3 Theoretical properties . . . . . . . . . . . . . . . . . . . 29

2.3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.5 Tuning regularization parameter . . . . . . . . . . . . . . 37


2.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 39


2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Two Stage Methods for Precision Matrix Estimation 46

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Estimation of precision matrix via columnwise two-stage methods 49

3.2.1 Two stage method based on LASSO . . . . . . . . . . . 50

3.2.2 Two stage Dantzig selector . . . . . . . . . . . . . . . . . 53

3.2.3 Theoretical results . . . . . . . . . . . . . . . . . . . . . 55


3.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 64


3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

iv

4 Concluding remarks 88

5 Appendix 90

5.1 Algorithms for Dantzig selector . . . . . . . . . . . . . . . . . . 90

5.1.1 Primal-dual interior point algorithm (Candes and Romberg,

2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1.2 DASSO (James et al., 2009) . . . . . . . . . . . . . . . . 96

5.1.3 Alternating direction method (ADM) (Lu et al., 2012) . 100

Abstract (in Korean) 111

감사의 글 114

v

List of Tables

2.1 Example 1 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2 Example 1 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.3 Example 2 (R=0.3) . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 Example 2 (R=0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.5 Real Data (TRIM) . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1 Example 1 (p=100, q=99) . . . . . . . . . . . . . . . . . . . . . 68

3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 70

3.3 Example 1 (p=200, q=199) . . . . . . . . . . . . . . . . . . . . 72

3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 74

3.5 Example 2 (p=100, q=59) . . . . . . . . . . . . . . . . . . . . . 76

3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 78

3.7 Example 2 (p=200, q=92) . . . . . . . . . . . . . . . . . . . . . 80

3.8 ‖Ω −Ω∗‖ of Example 2 (p=200, q=92) . . . . . . . . . . . . . 82

vi

3.9 Real Data (Breast Cancer) . . . . . . . . . . . . . . . . . . . . . 86

vii

List of Figures

2.1 LASSO and Dantzig selector . . . . . . . . . . . . . . . . . . . 10

2.2 Penalized method . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 LLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 CCCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Nonconvex penalties . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Adaptive DS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


3.2 ‖Ω −Ω∗‖ of Example 1 (p=100, q=99) . . . . . . . . . . . . . 69

3.3 ROC curve of Example 1 (p=200, q=199) . . . . . . . . . . . . 71

3.4 ‖Ω −Ω∗‖ of Example 1 (p=200, q=199) . . . . . . . . . . . . . 73


3.6 ‖Ω −Ω∗‖ of Example 2 (p=100, q=59) . . . . . . . . . . . . . 77


viii