disclaimer - seoul national university · 2019. 11. 14. · lasso which uses an l 2 norm of coe...

87
저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게 l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다: l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다. l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다. 저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약 ( Legal Code) 을 이해하기 쉽게 요약한 것입니다. Disclaimer 저작자표시. 귀하는 원저작자를 표시하여야 합니다. 비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다. 변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.

Upload: others

Post on 16-Apr-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

Page 2: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

이학박사 학위논문

Groupwisely Sparse Penalty for Highly

Correlated Covariates

상관성이 높은 공변량 자료에서 그룹변수 선택

방법을 위한 벌점화 방법 연구

2013년 2월

서울대학교 대학원

통계학과

오 미 애

Page 3: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Groupwisely Sparse Penalty for Highly Correlated

Covariates

지도교수 김 용 대

이 논문을 이학박사 학위논문으로 제출함.

2012년 10월

서울대학교 대학원

통계학과

오 미 애

오미애의 이학박사 학위논문을 인준함.

2012년 12월

위 원 장 : 전 종 우 (인)

부 위원장 : 김 용 대 (인)

위 원 : 임 요 한 (인)

위 원 : 장 원 철 (인)

위 원 : 최 호 식 (인)

Page 4: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Groupwisely Sparse Penalty for Highly

Correlated Covariates

By

Miae Oh

A Thesis

Submitted in fulfillment of the requirements

for the degree of

Doctor of Philosophy

in Statistics

Department of Statistics

College of Natural Sciences

Seoul National University

February, 2013

Page 5: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Abstract

This paper considers a problem of model selection and estimation in sparse,

high-dimensional regression models where covariates are grouped. We pro-

pose a new regularization method which can reflect a correlation structure

between groups. We propose a combination of group MCP penalty and group-

wise quadratic penalty called groupwise weighted ridge penalty. The former

ensures groupwise sparsity and the later promotes simultaneous selection of

highly correlated groups. We show that this estimator satisfies the group se-

lection consistency. We drive the optimization algorithm for the proposed

method. Numerical studies show that this estimator appears to perform well

in comparisons with other existing methods.

Keywords: Group variable selection, high-dimensional data, penalized re-

gression, weighted ridge, oracle property.

Student Number: 2006− 20276

i

Page 6: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Contents

Abstract i

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature review : Variable selection 6

2.1 Individual variable selection . . . . . . . . . . . . . . . . . . . . 7

2.2 Group variable selection . . . . . . . . . . . . . . . . . . . . . . 11

3 New penalized method 15

3.1 Sparse groupwise weighted ridge penalty . . . . . . . . . . . . . 15

3.2 Theoretical properties . . . . . . . . . . . . . . . . . . . . . . . 19

4 Computation 26

4.1 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

ii

Page 7: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

4.1.1 Orthogonal case . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.2 General case . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Group Laplacian method . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 Orthogonal case . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Convex concave procedure . . . . . . . . . . . . . . . . . . . . . 34

4.4 Sparse groupwise weighted ridge . . . . . . . . . . . . . . . . . . 36

4.4.1 Orthogonal case . . . . . . . . . . . . . . . . . . . . . . 36

4.4.2 General case . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Numerical studies 39

5.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.1 Wine quality . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.2 Microarray gene expression data . . . . . . . . . . . . . . 54

6 Concluding remarks 58

Abstract (in Korean) 77

iii

Page 8: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

List of Tables

5.1 Simulation results of Example 1: R=0.3 . . . . . . . . . . . . . 44

5.2 Simulation results of Example 1: R=0.5 . . . . . . . . . . . . . 45

5.3 Simulation results of Example 1: R=0.8 . . . . . . . . . . . . . 46

5.4 Simulation results of Example 2: R=0.8 . . . . . . . . . . . . . 47

5.5 Simulation results of Example 3 . . . . . . . . . . . . . . . . . . 48

5.6 The results of White Wine data . . . . . . . . . . . . . . . . . . 51

5.7 The results of Trim data . . . . . . . . . . . . . . . . . . . . . . 56

iv

Page 9: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

List of Figures

4.1 Blockwise Coordinate descent algorithm . . . . . . . . . . . . . 28

4.2 Single Line Search algorithm . . . . . . . . . . . . . . . . . . . 30

4.3 Group Laplacian algorithm . . . . . . . . . . . . . . . . . . . . 33

4.4 CCCP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 Example 1: R=0.3: Boxplots of the 50 prediction errors. . . . . 44

5.2 Example 1: R=0.5: Boxplots of the 50 prediction errors. . . . . 45

5.3 Example 1: R=0.8: Boxplots of the 50 prediction errors. . . . . 46

5.4 Example 2: R=0.8: Boxplots of the 50 prediction errors. . . . . 47

5.5 Example 3: Boxplots of the 50 prediction errors. . . . . . . . . 48

5.6 White Wine: the partial fit on selected covariates . . . . . . . . 53

5.7 TRIM : Boxplots of the 50 prediction errors. . . . . . . . . . . . 57

v

Page 10: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Chapter 1

Introduction

1.1 Overview

Variable selection is a classical topic in high dimensional statistical modeling.

These days sparse penalized methods have been receiving much attention.

Compared with traditional variable selection methods like step-wise variants

and best-subset selection, sparse penalized methods are able to do estimation

and automatic variable selection simultaneously.

There are many popular sparse penalized methods. The least absolute

shrinkage and selection operator (Lasso) proposed by Tibshirani [1996] uses

the l1 penalty. The smoothly clipped absolute deviation (SCAD) proposed

by Fan and Li [2001] and the minimum concave penalty (MCP) proposed

1

Page 11: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

by Zhang [2010] are the nonconvex penalization methods. Fan and Li [2001]

showed that the SCAD estimator is asymptotically equivalent to the oracle

estimator. Kim et al. [2008] extended this result to high dimensional cases.

MCP estimator also has the oracle property [Zhang, 2010].

These sparse penalized estimators, however, may not be optimal, especially

when there is collinearity among covariates. To use the collinearity of covari-

ates more effectively, Zou and Hastie [2005] proposed the Elastic net penalty,

which is a linear combination of the l1 and l2 penalties. The Elastic net yields

a sparse solution due to the l1 penalty while it selects a group of highly corre-

lated covariates by the l2 penalty. Daye and Jeng [2009] proposed the weighted

fusion penalty, which is a combination of the l1 and weighted fusion penalties.

Bondell and Reich [2008] proposed OSCAR(octagonal shrinkage and cluster-

ing algorithm for regression) , which is a combination of the l1 and pairwise

l∞ penalties. Mnet proposed by Huang et al. [2010a] is a combination of

the MCP and l2 penalties. Huang et al. [2011] proposed the sparse Lapla-

cian shrinkage estimator based on a combination of the MCP and Laplacian

quadratic penalties. Kim [2012] proposed the sparse weighted ridge penalty,

which uses a combination of the MCP and the weighted ridge penalty. Kim

[2012] showed that the sparse weighted ridge penalty is equivalent to the sparse

Laplacian penalty when the weights are selected accordingly. Compared with

2

Page 12: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

the Laplacian penalty, the sparse weighted ridge penalty is simpler to use.

The sparse penalized methods mentioned above are developed for individ-

ual variable selection. Regression problems often arise when potential covari-

ates are grouped naturally. Categorical factors can be represented by a group

of indicator variables, continuous factors may be represented by a group of

basis functions. The prediction performance of the standard variable selection

method which ignore the grouping structure may not be optimal. Where there

are grouping structures, we are interested in selecting important groups as well

as important variables. Recently, many authors have studied sparse penalized

methods for group variable selection. Yuan and Lin [2006] proposed the group

Lasso which uses an l2 norm of coefficients within each group. The group

SCAD penalty proposed by Wang et al. [2007] and the group MCP proposed

by Huang et al. [2010b] are nonconvex penalties for group selection.

These methods enforce the sparsity at only the group level, due to using the

l2 norm of each group. When there are strong correlations among groups, the

above methods may not be performed well since they do not fully utilize the

correlations between groups. Liu et al. [2011] proposed smoothed group Lasso,

which is a combination of the group Lasso and the quadratic penalty on the

differences of regression coefficients of adjacent groups. The smoothed group

Lasso penalty accommodates only the correlation between adjacent groups,

3

Page 13: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

and so the performance of the estimator depends on the group locations. To

overcome this drawback, we can think of the group Laplacian penalty which is

an extension of the smoothed group Lasso penalty where the differences of the

l2 norms of the coefficients of all pairs of groups are considered. A problem

of using the group Laplacian penalty is the difficulty of computation since the

group Laplacian penalty is nonconvex.

In this paper, we propose a new penalized method called sparse groupwise

weighted ridge penalty, which is a combination of the group MCP and the

groupwise weighted ridge penalty. The groupwise weighted ridge penalty is a

group version of the weighted ridge penalty proposed by Kim [2012] which en-

courages selection of highly correlated groups. The group MCP yields the

groupwise sparsity. An advantage of the sparse groupwise weighted ridge

penalty is that it considers correlations between groups since the weights can

control the influences between correlated groups. Also, computation can be

easier since the groupwise weighted ridge penalty is always strictly convex.

Zou and Hastie [2005] proposed the post-scaling for the elastic net esti-

mator to avoid over-regularization. For the sparse groupwise weighted ridge

penalty, we propose a similar post-scaling. We show that the post-scaled sparse

weighted ridge estimator converges to the optimal estimator when the two

groups of covariates are exactly the same. See chapter 3 for details.

4

Page 14: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

We prove the group selection consistency of the sparse groupwise weighted

ridge estimator. We develop an optimization algorithm which is efficient and

stable even in a high-dimensional setting. Numerical studies and real data

show that the sparse groupwise weighted ridge estimator outperforms the other

competitors in terms of the prediction accuracy, and selects the signal groups

as well.

1.2 Outline of the thesis

This thesis is organized as follows. In chapter 2, we review various penalized

regression methods. In chapter 3, we propose a new penalized method, sparse

groupwise weighted ridge penalty. We study theoretical properties of the sparse

groupwise weighted ridge estimator. In chapter 4, we describe an algorithm for

computing the sparse groupwise weighted ridge estimator and compare it with

other computational algorithms. In chapter 5, we present results of simulation

studies as well as real data analysis. In chapter 6, we give concluding remarks.

Proofs of the oracle properties of the proposed method and other technical

details are provided in Appendix.

5

Page 15: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Chapter 2

Literature review : Variable

selection

Variable selection is an important issue, especially in high dimensional regres-

sion models. Many authors have considered the problems of variable selection

in various statistical modelings. Sparse penalized methods are one of the most

important ways to do parameter estimation and variable selection simultane-

ously.

In this chapter, we review several sparse penalized methods. Variable selec-

tion can be categorized into two parts. One is the individual variable selection

and the other is the group variable selection. Section 2.1 deals with the indi-

vidual variable selection and the group variable selection is discussed in section

6

Page 16: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

2.2.

2.1 Individual variable selection

Consider a linear regression model with p predictors:

y =

p∑j=1

xTj βj + ε

with n observations, where y = (y1, . . . , yn) is the vector of n response vari-

ables, xj = (x1j, . . . , xnj) is the jth predictor vector, βj is the jth regression

coefficient and ε = (ε1, . . . , εn) is the vector of random errors.

A sparse penalized approach estimates the regression coefficients β by min-

imizing

1

2n||y − xTj βj||22 + Jλ(β),

where Jλ(·) is a penalty function and λ is a regularization parameter of con-

trolling the effect of the penalty. Here ‖ · ‖2 is the l2 norm. Many works have

been done to develop penalty functions. Tibshirani [1996] proposed the Lasso.

The Lasso penalty is defined as

Jλ(β) = λ

p∑j=1

|βj|.

Because of the singularity of the penalty function at the origin, the Lasso

can shrink the estimated coefficients to be exact zero when the regularization

7

Page 17: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

parameter λ is large enough. Zou [2006] proposed the adaptive Lasso, where

adaptive weights are used for penalizing different coefficients in the l1 penalty.

The adaptive Lasso satisfies the selection consistency.

Fan and Li [2001] argued for the three properties of a desirable penalty

function: Unbiasedness, Sparsity and Continuity. The SCAD proposed by Fan

and Li [2001] satisfies these three properties. The SCAD penalty function is

given by

Jλ(β) =

λ|β| , |β| < λ,

aλ(|β|−λ)−(β2−λ2)/2a−1 + λ2 , λ ≤ |β| < aλ,

(a−1)λ22

+ λ2 , |β| ≥ aλ.

The singularity from the origin of the SCAD penalty function leads to the

sparse solutions. But the flat part of the penalty function yields no shrinkage

on the large nonzero coefficients. Therefore, the SCAD estimator does not

have bias by itself.

Zhang [2010] proposed the MCP methodology for unbiased penalized se-

lection.

Jλ(β) =

−β2/2a+ λ|β| , |β| ≤ aλ,

12aλ2 , |β| > aλ.

The SCAD and MCP penalty functions satisfy the oracle property de-

scribed in Fan and Li [2001]. The oracle property means the asymptotic equiv-

alence between the estimator and the oracle estimator which is an ideal non-

8

Page 18: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

penalized estimator driven as if we knew the irrelevant variables in advance.

These approaches are desirable penalty functions ,however, these methods can

not consider the collinearity among covariates.

Zou and Hastie [2005] proposed the Elastic net for variable selection with

highly correlated variables.

Jλ(β) = λ1

p∑j=1

|βj|+ λ2

p∑j=1

β2j .

The Elastic net promotes a grouping effect unlike the Lasso. The Elastic net

yields a sparse solution due to the Lasso term while it selects a group of highly

correlated covariates by the ridge term.

The weighted fusion method proposed by Daye and Jeng [2009] combines

the weighted fusion penalty with the Lasso penalty.

Jλ(β) = λ1

p∑j=1

|βj|+ λ2J(β),

where J(β) = 1p

∑i<j wij(βi − sijβj)2. The weighted fusion utilizes correlation

information from the data by penalizing for pairwise differences of coefficients

via correlation-driven weights. And they use the Lasso penalty to produce

sparse solutions.

Huang et al. [2010a] proposed Mnet using a combination of MCP and ridge

penalties.

Jλ(β) = Jλ1(β) + λ2

p∑j=1

β2j ,

9

Page 19: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

where Jλ1(·) is a MCP. The Mnet estimator is equal to the oracle ridge es-

timator with high probability. It means that the Mnet satisfies the selection

consistency.

The sparse Laplacian shrinkage estimator proposed by Huang et al. [2011]

considers the correlation patterns among covariates.

Jλ(β) = Jλ1(β) +1

2λ2

∑1≤j<k≤p

|ajk|(βj − sjkβk)2,

where ajk is the strength of the connection between xj and xk ,sjk = sgn(ajk),

and Jλ1(·) is the MCP. The sparse Laplacian shrinkage estimator uses the

existing data to construct the graphical structure.

Kim [2012] proposed the sparse weighted ridge estimator, which is a com-

bination of a sparse penalty and the weighted ridge penalty.

Jλ(β) = Jλ1(β) + λ2

p∑j=1

wjβ2j ,

where Jλ1(·) is a MCP. The weight wj can be chosen through adjacency mea-

sures. They can set wj = 1 +∑

1≤j<k≤p |ajk| for given adjacency measures ajk.

By letting ajk = corr(xj, xk), the weighted ridge penalty can be compared with

the Laplacian penalty. Kim [2012] showed that the weighted ridge penalty is

simpler to use in practice since only p many weights are needed to be specified

while there are p(1− p)/2 many adjacency measures in the Laplacian penalty.

10

Page 20: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

2.2 Group variable selection

When there is a situation of existing grouping structures among covariates, we

can use the group penalties which consider the sparsity at the group level. In

the group variable selection, we consider a case where the covariates can be

decomposed into K groups and the number of covariates in the kth group is

pk with p =∑K

k=1 pk. Let Xk = (x1k, . . . ,xnk)T , k = 1, . . . , K, where xik, i =

1, . . . , n are pk × 1 vectors of covariates in the kth group.

y =K∑k=1

Xkβk + ε,

where βk = (βk(1), . . . , βk(pk))T .

A sparse penalized method for group variable selection estimates the re-

gression coefficients β by minimizing

1

2n‖y −

K∑k=1

Xkβk‖2 + Jλ(β),

where Jλ(·) is a penalty function and λ ≥ 0 is a tunning parameter.

For a vector υ ∈ Rd, d ≥ 1, and a positive definite matrix M , we denote

‖υ‖M = (υTMυ)1/2.

The group Lasso method is defined as follows.

Jλ(β) = λK∑k=1

√pk‖βk‖Mk

,

11

Page 21: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

where the√pk term accounts for the varying group sizes.

Yuan and Lin [2006] mentioned that there are many reasonable choices

for the kernel matrices Mk. Yuan and Lin [2006] chose to set Mk = Ipk .

In this paper, we refer to ‖βk‖Ipk = ‖βk‖ for notational simplicity. Some

methods took Mk = XTkXk/n. In that case, we stand for Mk = Σk. There are

many papers about the group Lasso. Kim et al. [2006] introduced the group

selection in general loss function. Similarly, Meier et al. [2008] extended the

group Lasso to logistic regression models. Wang and Leng [2008] proposed the

adaptive group Lasso method which satisfies the selection consistency.

Wang et al. [2007] proposed the group SCAD.

Jλ(β) =K∑k=1

pλ(‖βk‖),

where pλ(·) is the SCAD penalty function. They showed that group SCAD

estimator satisfies the group selection consistency.

Huang et al. [2010b] proposed the group MCP.

Jλ(β) =K∑k=1

pλ(‖βk‖Σk),

where ‖βk‖Σk= (βTkΣkβk)

1/2,Σk = XTkXk/n,

Pλ(t) =

√pkλt− t2

2a, t ≤ a

√pkλ,

12aλ2pk , t > a

√pkλ1

12

Page 22: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

The methods mentioned above use an l2 norm of the sub-coefficients. They do

not consider the strong correlation among groups.

Liu et al. [2011] proposed the smoothed group Lasso which can accommo-

date the correlation between adjacent groups.

Jλ(β) =K∑k=1

λ1√pk‖βk‖Σk

+λ22

K−1∑k=1

ζkpmax

(‖βk‖Σk√pk−‖βk+1‖Σk+1√

pk+1

)2,

where ‖βk‖Σk= (βTkΣkβk)

1/2, pmax = maxpk : k = 1, . . . , K and ζk is the

canonical correlation between two groups.

Smoothed group Lasso is a combination of the group Lasso penalty and

a quadratic penalty on difference of regression coefficients of adjacent groups.

This method has a flaw. It depends on location of correlated groups since

the quadratic penalty considers the difference between adjacent groups. An

alternative way is to consider the group Laplacian penalty that is an extension

of the smoothed group Lasso penalty of Liu et al. [2011].

Jλ(β) =K∑k=1

λ1√pk‖βk‖+

λ22

∑1≤k<j≤K

ζkjpmax

(‖βk‖√pk−‖βj‖√pj

)2The group Laplacian penalty considers the pairwise correlation among groups.

It is not relevant to where the correlated groups are located. The limit of

the group Laplacian penalty shows that computation of the group Laplacian

penalty becomes difficult since the second term is non-convex. In the next

chapter, we propose a new penalized method for group variable selection to

13

Page 23: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

take into account the correlation among groups.

14

Page 24: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Chapter 3

New penalized method

We take account of the high correlation between the groups of coefficients.

We propose a new regularization method called the sparse groupwise weighted

ridge penalty. The sparse groupwise weighted ridge penalty method is extended

to the Sparse weighted ridge method of Kim [2012]. This new regularization

method is a combination of group MCP and groupwise quadratic penalties.

The groupwise quadratic penalty is called groupwise weighted ridge penalty.

3.1 Sparse groupwise weighted ridge penalty

The objective function is as in the following.

Qλ(β) =1

2n‖Y −

K∑k=1

Xkβk‖2 +K∑k=1

Jλ1(‖βk‖) +λ22

K∑k=1

wk‖βk‖2,

15

Page 25: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

where wk is the weight for kth group,

Jλ1(t) =

√pkλ1t− t2

2a, t ≤ a

√pkλ1,

12aλ21pk , t > a

√pkλ1.

In a regression, the first term is the risk, the second and the third terms are

penalties. The second term is the group MCP that ensures groupwise sparsity.

The third term, groupwise weighted ridge penalty promotes selection of highly

correlated groups.

In the same group, the weights of the covariates are the same. The weight

measures the strength of the connection between groups. It means that weights

can be considered as a type of groupwise smoothing. If one group has a strong

connection with the other groups, the weight gets a larger value. Then, in this

group, the coefficients became smaller. Canonical correlation can be used as a

measure of correlation between groups.

The regularization parameter λ1 mainly controls the sparsity between groups.

The parameter λ2 and weight are involved in the correlation between group

structure. Hence, the sparse groupwise weighted ridge penalty can simultane-

ously select the groups and estimate the parameters under multicollinearity.

On the extreme case where λ2 = 0, the sparse groupwise weighted ridge penalty

becomes the group MCP penalty.

16

Page 26: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

In comparison with the group Laplacian penalty, computation of the sparse

groupwise weighted ridge penalty is easy since the groupwise weighted ridge

penalty is always strictly convex as long as wj > 0 for all j.

Several methods define the post - scaled estimator to avoid over-regularization.

In this paper, we deal with the Elastic net and sparse weighted ridge estimator.

In the case of λ1 = 0, three post - scaled estimators including sparse group-

wise weighted ridge estimator are below.

• Elastic net

βEN = (1 + λ)arg minβ

n∑i=1

(yi − xTi β)2 + λ

p∑j=1

β2j

• Sparse weighted ridge

βwR = (1 + λ)arg minβ

n∑i=1

(yi − xTi β)2 + λ

p∑j=1

wjβ2j

• Sparse groupwise weighted ridge

βGE = (1 + λ)arg minβ

n∑i=1

(yi −K∑k=1

xTikβk)2 + λ2

K∑k=1

wk‖βk‖2

To avoid double shrinkage, Zou and Hastie [2005] proposed the post-scaling

procedure. After the 1 + λ rescaling, Elastic net satisfies minimax optimality.

Elastic net estimator converges to the univariate regression estimator as λ2 →

17

Page 27: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

∞ with λ1 = 0. However, the univariate regression estimator may not be opti-

mal if there are high correlations between covariates. On the other hand, the

weighted ridge penalty can obtain the optimal estimator because of the weight.

The sparse groupwise weighted ridge penalty also converges to the groupwise

weighted univariate regression estimator. We consider a simple example. Let

p = 4 and x1 = x2, where x1 = (x11,x12),x2 = (x21,x22) xj = (x1j, . . . , xnj)T .

Assume orthogonal within group, where x11 = x21, x12 = x22. And suppose

y = x11 + 2x12 + x21 + 2x22 + ε. It means that, the true regression coefficient

vector β∗ = (1, 2, 1, 2)T . We can find that E(βjgUR|x1,x2) = (2, 4) for j = 1, 2.

The predictor ygUR = 2(x11 + x21) + 4(x12 + x22). In fact, the optimal estima-

tor of β is βjgUR

/2 for j = 1, 2, which is the groupwise ordinary least square

estimator that assumes the two coefficients are equal. In this example, where

p = 4 and x1 = x2, we let weight wj = 2, then βGW

k (∞) becomes the optimal

estimator.

• Sparse groupwise weighted ridge : Groupwise weighted univariate regres-

sion estimator

βGW = (1 + λ)arg minβ

n∑i=1

(yi −K∑k=1

xTikβk)2 + λ

K∑k=1

wk‖βk‖2.

βGW (λ) = arg minββT (

XTX + λW∗

1 + λ)β − 2yTXβ,

18

Page 28: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

where W∗ = diag(w∗1, . . . ,w∗K) , w∗k = (wk, . . . , wk︸ ︷︷ ︸

pk

)

βGW

k (∞) = XTk y/wk

3.2 Theoretical properties

In this section, we look at the theoretical properties of the sparse groupwise

weighted ridge estimator. The sparse groupwise weighted ridge estimator has

the oracle selection property under reasonable conditions. These conditions are

concerned with penalty parameters that are not determined based on data.

We study the selection properties of the sparse groupwise weighted ridge

estimator β. Let the true value of the regression group coefficient be β∗ =

(β∗1, . . . ,β∗K)T .

The bottom line says, the sparse groupwise weighted ridge estimator satis-

fies the group selection consistency.

P(k; ||β∗k|| 6= 0 = k; ||βk|| 6= 0

)→ 1, as n→∞.

Denote A = k : β∗k 6= 0, which is the set of indices of nonzero group

coefficients. Define

βoλ2

= (1 + λ2)(

argminβ=0,k∈Ac

1

2n‖Y −

K∑k=1

Xkβk‖2 +λ22

K∑k=1

wk‖βk‖2)

19

Page 29: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

This βoλ2

is the sparse groupwise weighted ridge estimator obtained only

with the covariates in the signal groups. In this sense, we refer βoλ2

as the

oracle groupwise weighted ridge estimator. It means that this is the groupwise

weighted ridge estimator on the set A.

We can say that, the oracle groupwise weighted ridge estimator can be

expressed by the oracle weighted ridge estimator form of Kim [2012] when

each group has one groupsize.

βoλ2

= (1 + λ2)(

argminβ

1

2n‖Y −

K∑k=1

Xkβk‖2 +λ22

∑k∈A

∑j≤pk

wkβ2kj

),

Note that the weight wk is the same as weighted ridge penalty where the

weights of the covariates in the same group are equal.

As mentioned in an earlier paper [Huang et al., 2010b], the oracle estimator

is not computable since the oracle set is unknown. We use it as the benchmark

for our proposed estimator. In other words, we will prove that the group

weighted estimator is asymptotically equivalent to the oracle group weighted

estimator, provided we choose λ1 appropriately. This result explains the role

of the regularization parameters λ1 and λ2 ; the selection of the groups is done

by λ1 and the shrinkage within the chosen groups is controlled by λ2.

In the following proposition 1, we give the sufficient conditions for local

solution of the objective function. Using this result, we can show the local

optimality of the oracle groupwise weighted ridge estimator.

20

Page 30: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Proposition 1 For a given β ∈ Rp, let A = k : ‖βk‖2 6= 0. Assume that β

satisfies

XTk (y −Xβ)/n = λ2wkβk, k ∈ A,

∣∣∣∣XTk (y −Xβ)/n

∣∣∣∣2< λ1

√pk, k ∈ Ac,

(3.1)

Then, β is a local minimizer of Qλ(β), provided mink∈A ‖βk‖2/√pk > aλ1.

Let m∗ = mink∈A ‖β∗k‖2/√pk and Σ = XTX/n. For a given subset

S ⊂ 1, . . . , K, defines XS as the n × Σk∈Spk submatrix that is obtained

by columnwisely combining Xk, k ∈ S, and let φ(XTSXS/n) denote the mini-

mum eigenvalue of the matrix XTSXS/n. We assume that the covariates are

standardized so that ‖Xkj‖22 = n, k = 1, . . . , K, j = 1, . . . , pk, where Xkj is the

jth column vector of Xk.

Let Ω(λ) be the set of all local minimizers of Qλ(β). Let K∗ = |A| be the

cadinality of A. That is, K∗ is the number of true nonzero groups. Let wmax

and wmin be maximum and minimum values of the weights, respectively.

Theorem 1 Assume that ρA = φ(XTAXA/n) > 0 and C denote the maximum

eigenvalue of the matrix X′AXA/n, then

Prβoλ2 ∈ Ω(λ)

≥ 1− P1 − P2,

21

Page 31: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

where

P1 = σ2K∗/n(ρA + λ2wmin)

[(m∗ − aλ1)−∇(β∗, λ2)/

√C]2,

P2 = σ2(K −K∗)/n[λ1 −∇(β∗, λ2)

]2,

where ∇(β∗, λ2) = λ2wmax√C · ‖β∗‖

/(ρA + λ2wmin),

provided m∗ − aλ1 −∇(β∗, λ2)/√C > 0 and λ1 −∇(β∗, λ2) > 0.

In Theorem 1, conditions require that the nonzero coefficients not to be

too small and λ1 should be at least in the same order as λ2. Theorem 1

gives a nonasymptotic lower bound of the probability for the oracle groupwise

weighted ridge estimator to belong to Ω(λ). From the following corollary

based on the consequence of Theorem 1, the oracle groupwise weighted ridge

estimator is one of the local minimizers of Qλ(β) with probability tending to

1 upon an appropriate choice of λ = (λ1, λ2).

Corollary 1 Assume that there exists a constant ρ0 > 0 such that ρA ≥ ρ0

for all n. If nλ21 →∞ and λ1 = o(m∗), then

Prβoλ2 ∈ Ω(λ)

→ 1,

provided K = o(nλ21) and ∇(β∗, λ2)/λ1 → 0 as n→∞.

Theorem 1 and Corollary 1 ensure that the oracle groupwise weighted ridge

estimator becomes a local minimizer of Qλ(β) with probability tending to 1.

22

Page 32: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

But, since Qλ(β) is nonconvex, we don’t know which local minimizer has the

oracle property. The next theorem and corollary provide sufficient conditions

under which the local minimizer of Qλ(β) is unique and the same as the oracle

groupwise weighted ridge with probability tending to 1. We give uniqueness

condition for a given local minimizer.

Uniqueness condition We say that β ∈ Ω(λ) satisfies the uniqueness

condition with ρ > 0,

mink∈A‖βk‖2/

√pk > maxaλ1, λ1/(ρ+ λ2wk),

maxk∈Ac‖XT

k (y −Xβ)/n‖2/√pk < minλ1, a(ρ+ λ2wk)λ1,

where A = k; ‖βk‖2 6= 0.

Theorem 2 Assume that ρ = λmin(XTX/n) > 0. If there exists a local min-

imizer β satisfying the uniqueness condition with ρ, then β = Ω(λ). That

is, β is a unique local minimizer and hence the global minimizer.

Theorem 3 Assumes that ρ = φ(XTX/n) > 0 and C denotes the maximum

eigenvalue of the matrix XTAXA/n, then

Prβ

oλ2 = Ω(λ)≥ 1− P1 − P2, (3.2)

23

Page 33: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

where

P1 = σ2K∗/n(ρA + λ2wmin)

[(m∗ −maxaλ1, λ1/(ρ+ λ2wk))−∇(β∗, λ2)/

√C]2,(3.3)

P2 = σ2(K −K∗)/n[

minλ1, a(ρ+ λ2wk)λ1 − ∇(β∗, λ2)]2, (3.4)

where ∇(β∗, λ2) = λ2wmax√C · ‖β∗‖

/(ρA + λ2wmin)

provided m∗−maxaλ1, λ1/(ρ+λ2wk)−∇(β∗, λ2)/√C > 0 and minλ1, a(ρ+

λ2wk)λ1 − ∇(β∗, λ2) > 0.

Corollary 2 Assume that there exists a constant ρ0 > 0 such that ρA ≥ ρ0

for all n. If nλ21 →∞ and λ1 = o(m∗), then

Prβ

oλ2 = Ω(λ)→ 1,

provided K = o(nλ21) and ∇(β∗, λ2)/λ1 → 0 as n→∞.

Remark 1 In my case, Σ(λ2) = Σ + λ2W is always a positive definite. Then

the minimizer of the sparse groupwise weighted ridge estimator restricted to

the support A can be explicitly written as,

βo

A = (ΣA + λ2wA)−1XTAy/n, β

o

Ac = 0

Remark 1 is the benefit from introducing the groupwise weighted ridge penalty.

Since the sparse groupwise weighted ridge estimator is equivalent to the oracle

24

Page 34: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

estimator with high probability by Theorems, we can obtain the oracle estima-

tor on the support A. Without the groupwise weighted ridge penalty, that is,

when λ2 = 0, it becomes the least squares estimator

βo

A,λ2=0 = Σ−1A XTAy/n, β

o

Ac = 0

If some of the predictors in Xk, k ∈ A are highly correlated or the

situation in that the number of total variables in the signal group is larger than

n, the oracle least squares estimator is not stable or unique. Σ(λ2) = Σ+λ2W

is always positive definite under a proper condition, even if the predictors in

Xk, k ∈ A are highly correlated or the situation in that the number of total

variables in a signal group is larger than n. This result is similar to that of

Huang et al. [2011].

25

Page 35: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Chapter 4

Computation

In this section, we propose an efficient algorithm for sparse groupwise weighted

ridge. Prior to this, the group Lasso algorithm is necessary to explain the

sparse groupwise weighted ridge algorithm. And we introduce the algorithm of

the group Laplacian penalty as mentioned section 2. Since the sparse groupwise

weighted ridge estimator is a combination of the group MCP and the groupwise

weighted ridge, it is a nonconvex penalty. To solve the nonconvex problem, we

apply the Convex Concave procedure.

In terms of computation, an important issue for the group penalty is or-

thonormality. In the previous literature on the group penalties, many pa-

pers do not mention of orthonormalization within groups. Orthonormalization

within groups can be a way to make the computation easier. When covariates

26

Page 36: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

groups are assumed to be orthonormalized, this situation is the orthogonal

case. The general case do not orthonormalize within groups. We consider

the two cases, the design matrices in each group are orthonormal and non-

orthonormal, that is, general case.

4.1 Group Lasso

4.1.1 Orthogonal case

Yuan and Lin [2006] defined the group Lasso penalty by orthonormalizing each

group of covariates. We briefly review the algorithm for the group Lasso of

Yuan and Lin [2006].

Let yk = y −∑K

j 6=k Xjβj,

Qλ(β) =1

2n‖yk −Xkβk‖2 + λ

K∑k=1

√pk‖βk‖

By letting sk = XTk yk, the closed form solutions are given below.

βk =

0 , ‖sk‖ ≤

√pkλ,

(1−√pkλ

‖sk‖)sk , o.w.

When the groups are made orthonormal, it makes the computation easy

because of the closed form solution. Yuan and Lin [2006] provides the blockwise

coordinate descent algorithm for computing these solutions. We summarize it

27

Page 37: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

1. Set s = 0. Initialize β(0)

= (β(0)1 , . . . , β

(0)K )T . r = Y −

∑Kk=1 Xkβ

(0)

k .

2. For k = 1, 2, . . . , K,

(a) Calculate sk = n−1Xkr + βs

(b) Update β(s+1)

k = βk

(c) Update r ← r −Xk(β(s+1)

k − β(s)

k ) and k ← k + 1

3. Update s← s+ 1

4. Repeat Steps 2 and 3 until convergence.

Figure 4.1: Blockwise Coordinate descent algorithm

28

Page 38: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

in Figure 4.1. The blockwise coordinate descent algorithms optimize the target

function with respect to a single group at a time, and cycles through the groups

until convergence. Although the objective function is not necessarily convex,

it is convex with respect to a single group when the coefficients of all the other

groups are fixed. Tseng [2001] showed that the blockwise coordinate descent

algorithm always converges.

4.1.2 General case

Without the orthogonalization, we cannot get the closed form solution. But,

Foygel and Drton [2010] pointed that orthonormalizing each group of covari-

ates may be unnatural or undesirable. So, methods which do not require

orthonormal Xk’s are necessary. Many papers do not orthonormalize each

group. Friedman et al. [2010] proposed the algorithm to solve the general form

of the group Lasso, with non-orthonormal model matrices. Jacob et al. [2009]

proposed a generalization of the group Lasso penalty, which leads to sparse

models with sparsity patterns that are given a graph of covariates, groups

of connected covariates in the graph. Foygel and Drton [2010] proposed the

Single Line Search algorithm, which operates by computing the exact optimal

value for each group with one univariate line search. Since the objective func-

tion Qλ(β) is convex in β, Foygel and Drton [2010] proved that the objective

29

Page 39: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

1. Initialize β ⇐ β(0)

2. Do followings until convergence. (k = 1, 2, . . . , K)

(a) Let Rk = y −∑K

j 6=kXjβj.

if ‖XTk Rk‖2 ≤

√pkλ then βk ⇐ 0

(b) else

Compute the spectral decomposition XTk Xk = UT

k DkUk ,

where Dk = diagdk1, . . . , dkpk.

vk ⇐ UkXTk Rk.

(c) Find the unique r > 0 satisfying f(r) =∑q

j=1

v2j(djr+

√pkλ)2

= 1.

βk ⇐ UTk (Dk + r−1λIpk)−1vk.

Figure 4.2: Single Line Search algorithm

function has a unique minimizer β. See the Figure 4.2.

4.2 Group Laplacian method

The computation of the group Laplacian method is necessary for the compar-

ison between the group Laplacian and sparse groupwise weighted ridge penal-

ties. The group Laplacian penalty has a flaw in some situation. We need to

30

Page 40: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

look into it.

4.2.1 Orthogonal case

The orthogonal case of the group Laplacian penalty is similar to the solution

of the Smoothed Group Lasso proposed by Liu et al. [2011].

The objective function of the group Laplacian is as follows.

Qλ(β) =1

2n‖Y −

K∑k=1

Xkβk‖2 +K∑k=1

λ1√pk‖βk‖∑k

+λ22

∑1≤k<j≤K

ζkjpmax

(‖βk‖√pk−‖βj‖√pj

)2Given the group parameter vectors βj(j 6= k) fixed at their current esti-

mates β(s)

j , this problem is equivalent to minimizing R(βk) defined as

R(βk) = C +1

2β′kAkβk − skβk +Bk‖βk‖,

where,

Ak = 1 + Ipkλ2pmax

pk(∑K

j 6=k ζkj), Bk = λ1√pk − λ2pmax√

pk(∑K

j 6=k ζkj‖βj‖√pj

),

sk = XTk (Y −

∑Kj 6=k Xjβj) and C is a constant free of β.

It can be shown that the minimizer of R(βk) is

βk =1

Ak

(1− Bk

‖sk‖)+sk.

Then, the solution has the closed form. We can use the blockwise co-

ordinate descent algorithm for fitting group Laplacian penalty. With group

31

Page 41: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Laplacian, however, Bk can be negative or positive depending on the λ2 and

ζ. If Bk is negative, there is no shrinkage for βk. Also, it cannot guarantee the

convergence of a blockwise coordinate descent method because the objective

function may be not convex with respect to a single group when the coefficients

of all the other groups are fixed.

4.2.2 General case

Although the objective function is not necessarily convex, it is convex with

respect to a single group when the coefficients of all the other groups are fixed.

As aforementioned, R(βk) is convex in βk. So we can apply to the SLS algo-

rithm of Foygel and Drton [2010]. In here, Ak = X′kXk + Ipkλ2pmax

pk(∑K

j 6=k ζkj)

because of the non-orthonormality. We present the algorithm in Figure 4.3.

However, the general case of the group Laplacian penalty also has the same

problem as the orthogonal case of that.

32

Page 42: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

1. Initialize β ⇐ β(0)

2. Do followings until convergence. (k = 1, 2, . . . , K)

(a) Let Rk = y −∑K

j 6=kXjβj.

if ‖XTk Rk‖2 ≤

√pkλ then βk ⇐ 0

(b) else

Compute the spectral decomposition Ak = UTk DkUk ,

where Dk = diagdk1, . . . , dkpk.

vk ⇐ UkXTk Rk.

(c) Find the unique r > 0 satisfying f(r) =∑q

j=1

v2j(djr+Bk)2

= 1.

βk ⇐ UTk (Dk + r−1BkIpk)−1vk.

Figure 4.3: Group Laplacian algorithm

33

Page 43: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

4.3 Convex concave procedure

In this section, we introduce the Concave-Convex procedure (CCCP) proposed

by Yuille and Rangarajan [2002]. It can be applied to the nonconvex function

decomposed as the sum of the convex and concave functions. The main idea

is to update the solution by the minimizer of the tight convex upper bound

of the objective function at the current solution. Since the objective function

decompose as the sum of the convex and concave, we can easily find the tight

convex upper bound of objective function using the hyperplane of the concave

part at the given point. To explain more details, suppose that we are to

minimize a non-convex function Q(β). Suppose Q(β) is a sum of convex and

concave functions such that Q(β) = Qvex(β) + Qcav(β). For a given current

solution βc, the tight convex upper bound is defined by

U(β) = Qvex(β) + ∂Qcav(βc)/∂βTβ.

Then we update the solution by the minimizer of U(β). Since U(β) is the

convex function, we can easily find the minimizer using the various convex

algorithms. This procedure is repeated until the solution converge. It always

converges to a local minimizer by the decent property of the CCCP algorithm.

34

Page 44: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Let ∂J(β) be the sub-gradient of J(β).

1. Find the initial estimator βc.

2. Do following until converge

(a) Let

Uλ(β) = L(β) + ∂J(βc)Tβ + λ1||β||.

(b) Find the

β = arg minβUλ(β).

(c) Update the βc by β.

Figure 4.4: CCCP algorithm

35

Page 45: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

4.4 Sparse groupwise weighted ridge

Sparse groupwise weighted ridge penalty is a combination of the group MCP

and the groupwise weighted ridge penalty. In the orthogonal case, we can use

the blockwise coordinate descent method since we get the closed form solution

with respect to a single group. In the general case, because of the non-convex

term, group MCP, we can find a local minimizer of the objective function by

using of the CCCP algorithm.

4.4.1 Orthogonal case

By letting yk = y −∑K

j 6=k Xjβj, sk = XTk yk, the objective function of the

sparse groupwise weighted ridge penalty can be rewritten as in the following.

Qλ(β) =1

2n‖yk −Xkβk‖2 +

K∑k=1

Jλ1(‖βk‖) +λ22

K∑k=1

wk‖βk‖2,

Where Jλ1(·) is the group MCP.

We can get the solution by minimizing Q with respect to βk.

βk =

0 , ‖sk‖ ≤

√pkλ1,

11− 1

a+2λ2wk

(1−√pkλ1‖sk‖

)sk ,√pkλ1 < ‖sk‖ ≤ a

√pkλ1

11+2λ2wk

sk , ‖sk‖ > a√pkλ1

The blockwise coordinate descent method to fitting proposed penalty is always

36

Page 46: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

converges, while the algorithm of the group Laplacian penalty may does not

converge. So, we get the solutions using the blockwise coordinate algorithm.

4.4.2 General case

The objective function can decompose as the sum of the convex and concave.

Jλ1(β) =∑K

k=1Jλ1(‖βk‖)−√pkλ1‖βk‖ is a concave function.

Qλ(β) =1

2n‖Y −

K∑k=1

Xkβk‖2 +λ22

K∑k=1

wk‖βk‖2 + Jλ1(β) +K∑k=1

√pkλ1‖βk‖.

For a given current solution βc, we update the solution by minimizing

Uλ(β) =1

2n‖Y −

K∑k=1

Xkβk‖2+λ22

K∑k=1

wk‖βk‖2+∂Jλ1(βc)Tβ+

K∑k=1

√pkλ1‖βk‖,

where ∂Jλ1(‖β‖) be the subgradient of Jλ1(β) and

∂Jλ1(βk) =

− 1aβk , ‖βk‖ ≤ a

√pkλ1,

−λ1√pk

βk

‖βk‖, ‖βk‖ > a

√pkλ1.

Then, Uλ(β) can be solved many efficient algorithms for group Lasso prob-

lem. This paper uses the SLS algorithm. At group k, we fix the coef-

ficients of other groups and λ. Uλ(βk) represents a quadratic function of

βk. Uλ(βk) = βTkQβk + LTβk + λ√pk‖βk‖, where Q = XT

kXk + λ2Ipk(Wk),

L = XTk Y + ∂Jλ1(β

c). Finally, we can demonstrate the algorithm computing

37

Page 47: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

the proposed estimator using SLS algorithm and CCCP algorithm in Figure

4.2 and 4.4.

38

Page 48: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Chapter 5

Numerical studies

In this chapter, we present the several performances of penalties using simu-

lation and real data analysis.

We suppose there are K covariates variables xi, . . . , xK . The goal is to

assess how well the covariates explain a response variable y. The standard

approach is a linear regression model.

y = xTγ + ε,

where x = (x1, . . . , xK)T and γ ∈ RK is a regression coefficient.

A linear regression model has a problem that the relation of xk to y may

not be linear. An alternative approach is using an additive model.

39

Page 49: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

y = f1(x1) + · · ·+ fK(xK) + ε,

where fk(xk) =∑pk

j=1 xkjβkj and xkj is a j-th transformation of xk.

We compare the prediction errors and the variable selectivity of the Sparse

groupwise weighted ridge with group Lasso of Foygel and Drton [2010], group

MCP and group Laplacian on several examples with regression models. In

addition, we use the group Bridge algorithm of Breheny and Huang [2009] via

grpreg package in R.

5.1 Simulation studies

For simulation studies, we consider the linear regression model. We simulate

data from the true model

y = Xβ + σε, ε ∼ N(0, 1).

Three examples are presented here. We consider the situation of p < n and

p > n. Simulation scheme is similar to that of Zou and Hastie [2005].

Example 1 We simulated 50 data sets, each of which consisted of 200

observations. The total number of variables p is 30. The correlation between

covariates xi and xj was set to be corr(i, j) = R|i−j| ,i, j = 1, . . . , 10. After

40

Page 50: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

generating x, we construct the first three Hermite polynomials like p1(x) =

x, p2(x) = x2 − 1, and p3(x) = x3 − 3x to make the group structure.

The three columns of p1(xj), p2(xj) and p3(xj) form a original covariate xj

form a natural group. The total number of signal groups and variables are 4,

12, respectively. We set σ = 10 and

β = (3, 3, 1.5︸ ︷︷ ︸3

, 2, 2, 1︸ ︷︷ ︸3

, 4, 2, 2︸ ︷︷ ︸3

, 2, 2, 1︸ ︷︷ ︸3

, 0, . . . , 0︸ ︷︷ ︸18

)

Example 2 We simulated 50 data sets, each of which consisted of 100

observations. The total number of variables p is 300. The correlation between

covariates xi and xj was set to be corr(i, j) = 0.8|i−j| ,i, j = 1, . . . , 10. After

generating x, we construct the first three Hermite polynomials like p1(x) =

x, p2(x) = x2 − 1, and p3(x) = x3 − 3x.

The three columns of p1(xj), p2(xj) and p3(xj) form a original covariate xj

form a natural group. The total number of signal groups and variables are 20,

60, respectively. We set σ = 3 and

β = (0.5, . . . , 0.5︸ ︷︷ ︸60

, 0, . . . , 0︸ ︷︷ ︸240

)

Example 3 We simulated 50 data sets, each of which consisted of 200

observations. The total number of variables p is 75. We set σ = 5 and

β = (0.3, 0.3, 0.3) for each signal group. The total number of signal groups and

41

Page 51: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

variables are 9, 27, respectively. We construct the group such as an example

1. The covariates x were generated as follows:

xk = z1 + εxk, z1 ∼ N(0, 1), k = 1, 2, 3,

xk = z2 + εxk, z2 ∼ N(0, 1), k = 4, 5, 6,

xk = z3 + εxk, z3 ∼ N(0, 1), k = 7, 8, 9,

and x ∼ N(0, 1), xk independent identically distributed, k = 10, . . . , 16, where

εxk are independent identically distributed N(0, 0.5), k = 1, . . . , 9.

Tables represent the prediction error(PE) based on independent test sample

1000, the averages of the model error(ME), the number of signal groups se-

lected (gSig), the number of noise groups selected(gNoi), the number of signal

variables selected(vSig), and the number of noise variables selected(vNoi).

In the tables, we compare the proposed method with other methods. ‘gBridge’

represents the group Bridge by Breheny and Huang [2009], ‘gLasso’ is the

group Lasso by Foygel and Drton [2010], ‘gMCP’ is the group MCP which is

the λ2 = 0 in the proposed method and ‘gLapl’ is the group Laplacian penalty.

We consider four proposed methods using the four different weights. The way

for constructing a weight is that wj = 1 +∑

k 6=j |ajk|. ajk is the adjacency

measure. The adjacency measures of the three proposed methods are ‘gWR

1’ ajk = Canonical Corr (xj,xk), ‘gWR 2’ajk = I (Canonical Corr (xj,xk) >

42

Page 52: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

0.6), ‘gWR 3’ajk = mincorr(xj,xk), ‘gWR 4’ajk = maxcorr(xj,xk), re-

spectively.

Under all simulated scenarios, three proposed methods outperform the

other methods in prediction accuracy. The performances of three proposed

methods are not significantly different. The group Lasso and the group Lapla-

cian methods select more noisy variables than the other methods. When the

correlations between groups are low, the performances of group MCP and pro-

posed methods are similar. However, when the correlations between groups

are high, the performances of proposed methods are better than that of group

MCP. It means that the second parameter λ2 and the weight control well the

correlations between group structure. We can find this evidence in example

1. The group Bridge methods shows poor performance, especially in the high

dimensional case. The group Bridge and the group Laplacian methods have

a wide variation in prediction accuracy. The proposed methods, on the other

hand, are relatively stable for the results of performance.

43

Page 53: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Table 5.1: Simulation results of Example 1: R=0.3

Methods PE ME gSig gNoi vSig vNoi

gBridge 116.88 (0.921) 16.48 (0.913) 4.00 (.000) 4.90 (.200) 12.00 (.000) 12.56 (.590)

gLasso 113.06 (.840) 11.75 (.822) 4.00 (.000) 4.96 (.150) 12.00 (.000) 14.88 (.450)

gMCP 111.31 (.870) 9.46 (.818) 4.00 (.000) 3.22 (.210) 12.00 (.000) 9.66 (.640)

gLapl 113.85 (.990) 12.98 (.936) 4.00 (.000) 4.54 (.130) 12.00 (.000) 13.62 (.390)

gWR 1 110.84 (.838) 9.17 (.781) 4.00 (.000) 3.48 (.200) 12.00 (.000) 10.44 (.610)

gWR 2 110.93 (.827) 9.51 (.788) 4.00 (.000) 3.22 (.220) 12.00 (.000) 9.66 (.650)

gWR 3 110.95 (.829) 9.49 (.793) 4.00 (.000) 3.20 (.210) 12.00 (.000) 9.60 (.640)

gWR 4 110.87 (.838) 9.27 (.798) 4.00 (.000) 3.30 (.200) 12.00 (.000) 9.90 (.600)

Figure 5.1: Example 1: R=0.3: Boxplots of the 50 prediction errors.

44

Page 54: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Table 5.2: Simulation results of Example 1: R=0.5

Methods PE ME gSig gNoi vSig vNoi

gBridge 117.71 (1.268) 16.98 (.994) 4.00 (.000) 4.92 (.180) 11.96 (.030) 12.90 (.550)

gLasso 113.93 (1.132) 13.05 (.895) 4.00 (.000) 4.78 (.180) 12.00 (.000) 14.34 (.530)

gMCP 112.25 (1.059) 10.91 (.881) 3.98 (.020) 3.20 (.220) 11.94 (.060) 9.60 (.660)

gLapl 114.69 (1.406) 13.04 (1.237) 4.00 (.000) 4.36 (.180) 12.00 (.000) 13.08 (.540)

gWR 1 111.21 (1.032) 10.59 (.803) 4.00 (.000) 3.54 (.200) 12.00 (.000) 10.62 (.610)

gWR 2 111.51 (1.000) 10.69 (.810) 4.00 (.000) 3.12 (.220) 12.00 (.000) 9.36 (.670)

gWR 3 111.44 (1.010) 10.76 (.815) 4.00 (.000) 3.12 (.230) 12.00 (.000) 9.36 (.690)

gWR 4 111.28 (1.028) 11.03 (.804) 4.00 (.000) 3.42 (.210) 12.00 (.000) 10.26 (.640)

Figure 5.2: Example 1: R=0.5: Boxplots of the 50 prediction errors.

45

Page 55: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Table 5.3: Simulation results of Example 1: R=0.8

Methods PE ME gSig gNoi vSig vNoi

gBridge 114.39 (.982) 13.32 (.880) 4.00 (.000) 4.28 (.200) 12.00 (.000) 10.80 (.620)

gLasso 112.71 (1.035) 10.39 (.928) 4.00 (.000) 4.42 (.200) 12.00 (.000) 13.26 (.590)

gMCP 112.48 (1.041) 11.65 (.912) 3.90 (.040) 2.96 (.190) 11.70 (.130) 8.88 (.560)

gLapl 113.04 (1.116) 9.83 (1.027) 4.00 (.000) 3.98 (.180) 12.00 (.000) 11.94 (.530)

gWR 1 110.03 (.908) 8.39 (.814) 3.98 (.020) 2.98 (.170) 11.94 (.060) 8.94 (.510)

gWR 2 110.01 (.911) 8.59 (.816) 3.98 (.020) 3.00 (.170) 11.94 (.060) 9.00 (.510)

gWR 3 111.08 (.913) 10.64 (.803) 3.96 (.030) 2.90 (.190) 11.88 (.080) 8.70 (.570)

gWR 4 109.71 (.966) 9.14 (.706) 3.90 (.040) 2.60 (.190) 11.70 (.130) 7.80 (.560)

Figure 5.3: Example 1: R=0.8: Boxplots of the 50 prediction errors.

46

Page 56: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Table 5.4: Simulation results of Example 2: R=0.8

Methods PE ME gSig gNoi vSig vNoi

gBrigde 43.98 (3.609) 31.96 (3.595) 14.00 (.435) 8.12 (.735) 38.58 (1.270) 16.72 (1.570)

gLasso 40.52 (2.718) 29.02 (2.745) 17.78 (.340) 36.28 (2.210) 53.34 (1.025) 108.84 (6.640)

gMCP 43.47 (2.362) 30.38 (2.361) 14.70 (.435) 22.54 (1.190) 44.10 (1.305) 67.62 (3.560)

gLapl 42.52 (3.402) 29.60 (3.416) 19.34 (.160) 42.44 (2.670) 58.02 (.475) 127.32 (8.010)

gWR 1 39.19 (2.489) 27.81 (2.502) 15.94 (.355) 27.66 (.830) 47.82 (1.075) 82.98 (2.490)

gWR 2 37.90 (2.603) 27.07 (2.618) 16.76 (.325) 29.68 (1.000) 50.28 (.975) 89.04 (3.005)

gWR 3 40.45 (2.375) 28.26 (2.382) 15.62 (.340) 26.36 (.665) 46.86 (1.020) 79.08 (1.995)

gWR 4 38.75 (2.402) 27.34 (2.422) 16.62 (.320) 28.28 (.865) 48.46 (1.220) 84.48 (2.405)

Figure 5.4: Example 2: R=0.8: Boxplots of the 50 prediction errors.

47

Page 57: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Table 5.5: Simulation results of Example 3

Methods PE ME gSig gNoi vSig vNoi

gBridge 29.67 (.542) 4.71 (.384) 8.92 (.060) 3.16 (.470) 24.60 (.290) 5.80 (1.060)

gLasso 29.33 (.466) 4.92 (.271) 9.00 (.000) 9.76 (.580) 27.00 (.000) 29.28 (1.740)

gMCP 29.35 (.472) 4.81 (.288) 8.96 (.040) 8.36 (.540) 26.88 (.120) 25.08 (1.620)

gLapl 29.08 (.466) 4.41 (.248) 9.00 (.000) 9.56 (.470) 27.00 (.000) 28.68 (1.410)

gWR 1 28.93 (.455) 4.36 (.251) 9.00 (.000) 7.96 (.490) 27.00 (.000) 23.88 (1.470)

gWR 2 28.99 (.465) 4.44 (.255) 8.96 (.040) 7.92 (.490) 26.88 (.120) 23.76 (1.460)

gWR 3 29.06 (.462) 4.63 (.265) 9.00 (.000) 8.40 (.570) 27.00 (.000) 25.20 (1.720)

gWR 4 29.05 (.463) 4.55 (.264) 9.00 (.000) 8.44 (.560) 27.00 (.000) 25.32 (1.690)

Figure 5.5: Example 3: Boxplots of the 50 prediction errors.

48

Page 58: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

5.2 Real data

We analyze two real data,a wine quality and a gene expression data. The Wine

quality data is a low dimensional data set while the gene expression data is a

high dimensional data set. We compare proposed method with other existing

methods in two settings.

5.2.1 Wine quality

Two datasets were created, using red and white wine samples. We deal with the

white wine data. This dataset is public available for research. The data set is

available from UCI Machine Learning Repository (http://www.ics.uci.edu/ mlearn/

MLRepository.html). The details are described in [Cortez et al., 2009]. The

inputs include objective tests (e.g. PH values) and the output is based on

sensory data (median of at least 3 evaluations made by wine experts). Each

expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The 11 covariates are given below.

v1: fixed acidity

v2: volatile acidity

v3: citric acid

v4: residual sugar

49

Page 59: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

v5: chlorides

v6: free sulfur dioxide

v7: total sulfur dioxide

v8: density

v9: pH

v10: sulphates

v11: alcohol

Output variable (based on sensory data):

v12: quality (score between 0 and 10)

The objective of the analysis is to consider the significant covariates that

are correlated with other covariates. And we want to attain high predictive

performances. We expand each of the numerical covariates to a group using

transformations up to the 3th Hermite polynomials.

We compare the prediction errors of the four proposed methods with other

penalized approaches used in the previous subsection. We obtained 50 data

sets, each of which was divided into four parts, training, validation and test

data sets, by randomly selecting 50%, 20% and 30% , respectively. Models were

selected on training data, and the validation data were used to find the proper

regularization parameters. We evaluated the test error (the mean-squared

50

Page 60: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Table 5.6: The results of White Wine data

Methods PE Variable

gBridge 1.495 (.390) 14.5 (1.175)

gLasso 1.562 (.319) 33.0 (.548)

gMCP 1.597 (.427) 33.0 (.774)

gLapl 1.630 (.333) 33.0 (.548)

gWR 1 0.612 (.011) 27.0 (.395)

gWR 2 0.615 (.012) 27.0 (.449)

gWR 3 0.618 (.014) 27.0 (.415)

gWR 4 0.612 (.011) 27.0 (.447)

Lasso 2.031 (.450) 31.5 (1.289)

Elastic net 2.070 (.454) 32.0 (1.304)

error) on the test data set.

The results summarized in Table 5.6 show that four proposed methods

have the smallest prediction error. In wine quality data, we add the results

of the Lasso [Tibshirani, 1996] and Elastic net [Zou and Hastie, 2005] for the

individual variable selection. In additive model of the wine quality data,the

results of the methods for the group variable selection are better than those of

the Lasso and the Elastic net.

The partial fits on selected 8 covariates are shown in Figure 5.6. We can

find the nonlinear relations between covariates and response variable and the

51

Page 61: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

polynomial models with proposed methods are specially formulated well for

nonlinearity. All wines contain sulfur dioxide, and it play an important role in

wine quality. Among the covariates, the free sulfur dioxide is closely associated

with the total sulfur dioxide. Since correlation between two covariates is high,

the partial fits are similar as you see. In Figure 5.6, we can find that an increase

in the alcohol tends to result in a higher quality wine.

52

Page 62: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Figure 5.6: White Wine: the partial fit on selected covariates

53

Page 63: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

5.2.2 Microarray gene expression data

We analyze popular microarray gene expression data, the Trim used in Scheetz

et al. [2006]. This data set consists of gene expression levels of 18, 975 genes

obtained from 120 rats. The main objective of the analysis is to find genes that

are correlated with the TRIM32 gene, known to cause Bardet.Biedl syndrome.

We first select 3, 000 genes with the largest variance in expression level and

then choose the top 100 genes that have the largest absolute correlation with

TRIM32 among the selected 3, 000 genes. For the group structure, we consider

the continuous piecewise linear model using pk = 5 (knot) for all genes. Table

5.7 and Figure 5.7 present that the three proposed estimators outperform the

other estimators. In this data, ‘gWR 1’ has the best performance in prediction

accuracy. But among four proposed methods, it select more variables than two

proposed methods. The results of the performances will be different according

to choosing the weight. So we need to consider the suitable weight to match

the data. Even though to find a proper weight is also important, it is another

topic.

We add the result for the GDS (Groupwise Doubly Sparse penalty) of

Kwon et al. [2011]. The GDS penalty can simultaneously select both groups

and variables as the group bridge penalty does, however, the GDS penalty can

control the group-level sparsity and variable-level sparsity separately which is

54

Page 64: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

not possible for the group bridge penalty. See the details in Kwon et al. [2011].

Based on these empirical studies, we can find that the proposed method

has good performances in terms of prediction accuracy, group and variable

selectivity compared with other methods.

55

Page 65: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Table 5.7: The results of Trim data

Methods PE Variable

gBridge 9.356 (3.255) 10.5 (.678)

gLasso 0.446 (.036) 75.0 (2.094)

gMCP 0.513 (.044) 40.0 (6.094)

gLapl 0.419 (.035) 492.5 (25.845)

gWR 1 0.332 (.014) 350.0 (16.644)

gWR 2 0.358 (.016) 292.5 (24.856)

gWR 3 0.362 (.014) 195.0 (21.188)

gWR 4 0.343 (.015) 260.0 (19.614)

GDS 0.388 (.015) 12.0 (.789)

56

Page 66: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Figure 5.7: TRIM : Boxplots of the 50 prediction errors.

57

Page 67: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Chapter 6

Concluding remarks

In this thesis, we proposed the new penalty function for group variable selec-

tion reflected group correlation structure. We showed that the sparse group-

wise weighted ridge estimator satisfies the group selection consistency. From

numerical studies, we confirmed that our method has satisfactory performance

in comparison with other method. We considered the linear regression prob-

lem. But, this method can be extended to other likelihood functions via second

order approximations to the loss function. In the future work, we need to study

theoretical properties of the proposed method for other likelihood functions.

58

Page 68: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Appendix

Proof of proposition 1. It suffices that there exist a δ > 0 such that

Qλ(β) ≥ Qλ(β) for all β ∈ B(β, δ), where B(β, δ) = β : ||β − β||2 ≤ δ.

From the convexity of ||y −Xβ||22/2n, we have

Qλ(β)−Qλ(β) ≥K∑k=1

vk

where

vk = −X′k(y−Xβ)/n′(βk−βk)+Jλ1(||βk||2

)−Jλ1

(||βk||2

)+λ22β′kwkβk−

λ22β′kwkβk.

First, consider the case, where k ∈ A. Let δk = ||βk||2− a√pkλ1. Then δk > 0

and

||βk||2 ≥ ||βk||2 − ||βk − βk||2 > a√pkλ1

for all βk ∈ B(βk, δk). Hence from (1),

vk =λ2wk

2

(β′kβk − 2β

′kβk + β

′kβk

)≥ 0

59

Page 69: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Second, consider the case, where k ∈ Ac. Let δk = minγk, a√pkλ1, where

γk = 2a(λ1√pk −

∣∣∣∣X′k(y − Xβ)/n∣∣∣∣2

)> 0. Then ||βk||2 < a

√pkλ1 and

||βk||2 < γk for all β ∈ B(0, δk). Hence, we have

vk ≥(−∣∣∣∣X′k(y −Xβ)/n

∣∣∣∣2− ||βk||2/2a+

√pkλ1

)||βk||2 +

λ22wkβ

′kβk ≥ 0.

Therefore, we haveQλ(β) ≥ Qλ(β) for all β ∈ B(β, δ), where δ = minδ1, . . . , δK.

This completes the proof.

Proof of theorem 1. By the KKT conditions and definition of βoλ2

, the

oracle group weighted βoλ2

satisfies

X′k(y −XAβoλ2

A )/n = λ2wkβoλ2

k , ∀k ∈ A (6.1)

From Proposition 1, it suffices to show that βoλ2

satisfies

P

mink∈A||β

oλ2

k ||2/√pk > aλ1

≥ 1− P1, (6.2)

and

P

maxk∈Ac

∣∣∣∣XTk (y −Xβ

oλ2)/n∣∣∣∣2/√pk < λ1

≥ 1− P2. (6.3)

First, we will show (6.2). Let ΣA = (X′AXA/n), wA be a pA× pA diagonal

matrix. Then direct calculation yields

60

Page 70: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

βoλ2

A − β∗A

=(X′AXA

n+ λ2wA

)−1X′A(XAβ

∗A + ε)

/n− β∗A

=

(X′AXAn

+ λ2wA

)−1X′AXAn

− Iβ∗A +

(X′AXAn

+ λ2wA

)−1X′Aε

/n

= (−λ2wA)(ΣA + λ2wA)−1β∗A + (ΣA + λ2wA)−1X′Aε/n.

For k ∈ A, let Uk be a pA× pA diagonal matrix which pk nonzero diagonal

elements corresponding to βk is 1 and the others 0. wmax denote the maximum

value of the weight, wmin is the minimum value of the weight. Then, we can

rewrite

‖βoλ2

k − β∗k‖2 = ‖Uk(βoλ2

A − β∗A)‖2 < ‖ηk‖2 + ‖V′kε‖2,

where ηk = −λ2wAUk(ΣA+λ2wA)−1β∗A is a vector with length pA and Vk =

XA(ΣA + λ2wA)−1Uk/n is a n× pA matrix.

Note that

‖ηk‖2 ≤ ‖λ2wAUk||F ||(ΣA+λ2wA)−1β∗A||2 ≤ λ2wmax√pk‖β∗‖/(ρA+λ2wmin),

(6.4)

61

Page 71: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

‖λ2wAUk||F = λ22∑k∈A

pk∑j=1

(wkukj)2

12

= λ22∑k∈A

pk∑j=1

u′kjw′kwkukj

12

= λ2

√w2kpk

≤ λ2wmax√pk.

‖Vk‖2F =∑k∈A

pk∑j=1

‖XA(ΣA + λ2wA)−1ukj/n‖22

=∑k∈A

pk∑j=1

u′

kj(ΣA + λ2wA)−1X′AXAn

(ΣA + λ2wA)−1ukj/n

≤∑k∈A

pk∑j=1

u′

kj(ΣA + λ2wA)−1ukj/n

≤ pk/n(ρA + λ2wmin).

where ukj is the jth column vector of the matrix Uk and ‖ · ‖F stands for

62

Page 72: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

the Frobenius norm. Hence, from the Markov’s inequality, for all k ∈ A,

P||β

oλ2

k − β∗k||2/√pk ≥ m∗ − aλ1

≤ P

‖ηk‖2 + ‖V′kε‖2 ≥

√pk(m

∗ − aλ1)

≤ P‖V′kε‖2 ≥

√pk(m

∗ − aλ1)− λ2√pkwmax‖β∗‖/(ρA + λ2wmin)

≤ E‖V′kε‖2

/pk(m∗ − aλ1)− λ2wmax‖β∗‖/(ρA + λ2wmin)2

≤ σ20‖Vk‖2F

/pk(m∗ − aλ1)− λ2wmax‖β∗‖/(ρA + λ2wmin)2

≤ σ20

/n(ρA + λ2wmin)

[(m∗ − aλ1)− λ2wmax‖β∗‖/(ρA + λ2wmin)2

].

By using the triangular inequality ||βoλ2

k ||2 ≥ ||β∗k||2 − ||β∗k − βoλ2

k ||2, we

have

P

mink∈A||β

oλ2

k ||2/√pk ≤ aλ1

≤ P

mink∈A||β∗k||2/

√pk −max

k∈A||β

oλ2

k − β∗k||2/√pk ≤ aλ1

≤ P

maxk∈A||β

oλ2

k − β∗k||2/√pk ≥ m∗ − aλ1

∑k∈A

P||β

oλ2

k − β∗k||2/√pk ≥ m∗ − aλ1

≤ σ2

0K∗/n(ρA + λ2wmin)

[(m∗ − aλ1)− λ2wmax‖β∗‖/(ρA + λ2wmin)2

].

63

Page 73: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Now, we will show that (6.3).

X′k(y −XAβoλ2

A )/n = X′k

−XA(β

oλ2

A − β∗A) + ε/

n

≤ X′k

−XA−λ2wA(ΣA + λ2wA)−1β∗A + (ΣA + λ2wA)−1X′Aε/n+ ε

/n

≤ λ2X′kXAwA(ΣA + λ2wA)−1β∗A

/n−X′kXA(ΣA + λ2wA)−1X′Aε

/n2 + X′kε

/n

≤ λ2X′kXAwA(ΣA + λ2wA)−1β∗A

/n−X′k

I −XA(ΣA + λ2wA)−1X′A/n

ε/n

≤ ξk + W′kε.

where ξk = λ2X′kXAwA(ΣA+λ2wA)−1β∗A

/n is the vector with length pk and

Wk = (I −XA(ΣA + λ2wA)−1X′A/n)Xk/n is the n× pk matrix.

‖ξk‖2 = ‖λ2X′kXAwA(ΣA + λ2wA)−1β∗A

/n‖2

≤ ‖λ2X′kXAwA/n‖F‖(ΣA + λ2wA)−1β∗A‖

≤ ‖λ2X′kXAwA/n‖F‖β∗‖/

(ρA + λ2wmin)

≤ λ2wmax√pk · C · ‖β∗‖

/(ρA + λ2wmin).

‖X′kXAwA/n‖F = ‖wAX′AXk/n‖F

= pk∑

j=1

X ′kjXAw′AwAX′AXkj

/n2 1

2

≤w2max

pk∑j=1

X ′kjXkjC/n 1

2= wmax

pk∑j=1

1 · C 1

2= wmax

√pkC.

64

Page 74: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

and

‖Wk‖F = ‖(I −XA(ΣA + λ2wA)−1X′A/n)Xk

/n‖F = ‖AXk/n‖F ,

where A = I −XA(ΣA + λ2wA)−1X′A/n

I = UU ′ and let Z = XA√n

. Using SVD of Z,

XA(ΣA + λ2wA)−1X′A/n

= Z(Z ′Z + λ2wA)−1Z ′

= UDV ′(V DDV ′ + λ2wA)−1V DU ′

= UDV ′[V (DD + λ2wA)V ′

]−1V DU ′

= UD(DD + λ2wA)−1DU ′.

Noticing matrices D, D2 + λ2wA are both diagonal. Thus,

D(DD + λ2wA)−1D = diag d21d21 + λ2w1

, · · · ,d2pA

d2pA + λ2wA

A = I −XA(ΣA + λ2wA)−1X′A/n = UU ′ − UD(DD + λ2wA)−1DU ′

= U [I −D(DD + λ2wA)−1D]U ′

≤ UU ′ = I.

65

Page 75: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

I −D(DD + λ2wA)−1D = diag λ2w1

d21 + λ2w1

, · · · , λ2wAd2pA + λ2wA

‖Wk‖2F = ‖AXk/n‖2F =

pk∑j=1

‖AXkj/n‖2 ≤pk∑j=1

‖Xkj/n‖2 =1

n

pk∑j=1

1 =pkn

From the Markov’s inequality, we have

P

maxk∈Ac

∣∣∣∣X′k(y −Xβoλ2

)/n∣∣∣∣2/√pk ≥ λ1

∑k∈Ac

P∣∣∣∣X′k(y −Xβ

oλ2)/n∣∣∣∣2/√pk ≥ λ1

∑k∈Ac

P ||ξk||2 + ||W′kε||2 ≥

√pkλ1

≤∑k∈Ac

P‖Wkε‖2 ≥

√pkλ1 − λ2wmax

√pk · C · ‖β∗‖

/(ρA + λ2wmin)

∑k∈Ac

E‖Wkε‖22/[√

pkλ1 − λ2wmax√pk · C · ‖β∗‖

/(ρA + λ2wmin)

]2≤

∑k∈Ac

σ2‖Wk‖2F/pk

[λ1 − λ2wmax

√C · ‖β∗‖

/(ρA + λ2wmin)

]2≤ σ2(K −K∗)

/n[λ1 − λ2wmax

√C · ‖β∗‖

/(ρA + λ2wmin)

]2

Proof of theorem 2. Assume that there exist another local minimizer β ∈

Ω(λ). Let S = (ST1 , . . . , STK)T , where Sk = −XT

k (y −Xβ)/n. Simple algebra

yields

‖y −Xβ‖22 − ‖y −Xβ‖22 = 2nST (β − β) + (β − β)T (XTX)(β − β), (6.5)

66

Page 76: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

for any β ∈ Rp. Denote βh = β + h(β − β) for some h ∈ (0, 1), then from

(6.5),

‖y−Xβh‖22−‖y−Xβ‖22 = −2nhS(β− β)+(h2−2h)(β− β)T (XTX)(β− β).

λ22

K∑k=1

wk‖βh‖2 −λ22

K∑k=1

wk‖β‖2

= λ2hK∑k=1

wkβT

k (βk − βk) +λ22h2(β − β)TW(β − β).

Hence, it follows

Qλ(βh)−Qλ(β) ≤ h

K∑k=1

dk(h) +h2

2(β − β)T (

XTX

n+ λ2W)(β − β)

where

wk(h) = −STk (βk−βk)−ρ‖βk−βk‖22+J(k)λ1

(||βhk||2

)−J (k)

λ1

(||βk||2

)/h+λ2wkβ

T

k (βk−βk).

First, consider the case A = ‖βk‖2 6= 0. Then From the uniqueness

condition, β satisfies

‖βk‖2/√pk > aλ1 and ‖βk‖2/

√pk > λ1/(ρ+ λ2wk)

If ‖βk‖2 = 0, then (Sk = −λ2wkβk)

♦ −STk (βk − βk) + λ2wkβT

k (βk − βk) = −λ2wk‖βk − βk‖22 = −λ2wk‖βk‖22

♦ −ρ‖βk − βk‖22 = −ρ‖βk‖22

♦J(k)λ1

(||βhk||2

)− J (k)

λ1

(||βk||2

)/h ≤ ∇J (k)

λ1(0)(‖βhk‖ − ‖βk‖)/h < λ1

√pk‖βk‖2

67

Page 77: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Hence,

dk(h) ≤ ‖βk‖2(−λ2wk‖βk‖2 − ρ‖βk‖2 + λ1

√pk

)= ‖βk‖2

(λ1√pk − (ρ+ λ2wk)‖βk‖2

)< 0

:‖βk‖2/

√pk > λ1/(ρ+ λ2wk)

If 0 < ‖βk‖2/

√pk < aλ1, then (‖βk‖2/

√pk > aλ1)

♦ −STk (βk − βk) + λ2wkβT

k (βk − βk) ≤ −λ2wk(‖βk‖2 − ‖βk‖2)2

♦ −ρ‖βk − βk‖22 ≤ −ρ(‖βk‖2 − ‖βk‖2)2 = −ρ(‖βk‖2 − ‖βk‖2)2

♦J(k)λ

(||βhk||2

)− J (k)

λ

(||βk||2

)/h

≤ ∇J (k)λ1

(||βk‖2)(‖βhk‖2 − ‖βk‖2)/h < ∇J(k)λ1

(||βk‖2)(‖βk‖2 − ‖βk‖2)

Hence,

dk(h) ≤ (‖βk‖2 − ‖βk‖2)−(ρ+ λ2wk)(‖βk‖2 − ‖βk‖2) +∇J (k)

λ1(||βk‖2)

< 0

unless ‖βk‖2 6= ‖βk‖2. Since (ρ+ λ2wk < 1/a, ρ+ λ2wk > 1/a)

(ρ+ λ2wk)‖βk‖2/√pk +∇J (k)

λ1(||βk‖2)/

√pk < maxλ1, /(ρ+ λ2wk)aλ1

68

Page 78: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

If ‖βk‖2/√pk > aλ1,

♦ −STk (βk − βk) + λ2wkβT

k (βk − βk) ≤ −λ2wk(‖βk‖2 − ‖βk‖2)2

♦ −ρ‖βk − βk‖22 ≤ −ρ(‖βk‖2 − ‖βk‖2)2 = −ρ(‖βk‖2 − ‖βk‖2)2

♦J(k)λ

(||βhk||2

)− J (k)

λ

(||βk||2

)/h ≤ ∇J (k)

λ1(||βk‖2)(‖βk‖2 − ‖βk‖2)

Hence,

dk(h) ≤ −λ2wk(‖βk‖2−‖βk‖2)2−ρ(‖βk‖2−‖βk‖2)2+∇J(k)λ1

(||βk‖2)(‖βk‖2−‖βk‖2) < 0

unless ‖βk‖2 6= ‖βk‖2.

Second, consider the case βk ∈ Ac = k; ‖βk‖2 = 0.

It is easy to see that (‖βhk‖2 < ‖βk‖2)

dk(h) ≤ ‖βk‖2‖Sk‖2 − ρ‖βk‖2 −∇J(‖βk‖2)− λ2wk‖β‖

< 0

unless ‖βk‖2 6= ‖βk‖2. Since

inf‖βk‖2>0

(ρ+ λ2wk)‖βk‖2/

√pk +∇J(‖βk‖2)/

√pk

≥ minλ1, a(ρ+ λ2wk)λ1

If ‖βk‖2 = 0, then

dk(h) ≤ 0

If 0 < ‖βk‖2/√pk < aλ1, then ∇J(‖βk‖2) = −‖βk‖

a+√pkλ1

69

Page 79: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

(ρ+ λ2wk)‖βk‖2/√pk +∇J(‖βk‖2)/

√pk ≥ λ1 I(ρ+ λ2wk > 1/a)

≥ a(ρ+ λ2wk)λ1

If ‖βk‖2/√pk > aλ1, then

(ρ+ λ2wk)‖βk‖2/√pk +∇J(‖βk‖2)/

√pk ≥ a(ρ+ λ2wk)λ1

Thus, there exists a constant C > 0 that does not depend on h unless

β = β such that

Qλ(βh)−Qλ(β) ≤ −hC +

h2

2(β − β)T (

XTX

n+ λ2W)(β − β)

Therefore, we can choose a sufficiently small δ > 0 so that for all h ∈

(0, δ), Qλ(βh) − Qλ(β) < 0 unless β = β, and hence β is the unique local

minimizer of Qλ(β). This completes the proof.

Proof of theorem 3. By Proposition 1, Uniqueness condition and (6.1), it

suffices to show that

P

mink∈A||β

oλ2

k ||2/√pk > maxaλ1, λ1/(ρ+ λ2wk)

≥ 1− P1, (6.6)

70

Page 80: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

and

P

maxk∈Ac

∣∣∣∣XTk (y −Xβ

oλ2)/n∣∣∣∣2/√pk < minλ1, a(ρ+ λ2wk)λ1

≥ 1− P2.

(6.7)

Since the rest of the proof is quite similar to those of (6.2) and (6.3), we

omit the proof.

71

Page 81: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

Bibliography

H. Bondell and B. Reich. Simultaneous regression shrinkage, variable selection,

and supervised clustering of predictors with oscar. Biometrics, 64(1):115–

123, 2008.

P. Breheny and J. Huang. Penalized methods for bi-level variable selection.

Statistics and its Interface, 2(3):369, 2009.

F. Chung. Spectral graph theory. Number 92. Amer Mathematical Society,

1997.

Z. Daye and X. Jeng. Shrinkage and model selection with correlated variables

via weighted fusion. Computational Statistics & Data Analysis, 53(4):1284–

1298, 2009.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and

72

Page 82: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

its oracle properties. Journal of the American Statistical Association, 96

(456):1348–1360, 2001.

R. Foygel and M. Drton. Exact block-wise optimization in group lasso and

sparse group lasso for linear regression. Arxiv preprint arXiv:1010.3320,

2010.

I. Frank and J. Friedman. A statistical view of some chemometrics regression

tools. Technometrics, pages 109–135, 1993.

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized

linear models via coordinate descent. Journal of statistical software, 33(1):

1, 2010.

J. Huang, P. Breheny, S. Ma, and C. Zhang. The mnet method for variable

selection. Department of Statistics and Actuarial Science, The University of

Iowa, 2010a.

J. Huang, F. Wei, and S. Ma. Semiparametric regression pursuit. Technical

report, Citeseer, 2010b.

J. Huang, S. Ma, H. Li, and C. Zhang. The sparse laplacian shrinkage estimator

for high-dimensional regression. The Annals of Statistics, 39(4):2021–2046,

2011.

73

Page 83: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso.

In Proceedings of the 26th Annual International Conference on Machine

Learning, pages 433–440. ACM, 2009.

Y. Kim. Sparse weighted ridge estimator. 2012.

Y. Kim, J. Kim, and Y. Kim. Blockwise sparse regression. Statistica Sinica,

16(2):375, 2006.

Y. Kim, H. Choi, and H. Oh. Smoothly clipped absolute deviation on high

dimensions. Journal of the American Statistical Association, 103(484):1665–

1673, 2008.

S. Kwon, J. Ahn, W. Jang, S. Lee, and Y. Kim. A doubly sparse penalty

approach for group variable selection. manucript, 2011.

C. Li and H. Li. Network-constrained regularization and variable selection for

analysis of genomic data. Bioinformatics, 24(9):1175–1182, 2008.

C. Li and H. Li. Variable selection and regression analysis for graph-structured

covariates with an application to genomics. The Annals of Applied Statistics,

4(3):1498–1516, 2010.

J. Liu, J. Huang, S. Ma, and K. Wang. Incorporating group correlations in

genome-wide association studies using smoothed group lasso. 2011.

74

Page 84: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

L. Meier, S. Van De Geer, and P. Buhlmann. The group lasso for logistic

regression. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 70(1):53–71, 2008.

W. Pan, B. Xie, and X. Shen. Incorporating predictor network in penalized

regression with application to microarray data. Biometrics, 66(2):474–484,

2010.

T. Scheetz, K. Kim, R. Swiderski, A. Philp, T. Braun, K. Knudtson, A. Dor-

rance, G. DiBona, J. Huang, T. Casavant, et al. Regulation of gene expres-

sion in the mammalian eye and its relevance to eye disease. Proceedings of

the National Academy of Sciences, 103(39):14429–14434, 2006.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

P. Tseng. Convergence of a block coordinate descent method for nondifferen-

tiable minimization. Journal of optimization theory and applications, 109

(3):475–494, 2001.

H. Wang and C. Leng. A note on adaptive group lasso. Computational Statis-

tics & Data Analysis, 52(12):5277–5286, 2008.

75

Page 85: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

L. Wang, G. Chen, and H. Li. Group scad regression analysis for microarray

time course gene expression data. Bioinformatics, 23(12):1486, 2007.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped

variables. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 68(1):49–67, 2006.

A. Yuille and A. Rangarajan. The concave-convex procedure (cccp). Advances

in Neural Information Processing Systems, 2:1033–1040, 2002.

C. Zhang. Nearly unbiased variable selection under minimax concave penalty.

The Annals of Statistics, 38(2):894–942, 2010.

H. Zou. The adaptive lasso and its oracle properties. Journal of the American

Statistical Association, 101(476):1418–1429, 2006.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net.

Journal of the Royal Statistical Society: Series B (Statistical Methodology),

67(2):301–320, 2005.

76

Page 86: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

국문초록

많은회귀분석문제에서설명변수가자연스럽게그룹핑되어있는경우를종

종 볼 수 있다. 유의미한 그룹을 선택하는 분석에서 이러한 설명변수의 그

룹구조 정보를 추가하는 것은 분석결과를 향상시키는데 도움을 줄 수도 있

다. 본 학위논문에서는 설명 변수들의 그룹 정보가 있는 고차원 회귀 모형

에서 추정과 변수선택의 문제를 고려하였다.

본 연구에서는 그룹간에 상관 구조를 반영한 새로운 벌점 함수법을 제안

한다. 새로제안한방법은그룹 MCP와그룹이차벌점함수인그룹가중릿

지 벌점함수의 조합이다. 그룹 MCP는 그룹 희소성을 보장하고 그룹 가중

릿지 벌점함수는 그룹간에 높은 상관관계가 있는 그룹들을 선택할 수 있도

록 한다. 제안된 추정량은 그룹 선택 일치성을 만족한다. 의미없는 그룹에

서의 변수들을 버리고, 유의미한 그룹에서의 변수들만을 가지고 구한 추정

량인 ’신의 추정량’은 제안한 방법의 추정량의 하나임을 증명하였다. 자료

에 대한 조건 가정 아래, 제안한 방법을 적용한 추정문제에서 ’신의 추정량’

은 전역 최적점임을 증명하였다. 또한 제안된 방법에 대한 최적화 알고리즘

77

Page 87: Disclaimer - Seoul National University · 2019. 11. 14. · Lasso which uses an l 2 norm of coe cients within each group. The group SCAD penalty proposed by Wang et al. [2007] and

을 구현하였다. 수치적 연구를 통해 제안된 추정량이 다른 기존의 방법들과

비교하여 좋은 성능을 나타냄을 보일 수 있었다.

주요어 :고차원 자료, 벌점함수, 그룹선택 일치성, 가중 릿지, 신의 성질

학 번 : 2006− 20276

78