icml2012読み会 scaling up coordinate descent algorithms for large l1 regularization problems

ICML2012読み会: Scaling Up Coordinate Descent Algorithms for

Large L1 regularization Problems

2012-07-28

Yoshihiko Suhara

@sleepy_yoshi

読む論文

• Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

– by C. Scherrer, M. Halappanavar, A. Tewari, D. Haglin

• Coordinate Descent の並列計算

– [Bradley+ 11] Parallel Coordinate Descent for L1-Regularized Loss Minimization (ICML2011) とか

2

概要

• 共有メモリマルチコア環境におけるParallel Coordinate Descentの一般化フレームワークを紹介

• 以下の2つの手法を提案 – Thread-Greedy Coordinate Descent – Coloring-Based Coordinate Descent

• Parallel CDの4手法を実験で比較

– Thread-Greedy が思いのほかよかった

3

L1正則化損失関数の最適化

• L1正則化損失関数

min𝒘

1

𝑛 ℓ 𝒚𝑖 , 𝑿𝒘 𝑖 + 𝜆 𝒘 1

𝑛

𝑖=1

• ここで – 𝑿 ∈ ℝ𝑛×𝑘: 計画行列 – 𝒘 ∈ ℝ𝑘: 重みベクトル – ℓ(𝑦,⋅): 微分可能な凸関数

• たとえば – Lasso (L1 + 二乗誤差) – L1正則化ロジスティック回帰

4

記法

𝑿 = 𝑿1, 𝑿2, … , 𝑿𝑗 , … 𝑿𝑘

𝒆𝑗 = 0, 0, … , 1, … , 0 𝑇

𝑿 =

𝒙1𝑇

𝒙2𝑇

⋮𝒙𝑖

𝑇

⋮𝒙𝑛

𝑇

5

補足: Coordinate Descent

• 座標降下法とも呼ばれる (?) • 選択された次元に対して直線探索 • いろんな次元の選び方

– 例) Cyclic Coordinate Descent

• 並列計算する場合には全次元の部分集合を選択して更新

6

7

GenCD: A Generic Framework for Parallel Coordinate Descent

なぜかここから英語

Generic Coordinate Descent (GenCD)

8

Step 1: Select

• Selecting 𝐽 coordinates

• The selection criteria differs for variations of CD techniques – cyclic CD (CCD)

– stochastic CD (SCD) • selection of a singlton

– fully greedy CD

• 𝐽 = {1, … , 𝑘}

– Shotgun [Bradley+ 11] • selects a random subset of a given size

9

Step 2: Propose

• Propose step computes a proposed increment 𝛿𝑗 for each 𝑗 ∈ 𝐽. – this step does not actually change the weights

• In Step 2, we maintain a vector 𝝋 ∈ ℝ𝑘, where 𝝋𝑗 is a proxy for the objective function evaluated at 𝒘 + 𝜹𝑗𝒆

𝑗

– update 𝝋𝑗 whenever a new proposal is calculated for j

– 𝝋 is not necessary if the algorithm will accepts all proposals

10

Step 3: Accept

• In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽 – [Bradley+ 11] show correlations among features can

lead to divergence if too many coordinates are updated at once (see below figure)

• In CCD, SCD, Shotgun, the algorithm allows all proposals to be accepted – No need to calculate 𝝋

11

Step 4: Update

• In Update step, the algorithm updates according to the set 𝐽′

12

𝑿𝒘 を保持

Approximate Minimization (1/2)

• Propose step calculates a proposed increment 𝜹𝑗 for each 𝑗 ∈ 𝐽

𝛿 = argmin𝛿 𝐹 𝒘 + 𝛿𝒆𝑗 + 𝜆|𝒘𝑗 + 𝛿|

where, 𝐹 𝒘 =1

𝑛 ℓ 𝒚𝑖 , 𝑿𝒘 𝑖

𝑛𝑖=1

• For a general loss function, there is no closed-form solution along a given coordinate. – Thus, consider approximate minimization

13

Approximate Minimization (2/2)

• Well known minimizer (e.g., [Yuan and Lin 10])

𝛿 = −𝜓 𝒘𝑗;𝛻𝑗𝐹 𝒘 − 𝜆

𝛽,𝛻𝑗𝐹 𝒘 + 𝜆

𝛽

where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑎 if 𝑥 < 𝑎𝑏 if 𝑥 > 𝑏𝑥 otherwise

14

for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4.

Step 2: Propose (Approximated)

15

ℓ′ 𝒚𝑖,𝒛𝑖𝑖 ,𝑿𝑗

𝑛 ?

Decrease in the approximated objective

Experiments

16

Algorithms (conventional)

• SHOTGUN [Bradley+ 11] – Select step: random subset of the columns – Accept step: accepts every proposal

• No need to compute a proxy for the objective

– convergence is guaranteed only if the # of coordinates selected is at most 𝑃∗ =

𝑘

2𝜌 (*1)

• GREEDY – Select step: all coordinates – Propose step: each thread generating proposals for some subset

of the coordinates using approximation – Accept step: Only accepts the single best among the all threads.

17 (*1) 𝜌 is the matrix eigenvalue of 𝑿𝑇𝑿

Comparisons of the Algorithms

18

Algorithms (proposed)

• THREAD-GREEDY – Select step: random set of coordinates (?) – Propose step: each thread generating proposals for some subset of the

coordinates using approximation – Accept step: Each thread accepts the best of the proposals – no proof for convergence (however, empirical results are encouraging)

• COLORING

– Preprocessing: structurally independent features are identified via partial distance-2 coloring

– Select step: a random color is selected – Accept step: accepts every proposal

• since the features are disjoint.

19

Implementation and Platform

• Implementation – gcc with OpenMP

• -O3 -fopenmp flags

• parallel for pragma

• static scheduling – Given n iterations and p threads, each thread gets n/p iterations

• Platform – AMD Opteron (Magny-Cours)

• with 48 cores (12 cores x 4 sockets)

– 256GB Memory

20

Datasets

21

(Number of Non-Zero)

Convergence rates

22

ナゼカワカラナイ

Scalability

23

Summary

• Presented GenCD, a generic framework for expressing parallel coordinate descent – Select, Propose, Accept, Upadte

• Performs convergence and scalability tests for the four algorithms – but the authors do not favor any of these algorithms

over the others

• The condition for convergence of the THREAD-

GREEDY algorithm is an open question

24

References

• [Yuan and Lin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods and Software for Large-scale L1-regularized Linear Classification”, Journal of Machine Learning Research, vol.11, pp.3183-3234, 2010.

• [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML ‘11, 2011.

25

おわり

26

icml2012読み会 scaling up coordinate descent algorithms for large l1 regularization problems

Technology