icml2012読み会 scaling up coordinate descent algorithms for large l1 regularization problems
DESCRIPTION
2012-07-28 ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems の発表資料TRANSCRIPT
ICML2012読み会: Scaling Up Coordinate Descent Algorithms for
Large L1 regularization Problems
2012-07-28
Yoshihiko Suhara
@sleepy_yoshi
読む論文
• Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems
– by C. Scherrer, M. Halappanavar, A. Tewari, D. Haglin
• Coordinate Descent の並列計算
– [Bradley+ 11] Parallel Coordinate Descent for L1-Regularized Loss Minimization (ICML2011) とか
2
概要
• 共有メモリマルチコア環境におけるParallel Coordinate Descentの一般化フレームワークを紹介
• 以下の2つの手法を提案 – Thread-Greedy Coordinate Descent – Coloring-Based Coordinate Descent
• Parallel CDの4手法を実験で比較
– Thread-Greedy が思いのほかよかった
3
L1正則化損失関数の最適化
• L1正則化損失関数
min𝒘
1
𝑛 ℓ 𝒚𝑖 , 𝑿𝒘 𝑖 + 𝜆 𝒘 1
𝑛
𝑖=1
• ここで – 𝑿 ∈ ℝ𝑛×𝑘: 計画行列 – 𝒘 ∈ ℝ𝑘: 重みベクトル – ℓ(𝑦,⋅): 微分可能な凸関数
• たとえば – Lasso (L1 + 二乗誤差) – L1正則化ロジスティック回帰
4
記法
𝑿 = 𝑿1, 𝑿2, … , 𝑿𝑗 , … 𝑿𝑘
𝒆𝑗 = 0, 0, … , 1, … , 0 𝑇
𝑿 =
𝒙1𝑇
𝒙2𝑇
⋮𝒙𝑖
𝑇
⋮𝒙𝑛
𝑇
5
補足: Coordinate Descent
• 座標降下法とも呼ばれる (?) • 選択された次元に対して直線探索 • いろんな次元の選び方
– 例) Cyclic Coordinate Descent
• 並列計算する場合には全次元の部分集合を選択して更新
6
7
GenCD: A Generic Framework for Parallel Coordinate Descent
なぜかここから英語
Generic Coordinate Descent (GenCD)
8
Step 1: Select
• Selecting 𝐽 coordinates
• The selection criteria differs for variations of CD techniques – cyclic CD (CCD)
– stochastic CD (SCD) • selection of a singlton
– fully greedy CD
• 𝐽 = {1, … , 𝑘}
– Shotgun [Bradley+ 11] • selects a random subset of a given size
9
Step 2: Propose
• Propose step computes a proposed increment 𝛿𝑗 for each 𝑗 ∈ 𝐽. – this step does not actually change the weights
• In Step 2, we maintain a vector 𝝋 ∈ ℝ𝑘, where 𝝋𝑗 is a proxy for the objective function evaluated at 𝒘 + 𝜹𝑗𝒆
𝑗
– update 𝝋𝑗 whenever a new proposal is calculated for j
– 𝝋 is not necessary if the algorithm will accepts all proposals
10
Step 3: Accept
• In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽 – [Bradley+ 11] show correlations among features can
lead to divergence if too many coordinates are updated at once (see below figure)
• In CCD, SCD, Shotgun, the algorithm allows all proposals to be accepted – No need to calculate 𝝋
11
Step 4: Update
• In Update step, the algorithm updates according to the set 𝐽′
12
𝑿𝒘 を保持
Approximate Minimization (1/2)
• Propose step calculates a proposed increment 𝜹𝑗 for each 𝑗 ∈ 𝐽
𝛿 = argmin𝛿 𝐹 𝒘 + 𝛿𝒆𝑗 + 𝜆|𝒘𝑗 + 𝛿|
where, 𝐹 𝒘 =1
𝑛 ℓ 𝒚𝑖 , 𝑿𝒘 𝑖
𝑛𝑖=1
• For a general loss function, there is no closed-form solution along a given coordinate. – Thus, consider approximate minimization
13
Approximate Minimization (2/2)
• Well known minimizer (e.g., [Yuan and Lin 10])
𝛿 = −𝜓 𝒘𝑗;𝛻𝑗𝐹 𝒘 − 𝜆
𝛽,𝛻𝑗𝐹 𝒘 + 𝜆
𝛽
where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑎 if 𝑥 < 𝑎𝑏 if 𝑥 > 𝑏𝑥 otherwise
14
for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4.
Step 2: Propose (Approximated)
15
ℓ′ 𝒚𝑖,𝒛𝑖𝑖 ,𝑿𝑗
𝑛 ?
Decrease in the approximated objective
Experiments
16
Algorithms (conventional)
• SHOTGUN [Bradley+ 11] – Select step: random subset of the columns – Accept step: accepts every proposal
• No need to compute a proxy for the objective
– convergence is guaranteed only if the # of coordinates selected is at most 𝑃∗ =
𝑘
2𝜌 (*1)
• GREEDY – Select step: all coordinates – Propose step: each thread generating proposals for some subset
of the coordinates using approximation – Accept step: Only accepts the single best among the all threads.
17 (*1) 𝜌 is the matrix eigenvalue of 𝑿𝑇𝑿
Comparisons of the Algorithms
18
Algorithms (proposed)
• THREAD-GREEDY – Select step: random set of coordinates (?) – Propose step: each thread generating proposals for some subset of the
coordinates using approximation – Accept step: Each thread accepts the best of the proposals – no proof for convergence (however, empirical results are encouraging)
• COLORING
– Preprocessing: structurally independent features are identified via partial distance-2 coloring
– Select step: a random color is selected – Accept step: accepts every proposal
• since the features are disjoint.
19
Implementation and Platform
• Implementation – gcc with OpenMP
• -O3 -fopenmp flags
• parallel for pragma
• static scheduling – Given n iterations and p threads, each thread gets n/p iterations
• Platform – AMD Opteron (Magny-Cours)
• with 48 cores (12 cores x 4 sockets)
– 256GB Memory
20
Datasets
21
(Number of Non-Zero)
Convergence rates
22
ナゼカワカラナイ
Scalability
23
Summary
• Presented GenCD, a generic framework for expressing parallel coordinate descent – Select, Propose, Accept, Upadte
• Performs convergence and scalability tests for the four algorithms – but the authors do not favor any of these algorithms
over the others
• The condition for convergence of the THREAD-
GREEDY algorithm is an open question
24
References
• [Yuan and Lin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods and Software for Large-scale L1-regularized Linear Classification”, Journal of Machine Learning Research, vol.11, pp.3183-3234, 2010.
• [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML ‘11, 2011.
25
おわり
26