a robust accelerated optimization algorithm for strongly convex … · 2020. 12. 15. · saman...
TRANSCRIPT
-
A robust accelerated optimization algorithm for
strongly convex functions
Saman Cyrus Bin Hu Bryan Van Scoy Laurent Lessard
University of Wisconsin–Madison
June 27, 2018
-
Iterative algorithms
Unconstrained optimization with f strongly convex.
minx∈Rn
f (x)
mI � ∇2f (x) � LI , and define κ = Lm
(condition number)
• Gradient method:xk+1 = xk − α∇f (xk)
• Nesterov’s accelerated gradient method (AGM)yk = xk + β(xk − xk−1)
xk+1 = yk − α∇f (yk)
2
-
Iteration complexity
100 101 102 103
100
101
102
103
Condition number κ
Iterationsto
convergence
Gradient Method α = 1L
Gradient Method α = 2m+L
Nesterov’s AGM
Triple Momentum
Theoretical Lower Bound
κ log 1ε
√κ log 1
ε
3
-
Noise robustness
• ∇f is inexact/approximate.
• Many different noise models have been studied.
• We use relative deterministic noise:
‖∇fnoisy −∇fexact‖‖∇fexact‖
≤ δ
• Ex: round-off error.
4
-
Performance in the presence of noise
Gradient method: slow and robust
100 101 102 10310−1
100
101
102
103
Iterationsto
convergence
Iterations for different δδ ∈ {.01, .02, .05, .1, .2, .5}Gradient method
Condition number κ
Nesterov’s AGM: fast and fragile
100 101 102 103 10410−1
100
101
102
103
Iterationsto
convergence
Iterations for different δδ ∈ {.05, .1, .2, .3, .4, .5}Nesterov (quadratic)
27
Condition number κ
Simulated by solving a small SDP(Lessard et al., SIAM Journal on Opt., 2016)
5
-
Proposed algorithm: Robust momentum method
Iteration update:
xk+1 = xk + β(xk − xk−1)− α∇f (yk)yk = xk + γ(xk − xk−1)
with parameters:
α =κ(1− ρ)2(1 + ρ)
L, β =
κρ3
κ− 1, γ =
ρ3
(κ− 1)(1− ρ)2(1 + ρ)
single tuning parameter ρ:
1− 1√κ︸ ︷︷ ︸
fast + fragile
≤ ρ ≤ 1− 1κ︸ ︷︷ ︸
slow + robust
6
-
Control interpretation
Nesterov’s AGM:
xk+1 = yk − α∇f (yk)yk = xk + β(xk − xk−1)
G
∇fyu
G =
[xk+1xk
]=
[1 + β −β
1 0
] [xkxk−1
]+
[−α0
]uk
yk =[1 + β −β
] [ xkxk−1
]
7
-
Frequency domain condition
Transfer function for the linear part:
G (z) = −α (1 + γ)z − γ(z − 1)(z − β)
Sufficient condition for linear convergence: xk − x? = O(ρk)
Re
[(ρz−1 − 1) 1− LG (ρz)
1−mG (ρz)
]< 0 for all |z | = 1
• combination of Circle Criterion and Zames-Falb multipliers
8
-
Gradient method
−1 −0.5 0 0.5 1−1.5
−1
−0.5
0
0.5
1
1.5
real part
imaginarypart
ρ = 1
ρ = 0.98
ρ = 0.95
ρ = Opt.
ρ = 1
ρ = 0.98
ρ = 0.95
ρ = Opt.
α = 1L
α = 2m+L
9
-
Idea: add a margin to improve robustness
Re
[(ρz−1 − 1) 1− LG (ρz)
1−mG (ρz)
]+ ν ≤ 0 for all |z | = 1
• ν > 0 is a robustness margin• tune α, β, γ to obtain a vertical line at −ν.
10
-
Robust Momentum Method
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
real part
imaginarypart
ρ = 0.90
ρ = 0.85
ρ = 0.80
ρ = Opt.
ρ = 0.90
ρ = 0.85
ρ = 0.80
ρ = Opt.
11
-
Nesterov’s AGM — not a line!
−10 −8 −6 −4 −2 0−10
−5
0
5
10
real part
imaginarypart
ρ = 0.80
ρ = 0.77
ρ = 0.76
ρ = Opt.
12
-
Trade-off: Performance vs noise
0 0.2 0.4 0.6 0.8 11− 1√
κ
κ− 1κ+ 1
1− 1κ
1
Noise strength (δ)
Con
vergence
rate
(ρ)
GM (α = 1/L)
GM (α = 2/(L+m))
GM (min α ∈ [0, 2/L])Nesterov’s AGM
RMM (ν = 0)
RMM (ν = 1/3)
RMM (ν = 2/3)
RMM (min ν ∈ [0, 1))
13
-
Simulation with relative gradient noise
f (x) = x21 + 10x22
0 50 10010−10
10−5
100
105
iteration
δ = 0
RMM (ν = 0)
RMM (ν = 0.55)
AGM
0 50 100
iteration
δ = 0.25
0 50 100
iteration
δ = 0.5
14
-
Thank you