stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfstochastic...

95

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of
Page 2: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Stochastic subgradient methodsBased on material by Mark Schmidt

Julieta Martinez

University of British Columbia

October 06, 2015

Page 3: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Introduction

I We are interested in a typical machine learning problem

minx∈RD

1N

N∑i=1

L(x , ai , bi ) + λ · r(x)

data fitting term + regularizer

I Last time, we talked about gradient methods, which workwhen D is large

I Today we will talk about stochastic subgradient methods,which work when N is large

Julieta Martinez Subgradient methods

Page 4: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Introduction

I We are interested in a typical machine learning problem

minx∈RD

1N

N∑i=1

L(x , ai , bi ) + λ · r(x)

data fitting term + regularizer

I Last time, we talked about gradient methods, which workwhen D is large

I Today we will talk about stochastic subgradient methods,which work when N is large

Julieta Martinez Subgradient methods

Page 5: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Introduction

I We are interested in a typical machine learning problem

minx∈RD

1N

N∑i=1

L(x , ai , bi ) + λ · r(x)

data fitting term + regularizer

I Last time, we talked about gradient methods, which workwhen D is large

I Today we will talk about stochastic subgradient methods,which work when N is large

Julieta Martinez Subgradient methods

Page 6: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 7: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 8: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 9: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 10: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 11: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 12: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 13: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 14: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Page 15: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I Deterministic gradient methods

I Stochastic gradient methods [Robbins and Monro, 1951]

Julieta Martinez Subgradient methods

Page 16: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Page 17: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Page 18: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Page 19: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Page 20: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Page 21: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Page 22: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Page 23: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Figure : Convergence rates in the strongly convex case

I Stochastic methods are better for low-accuracy/time situationsI It can be hard to know when the crossing will happen

Julieta Martinez Subgradient methods

Page 24: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Figure : Convergence rates in the strongly convex case

I Stochastic methods are better for low-accuracy/time situationsI It can be hard to know when the crossing will happen

Julieta Martinez Subgradient methods

Page 25: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Figure : Convergence rates in the strongly convex case

I Stochastic methods are better for low-accuracy/time situationsI It can be hard to know when the crossing will happen

Julieta Martinez Subgradient methods

Page 26: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 27: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 28: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 29: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 30: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 31: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 32: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 33: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 34: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Page 35: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Page 36: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Page 37: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Page 38: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Page 39: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Page 40: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

Julieta Martinez Subgradient methods

Page 41: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Page 42: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

f(x) + rf(x)|(y � x)

Julieta Martinez Subgradient methods

Page 43: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Page 44: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Page 45: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Page 46: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Page 47: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Page 48: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

@f(x)

Julieta Martinez Subgradient methods

Page 49: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Page 50: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Page 51: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Page 52: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Page 53: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Page 54: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Page 55: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

@f(x)

Julieta Martinez Subgradient methods

Page 56: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Subdifferential of max function

I |x | is a special case of the max functionI Given two convex functions f1(x) and f2(x), the subdifferential

of max(f1(x), f2(x)) is given by

∂max (f1(x), f2(x)) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f1(x) < f2(x)

θ∇f2(x) + (1− θ)∇f2(x) f1(x) = f2(x)I I.e., any convex combination of the gradients of the argmax

Julieta Martinez Subgradient methods

Page 57: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Subdifferential of max function

I |x | is a special case of the max functionI Given two convex functions f1(x) and f2(x), the subdifferential

of max(f1(x), f2(x)) is given by

∂max (f1(x), f2(x)) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f1(x) < f2(x)

θ∇f2(x) + (1− θ)∇f2(x) f1(x) = f2(x)I I.e., any convex combination of the gradients of the argmax

Julieta Martinez Subgradient methods

Page 58: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Subdifferential of max function

I |x | is a special case of the max functionI Given two convex functions f1(x) and f2(x), the subdifferential

of max(f1(x), f2(x)) is given by

∂max (f1(x), f2(x)) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f1(x) < f2(x)

θ∇f2(x) + (1− θ)∇f2(x) f1(x) = f2(x)I I.e., any convex combination of the gradients of the argmax

Julieta Martinez Subgradient methods

Page 59: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

Julieta Martinez Subgradient methods

Page 60: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

@f(x) = [0.4, 1.2]

Julieta Martinez Subgradient methods

Page 61: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

@f(x) = [0.4, 1.2]

Julieta Martinez Subgradient methods

Page 62: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

@f(x) = [0.4, 1.2]

Julieta Martinez Subgradient methods

Page 63: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Page 64: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Page 65: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Page 66: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Page 67: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 68: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 69: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 70: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 71: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 72: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 73: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 74: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 75: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 76: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 77: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Page 78: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

There is work that supports using large steps and averagingI [Moulines and Bach, 2011], [Lacoste-Julien et al., 2012]

I Averaging later iterations achieves O(1)̈ in non-smooth caseI Averaging by iteration number achieves the same

I [Nesterov, 2009], [Xiao, 2009]I Gradient averaging improves constants (‘dual averaging’)I Finds non-zero variables with sparse regularizers

I [Moulines and Bach, 2011]I αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t)

I [Nedić and Bertsekas, 2001]I Constant step size (αt = α) achieves rate of

E[f(x t)]− f (x∗) ≤ (1− 2µα)t (f (x0)− f (x∗)

)+O(α)

I [Polyak and Juditsky, 1992]I In the smooth case, iterate averaging is asymptotically optimalI Achieves same rate as optimal Stochastic Newton method

Julieta Martinez Subgradient methods

Page 79: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Page 80: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Page 81: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Page 82: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Page 83: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Page 84: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Page 85: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Page 86: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 87: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 88: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 89: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 90: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 91: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 92: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 93: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 94: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Page 95: Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic subgradient methods Based on material by Mark Schmidt Julieta Martinez University of

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

References

Bach, F. and Moulines, E. (2013).Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).In Advances in Neural Information Processing Systems, pages 773–781.

Duchi, J., Hazan, E., and Singer, Y. (2011).Adaptive subgradient methods for online learning and stochastic optimization.The Journal of Machine Learning Research, 12:2121–2159.

Ghadimi, S. and Lan, G. (2010).Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization.Optimization Online, July.

Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012).A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method.arXiv preprint arXiv:1212.2002.

Moulines, E. and Bach, F. R. (2011).Non-asymptotic analysis of stochastic approximation algorithms for machine learning.In Advances in Neural Information Processing Systems, pages 451–459.

Nedić, A. and Bertsekas, D. P. (2001).Incremental subgradient methods for nondifferentiable optimization.SIAM Journal on Optimization, 12(1):109–138.

Nesterov, Y. (2009).Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259.

Polyak, B. T. and Juditsky, A. B. (1992).Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization, 30(4):838–855.

Robbins, H. and Monro, S. (1951).A stochastic approximation method.The annals of mathematical statistics, pages 400–407.

Tseng, P. (1998).An incremental gradient (-projection) method with momentum term and adaptive stepsize rule.SIAM Journal on Optimization, 8(2):506–531.

Xiao, L. (2009).Dual averaging method for regularized stochastic learning and online optimization.In Advances in Neural Information Processing Systems, pages 2116–2124.

Julieta Martinez Subgradient methods