lecture 10: subgradient methodsmath.xmu.edu.cn/group/nona/damc/lecture10.pdf · 2.1 assumptions and...

32
Lecture 10: Subgradient Methods April 29 - May 6, 2020 Subgradient Methods DAMC Lecture 10 April 29 - May 6, 2020 1 / 32

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Lecture 10 Subgradient Methods

April 29 - May 6 2020

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 1 32

1 Convex optimization problem and ε-optimal solution

Consider the problem

minx

f(x) subject to x isin C

where f Rn 983041rarr R cup +infin is convex

f(x) = +infin for x isin domf

and C is a closed convex set

ε-optimal solution 983141x

f(983141x)minus f(x983183) le ε and dist(983141x C) le ε

where x983183 isin C is an optimal solution

Newton method interior point method prohibitive due tocomplexity issue Subgradient method is preferable low costper-iteration achieve low accuracy solutions very quickly

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 2 32

11 Gradient descent and Newtonrsquos method

Linear approximation f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉Quadratic approximations

f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ L

2983042xminus xk98304222

f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ 1

2(xminus xk)Tnabla2f(xk)(xminus xk)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 3 32

2 Subgradient methods for C = Rn

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = xk minus αkgk

= argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have

partf(x1) = e1 +

n983131

i=2

tiei ti isin [minus1 1]

Any g1 isin partf(x1) with983123n

i=2 |ti| gt 1 is an ascent direction ie

f(x1 minus αg1) gt f(x1) forallα gt 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32

Contour plot of a convex function

Subdifferential = blue zone + red zone

Gray zone = negative blue zone ascent directions

Green zone = negative red zone descent directions

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32

Theorem 1 (Distance to a minimizer is decreasing)

If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that

983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422

Proof Note that

1

2983042xk minus αgk minus x98318398304222 =

1

2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2

2983042gk98304222

andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉

Any α satisfying

0 lt α lt2(f(xk)minus f(x983183))

983042gk98304222is desired

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32

21 Assumptions and convergence analysis

The function f is convex

There is at least one (possibly non-unique) minimizing pointx983183 isin argmin

xf(x) with f(x983183) = inf

xf(x) gt minusinfin

The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)

Theorem 2

Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1

K983131

k=1

αk[f(xk)minus f(x983183)] le

1

2983042x1 minus x98318398304222 +

1

2

K983131

k=1

α2kM

2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 2: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

1 Convex optimization problem and ε-optimal solution

Consider the problem

minx

f(x) subject to x isin C

where f Rn 983041rarr R cup +infin is convex

f(x) = +infin for x isin domf

and C is a closed convex set

ε-optimal solution 983141x

f(983141x)minus f(x983183) le ε and dist(983141x C) le ε

where x983183 isin C is an optimal solution

Newton method interior point method prohibitive due tocomplexity issue Subgradient method is preferable low costper-iteration achieve low accuracy solutions very quickly

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 2 32

11 Gradient descent and Newtonrsquos method

Linear approximation f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉Quadratic approximations

f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ L

2983042xminus xk98304222

f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ 1

2(xminus xk)Tnabla2f(xk)(xminus xk)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 3 32

2 Subgradient methods for C = Rn

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = xk minus αkgk

= argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have

partf(x1) = e1 +

n983131

i=2

tiei ti isin [minus1 1]

Any g1 isin partf(x1) with983123n

i=2 |ti| gt 1 is an ascent direction ie

f(x1 minus αg1) gt f(x1) forallα gt 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32

Contour plot of a convex function

Subdifferential = blue zone + red zone

Gray zone = negative blue zone ascent directions

Green zone = negative red zone descent directions

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32

Theorem 1 (Distance to a minimizer is decreasing)

If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that

983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422

Proof Note that

1

2983042xk minus αgk minus x98318398304222 =

1

2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2

2983042gk98304222

andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉

Any α satisfying

0 lt α lt2(f(xk)minus f(x983183))

983042gk98304222is desired

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32

21 Assumptions and convergence analysis

The function f is convex

There is at least one (possibly non-unique) minimizing pointx983183 isin argmin

xf(x) with f(x983183) = inf

xf(x) gt minusinfin

The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)

Theorem 2

Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1

K983131

k=1

αk[f(xk)minus f(x983183)] le

1

2983042x1 minus x98318398304222 +

1

2

K983131

k=1

α2kM

2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 3: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

11 Gradient descent and Newtonrsquos method

Linear approximation f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉Quadratic approximations

f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ L

2983042xminus xk98304222

f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ 1

2(xminus xk)Tnabla2f(xk)(xminus xk)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 3 32

2 Subgradient methods for C = Rn

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = xk minus αkgk

= argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have

partf(x1) = e1 +

n983131

i=2

tiei ti isin [minus1 1]

Any g1 isin partf(x1) with983123n

i=2 |ti| gt 1 is an ascent direction ie

f(x1 minus αg1) gt f(x1) forallα gt 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32

Contour plot of a convex function

Subdifferential = blue zone + red zone

Gray zone = negative blue zone ascent directions

Green zone = negative red zone descent directions

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32

Theorem 1 (Distance to a minimizer is decreasing)

If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that

983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422

Proof Note that

1

2983042xk minus αgk minus x98318398304222 =

1

2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2

2983042gk98304222

andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉

Any α satisfying

0 lt α lt2(f(xk)minus f(x983183))

983042gk98304222is desired

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32

21 Assumptions and convergence analysis

The function f is convex

There is at least one (possibly non-unique) minimizing pointx983183 isin argmin

xf(x) with f(x983183) = inf

xf(x) gt minusinfin

The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)

Theorem 2

Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1

K983131

k=1

αk[f(xk)minus f(x983183)] le

1

2983042x1 minus x98318398304222 +

1

2

K983131

k=1

α2kM

2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 4: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

2 Subgradient methods for C = Rn

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = xk minus αkgk

= argminxisinRn

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have

partf(x1) = e1 +

n983131

i=2

tiei ti isin [minus1 1]

Any g1 isin partf(x1) with983123n

i=2 |ti| gt 1 is an ascent direction ie

f(x1 minus αg1) gt f(x1) forallα gt 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32

Contour plot of a convex function

Subdifferential = blue zone + red zone

Gray zone = negative blue zone ascent directions

Green zone = negative red zone descent directions

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32

Theorem 1 (Distance to a minimizer is decreasing)

If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that

983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422

Proof Note that

1

2983042xk minus αgk minus x98318398304222 =

1

2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2

2983042gk98304222

andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉

Any α satisfying

0 lt α lt2(f(xk)minus f(x983183))

983042gk98304222is desired

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32

21 Assumptions and convergence analysis

The function f is convex

There is at least one (possibly non-unique) minimizing pointx983183 isin argmin

xf(x) with f(x983183) = inf

xf(x) gt minusinfin

The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)

Theorem 2

Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1

K983131

k=1

αk[f(xk)minus f(x983183)] le

1

2983042x1 minus x98318398304222 +

1

2

K983131

k=1

α2kM

2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 5: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Contour plot of a convex function

Subdifferential = blue zone + red zone

Gray zone = negative blue zone ascent directions

Green zone = negative red zone descent directions

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32

Theorem 1 (Distance to a minimizer is decreasing)

If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that

983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422

Proof Note that

1

2983042xk minus αgk minus x98318398304222 =

1

2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2

2983042gk98304222

andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉

Any α satisfying

0 lt α lt2(f(xk)minus f(x983183))

983042gk98304222is desired

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32

21 Assumptions and convergence analysis

The function f is convex

There is at least one (possibly non-unique) minimizing pointx983183 isin argmin

xf(x) with f(x983183) = inf

xf(x) gt minusinfin

The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)

Theorem 2

Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1

K983131

k=1

αk[f(xk)minus f(x983183)] le

1

2983042x1 minus x98318398304222 +

1

2

K983131

k=1

α2kM

2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 6: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Theorem 1 (Distance to a minimizer is decreasing)

If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that

983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422

Proof Note that

1

2983042xk minus αgk minus x98318398304222 =

1

2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2

2983042gk98304222

andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉

Any α satisfying

0 lt α lt2(f(xk)minus f(x983183))

983042gk98304222is desired

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32

21 Assumptions and convergence analysis

The function f is convex

There is at least one (possibly non-unique) minimizing pointx983183 isin argmin

xf(x) with f(x983183) = inf

xf(x) gt minusinfin

The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)

Theorem 2

Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1

K983131

k=1

αk[f(xk)minus f(x983183)] le

1

2983042x1 minus x98318398304222 +

1

2

K983131

k=1

α2kM

2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 7: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

21 Assumptions and convergence analysis

The function f is convex

There is at least one (possibly non-unique) minimizing pointx983183 isin argmin

xf(x) with f(x983183) = inf

xf(x) gt minusinfin

The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)

Theorem 2

Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1

K983131

k=1

αk[f(xk)minus f(x983183)] le

1

2983042x1 minus x98318398304222 +

1

2

K983131

k=1

α2kM

2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 8: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Corollary 3

Let AK =

K983131

k=1

αk and define xK =1

AK

K983131

k=1

αkxk xK

best = argminxkkleK

f(xk)

Then for all K ge 1

f(xK)minus f(x983183) le983042x1 minus x98318398304222 +

K983131

k=1

α2kM

2

2AK

and

f(xKbest)minus f(x983183) le

983042x1 minus xlowast98304222 +K983131

k=1

α2kM

2

2AK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 9: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Whenever αk rarr 0 and

infin983131

k=1

αk = infin we have

K983131

k=1

α2k

983089 K983131

k=1

αk rarr 0

and sof(xK)minus f(x983183) rarr 0 as K rarr infin

Taking

αk =983042x1 minus x9831839830422

Mradick

yields

f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic

K

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 10: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example robust regression in robust statistics

Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

where A =983045a1 middot middot middot am

983046T isin Rmtimesn and b isin Rm The subgradient

g =1

mATsign(Axminus b) =

1

m

m983131

i=1

aisign(〈aix〉 minus bi) isin partf(x)

The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 11: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

f(xk)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 12: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 13: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

3 Projected subgradient methods for constrained case

For k = 1 2 choose any gk isin partf(xk) set

xk+1 = πC(xk minus αkg

k)

whereπC(x) = argmin

yisinC983042xminus y9830422

The update is equivalent to (why Exercise)

xk+1 = argminxisinC

983069f(xk) + 〈gkxminus xk〉+ 1

2αk983042xminus xk98304222

983070

It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 14: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example LASSO or Compressed sensing applications

minx

983042Axminus b98304222 subject to 983042x9830421 le 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 15: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin

(1) p = infin[πC(x)]j = min1maxxj minus1

that is we simply truncate the coordinates of x to be in the range[minus1 1]

(2) p = 2

πC(x) =

983069x if 983042x9830422 le 1x983042x9830422 otherwise

(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then

[πC(x)]j = sign(xj)[|xj |minus t]+

where t is the unique t ge 0 satisfying

n983131

j=1

[|xj |minus t]+ = 1

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 16: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example Suppose that C is an affine set represented by

C = x isin Rn Ax = b

where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is

πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b

If we begin the iterates from a point xk isin C ie with Axk = bthen

xk+1 = πC(xk minus αkg

k) = xk minus αk(IminusAT(AAT)minus1A)gk

that is we simply project gk onto the nullspace of A and iterate

For more examples and proofs see FOMO sect64

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 17: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

31 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C

Theorem 4

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1

K983131

k=1

[f(xk)minus f(x983183)] leR2

2αK+

1

2

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 18: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Corollary 5

Let αk =αradick

and define xK =1

K

K983131

k=1

xk Then for all K ge 1

f(xK)minus f(x983183) leR2

2αradicK

+M2αradic

K

We see that convergence is guaranteed at the ldquobestrdquo rate1radicK

for

all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no

method can achieve a rate of convergence faster thanRMradicK

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 19: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

41 Stochastic subgradient

Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies

E[g(xS)] =983133

g(x s)dP(s) isin partf(x)

where S isin S is a random variable with distribution P

With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion

Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or

f(y) ge f(x) + 〈E[g]y minus x〉 for all y

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 20: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen

f(x) = E[F (xS)]

is convex when we take expectations over random variable S andtaking

g(x s) isin partF (x s)

gives a stochastic subgradient with the property that

E[g(xS)] isin partf(x)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 21: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

42 Stochastic programming

Consider the convex optimization problem

minxisinC

f(x) = E[F (xS)]

where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)

If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y

f(y) = E[F (yS)]

ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 22: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example Robust regression

f(x) =1

m983042Axminus b9830421 =

1

m

m983131

i=1

|〈aix〉 minus bi|

A natural stochastic subgradient is

g(x i) = aisign(〈aix〉 minus bi)

where i is uniformly at random draw from [m]

Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)

Generalization Given any problem with large dataset simi=1

minx

f(x) =1

m

m983131

i=1

F (x si)

Drawing i isin [m] uniformly at random and selecting g isin partF (x si)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 23: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

43 Projected stochastic subgradient methods

Sometimes computing stochastic subgradient is much easier thancomputing subgradient

The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)

For k = 1 2 compute a stochastic subgradient gk at the pointxk where

E[gk|xk] isin partf(xk)

Setxk+1 = πC(x

k minus αkgk)

This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 24: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example (Robust regression) We consider

minx

f(x) =1

m

m983131

i=1

|〈aix〉 minus bi| st 983042x9830422 le R

using the random sample

g = aisign(〈aix〉 minus bi)

as our stochastic gradient Set

A =983045a1 middot middot middot am

983046T ai sim N (0 Intimesn) (iid)

andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)

where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and

α =R

Mradick M2 =

1

m983042A9830422F

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 25: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

f(xk)minus f(x983183) versus k

Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 26: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Example (Multiclass support vector machine)

In a general m-class classification problem we represent themulticlass classifier using the matrix

X =983045x1 x2 middot middot middot xm

983046isin Rntimesm

The predicted class for a data vector a isin Rn is then

argmaxlisin[m]

〈axl〉 = argmaxlisin[m]

[XTa]l

where 〈axl〉 is the ldquoscorerdquo associated with class l

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 27: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Given training examples as pairs

(ai bi) isin Rn times 1 m i = 1 N

The multiclass classifier X can be determined by

minX

f(X) =1

N

N983131

i=1

F (X (ai bi)) st 983042X983042F le R

where the multiclass hinge loss function

F (X (a b)) = maxl ∕=b

[1 + 〈axl minus xb〉]+

with[t]+ = maxt 0

denotes the positive part

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 28: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Set

αk =α1radick M2 =

1

N

N983131

i=1

983042ai98304222

Stochastic subgradient method

Set i isin [N ] uniformly at random then take

gk isin partF (Xk (ai bi))

Subgradient method

gk =1

N

N983131

i=1

gki isin partf(Xk) gk

i isin partF(Xk (ai bi))

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 29: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 30: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

431 Assumptions and convergence analysis

The function f is convex

The set C sube intdomf is compact and convex and

983042xminus x9831839830422 le R lt infin

for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin

There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)

Theorem 6

Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1

xK =1

K

K983131

k=1

xk E[f(xK)minus f(x983183)] leR2

2KαK+

1

2K

K983131

k=1

αkM2

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 31: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Corollary 7

Let the conditions of Theorem 6 hold and let αk =R

Mradick

for each k

Then for all K ge 1

E[f(xK)]minus f(x983183) le3RM

2radicK

Corollary 8

Let αk be non-summable but convergent to zero that is

αk rarr 0983131infin

k=1αk = infin

Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have

lim supKrarrinfin

P[f(xK)minus f(x983183) ge 983171] = 0

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32

Page 32: Lecture 10: Subgradient Methodsmath.xmu.edu.cn/group/nona/damc/Lecture10.pdf · 2.1 Assumptions and convergence analysis The function f is convex. ... Subgradient Methods DAMC Lecture

Theorem 9

Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0

f(xK)minus f(xlowast) le R2

2KαK+

1

2K

K983131

k=1

αkM2 +

RMradicK

983171

with probability at least 1minus eminus129831712

Let αk =R

Mradickand set δ = eminus

129831712 we have

f(xK)minus f(xlowast) le 3RM

2radicK

+MR

radicminus2 log δradicK

with probability at least 1minus δ That is we have convergence ofO(MR

radicK) with high probability

Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32