18-660: numerical methods for engineering design and

18-660: Numerical Methods for Engineering Design and Optimization

Xin Li Department of ECE

Carnegie Mellon University Pittsburgh, PA 15213

Overview

Unconstrained Optimization Gradient method Newton method

Unconstrained Optimization

Linear regression with regularization

Unconstrained optimization: minimizing a cost function without any constraint Golden section search Downhill simplex method Gradient method Newton method

BA =α 1

2

2min αλαα

⋅+− BA

Non-derivative method

Rely on derivatives (this lecture)

Gradient Method

If a cost function is smooth, its derivative information can be used to search optimal solution

x

f(x)

( )xfx

min

Gradient Method

For illustration purpose, we start from a one-dimensional case

( )xfx

min

( ) ( ) ( ) ( )

( )

( )( )01 >⋅−=−=∆ + k

x

kkkk

kdxdfxxx λλ

Step size Derivative

( ) ( ) ( )

( )

( )( )01 >⋅−=+ k

x

kkk

kdxdfxx λλ

Gradient Method

One-dimensional case (continued) ( ) ( ) ( ) ( )

( )

( )[ ] ( )[ ]( )

( )k

x

kk

x

kkkk xdxdfxfxf

dxdfxxx

kk

∆⋅+≈⋅−=−=∆ ++ 11 &λ

( )[ ] ( )[ ]( )

( )

( )

⋅−⋅+≈+

kk x

k

x

kk

dxdf

dxdfxfxf λ1

( )[ ] ( )[ ] ( )

( )

21

kx

kkk

dxdfxfxf ⋅−≈+ λ

The cost function f(x) keeps decreasing if the derivative is non-zero

λ(k) > 0

Gradient Method

One-dimensional case (continued)

0=⋅−=∆ dxdfx λ

x

f(x)

Derivative is zero at local optimum (gradient method converges)

Gradient Method

Two-dimensional case ( )21,

,min21

xxfxx

( )

( )

( ) ( )

( ) ( )( ) ( ) ( )[ ]kkk

kk

kk

k

k

xxfxxxx

xx

212

12

11

1

2

1 ,∇⋅−=

−−

=

∆∆

+

+

λ

( )

∂∂∂∂

=∇2

121, xf

xfxxf

( )

( )

( )

( )( ) ( ) ( )[ ]kkk

k

k

k

k

xxfxx

xx

212

11

2

11 ,∇⋅−

=

+

+

λ

Gradient Method

Two-dimensional case (continued)

The cost function f(x1, x2) keeps decreasing if the gradient is non-zero

λ(k) > 0

( )

( )( ) ( ) ( )[ ]

( ) ( )[ ] ( ) ( )[ ] ( ) ( )[ ]( )

( )

∆∆

⋅∇+≈

∇⋅−=

∆∆

++k

kTkkkkkk

kkkk

k

xx

xxfxxfxxf

xxfxx

2

12121

12

11

212

1

,,,

,λ

( ) ( )[ ] ( ) ( )[ ] ( ) ( ) ( )[ ] ( ) ( )[ ]kkTkkkkkkk xxfxxfxxfxxf 2121211

21

1 ,,,, ∇⋅∇⋅−≈++ λ

( ) ( )[ ] ( ) ( )[ ] ( ) ( ) ( )[ ] 2

221211

21

1 ,,, kkkkkkk xxfxxfxxf ∇⋅−≈++ λ

Gradient Method

N-dimensional case ( )Xf

Xmin

( )

∂∂∂∂

=∇

2

1

xfxf

Xf

( ) ( ) ( ) ( )[ ]kkkk XfXX ∇⋅−=+ λ1

Newton Method

Gradient method relies on first-order derivative information Each iteration is simple, but it converges to optimum slowly I.e., require a large number of iteration steps

The step size λ(k) can be optimized by one-dimensional search

for fast convergence

Alternative algorithm: Newton method Rely on both first-order and second-order derivatives Each iteration is more expensive But it converges to optimum more quickly, i.e., requires a

smaller number of iterations to reach convergence

( )( ) ( ) ( )[ ]{ }kkk XfXf

k∇⋅−λ

λmin

Newton Method

One-dimensional case ( )xf

xmin

( ) ( ) ( )

( ) ( )[ ]kk

xxx

xxdx

fddxdf

dxdf

kkk

−⋅+≈ +

+

12

2

1

( ) ( )

( ) ( )[ ]kk

xx

xxdx

fddxdf

kk

−⋅+= +12

2

0

First-order derivative is zero at local optimum

Newton Method

One-dimensional case (continued) ( ) ( )

( ) ( )[ ] 012

2

=−⋅+ + kk

xx

xxdx

fddxdf

kk

( ) ( ) ( )

( ) ( )kk xx

kkk

dxdf

dxfdxxx ⋅−=−=∆

−

+

1

2

21

( ) ( )

( ) ( )kk xx

kk

dxdf

dxfdxx ⋅−=

−

+

1

2

21

Newton Method


( )

( ) ( )

( )[ ] ( )[ ]( )

( )k

x

kk

xx

k xdxdfxfxf

dxdf

dxfdx

kkk

∆⋅+≈⋅−=∆ +

−

11

2

2

&

( )[ ] ( )[ ]( ) ( ) ( )

⋅−⋅+≈

−

+

kkk xxx

kk

dxdf

dxfd

dxdfxfxf

1

2

21

( )[ ] ( )[ ]( ) ( )

21

2

21

kk xx

kk

dxdf

dxfdxfxf ⋅−≈

−

+

Positive (convex function)

Newton Method


Newton method gives an estimation of the optimal step size λ(k) using second-order derivative

( ) ( )

( ) ( )kk xx

kk

dxdf

dxfdxx ⋅−=

−

+

1

2

21( ) ( ) ( )

( )kx

kkk

dxdfxx ⋅−=+ λ1

Gradient method Newton method

( )

( )

01

2

2

>=−

kx

k

dxfdλ (convex function)

Newton Method

One-dimensional case (continued) The step size estimation is based on the following linear

approximation for first-order derivative

The approximation is exact if and only if f(x) is quadratic and, therefore, df/dx is linear

( ) ( ) ( )

( ) ( )[ ]kk

xxx

xxdx

fddxdf

dxdf

kkk

−⋅+≈ +

+

12

2

1

( ) baxdx

dfcbxaxxf +=⇒++= 22

Newton Method

One-dimensional case (continued) If the actual f(x) is not quadratic, the following step size

estimation may be non-optimal

Using this step size may result in bad convergence

In these cases, a damping factor β is typically introduced

( )

( )

1

2

2 −

=kx

k

dxfdλ

( ) ( )

( ) ( )( )10

1

2

21 <<⋅⋅−=

−

+ ββkk xx

kk

dxdf

dxfdxx

Newton Method

Two-dimensional case

( ) ( )[ ] ( ) ( )[ ] ( ) ( )[ ]( ) ( )

( ) ( )

−−

⋅∇+∇≈∇+

+++

kk

kkkkkkkk

xxxx

xxfxxfxxf2

12

11

121

221

12

11 ,,,

( )21,,min

21

xxfxx

( ) ( )

∂∂∂∂∂∂∂∂∂∂

=∇

∂∂∂∂

=∇ 22

221

221

221

2

212

2

121 ,&,

xfxxfxxfxf

xxfxfxf

xxf

Hessian matrix

Newton Method


( ) ( )[ ] ( ) ( )[ ] ( ) ( )[ ]( ) ( )

( ) ( ) 0,,,2

12

11

121

221

12

11 =

−−

⋅∇+∇≈∇+

+++

kk

kkkkkkkk

xxxx

xxfxxfxxf

( )

( )

( ) ( )

( ) ( )( ) ( )[ ] ( ) ( )[ ]kkkk

kk

kk

k

k

xxfxxfxxxx

xx

211

212

21

2

11

11

2

11 ,, ∇⋅−∇=

−−

=

∆∆ −

+

+

+

+

( )

( )

( )

( )( ) ( )[ ] ( ) ( )[ ]kkkk

k

k

k

k

xxfxxfxx

xx

211

212

2

11

2

11 ,, ∇⋅∇−

=

−

+

+

Newton Method


Positive definite (convex function)

( )

( )( ) ( )[ ] ( ) ( )[ ]

( ) ( )[ ] ( ) ( )[ ] ( ) ( )[ ]( )

( )

∆∆

⋅∇+≈

∇⋅−∇=

∆∆

++

−

k

kTkkkkkk

kkkkk

k

xx

xxfxxfxxf

xxfxxfxx

2

12121

12

11

211

212

2

1

,,,

,,

( ) ( )[ ] ( ) ( )[ ] ( ) ( )[ ] ( ) ( )[ ] ( ) ( )[ ]kkkkTkkkkkk xxfxxfxxfxxfxxf 211

212

21211

21

1 ,,,,, ∇⋅∇⋅∇−≈−++

Newton Method


Gradient method and Newton method do not move along the same direction

Gradient method Newton method

( ) ( ) ( ) ( )[ ]kkkk xxfX 21 ,∇⋅−=∆ λ ( ) ( ) ( )[ ] ( ) ( )[ ]kkkkk xxfxxfX 211

212 ,, ∇⋅−∇=∆

−

x1

x2 Gradient

Newton

( ) CXBAXXXf TT ++=

Newton Method

Newton method can be extended to high-dimensional cases

The following Hessian matrix is N×N if we have N variables

Numerically computing the Hessian matrix and its inverse (by Cholesky decomposition) can be quite expensive for large N

( )

∂∂∂∂∂∂∂∂

∂∂∂∂∂∂∂∂∂∂∂∂∂∂∂∂

=∇

222

21

2

222

22

212

12

2122

12

2

NNN

N

N

xfxxfxxf

xxfxfxxfxxfxxfxf

Xf

( ) ( ) ( )[ ] ( )[ ]kkkk XfXfXX ∇⋅∇−=−+ 121

Newton Method

A number of modified algorithms were developed to address this complexity issue E.g., quasi-Newton method

The key idea is to approximate the Hessian matrix and its

inverse so that: The computational cost is significantly reduced Fast convergence can still be achieved

More details can be found at

Numerical Recipes: The Art of Scientific Computing, 2007

Summary

Unconstrained Optimization Gradient method Newton method

18-660: numerical methods for engineering design and

Documents