least squares - svcl

Post on 12-Dec-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Least Squares

Nuno Vasconcelos(Ken Kreutz-Delgado )

UCSD

(Unweighted) Least Squares• Assume linearity in the unknown,

deterministic model parameters θS l dditi i d l• Scalar, additive noise model:

( ; ) ( )Ty f x xε γ εθ θ= + = +

– E.g., for a line (f affine in x),

( )f θ θθ + 01( )x θ

θγ

⎡ ⎤⎡ ⎤⎢ ⎥⎢ ⎥

• This can be generalized to arbitrary functions of x:

1 0( ; )f x xθ θθ = +1

( ) xx θ

θγ = = ⎢ ⎥⎢ ⎥⎣ ⎦ ⎣ ⎦

g y

0 0( )( )

xx

γ θθγ

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥= =⎢ ⎥ ⎢ ⎥M M

2

( ) ( )k k

xx

θγγ θ

⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

M M

Examples• Components of γ(x) can be arbitrary

nonlinear functions of x that are linear in θ :

– Line Fitting (affine in x):

( ; )f x xθ θθ = + ( ) [1 ]Tx xγ =– Polynomial Fitting (nonlinear in x):

1 0( ; )f x xθ θθ = +

k

( ) [1 ]x xγ =

S ( )

0( ; ) i

ii

f x xθθ=

= ∑ ( ) 1T kx xγ ⎡ ⎤= ⎣ ⎦L

– Truncated Fourier Series (nonlinear in x):

( ; ) cos( )k

if x i xθθ = ∑ [ ]( ) 1 cos( )Tx k xγ = L

30

( ; ) cos( )ii

f x i xθθ=∑ [ ]( ) 1 cos( )x k xγ

(Unweighted)Least Squares• Loss = Euclidean norm of model error:

2L( )) (y xθ θ= Γwhere

( )T θ⎡ ⎤⎡ ⎤ ⎡ ⎤

L( )) (y xθ θ= − Γ

1 1 0( ) ( )

( )

T

T

y xy x

y x

γ θ

γ θθ

⎡ ⎤⎡ ⎤ ⎡ ⎤⎢ ⎥⎢ ⎥ ⎢ ⎥= Γ = =⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦⎣ ⎦

M M M

• The loss function can also be rewritten as,

( )n n ky xγ θ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦⎣ ⎦

( ) ( )2 2

1L( ( ) () )

n

i ii

T T

ny x n y xγθ θγθ

=

= − = −∑

4E.g., for the line, ( ) 2

0 1L )( ii

iy xθ θθ= − −∑

Examples• The most important component is the Design Matrix Γ(x)

– Line Fitting: Polynomial Fitting:

( )f θ θθk

k⎡ ⎤11 x⎡ ⎤

⎢ ⎥

1 0( ; )f x xθ θθ = +0

( ; ) ii

if x xθθ

=

= ∑

11( )

1

k

k

xx

⎡ ⎤⎢ ⎥Γ = ⎢ ⎥⎢ ⎥⎣ ⎦

L

M O M

1

( ) 1 n

xx

⎢ ⎥Γ = ⎢ ⎥⎢ ⎥⎣ ⎦

M M

– Truncated Fourier Series:1 k

nx⎢ ⎥⎣ ⎦L

1 cos( )k x⎡ ⎤Lk

∑ 11 cos( )( )

1 cos( )

k xx

k x

⎡ ⎤⎢ ⎥Γ = ⎢ ⎥⎢ ⎥⎣ ⎦

L

M O M

L

0( ; ) cos( )i

if x i xθθ

=

= ∑

5

1 cos( )nk x⎢ ⎥⎣ ⎦

(Unweighted) Least Squares• One way to minimize

2L( )) (y xθ θ= − Γ

is to find a value θ such that

L( )) (y xθ θΓ

T∂ ⎡ ⎤2∂

ˆ

– These conditions will hold when Γ(x) is one-to-one, yielding

L( 2 ( ) 0ˆ ˆ) ( ) andT

y x xθ θθ∂ ⎡ ⎤= − − Γ =⎣ ⎦∂

Γ 2 L( 2 (ˆ 0) ) ( )Tx xθθ∂

= Γ Γ∂

f

1( ) ( )ˆ ( () )T T x yx x x yθ

− +⎡ ⎤= Γ Γ Γ =⎣ ⎦ Γ

Thus, the least squares solution is determined by the Pseudoinverse of the Design Matrix Γ(x):

1T T−⎡ ⎤

6

1( ) ( ) ( ) ( )T Tx x x x+ ⎡ ⎤Γ = Γ Γ Γ⎣ ⎦

(Unweighted) Least Squares• If the design matrix Γ(x) is one-to-one (has full column rank)

the least squares solution is conceptually easy to compute. Thi ill l l b th i tiThis will nearly always be the case in practice.

• Recall the example of fitting a line in the plane:

1

21

1 11 1Γ ( Γ( )

1n

T

x xx) x n

x x x xx

⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎢ ⎥= = ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦⎢ ⎥⎣ ⎦

KM M

K1 nx ⎢ ⎥⎣ ⎦ ⎣ ⎦⎢ ⎥⎣ ⎦

11 1y

y⎡ ⎤

⎡ ⎤⎡ ⎤ ⎢ ⎥1

1 1Γ ( )

nn

T yx y n

x x xyy

⎡ ⎤⎡ ⎤ ⎢ ⎥= = ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦⎢ ⎥⎣ ⎦

KM

K

7

(Unweighted)Least squares• Pseudoinverse solution = LS solution:

1ˆ −⎡ ⎤

The LS line fit solution is

1(ˆ ( ) ) ( ) ( )T Tx x yx y xθ + −

⎡ ⎤= Γ Γ Γ⎣ ⎦Γ =

The LS line fit solution is 1

2ˆ 1 x y

θ−

⎡ ⎤ ⎡ ⎤= ⎢ ⎥ ⎢ ⎥

which (naively) requires

2 xyx xθ ⎢ ⎥ ⎢ ⎥

⎢ ⎥ ⎣ ⎦⎣ ⎦

( y) q– a 2x2 matrix inversion– followed by a 2x1 vector post-multiplication

8

(Unweighted) Least Squares• What about fitting kth order polynomial? :

11 kx⎡ ⎤⎢ ⎥

Lki∑ ( )

1 kn

xx

⎢ ⎥Γ = ⎢ ⎥⎢ ⎥⎣ ⎦

M O M

L0( ; ) i

ii

f x xθθ=

= ∑

11 1 1( ) ( )

k

T

xx x

⎡ ⎤⎡ ⎤⎢ ⎥⎢ ⎥Γ Γ = ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥

L L

M O M M O M

1 1

1

k k kn n

k

x x x

x

⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦ ⎣ ⎦

⎡ ⎤⎢ ⎥

L L

L

2

k k

n

x x

⎢ ⎥= ⎢ ⎥

⎢ ⎥⎢ ⎥⎣ ⎦

M O M

L

9

⎣ ⎦

(Unweighted) Least squaresand

11 1( )T

y y⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎢ ⎥⎢ ⎥ ⎢ ⎥Γ ⎢ ⎥M M

L

M O M

1

( )k k k

n n

T x y nx x y x y

⎢ ⎥ ⎢ ⎥Γ = = ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎣ ⎦

M MM O M

L

Thus, when Γ(x) is one-to-one (has full column rank):1

1 Kx y−

⎡ ⎤ ⎡ ⎤L

( ) 1

2

ˆ ( )

1

( ) ( ) ( )K K k

T Tx

x y

x x x y

y x x x yθ + −

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥

= ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥

= Γ = Γ Γ Γ

⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦

MM O M

L

• Mathematically, this is a very straightforward procedure.• Numerically this is generally NOT how the solution is

y⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦

10

• Numerically, this is generally NOT how the solution is computed. (Viz. the sophisticated algorithms in Matlab)

(Unweighted) Least Squares• Note the computational costs• Note the computational costs

– when (naively) fitting a line (1st order polynomial):1

1−

⎡ ⎤ ⎡ ⎤2

ˆ 1 x yxyx x

θ⎡ ⎤ ⎡ ⎤

= ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦

2x2 matrix inversion, followed by a2x1 vector post-multiplication

– when (naively) fitting a general kth order polynomial:1

1 kx y−

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥

L

2

ˆ1

k k k

x y

x x x y

θ⎢ ⎥ ⎢ ⎥

= ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥

⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦

MM O M

L

(k+1)x(k+1) matrix inversion, followed by a(k+1)x1 multiplication

It is evident that the computational complexity is an increasing function of the number of parameters.More efficient (and numerically stable) algorithms are used in

11

( y ) gpractice, but complexity scaling with number of parameters still remains true.

(Unweighted) Least Squares• Suppose the dataset {xi} is constant in value over repeated,

non-constant measurements of the dataset {yi}. e g consider some process (e g temperature) where the– e.g. consider some process (e.g., temperature) where the measurements {yi} are taken at the same locations {xi} every day

• Then the design matrix Γ(x) is constant in value!– In advance (off-line) compute: Everyday (on-line) re-compute:

11 kx

−⎡ ⎤⎢ ⎥

L y⎡ ⎤⎢ ⎥

2k k

A

x x

⎢ ⎥= ⎢ ⎥

⎢ ⎥⎢ ⎥⎣ ⎦

M O M

L

ˆk

yA

x y

θ⎢ ⎥

= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

M

Hence, least squares sometimes can be implemented very efficiently.There are also efficient recursive updates, e.g. the Kalman filter.

12

Geometric Solution & Interpretation• Alternatively, we can derive the least squares solution

using geometric (Hilbert Space) considerations.• Goal: minimize the size (norm) of the model prediction

error (aka residual error), e (θ ) = y - Γ(x)θ :2 2

• Note that given a known design matrix Γ(x) the vector

2 2L( ( )) ( )e y xθ θ θ= = − Γ

g g ( )

i

k

ikx θθ ∑ ⎥⎥⎤

⎢⎢⎡Γ=⎥

⎥⎤

⎢⎢⎡

ΓΓ=Γ 1

|||)( K

is a linear combination of the column vectors Γi .

ii

ik ∑= ⎥

⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣

11

|||

13

• I.e. Γ(x)θ is in the range space (column space) of Γ(x)

(Hilbert Space) Geometric Interpretation• The vector Γθ lives in the range (column space) of

Γ = Γ(x) = [Γ1 , … , Γk] :

yΓ1

ΓθΓ2

Γk y− Γθ̂

ˆ

• Assume that y is as shown. • Equivalent statements are:

– Γθ is the value of Γθ closest to y in the range of Γ.Γθ is the orthogonal projection of y onto the range of Γ

ˆˆ

14

– Γθ is the orthogonal projection of y onto the range of Γ.– The residual e = y- Γθ is ⊥ to the range of Γ.

Geometric Interpretation of LS• e = y - Γθ is ⊥ to the range of Γ iff ( ) 1

2 '' "

ˆy θ− Γ ⊥ Γ

⊥ Γ

'' " k⊥ Γ

M MyΓ1

Γ

ΓθΓ2

Γk y− Γθ̂

ˆ

• Thus e = y - Γθ is in the nullspace of ΓΤ :ˆ

( )( ) ( )

1 1ˆ

0ˆ0

0

T T

T

y

y y

θ

θ θ

⎧Γ − Γ = ⎡ ⎤− Γ −⎪ ⎢ ⎥⎪ ⇔ − Γ = ⇔ Γ − Γ =⎨ ⎢ ⎥⎪ ⎢ ⎥

M M

15( )

( )11

ˆ 0TT y θ

⎪ ⎢ ⎥− Γ −Γ − Γ = ⎣ ⎦⎪⎩

Geometric interpretation of LS• Note: Nullspace Condition ≡ Normal Equation:

( ) 0ˆ ˆT T Ty yθ θΓ − Γ = ⇔ Γ Γ = Γ

• Thus, if Γ is one-to-one (has full column rank) we again get the pseudoinverse solution:

1(ˆ ( ) ) ( ) ( )T Tx yx xy xθ + −

⎡ ⎤= Γ Γ Γ⎣ ⎦= Γ (( ) ) ( ) ( ) yy ⎣ ⎦

16

Probabilistic Interpretation• We have seen that estimating a parameter by minimizing

the least squares loss function is a special case of MLE• This interpretation holds for many loss functions• This interpretation holds for many loss functions

– First, note that

[ ]arg min L ( ; )ˆ y f xθ θ= [ ][ ]L , ( ; )

arg min L , ( ; )

arg max y f x

y f x

θ

θ θ∈ Θ

=

=

– Now note that, because

gθ ∈ Θ

we can make this exponential function into a Y|X pdf by an

[ ], ( ; ) 0, ,L y f xe y xθ− ∀≥ ∀

17

we can make this exponential function into a Y|X pdf by an appropriate normalization.

Prob. Interpretation of Loss Minimization• I.e. by defining

L ( ; ) L ( ; )1 y f x y f xθ θ⎡ ⎤ ⎡ ⎤⎛ ⎞⎜ ⎟

[ ]| L , ( ; )

L , ( ; ) L , ( ; )( ; )

1( | ; )Y X y f x

y f x y f xxP y x

e dye eθ

θ θα θθ

⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦

− −⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠∫

• If the normalization constant α(x;θ) does not depend on θ, α(x;θ) = α(x), then

[ ]L , ( ; )arg maxˆ y f xe θ

θθ −

Θ∈=

( )| arg max | ;Y XP y xθ

θΘ∈

=

18which makes the problem a special case of MLE

Prob. Interpretation of Loss Minimization• Note that for loss functions of the form,

) [ ( ; )L( ]g y f xθ θ=

the model f(x;θ ) only changes the mean of Y

) [ ( ; )L( ]g y f xθ θ= −

– A shift in mean does not change the shape of a pdf,and therefore cannot change the value of the normalization constantconstant

– Hence, for loss functions of the type above, it is always true that

[ ]L , ( ; )arg maxˆ y f xe θθ −=

( )|

arg max

arg max | ;Y X

e

P y xθ

θ

θ∈ Θ

=

=19

( )|a g a | ;Y X y xθ

θΘ∈

Regression• This leads to the interpretation that

we saw last time• Which is the usual definition of a

regression problem– two random variables X and Ytwo random variables X and Y– a dataset of examples D = {(x1,y1), … ,(xn,yn)}– a parametric model of the form

– where θ is a parameter vector and ε a random variable that

( ; )y f x εθ= +– where θ is a parameter vector, and ε a random variable that

accounts for noise

• the pdf of the noise determines the loss in the other f l ti

20

formulation

Regression• Error pdf:

– Gaussian• Equivalent Loss Function:

– L2 distance2

222

1( )2

P eεσ

ε επσ

−= 2( , ) ( )L x y y x= −

– Laplacian – L1 distance2πσ

| |1 ε− ( )L

– Rayleigh– Rayleigh distance

1( )2

P e σε ε

σ= ( , )L x y y x= −

– Rayleigh2

222( )P e

εσ

εεε

−=

2( , ) ( )log( )

L x y y xy x

= −

21

2( )ε σ log( )y x− −

Regression• We know how to solve the problem with losses

– Why would we want the added complexity incurred by introducing error probabilit models?error probability models?

• The main reason is that this allows a data driven definition of the loss

• One good way to see this is the problem of weighted least squares

S th t k th t t ll t ( ) h th– Suppose that you know that not all measurements (xi,yi) have the same importance

– This can be encoded in the loss function– Remember that the unweighted loss is

[ ]( )22( ) ( )i

y x y xL θ θ= =− Γ = Γ∑‖ ‖22

[ ]( )( ) ( )i

iy y∑

Regression• To weigh different points differently we could use

[ ]( )2( ) 0L θ >Γ∑

or even the more generic form

[ ]( ) ( ) , 0 ,i i i ii

wL w y x θ >= − Γ∑

• In the latter case the solution is (homework)

( ) ( ) ( 0,) ( ) ,T TL y x W y x W Wθ θ= − Γ − = >Γ

• In the latter case the solution is (homework)1* ( ) ( ) ( ) T Tx W x x W yθ

−⎡ ⎤= Γ Γ Γ⎣ ⎦

• The question is “how do I know these weights”?• Without a probabilistic model one has little guidance on this

23

• Without a probabilistic model one has little guidance on this.

Regression• The probabilistic equivalent

[ ], ( ; )* argmax L y f xe θθ −=

1 arg maxexp ( ( ) ) ( ( ) )2

Ty x W y x

θ

θ θ⎧ ⎫= − − Γ − Γ⎨ ⎬⎩ ⎭

is the MLE for a Gaussian pdf of known covariance2θ ⎩ ⎭

1

• In the case where the covariance W is diagonal we have

1−=Σ W

[ ]( )2

2

1 ( )i ii i

L y xσ

θ= − Γ∑

24• In this case, each point is weighted by the inverse variance.

Regression• This makes sense

– Under the probabilistic formulation the variance σi is the variance of the error associated with the ith observationof the error associated with the i observation

– This means that it is a measure of the uncertainty of the

( ; )i i iy f x θ ε= +This means that it is a measure of the uncertainty of the observation

• When 1−Σ=Wwe are weighting each point by the inverse of its uncertainty (variance)

Σ=W

• We can also check the goodness of this weighting matrix by plotting the histogram of the errorsif W i h tl th h ld b G i

25

• if W is chosen correctly, the errors should be Gaussian

Model Validation• In fact, by analyzing the errors of the fit we can say a lot• This is called model validation

– Leave some data on the side, and run it through the predictor– Analyze the errors to see if there is deviation from the assumed

model (Gaussianity for least squares)

26

( y )

Model Validation & Improvement• Many times this will give you hints to alternative models

that may fit the data better• Typical problems for least squares (and solutions)

27

Model Validation & Improvement• Example 1 error histogram

• this does not look Gaussian• look at the scatter plot of the error (y – f(x,θ *))p (y ( ))

– increasing trend, maybe we should try

log( ) ( ; )i i iy f x θ ε= +

28

g( ) ( ; )i i iy f

Model Validation & Improvement• Example 1 error histogram for the new model

• this looks Gaussian – this model is probably better– there are statistical tests that you can use to check this objectively– these are covered in statistics classes

29

these are covered in statistics classes

Model Validation & Improvement• Example 2 error histogram

• This also does not look Gaussian • Checking the scatter plot now seems to suggest to try

( ; )i i iy f x θ ε= +30

( ; )i i iy f x θ ε+

Model Validation & Improvement• Example 2 error histogram for the new model

• Once again, seems to work• The residual behavior looks Gaussian

– However, It is NOT always the case that such changes will work.If not, maybe the problem is the assumption of Gaussianity itselfMove away from least squares, try MLE with other error pdfs

31

Move away from least squares, try MLE with other error pdfs

END

32

top related