lecture 8 dimensionality reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/rlsc_lec8.pdf · some...

28
Lecture 8: RLSC - Prof. Sethu Vijayakumar 1 Lecture 8Dimensionality Reduction Contents: Subset Selection & Shrinkage Ridge regression, Lasso PCA, PCR, PLS Comparison of Methods

Upload: phamkiet

Post on 31-Mar-2018

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 1

Lecture 8– Dimensionality Reduction

Contents: • Subset Selection & Shrinkage

• Ridge regression, Lasso

• PCA, PCR, PLS

• Comparison of Methods

Page 2: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 2

Data From Human Movement

Measure arm movement and full-body movement of humans and anthropomorphic robots

Perform local dimensionality analysis with a growing variational mixture of factor analyzers

Page 3: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 3

Dimensionality of Full Body Motion

About 8 dimensions in the space formed by joint positions, velocities, and accelerations are needed to model an inverse dynamics model

Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .

1 11 21 31 410

0.05

0.1

0.15

0.2

0.25

Pro

ba

bili

ty

Dimensionality

Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .

/ /

105/ /

105

Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50

Cu

mu

lative

va

rian

ce

No. of retained dimensions

32

Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .

/ /

105/ /

105

Cu

mu

lati

ve V

aria

nce

Pro

bab

ility

Global PCA Local FA

Page 4: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 4

Dimensionality Reduction

Goals:

fewer dimensions for subsequent processing

better numerical stability due to removal of correlations

simplify post processing due to “advanced statistical properties” of pre-processed data

don’t lose important information, only redundant information or irrelevant information

perform dimensionality reduction spatially localized for nonlinear problems (e.g., each RBF has its own local dimensionality reduction)

Page 5: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 5

Refers to methods which reduces/shrinks the redundant or irrelevant variables in a more continuous fashion.

• Ridge Regression

• Lasso

• Derived Input Direction Methods – PCR, PLS

Subset Selection & Shrinkage Methods

Subset Selection

Refers to methods which selects a set of variables to be included and discards other dimensions based on some optimality criterion. Does regression on this reduced dimensional input set. Discrete method –variables are either selected or discarded

• leaps and bounds procedure (Furnival & Wilson, 1974)

• Forward/Backward stepwise selection

Shrinkage Methods

Page 6: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 6

Ridge Regression

Ridge regression shrinks the coefficients by imposing a penalty on their size. They minimize a penalized residual sum of squares :

M

j

j

N

i

M

j

jijiw

ridge wwxwtw1

2

1

2

1

0 )(minargˆ

Complexity parameter controlling amount of shrinkage

swtosubject

wxwtw

M

j

j

N

i

M

j

jijiw

ridge

1

2

1

2

1

0 )(minargˆ

Equivalent representation:

Page 7: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 7

Ridge Regression (cont’d)

Some notes:

• when there are many correlated variables in a regression problem, their coefficients can be poorly determined and exhibit high variance e.g. a wildly large positive variable can be canceled by a similarly largely negative coefficient on its correlated cousin.

• The bias term is not included in the penalty.

• When the inputs are orthogonal, the ridge estimates are just a scaled version of the least squares estimate

tXIXXw

wwXwtXwtw

TTridge

TTJ

1)(ˆ2

1),(

Matrix representation of criterion and it’s solution:

Page 8: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 8

Ridge Regression (cont’d)

Ridge regression shrinks the dimension with least variance the most.

)(0)(df:

)(][)(df

12

2

tionregularizanoifMNote

d

dλtr

M

j j

jT1T

XI)XX(X

[Page 63: Elem. Stat. Learning]

Effective degree of freedom :

Shrinkage Factor :

.

,)(

2

2

2

valueeigeningcorrespondthetorefersdwhere

d

dbyshrunkisdirectionEach

j

j

j

[Page 62: Elem. Stat. Learning]

Page 9: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 9

Lasso

Lasso is also a shrinkage method like ridge regression. It minimizes a penalized residual sum of squares :

M

j

j

N

i

M

j

jijiw

lasso wwxwtw11

2

1

0 )(minargˆ

swtosubject

wxwtw

M

j

j

N

i

M

j

jijiw

lasso

1

1

2

1

0 )(minargˆ

Equivalent representation:

The L2 ridge penalty is replaced by the L1 lasso penalty. This makes the solution non-linear in t.

Page 10: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 10

Derived Input Direction Methods

These methods essentially involves transforming the input directions to some low dimensional representation and using these directions to perform the regression.

Example Methods

• Principal Components Regression (PCR)

• Based on input variance only

• Partial Least Squares (PLS)

• Based on input-output correlation and output variance

Page 11: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 11

Principal Component Analysis

From earlier discussions:

PCA finds the eigenvectors of the correlation (covariance) matrix of the the data

Here, we show how PCA can be learned by bottle-neck neural networks (auto-associator) and Hebbian learning frameworks

Page 12: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 12

PCA Implementation

Hebbian Learning obtain a measure of familiarity of a new data point

the more familiar a data point, the larger the output

x1 x2 x3 xd

y

wn1 wn yx

w

Page 13: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 13

Fixing the Instability (Oja’s rule)

Renormalization

Oja’s Rule

Verify Oja’s Rule (does it do the right thing ??)

wn1 wn xy

wn xy

w wn1 wn y x yw yx y2w

2

T T T

T

E E y y

E

C

w x w

xx w w xx ww

Cw w ww

After convergence:

0

, thus is eigenvector

thus, 1

T

T T T T T

T

E

C

C C

w

Cw w w w w w

w Cw w w w w w w w w

w w

Page 14: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 14

Application Examples of Oja’s Rule

TX = UDV

Singular Value Decomposition (SVD)

Finds the direction (line) which minimizes the

sum of the total squared distance from each

point to its orthogonal projection onto the line.

Page 15: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 15

PCA : Batch vs Stochastic

PCA in Batch Update: just subtract the mean from the data

calculate eigenvectors to obtain principle components (Matlab function “eig”)

PCA in Stochastic Update: stochastically estimate the mean and subtract it from input data

use Oja’s rule on mean subtracted data

x n1 x nn x

n 1 x n

1

n 1x x n

x n1 x n x x n

Page 16: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 16

Autoencoder as motivation for Oja’s Rule

Note that Oja’s rule looks like a supervised learning rule the update looks like a reverse delta-rule: it depends on the difference

between the actual input and the back-propagated output

w w n1 wn y x yw

x

x

w

w

Convert unsupervised learning into a supervised learning problem by trying to reconstruct the inputs from few features!

y

Page 17: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 17

Learning in the Autoencoder

Minimize the cost J 1

2x ˆ x

Tx ˆ x

1

2x yw

Tx yw

J

w

y

ww y

w

w

T

x yw

wTx w

w yw

w

T

x yw

xTw y x yw

y y x yw

2y x yw

w J

w y x yw This is Oja’s rule!

Page 18: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture VIII: MLSC - Dr. Sethu Vijayakumar 18

Discussion about PCA

If data is noisy, we may represent the noise instead of the data The way out: Factor Analysis (handles noise in input dimension also)

PCA has no data generating model

PCA has no probabilistic interpretation (not quite true !!)

PCA ignores possible influence of subsequent (e.g., supervised) learning steps

PCA is a linear method Way out: Nonlinear PCA

PCA can converge very slowly Way out: EM versions of PCA

But PCA is a very reliable method for dimensionality reduction if it is appropriate!

Page 19: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 19

PCA preprocessing for Supervised Learning

In Batch Learning Notation for a Linear Network:

1

Subsequent Linear Regression for Network Weights

T T T T

W U X XU U X Y

NOTE: Inversion of the above matrix is very cheap since it is diagonal! No numerical problems!

Problems of this pre-processing: Important regression data might have been clipped!

1

max(1: )

max(1: )

1

where contains mean-zero data1

NT

n n

n

k

T

k

eigenvectorsN

eigenvectorsN

x x x x

U

X XX

Page 20: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 20

Principal Component Regression (PCR)

Build the matrix X and vector y

Compute the linear model

T

n

T

n

ttt ),...,,(

)~,...,~,~(

21

21

t

xxxX

tXUXU)X(Uw

vvvU

TTTTpcr

kvalueseigen

1

21

ˆ

][max

Eigen Decomposition

TTVVDXX

2

Data projection

Page 21: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 21

PCA in Joint Data Space

A straightforward extension to take supervised learning step into account:

perform PCA in joint data space

extract linear weight parameters from PCA results

x

y

Page 22: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 22

PCA in Joint Data Space: Formalization

1

max(1: )

max(1: )

1

1

NT

n n

n

k

T

k

eigenvectorsN

eigenvectorsN

x xZ

y y

z z z z

U

Z Z

Note: this new kind of linear network can actually tolerate noise in the input data! But only the same noise level in all (joint) dimensions!

1

The Network Weights become:

( ), where

( )

xT T T T

x y y y y y y

y

d k

c k

UW U U U U U I U U U

U

Page 23: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 23

Partial Least Squares (PLS)

Build the matrix X and vector y

Recursively compute the linear model

i

i

T

T

i

T

T

pls

i

T

β

w

sprojectionofri

spXX

syyss

Xsp

ss

ys

uXs

yXu

yyXX

resres

resres

res

res

ires

resresi

resres

)(#:1For

,:Initialize

Projection direction

Univariate regressions

residuals

Projected data

Projection direction based on input output correlation.

T

n

T

n

ttt ),...,,(

)~,...,~,~(

21

21

t

xxxX

Partial Least Squares is a linear regression methods that includes dimensionality reduction

Page 24: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 24

Comparing Dimensionality Red. Methods

One Projection

Four Projections

Five Projections

- Correct

Six Projections

FA

PCR

PCA_J

PLS

Page 25: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 25

Comparison of Methods (I)

Least complex

model within one

S.D. of the best

Prostate Cancer Data Example

(pg. 57, Elem. Stat. Analysis)

Page 26: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 26

Comparison of Methods (II)

Least complex

model within one

S.D. of the best

Prostate Cancer Data Example

(pg. 57, Elem. Stat. Analysis)

Page 27: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 27

Generalization of Shrinkage Methods

regressionridgeq

lassoq

selectionsubsetvariableq

qforwwxwtwM

j

q

j

N

i

M

j

jijiw

gen

:2

:1

:0

.0)(minargˆ11

2

1

0

Contours of constant value of for given value of q.

M

j

q

jw1

Page 28: Lecture 8 Dimensionality Reductionwcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC_Lec8.pdf · Some notes: • when there are ... PCA ignores possible influence of subsequent ... But

Lecture 8: RLSC - Prof. Sethu Vijayakumar 28

Error- Regularization Tradeoff

Lasso Ridge regression

Error Contours

Constraint regions

t 21 22

2

2

1 t