ee269 signal processing for machine learning - lecture...

EE269Signal Processing for Machine Learning

Lecture 8

Instructor : Mert Pilanci

Stanford University

February 4, 2019

Recap: Linear and Quadratic Discriminant Analysis

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σ) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σ|

12e−

(x−µk)T Σ−1(x−µk)

I K classes

f(x) = arg maxk=1,...,K

πkgk(x)

Scaled identity covariances Σk = σ2I

I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σk) when y = k

gk(x) = Px|y=k = 1

(2π)N2 |Σk|

12e−

(x−µk)T Σ−1k (x−µk)

I K classes

I Decision boundary: hyperplane

wT (x− x0) = 0

w = µi − µj

2(µi + µj)−

||µi − µj ||2log

πiπj

(µi − µj)

I Hyperplane passes through the point x0 and is orthogonal tow

Scaled identity covariances Σk = σ2I

gk(x) = Px|y=k = 1

(2π)N2 |Σk|

12e−

I K classes

wT (x− x0) = 0

w = µi − µj

2(µi + µj)−

||µi − µj ||2log

πiπj

(µi − µj)

I Hyperplane passes through the point x0 and is orthogonal tow

Identical covariances Σk = Σ

gk(x) = Px|y=k = 1

(2π)N2 |Σ|

12e−

wT (x− x0) = 0

w = Σ−1(µi − µj)

2(µi + µj)−

log πiπj

(µi − µj)TΣ−1(µi − µj)(µi − µj)

I Hyperplane passes through x0 but not necessarily orthogonalto the lines between the means

Identical covariances Σk = Σ

gk(x) = Px|y=k = 1

(2π)N2 |Σ|

12e−

wT (x− x0) = 0

w = Σ−1(µi − µj)

2(µi + µj)−

log πiπj

(µi − µj)TΣ−1(µi − µj)(µi − µj)

I Hyperplane passes through x0 but not necessarily orthogonalto the lines between the means

Quadratic Discriminant Analysis: Σk arbitrary

gk(x) = Px|y=k = 1

(2π)N2 |Σk|

12e−

I hk(x) = xTWkx+ wTk x+ wk0

I Classify as class k if hk(x) > hk′(x) ∀k′ 6= k

Wk = −12Σ−1

wk = Σ−1k µk

wk0 = −12µ

Tk Σ−1

k µk − 12 log |Σk|+ log πk

Quadratic decision regions: hyperquadrics

Estimating parameters: univariate Gaussian

I Suppose x1, x2, ...xn i.i.d. ∼ N(µ, σ2)

I Estimating means

µML = 1n

∑ni=1 xn

I Estimating variances

σ2ML = 1

∑ni=1(xn − µML)2

Estimating parameters: multivariate Gaussian

I Suppose x1, x2, ...xn i.i.d. ∼ N(µ,Σ)

I Estimating means

µML = 1n

∑ni=1 xn

I Estimating covariances

ΣML = 1n

∑ni=1(xn − µML)(xn − µML)T

Estimating parameters: multivariate Gaussian

I Suppose x1, x2, ...xn i.i.d. ∼ N(µ,Σ)

I Estimating means

µML = 1n

∑ni=1 xn

I Estimating covariances

ΣML = 1n

Linear vs Quadratic Discriminant Analysis

Estimate µk, for k = 1...,K and Σ

Kn+(n2

)+ n parameters

Estimate µk, Σk for k = 1...,K

Kn+K((n2

)parameters

Linear vs Quadratic Discriminant Analysis

Estimate µk, for k = 1...,K and Σ

Kn+(n2

)+ n parameters

Estimate µk, Σk for k = 1...,K

Kn+K((n2

)parameters

Regularized Linear Discriminant Analysis

I Maximum Likelihood Covariance estimate

ΣML = 1n

I Regularized estimate

Σ̂ = (1− α) diag(ΣML) + αΣML

I Diagonal Linear Discriminant Analysis (α = 0)

Σ̂ = diag(ΣML)

(slide credit: T. Hastie et al.)

Optimal basis change and dimension reduction

I Decision boundary wT (x− x0) = 0

e.g., in LDA with equal covariances, w = Σ−1(µi − µj)I Classifies based on wTx ∈ R

Mean of the projected data

I y = aTx

µ1 = 1N1

∑i∈ class 1 xi

µ2 = 1N2

∑i∈ class 2 xi

Fisher’s LDA

I µk = E[x | x comes from class k]

I Σk = E(x− µk)(x− µk)T | x comes from class k]

I classify using a scalar feature y = aTx

βk = E[y | x comes from class k]

σ2k = E[(y − βk)2 | x comes from class k]

(β1 − β2)2

σ21 + σ2

Fisher’s LDA

(β1 − β2)2

σ21 + σ2

Fisher’s LDA

(β1 − β2)2

σ21 + σ2

Fisher’s LDA

βk = E[y | x comes from class k] = aTµk

σ2k = E[(y − βk)2 | x comes from class k] =

E[(aT (x− µk))2] = E[(aT (x− µk)(x− µk)Ta] = aTΣka

(β1 − β2)2

σ21 + σ2

= maxa

(aT (µ1 − µ2))2

aT (Σ1 + Σ2)a

= maxa

where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.

Fisher’s LDA

Maximizing quadratic forms

I Eigenvalue Decomposition Q = UΛUT

I Change of basis b = UTa, i.e., Ub = a

aTUΛUTa

aTa= max

bTUTUb

= maxb

I Optimum is given by b = δ[n− k∗] where

k∗ = arg maxk

Λkk = 1

Solution: a = u1 maximal eigenvector, i.e., Qu1 = λ1u1

Optimal value : λ1

aTUΛUTa

aTa= max

bTUTUb

= maxb

k∗ = arg maxk

Λkk = 1

Optimal value : λ1

aTUΛUTa

aTa= max

bTUTUb

= maxb

k∗ = arg maxk

Λkk = 1

Optimal value : λ1

Maximizing quadratic forms: two quadratics

I Theorem (Simultaneous Diagonalization)

Let P,Q ∈ Rn×n real symmetric matrices, and P is positivedefinite, then there exists a matrix V such that

V TPV = I

V TQV = Λ = diag(λ1, ..., λn)

where V,Λ satisfies the generalized eigenvalue equation:

Qvi = λiPvi

V TPV = I

where V,Λ satisfies the generalized eigenvalue equation:

Qvi = λiPvi

V TPV = I

where V,Λ satisfies thegeneralized eigenvalue equation:

Qvi = λiPvi

Proof: Let P = UPΛPUTP be its Eigenvalue Decomposition

V ′ = UPΛ− 1

2P will only diagonalize P

Let V ′TQV ′ = U ′Λ′U ′T be its EVD

Set V = V ′U ′

I Let V and Λ satisfy the generalized eigenvalue equation

Qvi = λiPvi

Basis change a = V b, i.e., b = V Ta

bTV TQV b

bTV TPV b= max

I Solution: a = v1, maximal generalized eigenvector

Optimal value: λ1 maximum generalized eigenvalue

I Let V and Λ satisfy the generalized eigenvalue equation

Qvi = λiPvi

Basis change a = V b, i.e., b = V Ta

bTV TQV b

bTV TPV b= max

I Solution: a = v1, maximal generalized eigenvector

Optimal value: λ1 maximum generalized eigenvalue

Fisher’s LDA

I Solution: Qa = λPa, therefore P−1Qa = λa

P−1(µ1 − µ2)(µ1 − µ2)Ta = λa

a = constant× P−1(µ1 − µ2)

can be normalized as a := P−1(µ1−µ2)||P−1(µ1−µ2)||2

Fisher’s LDA

P−1(µ1 − µ2)(µ1 − µ2)Ta = λa

Fisher’s LDA

P−1(µ1 − µ2)(µ1 − µ2)Ta = λa

ee269 signal processing for machine learning - lecture...

Documents