ee269 signal processing for machine learning - lecture...
TRANSCRIPT
EE269Signal Processing for Machine Learning
Lecture 8
Instructor : Mert Pilanci
Stanford University
February 4, 2019
Recap: Linear and Quadratic Discriminant Analysis
I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σ) when y = k
gk(x) = Px|y=k = 1
(2π)N2 |Σ|
12e−
12
(x−µk)T Σ−1(x−µk)
I K classes
f(x) = arg maxk=1,...,K
πkgk(x)
Scaled identity covariances Σk = σ2I
I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σk) when y = k
gk(x) = Px|y=k = 1
(2π)N2 |Σk|
12e−
12
(x−µk)T Σ−1k (x−µk)
I K classes
I Decision boundary: hyperplane
wT (x− x0) = 0
w = µi − µj
x0 =1
2(µi + µj)−
σ2
||µi − µj ||2log
πiπj
(µi − µj)
I Hyperplane passes through the point x0 and is orthogonal tow
Scaled identity covariances Σk = σ2I
I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σk) when y = k
gk(x) = Px|y=k = 1
(2π)N2 |Σk|
12e−
12
(x−µk)T Σ−1k (x−µk)
I K classes
I Decision boundary: hyperplane
wT (x− x0) = 0
w = µi − µj
x0 =1
2(µi + µj)−
σ2
||µi − µj ||2log
πiπj
(µi − µj)
I Hyperplane passes through the point x0 and is orthogonal tow
Identical covariances Σk = Σ
I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σ) when y = k
gk(x) = Px|y=k = 1
(2π)N2 |Σ|
12e−
12
(x−µk)T Σ−1(x−µk)
I Decision boundary: hyperplane
wT (x− x0) = 0
w = Σ−1(µi − µj)
x0 =1
2(µi + µj)−
log πiπj
(µi − µj)TΣ−1(µi − µj)(µi − µj)
I Hyperplane passes through x0 but not necessarily orthogonalto the lines between the means
Identical covariances Σk = Σ
I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σ) when y = k
gk(x) = Px|y=k = 1
(2π)N2 |Σ|
12e−
12
(x−µk)T Σ−1(x−µk)
I Decision boundary: hyperplane
wT (x− x0) = 0
w = Σ−1(µi − µj)
x0 =1
2(µi + µj)−
log πiπj
(µi − µj)TΣ−1(µi − µj)(µi − µj)
I Hyperplane passes through x0 but not necessarily orthogonalto the lines between the means
Quadratic Discriminant Analysis: Σk arbitrary
I Suppose x[n] = [x1, ...xN ] ∼ N(µk,Σk) when y = k
gk(x) = Px|y=k = 1
(2π)N2 |Σk|
12e−
12
(x−µk)T Σ−1k (x−µk)
I hk(x) = xTWkx+ wTk x+ wk0
I Classify as class k if hk(x) > hk′(x) ∀k′ 6= k
Wk = −12Σ−1
k
wk = Σ−1k µk
wk0 = −12µ
Tk Σ−1
k µk − 12 log |Σk|+ log πk
Quadratic decision regions: hyperquadrics
Estimating parameters: univariate Gaussian
I Suppose x1, x2, ...xn i.i.d. ∼ N(µ, σ2)
I Estimating means
µML = 1n
∑ni=1 xn
I Estimating variances
σ2ML = 1
n
∑ni=1(xn − µML)2
Estimating parameters: multivariate Gaussian
I Suppose x1, x2, ...xn i.i.d. ∼ N(µ,Σ)
I Estimating means
µML = 1n
∑ni=1 xn
I Estimating covariances
ΣML = 1n
∑ni=1(xn − µML)(xn − µML)T
Estimating parameters: multivariate Gaussian
I Suppose x1, x2, ...xn i.i.d. ∼ N(µ,Σ)
I Estimating means
µML = 1n
∑ni=1 xn
I Estimating covariances
ΣML = 1n
∑ni=1(xn − µML)(xn − µML)T
Linear vs Quadratic Discriminant Analysis
I LDA
Estimate µk, for k = 1...,K and Σ
Kn+(n2
)+ n parameters
I QDA
Estimate µk, Σk for k = 1...,K
Kn+K((n2
)+ n
)parameters
Linear vs Quadratic Discriminant Analysis
I LDA
Estimate µk, for k = 1...,K and Σ
Kn+(n2
)+ n parameters
I QDA
Estimate µk, Σk for k = 1...,K
Kn+K((n2
)+ n
)parameters
Regularized Linear Discriminant Analysis
I Maximum Likelihood Covariance estimate
ΣML = 1n
∑ni=1(xn − µML)(xn − µML)T
I Regularized estimate
Σ̂ = (1− α) diag(ΣML) + αΣML
I Diagonal Linear Discriminant Analysis (α = 0)
Σ̂ = diag(ΣML)
(slide credit: T. Hastie et al.)
Optimal basis change and dimension reduction
I Decision boundary wT (x− x0) = 0
e.g., in LDA with equal covariances, w = Σ−1(µi − µj)I Classifies based on wTx ∈ R
Mean of the projected data
I y = aTx
µ1 = 1N1
∑i∈ class 1 xi
µ2 = 1N2
∑i∈ class 2 xi
Fisher’s LDA
I µk = E[x | x comes from class k]
I Σk = E(x− µk)(x− µk)T | x comes from class k]
I classify using a scalar feature y = aTx
βk = E[y | x comes from class k]
σ2k = E[(y − βk)2 | x comes from class k]
maxa
(β1 − β2)2
σ21 + σ2
2
Fisher’s LDA
I µk = E[x | x comes from class k]
I Σk = E(x− µk)(x− µk)T | x comes from class k]
I classify using a scalar feature y = aTx
βk = E[y | x comes from class k]
σ2k = E[(y − βk)2 | x comes from class k]
maxa
(β1 − β2)2
σ21 + σ2
2
Fisher’s LDA
I µk = E[x | x comes from class k]
I Σk = E(x− µk)(x− µk)T | x comes from class k]
I classify using a scalar feature y = aTx
βk = E[y | x comes from class k]
σ2k = E[(y − βk)2 | x comes from class k]
maxa
(β1 − β2)2
σ21 + σ2
2
Fisher’s LDA
βk = E[y | x comes from class k] = aTµk
σ2k = E[(y − βk)2 | x comes from class k] =
E[(aT (x− µk))2] = E[(aT (x− µk)(x− µk)Ta] = aTΣka
maxa
(β1 − β2)2
σ21 + σ2
2
= maxa
(aT (µ1 − µ2))2
aT (Σ1 + Σ2)a
= maxa
aTQa
aTPa
where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.
Fisher’s LDA
maxa
aTQa
aTPa
where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.
Maximizing quadratic forms
maxa
aTQa
aTa
I Eigenvalue Decomposition Q = UΛUT
I Change of basis b = UTa, i.e., Ub = a
maxa
aTUΛUTa
aTa= max
b
bTΛb
bTUTUb
= maxb
bTΛb
bT b
I Optimum is given by b = δ[n− k∗] where
k∗ = arg maxk
Λkk = 1
Solution: a = u1 maximal eigenvector, i.e., Qu1 = λ1u1
Optimal value : λ1
Maximizing quadratic forms
maxa
aTQa
aTa
I Eigenvalue Decomposition Q = UΛUT
I Change of basis b = UTa, i.e., Ub = a
maxa
aTUΛUTa
aTa= max
b
bTΛb
bTUTUb
= maxb
bTΛb
bT b
I Optimum is given by b = δ[n− k∗] where
k∗ = arg maxk
Λkk = 1
Solution: a = u1 maximal eigenvector, i.e., Qu1 = λ1u1
Optimal value : λ1
Maximizing quadratic forms
maxa
aTQa
aTa
I Eigenvalue Decomposition Q = UΛUT
I Change of basis b = UTa, i.e., Ub = a
maxa
aTUΛUTa
aTa= max
b
bTΛb
bTUTUb
= maxb
bTΛb
bT b
I Optimum is given by b = δ[n− k∗] where
k∗ = arg maxk
Λkk = 1
Solution: a = u1 maximal eigenvector, i.e., Qu1 = λ1u1
Optimal value : λ1
Maximizing quadratic forms: two quadratics
maxa
aTQa
aTPa
I Theorem (Simultaneous Diagonalization)
Let P,Q ∈ Rn×n real symmetric matrices, and P is positivedefinite, then there exists a matrix V such that
V TPV = I
V TQV = Λ = diag(λ1, ..., λn)
where V,Λ satisfies the generalized eigenvalue equation:
Qvi = λiPvi
Maximizing quadratic forms: two quadratics
maxa
aTQa
aTPa
I Theorem (Simultaneous Diagonalization)
Let P,Q ∈ Rn×n real symmetric matrices, and P is positivedefinite, then there exists a matrix V such that
V TPV = I
V TQV = Λ = diag(λ1, ..., λn)
where V,Λ satisfies the generalized eigenvalue equation:
Qvi = λiPvi
Maximizing quadratic forms: two quadratics
I Theorem (Simultaneous Diagonalization)
Let P,Q ∈ Rn×n real symmetric matrices, and P is positivedefinite, then there exists a matrix V such that
V TPV = I
V TQV = Λ = diag(λ1, ..., λn)
where V,Λ satisfies thegeneralized eigenvalue equation:
Qvi = λiPvi
Proof: Let P = UPΛPUTP be its Eigenvalue Decomposition
V ′ = UPΛ− 1
2P will only diagonalize P
Let V ′TQV ′ = U ′Λ′U ′T be its EVD
Set V = V ′U ′
Maximizing quadratic forms: two quadratics
maxa
aTQa
aTPa
I Let V and Λ satisfy the generalized eigenvalue equation
Qvi = λiPvi
Basis change a = V b, i.e., b = V Ta
maxb
bTV TQV b
bTV TPV b= max
b
bTΛb
bT b
I Solution: a = v1, maximal generalized eigenvector
Optimal value: λ1 maximum generalized eigenvalue
Maximizing quadratic forms: two quadratics
maxa
aTQa
aTPa
I Let V and Λ satisfy the generalized eigenvalue equation
Qvi = λiPvi
Basis change a = V b, i.e., b = V Ta
maxb
bTV TQV b
bTV TPV b= max
b
bTΛb
bT b
I Solution: a = v1, maximal generalized eigenvector
Optimal value: λ1 maximum generalized eigenvalue
Fisher’s LDA
maxa
aTQa
aTPa
where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.
I Solution: Qa = λPa, therefore P−1Qa = λa
P−1(µ1 − µ2)(µ1 − µ2)Ta = λa
a = constant× P−1(µ1 − µ2)
can be normalized as a := P−1(µ1−µ2)||P−1(µ1−µ2)||2
Fisher’s LDA
maxa
aTQa
aTPa
where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.
I Solution: Qa = λPa, therefore P−1Qa = λa
P−1(µ1 − µ2)(µ1 − µ2)Ta = λa
a = constant× P−1(µ1 − µ2)
can be normalized as a := P−1(µ1−µ2)||P−1(µ1−µ2)||2
Fisher’s LDA
maxa
aTQa
aTPa
where Q = (µ1 − µ2)(µ1 − µ2)T and P = Σ1 + Σ2.
I Solution: Qa = λPa, therefore P−1Qa = λa
P−1(µ1 − µ2)(µ1 − µ2)Ta = λa
a = constant× P−1(µ1 − µ2)
can be normalized as a := P−1(µ1−µ2)||P−1(µ1−µ2)||2