probabilistic pca, independent component analysis
TRANSCRIPT
10/22/2018
1
CS 3750 Advanced Machine Learning
Gaussian Processes: classification
Jinpeng Zhou
Gaussian Processes (GP)
xi
GP is a collection of ð ð± such that:
any set of ð ð±ð , ⊠, ð(ð±ð) has a joint Gaussian distribution.
ð² = ð(ð±) + ð ð ð ð±)
xi
10/22/2018
2
Weight-Space View
ðŠ = ð ð± + ð = ð±ðð°+ ð, ð ~ ð 0, ð2
â ð² ~ ð ððð°, ð2ð°
Assume ð°~ðµ(ð, ðºð©), then:
ð ð° ð, ð²) =ð ð² ð,ð° ð ð°
ð ð² ð)~ð
1
ð2ðŽâ1ðð², ðŽâ1
where ðŽ =1
ð2ððð + Σð
â1
Posterior on weights is Gaussian.
Weight-Space View
ð ð° ð, ð²) ~ ð1
ð2ðŽâ1ðð², ðŽâ1 , ðŽ =
1
ð2ððð + Σð
â1
Thus (Similar result when use basis function: ð ð± = ðœ(ð±)ðð°):
ð ðâ ð±â, ð, ð²) = âð(ð ð±â, ð° ð ð° ð, ð²)ðð°
= ð1
ð2ð±âððŽâ1ðð², ð±â
ððŽâ1ð±â = ð(ð, Σ)
⢠ðâ has a Gaussian distribution and its Σ doesnât depend on y
⢠x always appears as xTx (âscalar productâ kernel trick)
Kernel Distribution of ðâ Prediction
10/22/2018
3
Function-space View
Given:
mean function ð ð±
covariance function ð ð±, ð±â²
Define GP as:
f ð± ~ ðºð(ð(ð±), ð ð±, ð±â² )
Such that for each ð±ð¢:
f ð±ð¢ ~ ð ð(ð±ð¢), ð ð±ð¢, ð±ð¢
cov ð(ð±ð¢), ð(ð±ð£) = ð ð±ð¢, ð±ð£
Function-space View
Given joint Gaussian distribution:
xAxB
~ N(ÎŒAÎŒB
,ΣAA ΣABΣBA ΣBB
)
The conditional densities are also Gaussian:
xA | xB ~ N(ÎŒA + ΣABΣBBâ1 xA â xB , ΣAA â ΣABΣBB
â1ΣBA)
xB | xA ~ N(âŠ,âŠ)
10/22/2018
4
Function-space View
Consider training and test points.
ððâ
~ ð(ð ð¿ð ð¿â
,ð² ð¿,ð¿ ð² ð¿,ð¿â
ð²(ð¿â, ð¿) ð² ð¿â, ð¿â)
ð² is the covariance matrix.
ðŠ = ð ð± + ð, ð ~ ð 0, ð2 . Thus:
ð²ð²â
| ð, ðâ~ ð(ð ð¿ð ð¿â
,ð² ð¿, ð¿ + ððð° ð² ð¿,ð¿â
ð²(ð¿â, ð¿) ð² ð¿â, ð¿â + ððð°)
ð²â| ð², ð¿, ð¿â ~ ð(⊠,⊠)
GP for Regression (GPR)
Squares are observed, circles are latent
The thick bar is like a âbusâ connected all nodes
Each y is conditionally independent of all other nodes given f
10/22/2018
5
GPR
fi
f*f1
y1
x1
xi yi
fn
xn yn
x*
y*
GPR
ð² = ð + ð, where ð ~ GP m = 0, k and ð ~ N(0, Ï2):
ð²ð²â
~ ð(0,ð² ð¿,ð¿ + ððð° ð² ð¿,ð¿â
ð²(ð¿â, ð¿) ð² ð¿â, ð¿â + ððð°)
Prediction: ð²â | ð², ð¿, ð¿â ~ ð(âŠ,âŠ) = ð(ðâ, ð®â)
Where:
ðâ = ð² ð¿â, ð¿ ð² ð¿,ð¿ + ððð°âðð
ð®â = ð² ð¿â, ð¿â + ððð° - ð² ð¿â, ð¿ ð² ð¿,ð¿ + ððð°âðð² ð¿,ð¿â
10/22/2018
6
GP Classification (GPC)
Map output into [0, 1] by using response function:
⢠Sigmoid function (logistic):
ð ð¥ = 1 + ðâð¥ â1
⢠Cumulative normal (probit):
Ί ð¥ = ââð¥
ð ð¡ 0, 1 ðð¡
GPC examples
Squared exponential kernel with hyperparameter length-scale = 0.1
Logistic response function.
10/22/2018
7
GPC
GP over f
! = #(%)
! = ' # %
GP over f
Non-G over y
GPC
Target ð² = ð(ð ð± ) is a non-Gaussian.
E.g., itâs a Bernoulli for class 0, 1:
ð ð² ð = Ïð=1ð ð(f(xi))
ðŠð 1 â ð(f(xi)1âðŠð
How to predict p(ð²â | ð², ð¿, ð¿â) ?
10/22/2018
8
GPC
Assume single test case ð±â, predict its class 0 or 1.
Assume sigmoid function and a GP prior over f.
Thus, try to involve f:
ð(yâ = 1 ð²,ð¿, ð±â = âð(y = 1, fâ ð²,ð¿, ð±â ðfâ
ð= yâ = 1 fâ, ð², ð¿, ð±â) ð(fâ ð², ð¿, ð±â ðfâ
ð(fâ)= ð(fâ ð²,ð¿, ð±â ðfâ
Approximation
ð(yâ = 1 ð²,ð¿, ð±â ð(fâ) = ð(fâ ð², ð¿, ð±â ðfâ
Note that ð(fâ) is a sigmoid function.
If ð(fâ ð²,ð¿, ð±â is a Gaussian, convolution of a
sigmoid and a Gaussian can be computed as:
ð ð¡ ð ð, ð 2 ðð¡ â ð 1 +ðð 2
8ð
Otherwise, analytically intractable!
10/22/2018
9
If ð(fâ ð², ð¿, ð±â is a GaussianâŠ
ð(fâ ð², ð¿, ð±â ,âð(f = ð ð², ð¿, ð±â ðð
=ð ð² | fâ, ð, ð¿, ð±â ð fâ, ð, ð¿,ð±â
ð ð², ð¿, ð±âðð
=ð ð² | ð, ð¿ ð ð, ð¿, ð±â ð fâ | ð, ð¿,ð±â
ð ð², ð¿, ð±âðð
=ð ð² | ð, ð¿ ð ð, ð¿
ð ð², ð¿ð fâ | ð¿, ð±â, ð ðð
= ð(ð ð²,ð¿ ð(fâ ð¿, ð±â, ð ðð
If ð(fâ ð², ð¿, ð±â is a GaussianâŠ
ð(fâ ð², ð¿, ð±â = ð(ð ð²,ð¿ ð(fâ ð¿, ð±â, ð ðð
Recall from GP regression: ð²â | ð², ð¿, ð¿â ~ ð(ðâ, ð®â)
ðâ = ð² ð¿â, ð¿ ð² ð¿, ð¿ + ððð°âðð
ð®â = ð² ð¿â, ð¿â + ððð° - ð² ð¿â, ð¿ ð² ð¿, ð¿ + ððð°âðð² ð¿,ð¿â
Replace ð¿â with ð±â:
ð(fâ ð¿, ð±â, ð ~ ðµ(ðâð»ðªð
âðð, ð â ðâð»ðªð
âððâ)
Where: ðâ = ð² ð¿, ð±â ðªð = ð² ð¿,ð¿ + ððð°
ð = ð ð±â, ð±â + ðð
10/22/2018
10
If ð(fâ ð², ð¿, ð±â is a GaussianâŠ
ð(fâ ð², ð¿, ð±â = ð(ð ð²,ð¿ ð(fâ ð¿, ð±â, ð ðð
Now ð(fâ ð¿, ð±â, ð is a Gaussian!
IF:
ð(ð ð², ð¿ is a Gaussian
THEN:
ð(fâ ð², ð¿, ð±â is a Gaussian!
Is ð(ð ð², ð¿ a Gaussian?
ð(ð ð², ð¿ =ð ð ð, ð¿ ð ð ð¿)
ð ð ð¿)
ð ð ð¿) ~ ðµ ð,ð² , ð ð ð, ð¿ = Ïð=1ð ð ðŠð | ðð
â ð(ð ð²,ð¿ =ðµ ð,ð²
ð ð ð¿)Ïð=1ð ð ðŠð | ðð
Note: ð ðŠð | ðð is Non-Gaussian (e.g., Bernoulli)
â ð(ð ð²,ð¿ is Non-Gaussian.
10/22/2018
11
Approximation
ð(yâ = 1 ð²,ð¿, ð±â ð(fâ) = ð(fâ ð², ð¿, ð±â ðfâ
If ð(fâ ð²,ð¿, ð±â is a Gaussian, Try to approximate
ð(fâ ð², ð, ð±â as Gaussian, ⊠be computed as âŠ
⢠Laplace Approximation
⢠Expectation Propagation (EP)
⢠Variational Approximation, Monte Carlo Sampling, etc.
If ð(fâ ð², ð¿, ð±â is a GaussianâŠ
ð(fâ ð², ð¿, ð±â = ð(ð ð²,ð¿ ð(fâ ð¿, ð±â, ð ðð
Now ð(fâ ð¿, ð±â, ð is a Gaussian!
IF:
ð(ð ð², ð¿ is a approximated as Gaussian
THEN:
ð(fâ ð², ð¿, ð±â is a approximated as Gaussian!
10/22/2018
12
Laplace Approximation
Goal: use a Gaussian to approximate ð ð
Gaussian:1
2ðð2exp â
ð¥âð 2
2ð2
Idea:
⢠ð should match the mode: ðððððð¥ð¥ ð(ð)
⢠Relation between exp ð(ð¥ â ð)2 and ð(ð)
Laplace Approximation
⢠ð should match the mode: ðððððð¥ð¥ ð(ð)
ðâ² ð = 0, ð is placed at the mode
⢠Relation between exp ð(ð¥ â ð)2 and ð(ð)
Taylor expansion of ðð ð(ð) at ð
Let ð ð = ððð(ð) , Taylor expansion of ð¡ ð¥ at ð :
ð¡ ð¥ = ð¡(ð) +ð¡â² ð
1!ð¥ â ð +
ð¡â²â² ð
2!ð¥ â ð 2 +â¯
10/22/2018
13
Laplace Approximation
Taylor quadratic approximation:
ð¡ ð¥ = ð¡(ð) +ð¡â² ð
1!ð¥ â ð +
ð¡â²â² ð
2!ð¥ â ð 2 +â¯
â ð¡(ð) +ð¡â² ð
1!ð¥ â ð +
ð¡â²â² ð
2!ð¥ â ð 2
Note that ð¡â² ð =ðâ² ð = 0
ð(ð)= 0, thus:
ð¡ ð¥ â ð¡ ð +ð¡â²â² ð
2!ð¥ â ð 2
Laplace Approximation
ð¡ ð¥ â ð¡ ð +ð¡â²â² ð
2!ð¥ â ð 2
// ð¡ ð¥ = ðð ð(ð¥)
â exp ð¡ ð¥ â exp ð¡ ð +ð¡â²â² ð
2!ð¥ â ð 2
â ð ð¥ â ðð¥ð ð¡ ð ðð¥ðð¡â²â² ð ð¥âð 2
2!
â ð ð¥ â ð ð ðð¥ð âðŽ ð¥âð 2
2
Where: A = â ð¡â²â² ð = âð2
ðð¥2ðð ð ð¥ |ð¥=ð
10/22/2018
14
Laplace Approximation
ð ð¥ â ð(ð) ðð¥ð âðŽ ð¥âð 2
2
ð ð± â ð(ð) ðð¥ð âð±âð ð»ðš(ð±âð)
2
Corresponding Gaussian approximation:
ð ð± â ð(ð, ðšâð)
Where:
ð
ðð±ðð ð ð± |ð±=ð = 0
âð2
ðð±2ðð ð ð± |ð±=ð = A
Laplace Approximation
Back to ð(ð ð², ð¿ :
ln ð(ð ð²,ð¿ = lnð ð² ð ð ð ð
ð ð² ð)
= ln ð ð² ð + ln ð ð ð â ln ð ð² ð)
ln ð ð² ð) is independent of f, and we only need to
do 1st & 2nd derivatives on f
Thus, define Κ ð for derivative calculation:
Κ ð = ln ð ð² ð + ln ð ð ð
10/22/2018
15
Laplace Approximation
Κ ð = ln ð ð² ð + ln ð ð ð
Note that:
o ð ð² ð = Ïð=1ð ð(f(xi))
ðŠð 1 â ð(f(xi)1âðŠð
= Ïð=1ð 1 â ð(f(xi)
ð(f(xi))
1âð(f(xi)
ðŠð
= Ïð=1ð 1
1+ðf(xi)ðf xi
ðŠð
o ð | ð ~ ðµ(ð, ðªð)
Laplace Approximation
Κ ð = ln ð ð² ð + ln ð ð ð
o ð ð² ð = Ïð=1ð 1
1+ðf(xi)ðf xi
ðŠð
o ð | ð ~ ðµ(ð, ðªð)
Thus:
Κ ð = ð²ðð â Ïð=1ð ln 1 + ðf xi â
ðððð§âðð
ðâ
ð§ ln 2ð
ðâ
ln ð¶ð
ð
10/22/2018
16
Laplace Approximation
Κ ð = ln ð ð² ð + ln ð ð ð
= ð²ðð â Ïð=1ð ln 1 + ðf xi â
ðððªðâðð
ðâ
ð§ ln 2ð
ðâ
ln ð¶ð
ð
ð»Îš ð = ð»ln ð ð² ð â ðð§âðð = ð² â ð(ð) â ðð§
âðð
where ð ð = ð(ð ðð ), ⊠, ð(ð(ðð))ð»
ð»ð»Îš ð = ð»ð»ln ð ð² ð â ðð§âð = âðŸð â ðªð
âð
ðð =ð ð ðð 1 â ð ð ðð
â±
ð ð ðð 1 â ð ð ðð
Laplace Approximation
Corresponding Gaussian approximation:
ð(ð ð²,ð¿ â ð(ð, ðšâð)
Where:
ð»Îš ð = ð² â ð ð â ðð§âðð = 0
=> ð² â ð ð = ðð§âðð
?=> ð
âð»ð»Îš ð = ðŸð + ðð§âð
=> A = ðŸð + ðð§âð
10/22/2018
17
Laplace Approximation
Cannot directly solve f from: ð² â ð ð = ðð§âðð.
Recall its motivation:
place f at the mode: maximum ð(ð ð², ð¿
Thus, instead of maximum, try to find ðâ for the
minimum of âln ð(ð ð²,ð¿ , with iteration:
ðððð€ = ðððð â (ð»ð»Îš)â1ð»Îš
Finally, get the Gaussian approximation:
ð(ð ð², ð¿ â ð(ðâ, (ðŸð + ðð§âð)âð)
ð ð = ð(ð ðð ), ⊠, ð(ð(ðð))ð»
Laplace Approximation
ð(yâ = 1 ð²,ð¿, ð±â ð(fâ) = ð(fâ ð², ð¿, ð±â ðfâ
ð(fâ ð², ð¿, ð±â = ð(ð ð²,ð¿ ð(fâ ð¿, ð±â, ð ðð
Now we have:
ð(ð ð², ð¿ is approximated as Gaussian.
ð(fâ ð¿, ð±â, ð is a Gaussian.
Thus, ð(fâ ð²,ð¿, ð±â is a Gaussian ~ ð ð, ð 2
Thus, ð(yâ = 1 ð²,ð¿, ð±â â ð 1 +ðð 2
8ð
10/22/2018
18
Expectation Propagation
ð(fâ ð², ð¿, ð±â = ð(ð ð²,ð¿ ð(fâ ð¿, ð±â, ð ðð
ð(ð ð², ð¿ =ð ð² ð ð ð ð
ð ð² ð)
=1
Zðµ ð,ð² Ïð=1
ð ð ðŠð | ðð
Normalization term:
ð = ð ð² ð) = ð ð ð Ïð=1ð ð ðŠð | ðð ðð
Expectation Propagation
ð(ð ð², ð¿ =1
ððµ ð,ð² Ïð=1
ð ð ðŠð | ðð
ð ðŠð | ðð is Non-Gaussian due to response function.
Try to find a Gaussian ð¡ð to approximate it:
ð ðŠð | ðð â ð¡ð ðð à·šðð , ðð , ðð2) = à·©ðð ð(ðð | ðð, ðð
2)
â Ïð=1ð ð ðŠð | ðð â Ïð=1
ð ð¡ð ðð à·šðð, ðð , ðð2)
= N ð, ේΣ Ïð=1ð à·šðð, where ේΣ is ðððð ðð
2
10/22/2018
19
Expectation Propagation
ð(ð ð², ð¿ =1
ððµ ð,ð² Ïð=1
ð ð ðŠð | ðð
â1
ððµ ð,ð² Ïð=1
ð ð¡ð ðð à·šðð , ðð , ðð2)
ð(ð ð², ð¿ â1
ððµ ð,ð² N ð, à·©ðº Ïð=1
ð à·šðð
Thus:
ð(ð ð², ð¿ â ð(ð ð²,ð¿ = N(ð, Σ)
Where: ð = ΣේΣâ1ð Σ = ð²â1 + ේΣâ1â1
Expectation Propagation
How to choose à·šðð , ðð , ðð2 that defines ð¡ð?
à·šðð , ðð , ðð2 that minimize the difference between
ð(ð ð²,ð¿ and ð(ð ð²,ð¿ :
AÏð=1ð ð ðŠð | ðð â AÏð=1
ð ð¡ð ðð à·šðð , ðð, ðð2) where A =
1
ððµ ð,ð²
Using Kullback-Leibler (KL) divergence: KL (p(x) || q(x))
⢠min{ð¡ð}
ðŸð¿ AÏð=1ð ð ðŠð | ðð || ðŽÏð=1
ð ð¡ð intractableâ
⢠minð¡ð
ðŸð¿ ðŽð ðŠð | ðð Ïðâ ð ð¡ð || ðŽð¡ðÏðâ ð ð¡ð iterative on ð¡ð â
10/22/2018
20
Expectation Propagation
1 2 nâŠ
1 2 nâŠ
Default
intractableâ
Expectation Propagation
1 2 nâŠ
1 2 nâŠ
Ïðâ 1 ð¡ð
1st Iteration, based on marginal for ð¡1
නð(ð ð²,ð¿ ðððâ ð
10/22/2018
21
Expectation Propagation
1 2 nâŠ
1 2 nâŠ
2st Iteration, based on marginal for ð¡2
Expectation Propagation
1 2 nâŠ
1 2 nâŠ
Ïðâ ð ð¡ð
nth Iteration, based on marginal for ð¡ð
10/22/2018
22
Expectation Propagation
1 2 nâŠ
1 2 nâŠ
Repeat until convergence
Expectation Propagation
1 2 nâŠ
1 2 nâŠ
Repeat until convergence
10/22/2018
23
Expectation Propagation
Iteratively update ð¡ð
1. Start from current approximate posterior, leave out
the current ð¡ð to get the cavity distribution ðâð ðð :
ðâð ðð = ðŽÏðâ ð ð¡ð ðððâ ð =ð¡ð ðŽ Ïðâ ð ð¡ððððâ ð
ð¡ð= ðŽ Ïð=1
ð ð¡ððððâ ð
ð¡ð
= ð(ð ð²,ð¿ ðððâ ð
ð¡ð=
ð ðð | ð²,ð¿
ð¡ð=
N(ðð | ðð,ðð2)
à·šðð ð ðð ðð, ðð2)= N ðð ðâð, ðâð
2 )
where: ðð2 = Σðð ðâð
2 = ððâ2 â ðð
â2 â1
ðâð = ðâð2 ðð
â2ðð â ððâ2 ðð
ð(ð ð², ð¿ = ðŽÏð=1ð ð¡ð = N(ð, Σ)
ð¡ð ðð à·šðð , ðð , ðð2) = à·©ðð ð(ðð | ðð , ðð
2)
Expectation Propagation
Iteratively update ð¡ð
2. Find the new Gaussian marginal à·ð ðð : ð(ðð ð², ð¿
which best approximates the product of ðâð ðð and
ð ðŠð | ðð :
ðâð ðð ð ðŠð | ðð â à·ð ðð = ðð ð à·ðð , à·ðð2
By:
minð¡ð
ðŸð¿ ðâð ðð ð ðŠð | ðð || à·ð ðð // tractable
ð ðŠð | ðð â ð¡ðð ðŠð | ðð is from response function
Non-Gaussian Marginal p ðð | ð², ð¿
10/22/2018
24
Expectation Propagation
Iteratively update ð¡ð
3. Update parameters à·šðð , ðð , ðð2 of ð¡ð based on ðð,
à·ðð, à·ðð2
from found à·ð ðð . E.g.:
Recall: ðâð ðð =ð ðð
ð¡ð= N ðð ðâð , ðâð
2 )
where ðâð2 = ðð
â2 â ððâ2 â1
Now we have new à·ð ðð ,
ðâð2 = à·ðð
â2 â ððâ2 â1
â ðð2 = à·ðð
â2 â ðâðâ2 â1
old
new
old
Expectation Propagation
ð(yâ = 1 ð²,ð¿, ð±â ð(fâ) = ð(fâ ð², ð¿, ð±â ðfâ
ð(fâ ð², ð¿, ð±â = ð(ð ð²,ð¿ ð(fâ ð¿, ð±â, ð ðð
ð(ð ð², ð¿ =1
ððµ ð,ð² Ïð=1
ð ð ðŠð | ðð
Now we have:
ð ðŠð | ðð is approximated as Gaussian.
Subsequently ð(ð ð²,ð¿ is a Gaussian.
ð(fâ ð¿, ð±â, ð is a Gaussian.
Thus, ð(fâ ð²,ð¿, ð±â is a Gaussian ~ ð ð, ð 2
10/22/2018
25
Approximation
⢠Laplace Approximation
⢠ð(ð ð²,ð¿
⢠2nd order Taylor approximation
⢠Mean ð is placed at the mode
⢠Covariance ðšâð is also related to the mode
⢠EP
⢠ð ðŠð | ðð
⢠Iteratively update ð¡ð by minimizing KL divergence
Approximation
Laplace peaks at posterior mode
EP has a more accurate placement of probability mass
10/22/2018
26
GPC
⢠Nonparametric classification based on Bayesian
methodology
⢠Classification decision is directly made from
observed data
⢠The computational cost is O(N3) in general, due to
the covariance matrix inversion. (DeepMindâs newly
proposed âNeural Processesâ achieved O(N) by GP+NN)
References
1. Ebden, Mark. "Gaussian processes for regression: A quick introduction." The
Website of Robotics Research Group in Department on Engineering Science,
University of Oxford 91 (2008): 424-436.
2. Williams, Christopher K., and Carl Edward Rasmussen. "Gaussian processes
for machine learning." the MIT Press 2, no. 3 (2006): 4.
3. Kuss, Malte, and Carl E. Rasmussen. "Assessing approximations for Gaussian
process classification." In Advances in Neural Information Processing Systems,
pp. 699-706. 2006.
4. Milos Hauskrecht. âCS2750 Machine Learning: Linear regressionâ.
http://people.cs.pitt.edu/~milos/courses/cs2750/Lectures/Class8.pdf
5. Nicholas, Zabaras. âKernel Introduction to Gaussian Methods and Processes.â
https://www.dropbox.com/s/2yonhmzn57lvv1d/Lec27-
KernelMethods.pdf?dl=0
6. Yee Whye The. âExpectation and Belief Propagationâ.
https://www.stats.ox.ac.uk/~teh/teaching/probmodels/lecture4bp.pdf