Machine Learning
Probabilistic Machine Learning
learning as inference, Bayesian Kernel Ridge regression= Gaussian Processes, Bayesian Kernel LogisticRegression = GP classification, Bayesian Neural
Networks
Marc ToussaintUniversity of Stuttgart
Summer 2019
Learning as Inference
• The parameteric view
P (β|Data) =P (Data|β) P (β)
P (Data)
• The function space view
P (f |Data) =P (Data|f) P (f)
P (Data)
• Today:– Bayesian (Kernel) Ridge Regression ↔ Gaussian Process (GP)– Bayesian (Kernel) Logistic Regression ↔ GP classification– Bayesian Neural Networks (briefly)
2/27
• Beyond learning about specific Bayesian learning methods:
Understand relations between
loss/error ↔ neg-log likelihood
regularization ↔ neg-log prior
cost (reg.+loss) ↔ neg-log posterior
3/27
Gaussian Process = Bayesian (Kernel) RidgeRegression
4/27
Ridge regression as Bayesian inference
• We have random variables X1:n, Y1:n, β
• We observe data D = {(xi, yi)}ni=1 and want to compute P (β |D)
• Let’s assume:β
xi
yii = 1 : n
P (X) is arbitraryP (β) is Gaussian: β ∼ N(0, σ
2
λ ) ∝ e−λ
2σ2||β||2
P (Y |X,β) is Gaussian: y = x>β + ε , ε ∼ N(0, σ2)
5/27
Ridge regression as Bayesian inference• Bayes’ Theorem:
P (β |D) =P (D |β) P (β)
P (D)
P (β |x1:n, y1:n) =
∏ni=1 P (yi |β, xi) P (β)
ZP (D |β) is a product of independent likelihoods for each observation (xi, yi)
Using the Gaussian expressions:
P (β |D) =1
Z ′
n∏i=1
e−1
2σ2(yi−x>iβ)
2
e−λ
2σ2||β||2
− logP (β |D) =1
2σ2
[ n∑i=1
(yi − x>iβ)2 + λ||β||2]
+ logZ ′
− logP (β |D) ∝ Lridge(β)
1st insight: The neg-log posterior P (β |D) is proportional to the costfunction Lridge(β)!
6/27
Ridge regression as Bayesian inference• Bayes’ Theorem:
P (β |D) =P (D |β) P (β)
P (D)
P (β |x1:n, y1:n) =
∏ni=1 P (yi |β, xi) P (β)
ZP (D |β) is a product of independent likelihoods for each observation (xi, yi)
Using the Gaussian expressions:
P (β |D) =1
Z ′
n∏i=1
e−1
2σ2(yi−x>iβ)
2
e−λ
2σ2||β||2
− logP (β |D) =1
2σ2
[ n∑i=1
(yi − x>iβ)2 + λ||β||2]
+ logZ ′
− logP (β |D) ∝ Lridge(β)
1st insight: The neg-log posterior P (β |D) is proportional to the costfunction Lridge(β)!
6/27
Ridge regression as Bayesian inference• Bayes’ Theorem:
P (β |D) =P (D |β) P (β)
P (D)
P (β |x1:n, y1:n) =
∏ni=1 P (yi |β, xi) P (β)
ZP (D |β) is a product of independent likelihoods for each observation (xi, yi)
Using the Gaussian expressions:
P (β |D) =1
Z ′
n∏i=1
e−1
2σ2(yi−x>iβ)
2
e−λ
2σ2||β||2
− logP (β |D) =1
2σ2
[ n∑i=1
(yi − x>iβ)2 + λ||β||2]
+ logZ ′
− logP (β |D) ∝ Lridge(β)
1st insight: The neg-log posterior P (β |D) is proportional to the costfunction Lridge(β)! 6/27
Ridge regression as Bayesian inference
• Let us compute P (β |D) explicitly:
P (β |D) =1
Z′
n∏i=1
e− 1
2σ2(yi−x>iβ)
2
e− λ
2σ2||β||2
=1
Z′e− 1
2σ2
∑i(yi−x
>iβ)
2
e− λ
2σ2||β||2
=1
Z′e− 1
2σ2[(y−Xβ)>(y−Xβ)+λβ>β]
=1
Z′e− 1
2[ 1σ2y>y+ 1
σ2β>(X>X+λI)β− 2
σ2β>X>y]
= N(β | β̂,Σ)
This is a Gaussian with covariance and meanΣ = σ2 (X>X + λI)-1 , β̂ = 1
σ2 ΣX>y = (X>X + λI)-1X>y
• 2nd insight: The mean β̂ is exactly the classical argminβ Lridge(β).
• 3rd insight: The Bayesian approach not only gives a mean/optimal β̂,but also a variance Σ of that estimate. (Cp. slide 02:13!)
7/27
Predicting with an uncertain β
• Suppose we want to make a prediction at x. We can compute thepredictive distribution over a new observation y∗ at x∗:
P (y∗ |x∗, D) =∫βP (y∗ |x∗, β) P (β |D) dβ
=∫βN(y∗ |φ(x∗)>β, σ2) N(β | β̂,Σ) dβ
= N(y∗ |φ(x∗)>β̂, σ2 + φ(x∗)>Σφ(x∗))
Note, for f(x) = φ(x)>β, we have P (f(x) |D) = N(f(x) |φ(x)>β̂, φ(x)>Σφ(x)) without
the σ2
• So, y∗ is Gaussian distributed around the mean prediction φ(x∗)>β̂:
(from Bishop, p176)
8/27
Wrapup of Bayesian Ridge regression
• 1st insight: The neg-log posterior P (β |D) is equal to the costfunction Lridge(β).
This is a very very common relation: optimization costs correspond to neg-logprobabilities; probabilities correspond to exp-neg costs.
• 2nd insight: The mean β̂ is exactly the classical argminβ Lridge(β)
More generally, the most likely parameter argmaxβ P (β|D) is also theleast-cost parameter argminβ L(β). In the Gaussian case, most-likely β is alsothe mean.
• 3rd insight: The Bayesian inference approach not only gives amean/optimal β̂, but also a variance Σ of that estimate
This is a core benefit of the Bayesian view: It naturally provides a probabilitydistribution over predictions (“error bars”), not only a single prediction.
9/27
Kernel Bayesian Ridge Regression
• As in the classical case, we can consider arbitrary features φ(x)
• .. or directly use a kernel k(x, x′):
P (f(x) |D) = N(f(x) |φ(x)>β̂, φ(x)>Σφ(x))
φ(x)>β̂ = φ(x)>X>(XX>+ λI)-1y
= κ(x)(K + λI)-1y
φ(x)>Σφ(x) = φ(x)>σ2 (X>X + λI)-1φ(x)
=σ2
λφ(x)>φ(x)− σ2
λφ(x)>X>(XX>+ λIk)-1Xφ(x)
=σ2
λk(x, x)− σ2
λκ(x)(K + λIn)-1κ(x)>
3rd line: As on slide 05:22nd to last line: Woodbury identity (A+ UBV )-1 = A-1 −A-1U(B-1 + V A-1U)-1V A-1
with A = λI
• In standard conventions λ = σ2, i.e. P (β) = N(β|0, 1)
– Regularization: scale the covariance function (or features)10/27
Gaussian Processes
are equivalent to Kernelized Bayesian Ridge Regression(see also Welling: “Kernel Ridge Regression” Lecture Notes; Rasmussen & Williamssections 2.1 & 6.2; Bishop sections 3.3.3 & 6)
• But it is insightful to introduce them again from the “function spaceview”: GPs define a probability distribution over functions; they are theinfinite dimensional generalization of Gaussian vectors
11/27
Gaussian Processes – function prior
• The function space view
P (f |D) =P (D|f) P (f)
P (D)
• A Gaussian Processes prior P (f) defines a probability distributionover functions:
– A function is an infinite dimensional thing – how could we define aGaussian distribution over functions?
– For every finite set {x1, .., xM}, the function values f(x1), .., f(xM ) areGaussian distributed with mean and covariance
E{f(xi)} = µ(xi) (often zero)cov{f(xi), f(xj)} = k(xi, xj)
Here, k(·, ·) is called covariance function
• Second, for Gaussian Processes we typically have a Gaussian datalikelihood P (D|f), namely
P (y |x, f) = N(y | f(x), σ2)12/27
Gaussian Processes – function posterior
• The posterior P (f |D) is also a Gaussian Process, with the followingmean of f(x), covariance of f(x) and f(x′): (based on slide 10 (withλ = σ2))
E{f(x) |D} = κ(x)(K + λI)-1y + µ(x)
cov{f(x), f(x′) |D} = k(x, x′)− κ(x′)(K + λIn)-1κ(x′)>
13/27
Gaussian Processes
(from Rasmussen & Williams)
14/27
GP: different covariance functions
(from Rasmussen & Williams)
• These are examples from the γ-exponential covariance function
k(x, x′) = exp{−|(x− x′)/l|γ}
15/27
GP: derivative observations
(from Rasmussen & Williams)
16/27
• Bayesian Kernel Ridge Regression = Gaussian Process
• GPs have become a standard regression method
• If exact GP is not efficient enough, many approximations exist, e.g.sparse and pseudo-input GPs
17/27
GP classification = Bayesian (Kernel) LogisticRegression
18/27
Bayesian Logistic Regression (binary case)
• f now defines a discriminative function:
P (X) = arbitrary
P (β) = N(β|0, 2
λ) ∝ exp{−λ||β||2}
P (Y =1 |X,β) = σ(β>φ(x))
• Recall
Llogistic(β) = −n∑i=1
log p(yi |xi) + λ||β||2
• Again, the parameter posterior is
P (β|D) ∝ P (D |β) P (β) ∝ exp{−Llogistic(β)}
19/27
Bayesian Logistic Regression• Use Laplace approximation (2nd order Taylor for L) at β∗ = argminβ L(β):
L(β) ≈ L(β∗) + β̄>∇+1
2β̄>Hβ̄ , β̄ = β − β∗
At β∗ the gradient ∇ = 0 and L(β∗) = const. Therefore
P̃ (β|D) ∝ exp{−1
2β̄>Hβ̄}
⇒ P (β|D) ≈ N(β|β∗, H-1)
• Then the predictive distribution of the discriminative function is also Gaussian!
P (f(x) |D) =∫βP (f(x) |β) P (β |D) dβ
≈∫βN(f(x) |φ(x)>β, 0) N(β |β∗, H-1) dβ
= N(f(x) |φ(x)>β∗, φ(x)>H-1φ(x)) =: N(f(x) | f∗, s2)
• The predictive distribution over the label y ∈ {0, 1}:
P (y(x)=1 |D) =∫f(x)
σ(f(x)) P (f(x)|D) df
≈ σ((1 + s2π/8)-12 f∗)
which uses a probit approximation of the convolution.→ The variance s2 pushes the predictive class probabilities towards 0.5. 20/27
Kernelized Bayesian Logistic Regression
• As with Kernel Logistic Regression, the MAP discriminative function f∗
can be found iterating the Newton method↔ iterating GP estimationon a re-weighted data set.
• The rest is as above.
21/27
Kernel Bayesian Logistic Regression
is equivalent to Gaussian Process Classification
• GP classification became a standard classification method, if theprediction needs to be a meaningful probability that takes the modeluncertainty into account.
22/27
Bayesian Neural Networks
23/27
Bayesian Neural Networks
• Simple ways to get uncertainty estimates:– Train ensembles of networks (e.g. bootstrap ensembles)– Treat the output layer fully probabilistic (treat the trained NN body as
feature vector φ(x), and apply Bayesian Ridge/Logistic Regression on topof that)
• Ways to treat NNs inherently Bayesian:– Infinite single-layer NN→ GP (classical work in 80/90ies)– Putting priors over weights (“Bayesian NNs”, Neil, MacKay, 90ies)– Dropout (much more recent, see papers below)
• ReadGal & Ghahramani: Dropout as a bayesian approximation: Representing modeluncertainty in deep learning (ICML’16)
Damianou & Lawrence: Deep gaussian processes (AIS 2013)
24/27
Dropout in NNs as Deep GPs
• Deep GPs are essentially a a chaining of Gaussian Processes– The mapping from each layer to the next is a GP– Each GP could have a different prior (kernel)
• Dropout in NNs– Dropout leads to randomized prediction– One can estimate the mean prediction from T dropout samples (MC
estimate)– Or one can estimate the mean prediction by averaging the weights of the
network (“standard dropout”)– Equally one can MC estimate the variance from samples– Gal & Ghahramani show, that a Dropout NN is a Deep GP (with very
special kernel), and the “correct” predictive variance is this MC estimateplus pl2
2nλ(kernel length scale l, regularization λ, dropout prob p, and n
data points)
25/27
No Free Lunch• Averaged over all problem instances, any algorithm performs equally.
(E.g. equal to random.)– “there is no one model that works best for every problem”
Igel & Toussaint: On Classes of Functions for which No Free Lunch Results Hold(Information Processing Letters 2003)
• Rigorous formulations formalize this “average over all probleminstances”. E.g. by assuming a uniform prior over problems
– In black-box optimization, a uniform distribution over underlying objectivefunctions f(x)
– In machine learning, a uniform distribution over the hiddern true functionf(x)
... and NLF always considers non-repeating queries.
• But what does uniform distribution over functions mean?
• NLF is trivial: when any previous query yields NO information at all about theresults of future queries, anything is exactly as good as random guessing
26/27
No Free Lunch• Averaged over all problem instances, any algorithm performs equally.
(E.g. equal to random.)– “there is no one model that works best for every problem”
Igel & Toussaint: On Classes of Functions for which No Free Lunch Results Hold(Information Processing Letters 2003)
• Rigorous formulations formalize this “average over all probleminstances”. E.g. by assuming a uniform prior over problems
– In black-box optimization, a uniform distribution over underlying objectivefunctions f(x)
– In machine learning, a uniform distribution over the hiddern true functionf(x)
... and NLF always considers non-repeating queries.
• But what does uniform distribution over functions mean?
• NLF is trivial: when any previous query yields NO information at all about theresults of future queries, anything is exactly as good as random guessing 26/27
Conclusions• Probabilistic inference is a very powerful concept!
– Inferring about the world given data– Learning, decision making, reasoning can view viewed as forms of(probabilistic) inference
• We introduced Bayes’ Theorem as the fundamental form ofprobabilistic inference
• Marrying Bayes with (Kernel) Ridge (Logisic) regression yields– Gaussian Processes– Gaussian Process classification
• We can estimate uncertainty also for NNs– Dropout– Probabilistic weights and variational approximations; Deep GPs
• No Free Lunch for ML! 27/27