bayes deep
TRANSCRIPT
-
7/26/2019 Bayes Deep
1/43
Bayesian Reasoningand Deep Learning
Shakir Mohamed
@shakir_zashakirm.com
!""#$%&'
9 October 2015
-
7/26/2019 Bayes Deep
2/43
Bayesian Reasoning and Deep Learning 2
Abstract
Deep learning and Bayesian machine learning are currently two of the most
active areas of machine learning research. Deep learningprovides a powerfulclass of models and an easy framework for learning that now provides state-of-
the-art methods for applications ranging from image classification to speechrecognition. Bayesian reasoningprovides a powerful approach for information
integration, inference and decision making that has established it as the keytool for data-efficient learning, uncertainty quantification and robust model
composition that is widely used in applications ranging from informationretrieval to large-scale ranking. Each of these research areas has shortcomings
that can be effectively addressed by the other, pointing towards a neededconvergence of these two areas of machine learning; the complementary
aspects of these two research areas is the focus of this talk. Using the tools ofauto-encoders and latent variable models, we shall discuss some of the ways in
which our machine learning practice is enhanced by combining deep learningwith Bayesian reasoning. This is an essential, and ongoing, convergence that
will only continue to accelerate and provides some of the most exciting
prospects, some of which we shall discuss, for contemporary machine learningresearch.
-
7/26/2019 Bayes Deep
3/43
Bayesian Reasoning and Deep Learning 3
-
7/26/2019 Bayes Deep
4/43
Bayesian Reasoning and Deep Learning 4
Deep Learning
+ Rich non-linear models for
classification and sequence prediction.
+ Scalable learning using stochasticapproximations and conceptually simple.
+ Easily composable with other gradient-
based methods
- Only point estimates
- Hard to score models, domodel selection andcomplexity penalisation.
A framework for constructing flexible models
-
7/26/2019 Bayes Deep
5/43
Bayesian Reasoning and Deep Learning 5
Bayesian Reasoning
+ Unified framework for model building,inference, prediction and decision making
+ Explicit accounting for uncertainty andvariability of outcomes
+ Robust to overfitting; tools for modelselection and composition.
- Mainly conjugate and linearmodels
- Potentially intractableinference leading toexpensive computation orlong simulation times.
A framework for inference and decision making
-
7/26/2019 Bayes Deep
6/43
Bayesian Reasoning and Deep Learning 6
Two Streams of Machine Learning
- Mainly conjugate and linearmodels
- Potentially intractable inference,
computationally expensive or longsimulation time.
+ Unified framework for modelbuilding, inference, prediction anddecision making
+ Explicit accounting for uncertaintyand variability of outcomes
+ Robust to overfitting; tools formodel selection and composition.
Bayesian Reasoning
+ Rich non-linear models forclassification and sequenceprediction.
+ Scalable learning using stochasticapproximation and conceptuallysimple.
+ Easily composable with othergradient-based methods
- Only point estimates
- Hard to score models, do selectionand complexity penalisation.
Deep Learning
-
7/26/2019 Bayes Deep
7/43Bayesian Reasoning and Deep Learning 7
Outline
Complementary strengths that we shouldexpect to be successfully combined.
Why is this a good idea?
!Review of deep learning
!Limitations of maximum likelihood and MAP estimation
1
2
3 What else can we do?
!Semi-supervised learning, classification, better inferenceand more.
+Bayesian Reasoning Deep Learning
How can we achieve this convergence?
!Case study using auto-encoders and latent variable models
!Approximate Bayesian inference
-
7/26/2019 Bayes Deep
8/43Bayesian Reasoning and Deep Learning
Generalised Linear Regression
8
A (Statistical) Review of Deep Learning
=w>x
+b
p(y|x) = p(y|g(); )
Maximum likelihood estimation
Optimise the negative log-likelihood
L= log p(y|g(); )
"g(.) is an inverse link functionthat wellrefer to as an activation function.
Target Regression Link Inv link Activation
Real Linear Identity Identity
Binary Logistic Logit log 1
Sigmoid1
1+exp(
)
Sigmoid
Binary Probit Inv Gauss
CDF1()
Gauss CDF
()
Probit
Binary Gumbel Compl.
log-log
log(log())
Gumbel CDF
eex
Binary Logistic Hyperbolic
Tangent
tanh()
Tanh
Categorical Multinomial Multin. Logiti
Pj j
Softmax
Counts Poisson log() exp()
Counts Poissonp
() 2
Non-neg. Gamma Reciprocal 1
1
Sparse Tobit max max(0;) ReLU
Ordered Ordinal Cum. Logit
(k)
"The basic function can be any linearfunction, e.g., a!ne, convolution.
-
7/26/2019 Bayes Deep
9/43Bayesian Reasoning and Deep Learning
A general framework for buildingnon-linear, parametric models
9
A (Statistical) Review of Deep Learning
Recursive Generalised Linear Regression
Problem: Overfitting of MLE leading to limited generalisation.
"Recursively compose the basic linear functions.
"Gives a deep neural network.
E[y] =hL . . . hl h0(x)
-
7/26/2019 Bayes Deep
10/43Bayesian Reasoning and Deep Learning 10
A (Statistical) Review of Deep Learning
!Large data sets! Input noise/jittering and data augmentation/expansion.
!L2 /L1 regularisation (Weight decay, Gaussian prior)
!Binary or Gaussian Dropout
!Batch normalisation
"Regularisation is essential to overcome the limitations of maximumlikelihood estimation.
"Regularisation, penalised regression, shrinkage.
"A wide range of available regularisation techniques:
Regularisation Strategies for Deep Networks
More robust loss function using MAP estimation instead.
-
7/26/2019 Bayes Deep
11/43Bayesian Reasoning and Deep Learning 11
More Robust Learning
MAP estimators and limitations
"Power of MAP estimators is that they providesome robustness to overfitting.
"Creates sensitivities to parameterisation.
1. Sensitivities a!ect gradients and can make learning hard
2. Still no way to measure confidence of our model.
Can generate frequentist confidence intervalsand bootstrap estimates.
Invariant MAP estimators and exploiting naturalgradients, trust region methods and other
improved optimisation.
-
7/26/2019 Bayes Deep
12/43Bayesian Reasoning and Deep Learning 12
Towards Bayesian Reasoning
Given this powerful model class and invaluable tools for
regularisation and optimisation, let us develop a
Issues arise as a consequence of:
!Reasoning only about the most likely solution and
!Not maintaining knowledge of the underlying variability (andaveraging over this).
Pragmatic Bayesian Approach forProbabilistic Reasoning in Deep Networks.
Proposed solutions have not fully dealt with the underlying issues.
Bayesian reasoning over some, but not all parts of our models (yet).
-
7/26/2019 Bayes Deep
13/43Bayesian Reasoning and Deep Learning 13
Outline
Complementary strengths that we shouldexpect to be successfully combined.
1 Why is this a good idea?
!Review of deep learning
!Limitations of maximum likelihood and MAP estimation
2 How can we achieve this convergence?
!Case study using auto-encoders and latent variable models
!Approximate Bayesian inference
3 What else can we do?
!Semi-supervised learning, classification, better inferenceand more.
+Bayesian Reasoning Deep Learning
-
7/26/2019 Bayes Deep
14/43Bayesian Reasoning and Deep Learning 14
Unsupervised learning and auto-encoders
!A generic tool for dimensionalityreduction and feature extraction.
!Minimise reconstruction error using anencoder and a decoder.
Data y
Encoder
f(.)
z = f(y)
Decoder
g(.)
y* = g(z)
z
+ Non-linear dimensionality reductionusing deep networks for encoder anddecoder.
+ Easy to implement as a singlecomputational graph and train using
SGD
- No natural handling of missing data
- No representation of variability of therepresentation space. L= kyg(f(y))k
2
2
L= log p(y|g(z))
Dimensionality Reduction and Auto-encoders
-
7/26/2019 Bayes Deep
15/43Bayesian Reasoning and Deep Learning 15
Dimensionality Reduction and Auto-encoders
Some questions about auto-encoders:
!What is the model we are interested in?
!Why use an encoder?
!How do we regularise?
Data y
Encoder
f(.)
z = f(y)
Decoder
g(.)
y* = g(z)
z
Best to be explicit about the:
Probabilistic modelof interest and
Mechanism we use for inference.
-
7/26/2019 Bayes Deep
16/43Bayesian Reasoning and Deep Learning 16
Latent variable models:
! Generic and flexible model class for density estimation.
! Specifies a generative process that gives rise to the data.
z
y
W
n = 1, , N
! "
Use our knowledge of deep learning to design even richer models.
BXPCA
z N(z|,)
Latent Variable
y Expon(y|)
Observation Model
=Wz+ b
Exponential fam natural parameters ".
Density Estimation and Latent Variable Models
Latent Gaussian Models:
!Probabilistic PCA, Factor analysis (FA), Bayesian ExponentialFamily PCA (BXPCA).
-
7/26/2019 Bayes Deep
17/43Bayesian Reasoning and Deep Learning 17
Deep Generative Models
Rich extension of previous model using deep neural networks.
E.g., non-linear factor analysis, non-linear Gaussian beliefnetworks, deep latent Gaussian models (DLGM).
z1
y
W
n = 1, , N
! "
h1
h2
z2
h3
h4
W1
DLGM
Latent Variables (Stochastic layers)
zl N(zl|fl(zl+1),l)
fl(z) = (Wh(z) + b)Deterministic layers
hi(x) = (Ax + c)
y Expon(y|)
=Wh1 + b
Observation Model
Can also use non-exponential family.
-
7/26/2019 Bayes Deep
18/43Bayesian Reasoning and Deep Learning 18
Deep Latent Gaussian Models
z1
y
W
n = 1, , N
h1
h2
! "
p(z|y,W) p(y|z,W)p(z)1. Explain this data
p(y|y) =
Z p(y|z,W)p(z|y,W)dz
2. Make predictions:
p(y|W) =
Z p(y|z,W)p(z)dz
3. Choose the best model
Our inferential tasks are:
-
7/26/2019 Bayes Deep
19/43
Bayesian Reasoning and Deep Learning 19
Variational Inference
PenaltyReconstruction
F(y, q) = Eq(z)[logp(y|z)]KL[q(z)kp(z)]
Use tools from approximate inference to handle intractable integrals.
q(z)
KL[q(z|y)kp(z|y)] Approximation class
True posterior
Reconstruction cost:Expected log-likelihoodmeasures how wellsamples from q(z)are ableto explain the data y.
Penalty: Explanation ofthe data q(z)doesnt deviatetoo far from your beliefs
p(z) - Okhams razor.
Penalty is derived from your model and does not need to be designed.
-
7/26/2019 Bayes Deep
20/43
Bayesian Reasoning and Deep Learning 20
Amortised Variational Inference
Approximate posterior distributionq(z): Best match
to true posteriorp(z|y), one of the unknowninferential quantities of interest to us.
Data y
Inference/
Encoder
q(z |y)
z ~ q(z | y)
Inference network: qis an encoder or inverse model.
Parameters of qare now a set of global parameters
used for inference of all data points - test and train.Amortise (spread) the cost of inference over all data.
Encoders provide an e!cient mechanism foramortised posterior inference
PenaltyReconstruction
F(y, q
) =E
q(z)[logp
(y|z
)]KL
[q
(z
)kp
(z
)]Approx. Posterior
-
7/26/2019 Bayes Deep
21/43
Bayesian Reasoning and Deep Learning 21
Auto-encoders and Inference in DGMs
Model (Decoder):likelihoodp(y|z).
Inference (Encoder): variational distribution q(z|y)
Stochastic encoder-decoder systemsimplement variational inference.
PenaltyReconstruction
F y, q) = Eq(z)
logp y|z) KL q z)kp z)
Approx. Posterior
Data y
Inference
Network
q(z |y)
z ~ q(z | y)
Model
p(y |z)
y ~ p(y | z)
z
Specific combination of variational inferencein latentvariable modelsusing inference networks
Variational Auto-encoder
But dont forget what your model is, and what inference you use.
-
7/26/2019 Bayes Deep
22/43
Bayesian Reasoning and Deep Learning 22
What Have we Gained
Data y
Inference
Network
q(z |y)
z ~ q(z | y)
Model
p(y |z)
y ~ p(y | z)
z
F(y, q) = Eq(z)log p(y|z) K L q(z)kp(z)
+ Transformed an auto-encoders into moreinteresting deepgenerative models.
+ Rich new class of density estimators builtwith non-linear models.
+ Used a principled approachfor deriving
loss functions that automatically includeappropriate penalty functions.
+ Explained howan encoder enters intoour models and whythis is a good idea.
+ Able to answer all our desired inferentialquestions.
+ Knowledge of the uncertaintyassociatedwith our latent variables.
-
7/26/2019 Bayes Deep
23/43
Bayesian Reasoning and Deep Learning 23
What Have we Gained
Data y
Inference
Network
q(z |y)
z ~ q(z | y)
Modelp(y |z)
y ~ p(y | z)
z
F(y, q) = Eq(z) log p(y|z) KL q(z)kp(z) + Able to score our models and do model
selection using the free energy.
+ Can impute missing dataunder anymissingness assumption
+ Can still combine with natural gradient
and improved optimisation tools.
+ Easy implementation - have a singlecomputational graph and simple MonteCarlo gradient estimators.
+ Computational complexity the same asany large-scale deep learning system.
A true marriage of Bayesian Reasoning and Deep Learning
-
7/26/2019 Bayes Deep
24/43
Bayesian Reasoning and Deep Learning
MNIST Handwritten digits
24
Data Visualisation
. . .
. . .
500
28x28
Labels in 2D latent spaceSamples from 2D latent model
DLGM
-
7/26/2019 Bayes Deep
25/43
Bayesian Reasoning and Deep Learning 25
Visualising MNIST in 3D
DLGM
-
7/26/2019 Bayes Deep
26/43
Bayesian Reasoning and Deep Learning 26
Data Simulation
DLGM
Data Samples
-
7/26/2019 Bayes Deep
27/43
Bayesian Reasoning and Deep Learning 27
Missing Data Imputation
10%
observed
50%
observed
unobserved pixelsOriginal Data Inferred Image
DLGM
-
7/26/2019 Bayes Deep
28/43
Bayesian Reasoning and Deep Learning 28
Outline
Complementary strengths that we shouldexpect to be successfully combined.
1 Why is this a good idea?
!Review of deep learning
!Limitations of maximum likelihood and MAP estimation
2 How can we achieve this convergence?
!Auto-encoders and latent variable models
!Approximate and variational inference
3 What else can we do?
!Semi-supervised learning, recurrent networks, classification,better inference and more.
+Bayesian Reasoning Deep Learning
-
7/26/2019 Bayes Deep
29/43
Bayesian Reasoning and Deep Learning 29
Semi-supervised Learning
z
x
W
n = 1, , N
! "
y
#
Can extend the marriage of Bayesian reasoning and deep learning to the
problem of semi-supervised classification.
Semi-supervised
DLGM
-
7/26/2019 Bayes Deep
30/43
Bayesian Reasoning and Deep Learning 30
Analogical Reasoning
Semi-supervisedDLGM
d l h
-
7/26/2019 Bayes Deep
31/43
Bayesian Reasoning and Deep Learning 31
Generative Models with Attention
We can also combine other tools from deep learning to design
even more powerful generative models: recurrent networksand attention.
DRAW
read
write
encoder
RNN
sample
decoder
RNN
read
write
encoder
RNN
sample
decoder
RNN
ct1 ct cT
henc
t1
hdec
t1
Q(zt+1|x, z1:t)
. .
decoding
(generative model)
encoding
(inference)
encoder
FNN
sample
decoder
FNN
z
P(x|z)
d l
-
7/26/2019 Bayes Deep
32/43
Bayesian Reasoning and Deep Learning 32
Uncertainty on Model Parameters
We can also combine other tools from deep learning to design even more
powerful generative models: recurrent networks and attention.
Bayesia
nNeuralNe
tworks
x
y
W3
n = 1, , N
h1
h2 W2
W1
H1
H2
H3 1
X 1
Y
I R i
-
7/26/2019 Bayes Deep
33/43
Bayesian Reasoning and Deep Learning 33
In Review
Bayesian reasoning as a generalframework forinference that allows us to account for
uncertaintyand a principled approach forregularisation and model scoring.
Combined Bayesian reasoning with auto-encoders andshowed just how much can be gained by a marriage of these
two streamsof machine learning research.
Data y
Inference
Network
q(z |y)
z ~ q(z | y)
Model
p(y |z)
y ~ p(y | z)
z
Deep learning as aframework for building highlyflexible non-linear parametric models, but
regularisation and accounting for uncertaintyand lack of knowledge is still needed.
-
7/26/2019 Bayes Deep
34/43
Bayesian Reasoning and Deep Learning
Thanks to many people:
Danilo Rezende, Ivo Danihelka, Karol Gregor, Charles Blundell,Theophane Weber, Andriy Mnih, Daan Wierstra (Google DeepMind),Durk Kingma, Max Welling (U. Amsterdam)
34
Thank You.
S R f
-
7/26/2019 Bayes Deep
35/43
Bayesian Reasoning and Deep Learning
Probabilistic Deep Learning
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models ."ICML (2014).
Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." ICLR 2014. Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." ICML (2014).
Gregor, Karol, et al. "Deep autoregressive networks." ICML (2014).
Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. NIPS (pp. 3581-3589).
Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation.arXiv preprint arXiv:1502.04623.
Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. arXiv preprint arXiv:1505.05770.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. arXiv preprint arXiv:1505.05424.
Hernndez-Lobato, J. M., & Adams, R. P. (2015).Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. arXivpreprint arXiv:1502.05336.
Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprintarXiv:1506.02142.
35
Some References
Wh i V i i l M h d?
-
7/26/2019 Bayes Deep
36/43
Bayesian Reasoning and Deep Learning
Variational Principle
General family of methods for approximatingcomplicated densities by a simpler class of densities.
36
What is a Variational Method?
Deterministic approximationprocedureswith boundson probabilities of interest.
Fit the variational parameters.
q(z)
KL[q(z|y)kp(z|y)] Approximation class
True posterior
F IS V i i l I f
-
7/26/2019 Bayes Deep
37/43
Bayesian Reasoning and Deep Learning 37
From IS to Variational Inference
=Z q(z) logp(y|z)
Z q(z) log
q
(z
)p(z)
= Eq(z)[logp(y|z)]KL[q(z)kp(z)]Variational lower bound
Jensens inequality logp(y)
Z q(z) log
p(y|z)
p(z)
q(z)
dz
log
Z p(x)g(x)dx
Z p(x) log g(x)dx
Integral problem logp(y) = log Z p(y|z)p(z)dz
Importance Weight logp(y) = logZ
p(y|z)
p z
q(z) q(z)dz
Proposal logp(y) = log
Z p(y|z)p(z)
q(z)
q(z)dz
Mi i D i i L h (MDL)
-
7/26/2019 Bayes Deep
38/43
Bayesian Reasoning and Deep Learning
Hypothesis codeData code-length
38
Minimum Description Length (MDL)
F(y, q) = Eq(z)[logp(y|z)]KL[q(z)kp(z)]
Regularity in our data that can be explained withlatent variables, implies that the data is compressible.
MDL: inference seen as a problem of compression we must find the ideal shortest message of our data y:marginal likelihood.
Must introduce an approximation to the idealmessage.
Encoder: variational distribution q(z|y),
Decoder:likelihoodp(y|z).
Stochastic encoder-decoder systems implement variational inference.
Stochastic encoder
Data y
Encoder
q(z |y)
z ~ q(z | y)
Decoder
p(y |z)
y ~ p(y | z)
z
D i i A d (DAE)
-
7/26/2019 Bayes Deep
39/43
Bayesian Reasoning and Deep Learning
PenaltyReconstruction
39
Denoising Auto-encoders (DAE)
DAE: A mechanism for finding representations orfeatures of data (i.e. latent variable explanations).
Encoder: variational distribution q(z|y),
Decoder:likelihoodp(y|z).
Stochastic encoder-decoder systems implement variational inference.
Stochastic encoder
F(y, q) = Eq(z)[log p(y|z)] (z, y)
Data y
Encoder
q(z |y)
z ~ q(z | y)
Decoder
p(y |z)
y ~ p(y | z)
z
The variational approach requires you to be explicitabout your assumptions. Penalty is derived from your
model and does not need to be designed.
A ti i th C t f I f
-
7/26/2019 Bayes Deep
40/43
Bayesian Reasoning and Deep Learning 40
Amortising the Cost of Inference
Inference network: qis an encoder or inverse model.
Parameters of qare now a set of global parametersused for inference of all data points - test and train.
Share the cost of inference (amortise) over all data.
Combines easily with mini-batches and Monte Carloexpectations.
Can jointly optimise variational and modelparameters: no need for alternating optimisation.
Repeat:
E-ste
M-step
For i = 1, N
1
N
X
n
logp(yn|zn)
n Eq(z)[logp(yn|zn)]KL[q(zn)kp(zn)]
Instead of solving this optimisation
for every data point n, we caninstead use a model.
Data y
Inference
Network
q(z |y)
z ~ q(z | y)
Model
p(y |z)
y ~ p(y | z)
z
I l ti V i ti l Al ith
-
7/26/2019 Bayes Deep
41/43
Bayesian Reasoning and Deep Learning
Ideally want probabilistic programmingusing variational inference.
41
Implementing your Variational Algorithm
Avoid deriving pages of gradient updates for variational inference.
Variational inference turns integrationinto optimisation:
Automated Tools:
Di!erentiation: Theano, Torch7, Stan
Message passing:infer.NET
Eq[(
logp(y|z) + log q(z)
logp(z)]
Inference
q(z |x)
H[q(z)]
Model
p(x |z)
Prior
p(z)
z
log p(z)
log p(x|z)
Inference
q(z |x)
Model
p(x |z)
Prior
p(z)
Data x
Forward pass Backward pass
Stochastic gradient descent andother preconditioned optimisation.
Same code can run on both GPUsor on distributed clusters.
Probabilistic models are modular,can easily be combined.
St h ti B k ti
http://infer.net/ -
7/26/2019 Bayes Deep
42/43
Bayesian Reasoning and Deep Learning
A Monte Carlo method that works with continuous latent variables.
42
Stochastic Backpropagation
Can use any likelihood function, avoids the need for additional lower bounds. Low-variance, unbiased estimator of the gradient.
Can use just one samplefrom the base distribution.
Possible for many distributions with location-scale or other knowntransformations, such as the CDF.
Eq(z)[f(z)]Original problem
z N(, 2)z= + N(0, 1)Reparameterisation
EN(0,1)[f(+ )]
N(0,1)[={,}f(+ )]Backpropagation
with Monte Carlo
M t C l C t l V i t E ti t
-
7/26/2019 Bayes Deep
43/43
More general Monte Carlo approach that can be used with both discreteor continuous latent variables.
Monte Carlo Control Variate Estimators
log q(z|x) =q(z|x)
q(z|x)Property of the score function:
cis known as a control variateand is usedto control the variance of the estimator.
Original problem Eq(z)[log p(y|z)]
Score ratio q(z)[log p(y z)log q(z y)]
MCCV Estimate Eq(z)[(log p(y z) c)log q(z y)]