bayes deep

7/26/2019 Bayes Deep

1/43

Bayesian Reasoningand Deep Learning

Shakir Mohamed

@shakir_zashakirm.com

!""#$%&'

9 October 2015


2/43

Bayesian Reasoning and Deep Learning 2

Abstract

Deep learning and Bayesian machine learning are currently two of the most

active areas of machine learning research. Deep learningprovides a powerfulclass of models and an easy framework for learning that now provides state-of-

the-art methods for applications ranging from image classification to speechrecognition. Bayesian reasoningprovides a powerful approach for information

integration, inference and decision making that has established it as the keytool for data-efficient learning, uncertainty quantification and robust model

composition that is widely used in applications ranging from informationretrieval to large-scale ranking. Each of these research areas has shortcomings

that can be effectively addressed by the other, pointing towards a neededconvergence of these two areas of machine learning; the complementary

aspects of these two research areas is the focus of this talk. Using the tools ofauto-encoders and latent variable models, we shall discuss some of the ways in

which our machine learning practice is enhanced by combining deep learningwith Bayesian reasoning. This is an essential, and ongoing, convergence that

will only continue to accelerate and provides some of the most exciting

prospects, some of which we shall discuss, for contemporary machine learningresearch.


3/43



4/43


Deep Learning

+ Rich non-linear models for

classification and sequence prediction.

+ Scalable learning using stochasticapproximations and conceptually simple.

+ Easily composable with other gradient-

based methods

- Only point estimates

- Hard to score models, domodel selection andcomplexity penalisation.

A framework for constructing flexible models


5/43


Bayesian Reasoning

+ Unified framework for model building,inference, prediction and decision making

+ Explicit accounting for uncertainty andvariability of outcomes

+ Robust to overfitting; tools for modelselection and composition.

- Mainly conjugate and linearmodels

- Potentially intractableinference leading toexpensive computation orlong simulation times.

A framework for inference and decision making


6/43


Two Streams of Machine Learning

- Mainly conjugate and linearmodels

- Potentially intractable inference,

computationally expensive or longsimulation time.

+ Unified framework for modelbuilding, inference, prediction anddecision making

+ Explicit accounting for uncertaintyand variability of outcomes

+ Robust to overfitting; tools formodel selection and composition.

Bayesian Reasoning

+ Rich non-linear models forclassification and sequenceprediction.

+ Scalable learning using stochasticapproximation and conceptuallysimple.

+ Easily composable with othergradient-based methods

- Only point estimates

- Hard to score models, do selectionand complexity penalisation.

Deep Learning


7/43Bayesian Reasoning and Deep Learning 7

Outline

Complementary strengths that we shouldexpect to be successfully combined.

Why is this a good idea?

!Review of deep learning

!Limitations of maximum likelihood and MAP estimation

1

2

3 What else can we do?

!Semi-supervised learning, classification, better inferenceand more.

+Bayesian Reasoning Deep Learning

How can we achieve this convergence?

!Case study using auto-encoders and latent variable models

!Approximate Bayesian inference


8/43Bayesian Reasoning and Deep Learning

Generalised Linear Regression

8

A (Statistical) Review of Deep Learning

=w>x

+b

p(y|x) = p(y|g(); )

Maximum likelihood estimation

Optimise the negative log-likelihood

L= log p(y|g(); )

"g(.) is an inverse link functionthat wellrefer to as an activation function.

Target Regression Link Inv link Activation

Real Linear Identity Identity

Binary Logistic Logit log 1

Sigmoid1

1+exp(

)

Sigmoid

Binary Probit Inv Gauss

CDF1()

Gauss CDF

()

Probit

Binary Gumbel Compl.

log-log

log(log())

Gumbel CDF

eex

Binary Logistic Hyperbolic

Tangent

tanh()

Tanh

Categorical Multinomial Multin. Logiti

Pj j

Softmax

Counts Poisson log() exp()

Counts Poissonp

() 2

Non-neg. Gamma Reciprocal 1

1

Sparse Tobit max max(0;) ReLU

Ordered Ordinal Cum. Logit

(k)

"The basic function can be any linearfunction, e.g., a!ne, convolution.


9/43Bayesian Reasoning and Deep Learning

A general framework for buildingnon-linear, parametric models

9


Recursive Generalised Linear Regression

Problem: Overfitting of MLE leading to limited generalisation.

"Recursively compose the basic linear functions.

"Gives a deep neural network.

E[y] =hL . . . hl h0(x)




!Large data sets! Input noise/jittering and data augmentation/expansion.

!L2 /L1 regularisation (Weight decay, Gaussian prior)

!Binary or Gaussian Dropout

!Batch normalisation

"Regularisation is essential to overcome the limitations of maximumlikelihood estimation.

"Regularisation, penalised regression, shrinkage.

"A wide range of available regularisation techniques:

Regularisation Strategies for Deep Networks

More robust loss function using MAP estimation instead.



More Robust Learning

MAP estimators and limitations

"Power of MAP estimators is that they providesome robustness to overfitting.

"Creates sensitivities to parameterisation.

1. Sensitivities a!ect gradients and can make learning hard

2. Still no way to measure confidence of our model.

Can generate frequentist confidence intervalsand bootstrap estimates.

Invariant MAP estimators and exploiting naturalgradients, trust region methods and other

improved optimisation.



Towards Bayesian Reasoning

Given this powerful model class and invaluable tools for

regularisation and optimisation, let us develop a

Issues arise as a consequence of:

!Reasoning only about the most likely solution and

!Not maintaining knowledge of the underlying variability (andaveraging over this).

Pragmatic Bayesian Approach forProbabilistic Reasoning in Deep Networks.

Proposed solutions have not fully dealt with the underlying issues.

Bayesian reasoning over some, but not all parts of our models (yet).



Outline


1 Why is this a good idea?



2 How can we achieve this convergence?

!Case study using auto-encoders and latent variable models

!Approximate Bayesian inference


!Semi-supervised learning, classification, better inferenceand more.




Unsupervised learning and auto-encoders

!A generic tool for dimensionalityreduction and feature extraction.

!Minimise reconstruction error using anencoder and a decoder.

Data y

Encoder

f(.)

z = f(y)

Decoder

g(.)

y* = g(z)

z

+ Non-linear dimensionality reductionusing deep networks for encoder anddecoder.

+ Easy to implement as a singlecomputational graph and train using

SGD

- No natural handling of missing data

- No representation of variability of therepresentation space. L= kyg(f(y))k

2

2

L= log p(y|g(z))

Dimensionality Reduction and Auto-encoders



Dimensionality Reduction and Auto-encoders

Some questions about auto-encoders:

!What is the model we are interested in?

!Why use an encoder?

!How do we regularise?

Data y

Encoder

f(.)

z = f(y)

Decoder

g(.)

y* = g(z)

z

Best to be explicit about the:

Probabilistic modelof interest and

Mechanism we use for inference.



Latent variable models:

! Generic and flexible model class for density estimation.

! Specifies a generative process that gives rise to the data.

z

y

W

n = 1, , N

! "

Use our knowledge of deep learning to design even richer models.

BXPCA

z N(z|,)

Latent Variable

y Expon(y|)

Observation Model

=Wz+ b

Exponential fam natural parameters ".

Density Estimation and Latent Variable Models

Latent Gaussian Models:

!Probabilistic PCA, Factor analysis (FA), Bayesian ExponentialFamily PCA (BXPCA).



Deep Generative Models

Rich extension of previous model using deep neural networks.

E.g., non-linear factor analysis, non-linear Gaussian beliefnetworks, deep latent Gaussian models (DLGM).

z1

y

W

n = 1, , N

! "

h1

h2

z2

h3

h4

W1

DLGM

Latent Variables (Stochastic layers)

zl N(zl|fl(zl+1),l)

fl(z) = (Wh(z) + b)Deterministic layers

hi(x) = (Ax + c)

y Expon(y|)

=Wh1 + b

Observation Model

Can also use non-exponential family.


19/43


Variational Inference

PenaltyReconstruction

F(y, q) = Eq(z)[logp(y|z)]KL[q(z)kp(z)]

Use tools from approximate inference to handle intractable integrals.

q(z)

KL[q(z|y)kp(z|y)] Approximation class

True posterior

Reconstruction cost:Expected log-likelihoodmeasures how wellsamples from q(z)are ableto explain the data y.

Penalty: Explanation ofthe data q(z)doesnt deviatetoo far from your beliefs

p(z) - Okhams razor.

Penalty is derived from your model and does not need to be designed.


20/43


Amortised Variational Inference

Approximate posterior distributionq(z): Best match

to true posteriorp(z|y), one of the unknowninferential quantities of interest to us.

Data y

Inference/

Encoder

q(z |y)

z ~ q(z | y)

Inference network: qis an encoder or inverse model.

Parameters of qare now a set of global parameters

used for inference of all data points - test and train.Amortise (spread) the cost of inference over all data.

Encoders provide an e!cient mechanism foramortised posterior inference


F(y, q

) =E

q(z)[logp

(y|z

)]KL

[q

(z

)kp

(z

)]Approx. Posterior


21/43


Auto-encoders and Inference in DGMs

Model (Decoder):likelihoodp(y|z).

Inference (Encoder): variational distribution q(z|y)

Stochastic encoder-decoder systemsimplement variational inference.


F y, q) = Eq(z)

logp y|z) KL q z)kp z)

Approx. Posterior

Data y

Inference

Network

q(z |y)

z ~ q(z | y)

Model

p(y |z)

y ~ p(y | z)

z

Specific combination of variational inferencein latentvariable modelsusing inference networks

Variational Auto-encoder

But dont forget what your model is, and what inference you use.


22/43


What Have we Gained

Data y

Inference

Network

q(z |y)

z ~ q(z | y)

Model

p(y |z)

y ~ p(y | z)

z

F(y, q) = Eq(z)log p(y|z) K L q(z)kp(z)

+ Transformed an auto-encoders into moreinteresting deepgenerative models.

+ Rich new class of density estimators builtwith non-linear models.

+ Used a principled approachfor deriving

loss functions that automatically includeappropriate penalty functions.

+ Explained howan encoder enters intoour models and whythis is a good idea.

+ Able to answer all our desired inferentialquestions.

+ Knowledge of the uncertaintyassociatedwith our latent variables.


23/43


What Have we Gained

Data y

Inference

Network

q(z |y)

z ~ q(z | y)

Modelp(y |z)

y ~ p(y | z)

z

F(y, q) = Eq(z) log p(y|z) KL q(z)kp(z) + Able to score our models and do model

selection using the free energy.

+ Can impute missing dataunder anymissingness assumption

+ Can still combine with natural gradient

and improved optimisation tools.

+ Easy implementation - have a singlecomputational graph and simple MonteCarlo gradient estimators.

+ Computational complexity the same asany large-scale deep learning system.

A true marriage of Bayesian Reasoning and Deep Learning


24/43

Bayesian Reasoning and Deep Learning

MNIST Handwritten digits

24

Data Visualisation

. . .

. . .

500

28x28

Labels in 2D latent spaceSamples from 2D latent model

DLGM


25/43


Visualising MNIST in 3D

DLGM


26/43


Data Simulation

DLGM

Data Samples


27/43


Missing Data Imputation

10%

observed

50%

observed

unobserved pixelsOriginal Data Inferred Image

DLGM


28/43


Outline


1 Why is this a good idea?



2 How can we achieve this convergence?

!Auto-encoders and latent variable models

!Approximate and variational inference


!Semi-supervised learning, recurrent networks, classification,better inference and more.



29/43


Semi-supervised Learning

z

x

W

n = 1, , N

! "

y

#

Can extend the marriage of Bayesian reasoning and deep learning to the

problem of semi-supervised classification.

Semi-supervised

DLGM


30/43


Analogical Reasoning

Semi-supervisedDLGM

d l h


31/43


Generative Models with Attention

We can also combine other tools from deep learning to design

even more powerful generative models: recurrent networksand attention.

DRAW

read

write

encoder

RNN

sample

decoder

RNN

read

write

encoder

RNN

sample

decoder

RNN

ct1 ct cT

henc

t1

hdec

t1

Q(zt+1|x, z1:t)

. .

decoding

(generative model)

encoding

(inference)

encoder

FNN

sample

decoder

FNN

z

P(x|z)

d l


32/43


Uncertainty on Model Parameters

We can also combine other tools from deep learning to design even more

powerful generative models: recurrent networks and attention.

Bayesia

nNeuralNe

tworks

x

y

W3

n = 1, , N

h1

h2 W2

W1

H1

H2

H3 1

X 1

Y

I R i


33/43


In Review

Bayesian reasoning as a generalframework forinference that allows us to account for

uncertaintyand a principled approach forregularisation and model scoring.

Combined Bayesian reasoning with auto-encoders andshowed just how much can be gained by a marriage of these

two streamsof machine learning research.

Data y

Inference

Network

q(z |y)

z ~ q(z | y)

Model

p(y |z)

y ~ p(y | z)

z

Deep learning as aframework for building highlyflexible non-linear parametric models, but

regularisation and accounting for uncertaintyand lack of knowledge is still needed.


34/43


Thanks to many people:

Danilo Rezende, Ivo Danihelka, Karol Gregor, Charles Blundell,Theophane Weber, Andriy Mnih, Daan Wierstra (Google DeepMind),Durk Kingma, Max Welling (U. Amsterdam)

34

Thank You.

S R f


35/43


Probabilistic Deep Learning

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models ."ICML (2014).

Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." ICLR 2014. Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." ICML (2014).

Gregor, Karol, et al. "Deep autoregressive networks." ICML (2014).

Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. NIPS (pp. 3581-3589).

Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation.arXiv preprint arXiv:1502.04623.

Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. arXiv preprint arXiv:1505.05770.

Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. arXiv preprint arXiv:1505.05424.

Hernndez-Lobato, J. M., & Adams, R. P. (2015).Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. arXivpreprint arXiv:1502.05336.

Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprintarXiv:1506.02142.

35

Some References

Wh i V i i l M h d?


36/43


Variational Principle

General family of methods for approximatingcomplicated densities by a simpler class of densities.

36

What is a Variational Method?

Deterministic approximationprocedureswith boundson probabilities of interest.

Fit the variational parameters.

q(z)

KL[q(z|y)kp(z|y)] Approximation class

True posterior

F IS V i i l I f


37/43


From IS to Variational Inference

=Z q(z) logp(y|z)

Z q(z) log

q

(z

)p(z)

= Eq(z)[logp(y|z)]KL[q(z)kp(z)]Variational lower bound

Jensens inequality logp(y)

Z q(z) log

p(y|z)

p(z)

q(z)

dz

log

Z p(x)g(x)dx

Z p(x) log g(x)dx

Integral problem logp(y) = log Z p(y|z)p(z)dz

Importance Weight logp(y) = logZ

p(y|z)

p z

q(z) q(z)dz

Proposal logp(y) = log

Z p(y|z)p(z)

q(z)

q(z)dz

Mi i D i i L h (MDL)


38/43


Hypothesis codeData code-length

38

Minimum Description Length (MDL)

F(y, q) = Eq(z)[logp(y|z)]KL[q(z)kp(z)]

Regularity in our data that can be explained withlatent variables, implies that the data is compressible.

MDL: inference seen as a problem of compression we must find the ideal shortest message of our data y:marginal likelihood.

Must introduce an approximation to the idealmessage.

Encoder: variational distribution q(z|y),

Decoder:likelihoodp(y|z).

Stochastic encoder-decoder systems implement variational inference.

Stochastic encoder

Data y

Encoder

q(z |y)

z ~ q(z | y)

Decoder

p(y |z)

y ~ p(y | z)

z

D i i A d (DAE)


39/43



39

Denoising Auto-encoders (DAE)

DAE: A mechanism for finding representations orfeatures of data (i.e. latent variable explanations).

Encoder: variational distribution q(z|y),

Decoder:likelihoodp(y|z).

Stochastic encoder-decoder systems implement variational inference.

Stochastic encoder

F(y, q) = Eq(z)[log p(y|z)] (z, y)

Data y

Encoder

q(z |y)

z ~ q(z | y)

Decoder

p(y |z)

y ~ p(y | z)

z

The variational approach requires you to be explicitabout your assumptions. Penalty is derived from your

model and does not need to be designed.

A ti i th C t f I f


40/43


Amortising the Cost of Inference

Inference network: qis an encoder or inverse model.

Parameters of qare now a set of global parametersused for inference of all data points - test and train.

Share the cost of inference (amortise) over all data.

Combines easily with mini-batches and Monte Carloexpectations.

Can jointly optimise variational and modelparameters: no need for alternating optimisation.

Repeat:

E-ste

M-step

For i = 1, N

1

N

X

n

logp(yn|zn)

n Eq(z)[logp(yn|zn)]KL[q(zn)kp(zn)]

Instead of solving this optimisation

for every data point n, we caninstead use a model.

Data y

Inference

Network

q(z |y)

z ~ q(z | y)

Model

p(y |z)

y ~ p(y | z)

z

I l ti V i ti l Al ith


41/43


Ideally want probabilistic programmingusing variational inference.

41

Implementing your Variational Algorithm

Avoid deriving pages of gradient updates for variational inference.

Variational inference turns integrationinto optimisation:

Automated Tools:

Di!erentiation: Theano, Torch7, Stan

Message passing:infer.NET

Eq[(

logp(y|z) + log q(z)

logp(z)]

Inference

q(z |x)

H[q(z)]

Model

p(x |z)

Prior

p(z)

z

log p(z)

log p(x|z)

Inference

q(z |x)

Model

p(x |z)

Prior

p(z)

Data x

Forward pass Backward pass

Stochastic gradient descent andother preconditioned optimisation.

Same code can run on both GPUsor on distributed clusters.

Probabilistic models are modular,can easily be combined.

St h ti B k ti
http://infer.net/


42/43


A Monte Carlo method that works with continuous latent variables.

42

Stochastic Backpropagation

Can use any likelihood function, avoids the need for additional lower bounds. Low-variance, unbiased estimator of the gradient.

Can use just one samplefrom the base distribution.

Possible for many distributions with location-scale or other knowntransformations, such as the CDF.

Eq(z)[f(z)]Original problem

z N(, 2)z= + N(0, 1)Reparameterisation

EN(0,1)[f(+ )]

N(0,1)[={,}f(+ )]Backpropagation

with Monte Carlo

M t C l C t l V i t E ti t


43/43

More general Monte Carlo approach that can be used with both discreteor continuous latent variables.

Monte Carlo Control Variate Estimators

log q(z|x) =q(z|x)

q(z|x)Property of the score function:

cis known as a control variateand is usedto control the variance of the estimator.

Original problem Eq(z)[log p(y|z)]

Score ratio q(z)[log p(y z)log q(z y)]

MCCV Estimate Eq(z)[(log p(y z) c)log q(z y)]

bayes deep

Documents