07 approximate inference in bn

Bayesian Networks

Unit 7 Approximate Inference in Bayesian Networks

Fu Jen University Department of Electrical Engineering Wang, Yuan-Kai Copyright

Wang, Yuan-Kai, 王元凱ykwang@mails.fju.edu.tw

http://www.ykwang.tw

Department of Electrical Engineering, Fu Jen Univ.輔仁大學電機工程系

2006~2011

Reference this document as: Wang, Yuan-Kai, “Approximate Inference in Bayesian Networks," Lecture Notes of Wang, Yuan-Kai, Fu Jen University, Taiwan, 2011.

Bayesian Networks Unit - Approximate Inference in Bayesian Networks p. 2

Goal of This Unit• P(X|e) inference for Bayesian networks• Why approximate inference

– Exact inference is too slow because of exponential complexity

• Using approximate approaches– Sampling methods

• Likelihood weighting sampling• Markov Chain Monte Carlo sampling

– Loopy belief propagation– Variational method

Bayesian Networks Unit - Approximate Inference in Bayesian Networks p.

Related Units• Background

– Probabilistic graphical model– Exact inference in BN

• Next units– Probabilistic inference over time

Self-Study References• Chapter 14, Artificial Intelligence-a modern

approach, 2nd, by S. Russel & P. Norvig, Prentice Hall, 2003.

• Inference in Bayesian networks, B. D’Ambrosio, AI Magazine, 1999.

• Probabilistic Inference in graphical models, M. I. Jordan & Y. Weiss.

• An introduction to MCMC for machine learning. Andrieu, C., De Freitas, J., Doucet, A., & Jordan, M. I., Machine Learning, vol. 50, pp.5-43, 2003.

• Computational Statistics Handbook with Matlab, W. L. Martinez and A. R. Martinez, Chapman & Hall/CRC, 2002– Chapter 3 Sampling Concepts– Chapter 4 Generating Random Variables

Structure of Related Lecture Notes

PGM Representation

Inference

Problem

Learning

Unit 5 : BNUnit 9 : Hybrid BNUnits 10~15: Naïve Bayes, MRF,

HMM, DBN,Kalman filter

Unit 6: Exact inferenceUnit 7: Approximate inferenceUnit 8: Temporal inference

Units 16~ : MLE, EM

StructureLearning

ParameterLearning

P(A|B,E)P(J|A)P(M|A)

P(B)P(E)

Bayesian Networks Unit - Approximate Inference in Bayesian Networks p.

Contents

1. Sampling .......................................................... 112. Random Number Generator .......................... 203. Stochastic Simulation ……............................. 704. Markov Chain Monte Carlo .......................... 1135. Loopy Belief Propagation …………………. 1456. Variational Methods ………………………... 1467. Implementation …………………………….. 1478. Summary ……………………………………. 1489. References …………………………………… 151

4 Steps of Inference• Step 1: Bayesian theorem

• Step 2: Marginalization

• Step 3: Conditional independence

• Step 4: Product sum computation (Enumeration)– Exact inference–Approximate inference

),()|( eEXPeEP

eEXPeEXP

hHeEXP ),,(

ii XPaXP~1

Five Types of Queries in Inference• For a probabilistic graphical model G• Given a set of evidence E=e• Query the PGM with

–P(e) : Likelihood query–arg max P(e) :

Maximum likelihood query–P(X|e) : Posterior belief query–arg maxx P(X=x|e) : (Single query variable)

Maximum a posterior (MAP) query–arg maxx1…xk

P(X1=x1, …, Xk=xk|e) :Most probable explanation (MPE) query

Approximate Inference v.s. Exact Inference

• Exact inference: P(X|E) = 0.71828– Get exact probability value– Using the inference steps derived by

probabilistic formula– Need exponential time complexity

• Approximate inference: P(X|E) 0.71– Get approximate probability value– Using sampling theorem– Need only polynomial time complexity,

fast computation

Why Approximate Inference• Large treewidth

– Large, highly connected graphical models– Treewidth may be large (>40) in sparse

networks • In many applications, approximation are

sufficient– Example: P(X = x|e) = 0.3183098861– Maybe P(X = x|e) 0.3 is a good enough

approximation– e.g., we take action only if P(X=x|e) > 0.5

1. Sampling

• 1.1 What Is Sampling• 1.2 Sampling for Inference

Basic Idea of Sampling• Why sampling

– Estimate some values by random number generation

1. Sampling– Random number generating– Draw N samples from a known distribution P– Generate N random numbers from a known

distribution S2. Estimation

– Compute an approximate probability , which approximates the real posterior probability P(X|E)

1.1 What Is Sampling• A very simple example with a random

variable : coin toss– Tossing the coin, get head or tail– It is a Boolean R.V.

• coin = head or tail– If it is unbiased coin, head and tail have

equal probability• A prior probability distribution

P(Coin) = <0.5, 0.5> • Uniform distribution

–Assume we have a coin but we do not know it is unbiased

Sampling of Coin Toss• Sampling in this example

= flipping the coin many times N– e.g., N=1000 times– One flipping get one sample– Ideally, 500 heads, 500 tails

• P(head) = 500/1000=0.5P(tail) = 500/1000=0.5

– Practically, 5001 heads, 499 tails• P(head) = 501/1000=0.501

P(tail) = 499/1000=0.499• After the sampling,

– We can estimate probability distribution– Check if it is biased

Sampling & Estimation (Math)• For a Boolean random variable X

– P(X) is prior distribution= <P(x), P(x)>

– Using a sampling algorithm to generate Nsamples

– Say N(x) is the number of samples that x is true, N(x) x is false

)(ˆ)( ),(ˆ)( xPN

)()(lim ),()(lim xPN

1.2 Sampling for Inference• Given a Bayesian network G including

(X1, …, Xn)– We get a joint probability distribution

P(X1, …, Xn) = P(Xi|Pa(Xi))• For a query P(X|E=e)

– P(X|e) = P(Xi | Parent(Xi)) – It is hard to compute

• Need exponential time in number of Xi– We will try to use sampling to compute it

Compute P(X|e) by Sampling• Sampling

– Generate N samples of P(X1, …, Xn) = P(Xi|Pa(Xi))

• Estimation– Use N samples to estimate

P(X,e) N(X,e)/N– Use N samples to estimate P(e) N(e)/N– Estimate P(X|e) by P(X,e) / P(e)

Explained in Sections 2,3,4

What Is Sampling Algorithm• The algorithm to

–Generate samples from a known probability distribution P

–Estimate the approximate probability P̂

Various Sampling Algorithms• Stochastic simulation

– Direct Sampling– Rejection sampling

• Reject samples disagreeing with evidence– Likelihood weighting

• Use evidence to weight samples• Markov chain Monte Carlo

(MCMC)– Sample from a stochastic process whose

stationary distribution is the true posterior

Section 3

Section 4

2. Random Number Generator

• Very important for sampling algorithm• Introduce basic concepts related to

sampling of Bayesian networks• Subsections

– 2.1 Univariate– 2.2 Multivariate

RNG In Programming Languages• Random number generator (RNG)

– C/C++: rand()– Java: random()– Matlab: rand()

• Why should we discuss it?– They generate random numbers with

uniform distribution– How to generate

• Gaussian, … • Multivariate, dependent random

variables • Non-closed-form distribution?

Generate a Random Number (1/2)• Examples in C

– int i = rand();– Return 0 ~ RAND_MAX (32767)– It generates integers

• Generate a random number between 1 and n (n<32767)– int i = 1 + ( rand() % n )– (rand() % n) returns a number between 0

and n - 1– Add 1 to make random number between 1

and n– It generates integers, but not real numbers

Generate a Random Number (2/2)• Ex: integer between 1 and 6–1 + ( rand() % 6)

• Ex: real number between 0 and 1–double i = rand() / RAND_MAX

• Exercise– Real number between 10 and 20

Generate Many Random Numbers Repeatedly

• Using loop for repeated generation– for (int i=0; i<1000; i++)

{ rand(); }– int i, j[1000];

for (i=0; i<1000; i++){ j[i] = 1 + rand() % 6; }

rand() generates a number uniformlyUniform distribution

Why Generate Random Numbers• Simulate random behavior• Make random decision• Estimate some values

Random Behavior/Decision (1/2) • Flip a coin for decision (Boolean)

– Fair: each face has equal probability – int coin_face;

if (rand() > RAND_MAX/2) coin_face = 1;

else coin_face = 0;– int coin_face;

coin_face = rand() % 2;

Random Behavior/Decision (2/2)• Random decision of multiple choices

– Discrete random variable• Ex: roll a die

–Fair: each face has equal probability• int die_face; //Random variable

die_face = rand() % 6;

Uniform distribution

Estimation• If we can simulate a random behavior• We can estimate some values

– First, we repeat the random behavior– Then we estimate the value

Example: The Coin Toss

• Flip the coin 1000 times to estimate the fairness of the coin– int coin_face; //Random variable

int frequency[2];for (i=0; i<1000; i++){ coin_face = rand() % 2

frequency[coin_face]++;}

Coinface

frequencyUniform distribution

Example : Area of Circle (Estimation)• int x, y; //Two random variables

int N=1000, NCircle=0, Area;for (i=0; i<N; i++){ x = rand() / RAND_MAX;

y = rand() / RAND_MAX;if ( (x*x + y*y) <= 1 )

NCircle = NCircle + 1;}Area = 4 * (NCircle/N);

A random number ?

x and y are independent

We call (x,y) a sample

Multiple Dependent Random Variables• Markov Chain: n random variables

• Bayesian Networks: 5 random variables

X1 Xk Xn......

Burglary Earthquake

John Calls Mary Calls

Variables are dependent

What is a sample ?

Sampling• It is to randomly generate a sample

– For a random variable X orA set of random variables X1, …, Xn• Boolean, Discrete, Continuous• Multivariate

– Independent, dependent– According to a probability distribution P(X)

• Discrete X: Histogram• Continuous X:

– Uniform, Gaussian, or – Any distribution: Gaussian mixture models

UnivariateMultivariate

Sub-Sections for Generating a Sample

• 2.1 Univariate– Uniform, Gaussian, Gaussian mixture

• 2.2 Multivariate– Uniform– Gaussian

• Independent, dependent– Any distribution

• Gaussian mixture– Independent, dependent

• Bayesian network

2.1 Univariate• For a random variable X

– Boolean, discrete, continuous, hybrid• We know P(X) is

– Uniform, Gaussian, Gaussian mixture• Generate a sample X according to P(X)

Uniform Generator• Every programming language provides

a rand()/random() function to generate a uniform-distributed number– Integer number within [0, MAX)

• Sampling a Boolean uniform number– rand() %2

• Sampling a discrete uniform number within [0, d)– rand() % d

• Sampling a continuous uniform number– Within [0, 1): rand() % MAX– Within [a, b): a + (rand() % MAX)*(a-b)

Example : Uniform Generator• x=rand(1,10000);• h=hist(x,20);• bar(h);

0 5 10 15 20 250

Gaussian Generator (1/2)• Sampling Gaussian can be obtained by

uniform distribution• There are functions in C/Java/Matlab to

randomly generate a univariate Gaussian real number with (, )=(0,1)– C : Numerical recipies in C, – Java: Random.nextGaussian()– Matlab: randn()

• Suppose it is called Gaussian()

Gaussian Generator (2/2)• Sampling a continuous Gaussian

number with (, )– (Gaussian() * ) +

• Sampling a discrete Gaussian number with (, ) ?

Example : Gaussian Generator (1/2)• Pseudo codes

– Assume Gaussian() is a pseudo function to generate Gaussian numbers

– double x[10000]; for (i=0; i<10000; i++)

x[i] = Gaussian();– for (i=0; i<10000; i++)

x[i] = + Gaussian() * ;

Example : Gaussian Generator (2/2)• Matlab

– x=randn(1,10000);– h=hist(x,20);– bar(h);

• Java– Random r=new

Random();int x[10000];for (i=0;i<10000;i++)x[i]=r.nextGaussian();

0 5 10 15 20 250

Gaussian Mixture Generator (1/2)• Random variable X with Gaussian

– P(X) = N(X; , )• Random variable Y with Gaussian

mixture – P(Y) = m mN(Y; m, m)

Gaussian Mixture Generator (2/2)• Generate N samples of X

– for (i=0; i<N; i++)x[i]=(Gaussian() * ) +

• Generate N samples of Y with mixture of M Gaussians– Each Gaussian m has m, m– for (m=0; m<M; m++)

for (i=0; i<N*m; i++)y[m][i] = (Gaussian() * m) + m

Example : Gaussian Mixture Generator

• N=10000; pi1=0.8; pi2=0.2;• mu1=0; mu2=15; sigma1=3; sigma2=5;• x1 = mu1 + randn(1,N*pi1) * sigma1;• x2 = mu2 + randn(1,N*pi2) * sigma2;• x = [x1, x2];• h=hist(x,50);• bar(h);

0 10 20 30 40 50 600

2.2 Multivariate• For random variables X1,… ,Xn

– Boolean, discrete, continuous, hybrid• We know P(X1,… ,Xn) is

– Uniform, Gaussian, Gaussian mixture, any distribution

• Generate a sample (X1,… ,Xn) according to P(X1,… ,Xn)– Independent– Dependent

Multivariate Boolean Uniform Generator

• Boolean random variables X1,… ,Xn• int X[n]; // A sample

for (i=0; i<n; i++)X[i] = rand() % 2;

Multivariate Discrete Uniform Generator

• Discrete random variables X1,…, Xn– Each with d discrete values: [0, d-1]– Each Xi is uniform distributed– X1,…, Xn must be independent

• int X[n]; // A samplefor (i=0; i<n; i++)

X[i] = rand() % d;

Multivariate Gaussian Generator - Independent (1/2)

• Pseudo codes• For n random variables X=(X1,…,Xn)

– Gaussian : N(X; , ) • Mean vector: • Covariance matrix: =[ij]

• X1,…,Xn are independent– ij = 0 for ij

• Generate a sample of X Generate each Xi independently

Multivariate Gaussian Generator - Independent (2/2)

• Generate a sample of X =(X1,…,Xn) with i=0, ii=1, ij = 0 for ij– int X[n]; // a sample

for (i=0; i<n; i++)X[i] = Gaussian();

• Generate a sample of X =(X1,…,Xn) with i0, ii 1, ij = 0 for ij– int X[n]; // a sample

for (i=0; i<n; i++)X[i] = i + Gaussian() * ii;

Example – Matlab (1/2)mx=[0 0]';Cx=[1 0; 0 1];x1=-3:0.1:3;x2=-3:0.1:3;for i=1:length(x1),

for j=1:length(x2),

f(i,j)=(1/(2*pi*det(Cx)^1/2))*exp((-1/2)*([x1(i) x2(j)]-mx')*inv(Cx)*([x1(i);x2(j)]-mx));

endendmesh(x1,x2,f)pause;contour(x1,x2,f)pause

Example – Matlab (2/2)• Randomly generate 1000 samples for

y1=randn(1,1000);y2=randn(1,1000);plot(y1,y2,'.');

,)0,0( XT

Multivariate Gaussian Generator - Dependent (1/4)

• For n random variables X=(X1,…,Xn)–Gaussian : N(X; , )

• Mean vector: • Covariance matrix: =[ij]

– is a positive definite matrix• Symmetric and all eigenvalues (pivots) > 0

– For general matrix A : A= LDU• L: lower triangular, U: upper triangular

D: diagonal matrix of pivots– For symmetric matrix S: S = LDLT

– For positive definite matrix = LDLT =– This is called Cholesky decomposition

• X1,…,Xn are dependent–ij 0

T TL D L D PP

• Generate a sample of X with , – Perform Cholesky decomposition of

• Cholesky decomposition is pivot decompositionfor positive definite matrix

• = PP-1 = PPT

– Generate independent Gaussian Y=(Y1,…,Yn )with i=0, i=1

– X = PY +

• Pseudo code to generate a sample of Xwith , – Matrix ;

Vector ;Vector X(n), Y(n); // a sample

Matrix P=chol(); //Cholesky decomp. for (i=0; i<n; i++) Y(i) = Gaussian();X = P * Y +

• Proof– For n random variables X=(X1,…,Xn) with , – Generate n independent, zero-mean, unit variance

normal random variables Y=(Y1,…,Yn)

,)0,,0(,),,( 1T

– Take X = PY+, where =PP-1 =PPT

TTTTTT

PPPYYPEPPYYEPYPYEXXEX

}{}{}))({())(( ofMatrix Covariance

Example – Matlab (1/4)

mx=[0 0]';Cx=[1 1/2; 1/2 1];P=chol(Cx);

2/12/1

12/12/11

Assume

Matlab:

• mx=zeros(2,1000);y1=randn(1,1000);y2=randn(1,1000);y=[y1;y2];P=[1, 0; 1/2, sqrt(3)/2];x=P*y+mx;x1=x(1,:);x2=x(2,:);plot(x1,x2,'.');r=corrcoef(x1',x2');

12/12/11

,)0,0( XT

Example – Matlab (3/4)

• mx=[5 5]';• Cx=[1 9/10; 9/10 1];• P=chol(Cx);

19.09.01

Assume

Matlab:

• mx=5*ones(2,1000);y1=randn(1,1000);y2=randn(1,1000);y=[y1;y2];P=[1, 0; 9/10, sqrt(19)/10];x=P*y+mx;x1=x(1,:);x2=x(2,:);plot(x1,x2,'.');r=corrcoef(x1',x2');

19.09.01

,)5,5( XT

Multivariate Gaussian Mixture Generator

• Generate N samples of X with mixture of MGaussians (Matlab-like pseudo code)– for (m=0; m<M; m++)

{ Matrix P=chol(m) //Cholesky decomposition for (i=0; i<N*m; i++){ //Generate n independent normally distributed

// R.V. (=0, =1) y = randn(1, n)// Transform y into x x = P * y +

Example – Matlab (1/4)• Combine the previous two Gaussians:1=0.5, 2=0.5,

-4 -2 0 2 4 6 8 10-3

12/12/11

19.09.01

Example – Matlab (2/4)• pi1= 0.5; pi2=0.5; N=2000;

mx1=zeros(2,pi1*N); Cx1=[1 1/2; 1/2 1];P1=chol(Cx1); %P=[1, 0; 1/2, sqrt(3)/2];y1_1=randn(1,pi1*N); y1_2=randn(1,pi1*N);y1=[y1_1;y1_2];

x1=P1*y1+mx1; x1_1=x1(1,:); x1_2=x1(2,:);

mx2=5*ones(2,pi2*N); Cx2=[1 9/10; 9/10 1];P2=chol(Cx2); %P=[1, 0; 1/2, sqrt(3)/2];y2_1=randn(1,pi2*N); y2_2=randn(1,pi2*N);y2=[y2_1;y2_2];x2=P2*y2+mx2; x2_1=x2(1,:); x2_2=x2(2,:);

z1=[x1_1,x2_1]; z2=[x1_2,x2_2];plot(z1,z2,'.');

Example – Matlab (3/4)• Combine the previous two Gaussians1=0.2, 2=0.8

-4 -2 0 2 4 6 8 10-3

12/12/11

19.09.01

Example – Matlab (4/4)• pi1= 0.2; pi2=0.8; N=2000;

mx1=zeros(2,pi1*N); Cx1=[1 1/2; 1/2 1];P1=chol(Cx1); %P=[1, 0; 1/2, sqrt(3)/2];y1_1=randn(1,pi1*N); y1_2=randn(1,pi1*N);y1=[y1_1;y1_2];

x1=P1*y1+mx1; x1_1=x1(1,:); x1_2=x1(2,:);

mx2=5*ones(2,pi2*N); Cx2=[1 9/10; 9/10 1];P2=chol(Cx2); %P=[1, 0; 1/2, sqrt(3)/2];y2_1=randn(1,pi2*N); y2_2=randn(1,pi2*N);y2=[y2_1;y2_2];x2=P2*y2+mx2; x2_1=x2(1,:); x2_2=x2(2,:);

z1=[x1_1,x2_1]; z2=[x1_2,x2_2];plot(z1,z2,'.');

Exercise• Write a program to randomly generate

1000 samples of 3-dimensional Gaussian with =(5,10,-3), =(2,1,3;4,2,2;3,1,2)

Any Distribution• For random variables X1,… ,Xn

– Boolean, discrete, continuous, hybrid• We know P(X1,… ,Xn) has no closed-form

formula– Independent: P(X1,… ,Xn)= P(X1)… P(Xn) – Dependent:

P(X1,… ,Xn)= P(Xi | Parent(Xi))• Generate a sample (X1,… ,Xn) according to

P(X1,… ,Xn)– Independent: generate each Xi by P(Xi) – Dependent: generate each Xi by P(Xi| Parent(Xi))

Two Boolean R.V.s - Independent• X1, X2 have distributions :

– P(X1)=<0.67, 0.33>, P(X2)=<0.75,0.25>• int X1, X2;

for (i=0; i<1000; i++){ if (rand() > RAND_MAX/3)

X1 = 1;else X1 = 0;if (rand() > RAND_MAX/4)

X2 = 1;else X2 = 0;

P(X1)0.67

P(X2)0.75

Two Boolean R.V.s - Dependent• X1, X2 have distributions :

– P(X1)=<0.67, 0.33>– P(X2|X1=T)=<0.75,0.25>, P(X2|X1=F)=<0.8,0.2>

• Generate a sample (x1, x2)if (rand() > RAND_MAX/3) x1 = 1;else x1 = 0;if (x1==1)

if (rand() > RAND_MAX/4) x2 = 1;else x2 = 0;

else // x1==0if (rand() > RAND_MAX/5) x2 = 1;else x2 = 0;

Markov Chain• Markov Chain: n random variables

X1 Xk Xn......

Bayesian Network• Example: 5 random variables

Burglary Earthquake

John Calls Mary Calls

3. Stochastic Simulation

• Also called–Monte Carlo Methods–Sampling Methods

• Sub-sections–3.1 Direct sampling–3.2 Rejection sampling–3.3 Likelihood weighting

3.1 Direct Sampling• Generate N samples randomly• For the inference P(X|E)

–P(X|E)= P(X^E) / P(E)–Get N(E) & N(X^E) from the N

samples• N(E) : No. of samples of E• N(X^E) : No. of samples of X and E

–P(E) = N(E) / N, P(X^E) = N(X^E) / N

–P(X|E) = N(X^E) / N(E)

Example (1/4)• For the sprinkler network

–Estimate P(w|r)by direct sampling

–4 random variables–A sample =

(c,s,r,w)

Example (2/4)• Generate 1000 samples

Cloudy Sprinkler Rain WetGrass

T T T FF T T FF F T TT T T FT T T F... ... ... ...F T T F

Example (3/4)• P(r| w) = P(r, w)/P(w)

T T T FF T T FF F T TT T F F... ... ... ...F T T F

Nw: No. of WetGrass=FalseNr^w: No. of (Rain=True&WetGrass=False)

Nr^w / Nw

Example (4/4)• P(R|w)

– = P(R, w)/P(w)– = < P(r ^ w)/P(w), P(r ^ w)/P(w) >

T T T FF T T FF F T TT T F F... ... ... ...F T T F

How to Generate a Sample for the Bayesian Network? (1/3)

• The sprinkler Bayesian network

•Assume a sampling order:[ Cloudy, Sprinkler,

Rain, WetGrass ]

A sample is an atomic event :(cloundy,sprinkler,rain,wetgrass)=(T, F, T, T)

• int C, S, R, W;for (i=0; i<1000; i++){ if (rand() > RAND_MAX/2) C = T;

else C = F;if (rand() > RAND_MAX/2) S = T;

else S = F; if (rand() > RAND_MAX/2) R = T;

else R = F; if (rand() > RAND_MAX/2) W = T;

else W = F; } Incorrect

Implementation

• int C, S, R, W;for (i=0; i<1000; i++){ if (rand() > RAND_MAX/2) C = T;

else C = F;if (C==T)

if (rand() > RAND_MAX*0.9) S = T;

else S = F; else // C==F

if (rand() > RAND_MAX/2) S = T;

else S = F;...

An Example Generating One Sample (1/8)

• The sampling algorithm1.Sample from P(Cloudy)=<0.5, 0.5>

– Suppose it returns true2.Sample from

P(Sprinkler|Cloudy=true)=<0.1,0.9>– Suppose it returns false

3.Sample from P(Rain|Cloudy=true)=<0.8,0.2>– Suppose it returns true

4.Sample from P(WetGrass|Sprinkler=false, Rain=true) = <0.9,0.1>– Suppose it returns true

Samples:

C S R W

Random sampling: Cloudy

Return: Cloudy=trueSamples:

C S R Wc

Random sampling1. Sprinkler2. RainGiven Cloudy=true

Samples:

C S R Wc

Random samplingSprinklerGiven Cloudy=true

Return: Sprinkler=false

Samples:

C S R Wc s

Random sampling RainGiven Cloudy=true

Return: Rain=true

Samples:

C S R Wc s r

Random sampling WetGrassGiven Rain=true,

Sprinkler=false

Samples:

C S R Wc s r

Random sampling WetGrassGiven Rain=true,

Sprinkler=false

Return: WetGrass=true

Samples:

C S R Wc s r w

The Algorithm (1/2)• To generate one sample

The Algorithm (2/2)• In previous example

–We get a sample [true, false, true, true] of a Bayesian network using the Prior-Sample

• The sampling of a Bayesian network–Repeat the sampling N times–We get N samples

• We can use the N samples to compute any query probability in the Bayesian network

How It Works (1/2)• Why any probability can be

answered from the sampling?–The N samples is actually a full joint

distribution table (FJD)C S R WT T T FF T T FF F T TT T F F... ... ... ...F T T F

C S R W PT T T F 0.02F T T F 0.13F F T T 0.04T T F F 0.15... ... ... ... ...

Why It Works (2/2)• A sample is an atomic event (x1, ..., xn)• P(x1, ..., xn) N(x1, ..., xn) / N• Therefore, a FJD is generated from

the N samples• Note: N < 2n

Exercise: Direct Sampling

smart study

prepared fair

p(smart)=.8 p(study)=.6

p(fair)=.9

p(prep|…) smart smartstudy .9 .7study .5 .1

p(pass|…)smart smart

prep prep prep prepfair .9 .7 .7 .2fair .1 .1 .1 .1

Query: What is the probability that a student studied, given that they pass the exam?

Problems of Direct Sampling• It needs to generate very many

samples in order to obtain the approximate FJD

• For a query of conditional probability P(X|e)–Can we just approximate the

conditional probability?–Yes, the following two algorithms will

do this

3.2 Rejection Sampling• is estimated from samples

agreeing with e)|(ˆ eXP

An Example• Estimate P(Rain|Sprinkler=true)

using 100 samples–27 samples have Sprinkler = true–Of these, 8 have Rain=true and

19 have Rain=false–P(Rain|Sprinkler=true) =

Normalize(<8,19>) = <0.296, 0.704>• Similar to a basic real-world

empirical estimation procedure

Analysis of Rejection Sampling

• Hence rejection sampling returns consistent posterior estimates

• Problem: expensive if P(e) is small–P(e) drops off exponentially with

number of evidence variables!

)|()|(ˆ)(

),( eXPeXP ePeXP

3.3 Likelihood Weighting• Avoids the inefficiency of rejection

sampling –By generating only events consistent

with the evidence variables e• Idea

–Fix evidence variables,–Sample only hidden variables–Weight each sample event by the

likelihood it accords the evidence• Events have different weights

Randomly generatea sample event

An Example (1/9)• Query P(Rain|sprinkler, wetgrass)

An Example (2/9)1. Set the weight =1.02. Sample from P(Cloudy)=<0.5,0.5>

• Suppose it returns true3. The evidence Sprinkler=true. So we set

= P(sprinkler|cloudy)=1*0.1=0.14. Sample from P(Rain|cloudy)=<0.8,0.2>

• Suppose it returns true5. The evidence WetGrass=true. So we set

= P(wetgrass|sprinkler,rain) =0.1*0.99=0.099

A sample event (true, true, true, true) with weight 0.099

An Example (3/9)

An Example (4/9)

An Example (5/9)

An Example (6/9)

=1.0 0.1

An Example (7/9)

=1.0 0.1

An Example (8/9)

=1.0 0.1

An Example (9/9)

=1.0 0.1 0.99= 0.099

The Algorithm (1/2)• The example generates a sample

event (true, true, true, true) for the query P(Rain|sprinkler, wetgrass)

• Repeat the sampling N times–We get N sample events–Each event has a likelihood weight –1 = rain=true , 1 = rain=false

• P(Rain|sprinkler, wetgrass) = < 1/(1+2), 2/(1+2) >

The Algorithm (2/2)

Exercise: Likelihood Weighting

smart study

prepared fair

p(fair)=.9

Analysis (1/3)• Why the algorithm works? P(X|E=e)• Let the sampling probability for

WEIGHTED-SAMPLE be SWS–The evidence variables E are fixed

with e–All the other variables Z = {X} Y–The algorithm samples each variable

in Z given its parent values

iiiWS ZparentszPezS

))(|(),(

Analysis (2/3)• The likelihood weight w for a given

sample (z, e)=(x, y, e) is

• The weighted probability of a sample (z,e)=(x, y, e) is

iii EparentsePezw

))(|(),(

))(|())(|(

),(),(

11eyxP

EparentsePZparentszP

ezwezSm

iiin XparentsxPxxP

11 ))(|(),,(

Analysis (3/3)

WS eyxweyxNexP ),,(),,()|(ˆ

WS eyxweyxS ),,(),,('

)|(),(' exPexP

eyxP ),,('

So the algorithm works

Discussions• Likelihood weighting is efficient

because it uses all the samples generated

• However, it suffers a degradation in performance as the no. of evidence variables increases, because –Most samples will have very low weights,–The weighted estimate will be dominated

by the tiny fraction of samples that have infinitesimal likelihood

4. Inference by MCMC• Key idea

– Sampling process as a Markov Chain• Next sample depends on the previous one

– Approximate any posterior distribution• "State" of network

= current assignment to all variables• Generate next state

– by sampling one variable given Markov blanket

• Sample each variable in turn, keeping evidence fixed

The Markov Chain• With Sprinkler =true, WetGrass=true,

there are four states:

Markov Blanket Sampling• Markov blanket of Cloudy is

–Sprinkler and Rain• Markov blanket of Rain is

–Cloudy, Sprinkler, and WetGrass• Probability given the Markov

blanket is calculated as follows–P(x'i|MB(Xi))

= P(x'i|Parents(Xi))ZjChildren(Xi)P(zj|Parents(Zj))

An Example (1/2)• Estimate P(Rain|sprinkler,wetgrass)• Loop for N times

–Sample Cloudy or Rain given its Markov blanket

• Count number of times Rain=trueand Rain=false in the samples

An Example (2/2)• E.g., visit 100 states

–31 have Rain=true, –69 have Rain=false

• P(Rain|sprinkler,wetgrass)= Normalize(<31, 69>) = <0.31, 0.69>

The Algorithm

Why it works• Skipped

–Details in pp. 517-518 in the AIMA 2e textbook

Sub-Sections• 4.1 Markov chain theory• 4.2 Two MCMC sampling algorithms

4.1 Markov Chain Theory• Suppose X1, X2, … take some set of values

– wlog. These values are 1, 2, ...• A Markov chain is a process that corresponds

to the network:

• To quantify the chain, we need to specify– Initial probability: P(X1)– Transition probability: P(Xt+1|Xt)

• A Markov chain has stationary transition probability: P(Xt+1|Xt) same for all times t

X1 X2 X3 Xn... ...

Irreducible Chains

• A state j is accessible from state i if there is an n such that P(Xn = j | X1 = i) > 0– There is a positive probability of reaching j from i after some number steps

• A chain is irreducible if every state is accessible from every state

Ergodic Chains• A state is positively recurrent if there is a

finite expected time to get back to state iafter being in state i– If X has finite number of states, then this is

suffices that i is accessible from itself

• A chain is ergodic if it is irreducible and every state is positively recurrent

(A)periodic Chains• A state i is periodic if there is an integer d such that when n is not divisible by d

P(Xn = i | X1 = i ) = 0• Intuition: only every d steps state i may

occur • A chain is aperiodic if it contains no

periodic state

Stationary ProbabilitiesThm:• If a chain is ergodic and aperiodic, then

the limitexists, and does not depend on i

• Moreover, letthen, P*(X) is the unique probability satisfying

)|(lim 1 iXXP nn

)|(lim)( 1* iXjXPjXP nn

tt iXPiXjXPjXP )()|()( *1

Stationary Probabilities• The probability P*(X) is the stationary

probability of the process• Regardless of the starting point, the

process will converge to this probability

• The rate of convergence depends on properties of the transition probability

Sampling from the Stationary Probability

• This theory suggests how to sample from the stationary probability:– Set X1 = i, for some random/arbitrary i– For t = 1, 2, …, n

•Sample a value xt+1 for Xt+1 from P(Xt+1|Xt=xt)

– return xn• If n is large enough, then this is a sample

from P*(X)

Designing Markov Chains• How do we construct the right chain to

sample from?– Ensuring aperiodicity and irreducibility is

usually easy

• Problem is ensuring the desired stationary probability

Designing Markov ChainsKey tool:• If the transition probability satisfies

then, P*(X) = Q(X)• This gives a local criteria for checking

that the chain will have the right stationary distribution

0)|1(whenever)()(

)|()|(

itXjtXPiXQjXQ

jXiXPiXjXP

MCMC Methods• We can use these results to sample from P(X1,…,Xn|e)

Idea:• Construct an ergodic & aperiodic

Markov Chain such that P*(X1,…,Xn) = P(X1,…,Xn|e)

• Simulate the chain n steps to get a sample

MCMC MethodsNotes:• The Markov chain variable Y takes as

value assignments to all variables that are consistent evidence

• For simplicity, we will denote such a state using the vector of variables

}satisfy,...,|)()(,...,{)( 1111 enn xxXVXVxxYV

4.2 Two MCMC Sampling Algorithms

• Gibbs Sampler• Metropolis-Hastings Sampler

Gibbs Sampler• One of the simplest MCMC method• Each transition changes the state of one Xi

• The transition probability defined by P itself as a stochastic procedure:– Input: a state x1,…,xn– Choose i at random (uniform probability)– Sample x’i from P(Xi|x1, …, xi-1, xi+1 ,…, xn, e)

– let x’j = xj for all j i– return x’1,…,x’n

Correctness of Gibbs Sampler• How do we show correctness?

Correctness of Gibbs Sampler• By chain ruleP(x1,…,xi-1, xi, xi+1,…,xn|e) =P(x1,…,xi-1, xi+1,…,xn|e)P(xi|x1,…,xi-1, xi+1,…,xn, e)

• Thus, we get

• Since we choose i from the same distribution at each stage, this procedure satisfies the ratio criteria

),,,,,,|'(),,,,,,|(

)|,,,',,,()|,,,,,,(

niiixxxxxPxxxxxP

xxxxxPxxxxxP

Transition

Gibbs Sampling for Bayesian Network

• Why is the Gibbs sampler “easy” in BNs?• Recall that the Markov blanket of a

variable separates it from the other variables in the network– P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi | Mbi )

• This property allows us to use localcomputations to perform sampling in each transition

Gibbs Sampling in Bayesian Networks

• How do we evaluate P(Xi | x1,…,xi-1,xi+1,…,xn) ?

• Let Y1, …, Yk be the children of Xi– By definition of Mbi, the parents of Yj are

in Mbi{Xi}• It is easy to show that

x jyjii

ii payPPaxP

payPPaxPMbxP

')|()|'(

)|()|()|(

Metropolis-Hastings• More general than Gibbs (Gibbs is a

special case of M-H)• Proposal distribution arbitrary q(x’|x)

that is ergodic and aperiodic (e.g., uniform)

• Transition to x’ happens with probability(x’|x)=min(1, P(x’)q(x|x’)/P(x)q(x’|x))

• Useful when computing P(x) infeasible• q(x’|x)=0 implies P(x’)=0 or q(x|x’)=0

Sampling Strategy• How do we collect the samples?Strategy I:• Run the chain M times, each for N steps

– each run starts from a different state points

• Return the last state in each run

M chains

Sampling StrategyStrategy II:• Run one chain for a long time• After some “burn in” period, sample

points every some fixed number of steps

“burn in” M samples from one chain

Comparing StrategiesStrategy I:

– Better chance of “covering” the space of pointsespecially if the chain is slow to reach stationarity

– Have to perform “burn in” steps for each chainStrategy II:

– Perform “burn in” only once– Samples might be correlated (although only weakly)

Hybrid strategy: – Run several chains, sample few times each– Combines benefits of both strategies

Short Summary -Approximate Inference

• Monte Carlo (sampling with positive and negative error) Methods:– Pos: Simplicity of implementation and

theoretical guarantee of convergence– Neg: Can be slow to converge and hard to

diagnose their convergence.• Variational Methods – Your presentation• Loopy Belief Propagation and Generalized

Belief Propagation -- Your presentation

Exercise: MCMC Sampling

smart study

prepared fair

p(fair)=.9

Main Computational Problems1. Difficult to tell if convergence has

been achieved2. Can be wasteful if Markov

blanket is large– P(Xi|MB(Xi)) won't change much

(law of large numbers)

5. Loopy Belief Propagation• TBU

6. Variational Methods• TBU

7. Implementation by PNL

PNL GeNIeEnumeration v (Naïve)Variable EliminationBelief Propagation v (Pearl) v (Polytree)Junction Tree v v (Clustering)Direct Sampling v (Logic)Likelihood Sampling v(LWSampling) v(Likelihood

sampling)MCMC Sampling v(Gibbswithanneal) (Other 5 samplings)

8. Summary• Exact inference by variable

elimination–Polytime on polytrees–NP-hard on general graphs–Space = time, very sensitive to

topology

Summary• Approximate inference by LW,

MCMC–LW does poorly when there is lots of

(downstream) evidence–LW, MCMC generally insensitive to

topology–Convergence can be very slow with

probabilities close to 1 or 0–Can handle arbitrary combinations of

discrete and continuous variables

Summary• What we know

–What is a Bayesian network–How to inference, given a Bayesian

network• However, we still need to know

–How to learn CPTs–How to build or automatically learn

the structure of a Bayesian network by given a set of data

9. References• General Introduction to Probabilistic Inference in

BN– B. D’Ambrosio, Inference in Bayesian networks, AI

Magazine, 1999.– M. I. Jordan & Y. Weiss, Probabilistic Inference in

graphical models,.– Andrieu, C., De Freitas, J., Doucet, A., & Jordan, M. I.

(in press). An introduction to MCMC for machine learning. Machine Learning, vol. 50, pp.5-43, 2003..

Recent Books• R. E. Neapolitan, Learning Bayesian Networks,

Prentice Hall, 2004.• C. Borgelt and R. Kruse, Graphical Models:

methods for data analysis and mining, Wiley, 2002.• D. Edwards, Introduction to Graphical Modelling,

2nd, Springer, 2000.• S. L. Lauritzen, Graphical Models, Oxford, 1996.• M. I. Jordan (ed.), Learning in Graphical Models,

MIT, 2001.

Appendix• Theoretical analysis of approximation

Types of ApproximationsAbsolute error• An estimate q of P(X=x|e) has

absolute error , ifP(X=x|e) - q P(X=x|e) +

equivalentlyq - P(X = x|e) q +

• Not always what we want: error 0.001– Unacceptable if P(X = x | e) = 0.0001,– Overly precise if P(X = x | e) = 0.3

Types of ApproximationsRelative error• An estimate q of P(X=x|e)

has relative error , ifP(X=x|e)(1-) q P(X=x|e)(1+)equivalently

q/(1+) P(X=x|e) q/(1-)• Sensitivity of approximation

depends on actual value of desired result 0

q/(1+)

q/(1-)

Complexity• Recall, exact inference is NP-hard• Is approximate inference any easier?

• Construction for exact inference:– Input: a 3-SAT problem – Output: a BN such that P(X=t) > 0 iff is

satisfiable

Complexity: Relative Error• Suppose that q is a relative error

estimate of P(X = t), • If is not satisfiable, then

P(X = t)(1 - ) q P(X = t)(1 + )0 = P(X = t)(1 - ) q P(X = t)(1 + ) = 0Thus, if q > 0, then is satisfiable

An immediate consequence:

Thm: Given , finding an -relative error approximation is NP-hard

Complexity: Absolute error• Thm: If < 0.5, then finding an

estimate of P(X=x|e) with absulote error approximation is NP-Hard

Likelihood Weighting• Can we ensure that all of our sample

satisfy e?• One simple solution:

–When we need to sample a variable that is assigned value by e, use the specified value

• For example: we know Y = 1–Sample X from P(X)–Then take Y = 1

• Is this a sample from P(X,Y |Y = 1) ?

Likelihood Weighting• Problem: these samples of X from P(X)• Solution:

– Penalize samples in which P(Y=1|X) is small

• We now sample as follows:– Let x[i] be a sample from P(X)– Let w[i] be P(Y = 1|X = x [i])

[i])x|XPiw)xXP

(][1|(

Likelihood Weighting• Why does this make sense?• When N is large, we expect to sample NP(X = x) samples with x[i] = x

• Thus,

• When we normalize, we get approximation of the conditional probability

)|1()(][,

xXYPxXNPwxixi

Samples:

B E A C R

Likelihood WeightingP(b) 0.03

P(e) 0.001

P(a)b e b e b e b e0.98 0.40.7 0.01

0.8 0.05

P(r)e e

0.3 0.001

Earthquake

Burglary

Weight

Samples:

B E A C R

P(e) 0.001

b e b e b e b e0.98 0.40.7 0.01

a a0.8 0.05

e e0.3 0.001

Earthquake

Burglary

Weight

= r = a

Samples:

B E A C R

P(e) 0.001

b e b e b e b e0.98 0.40.7 0.01

a a0.8 0.05

P(r)e e

0.3 0.001

Earthquake

Burglary

Weight

= r = a

Samples:

B E A C R

P(e) 0.001

b e b e b e b e0.98 0.40.7 0.01

a a0.8 0.05

e e0.3 0.001

Earthquake

Burglary

0.05Weight

= r = a

Samples:

B E A C R

P(e) 0.001

P(a)b e b e b e b e0.98 0.40.7 0.01

P(c)a a

0.8 0.05

P(r)e e

0.3 0.001

e cb r

Earthquake

Burglary

Weight

= r = a

a 0.6 *0.3

Likelihood Weighting• Let X1, …, Xn be order of variables

consistent with arc direction• w = 1• for i = 1, …, n do

–if Xi = xi has been observed•w w* P(Xi = xi | pai )

–else•sample xi from P(Xi | pai )

• return x1, …,xn, and w

Importance Sampling• A method for evaluating expectation of f

under P(x), <f>P(X)• Discrete:• Continuous:

• If we could sample from P

dxxPxff

XP rxfR

f ])[(1)(

Importance SamplingA general method for evaluating <f>P(X) when we cannot sample from P(X).Idea: Choose an approximating distribution

Q(X) and sample from it

Using this we can now sample from Q and then

XP XQXPxfdx

XQXQxPxfdxxPxfxf

)()( )(

)()()()()()()()()(

mXP mwmxf

)(])[(1])[(1)(

If we could generate samples from P(X)

Now that we generate the samples from Q(X)

(Unnormalized) Importance Sampling

1. For m=1:MSample X[m] from Q(X)Calculate W(m) = P(X)/Q(X)

2. Estimate the expectation of f(X) using

Requirements: P(X)>0 Q(X)>0 (don’t ignore possible scenarios) Possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X)

mXP mwmxf

)(])[(1)(

Normalized Importance SamplingAssume that we cannot evaluate P(X=x) but can evaluate P’(X=x) = P(X=x)(ex., we can evaluate P(X) but not P(X|e) in a Bayesian network)We define w’(X) = P’(X)/Q(X). We can then evaluate :

and then:

In the last step we simply replace with the above equation

XQαxP

XQXPXQXw )('

)()(')()('

)(')(')(

)(')(1)()()(')(1

)()()()()()()(

XwXwXf

XwXfα

dxXQXQxPxf

dxXQXQxPxfdxxPxfxf

Normalized Importance SamplingWe can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling x[m] from Q(X) and then

(hence the name “normalized”)

mwmxfxf

)('])[()(

Importance Sampling Weaknesses• Important to choose sampling

distribution with heavy tails– Not to “miss” large values of f

• Many-dimensional I-S:– “Typical set” of P may take a long time to

find, unless Q good approximation to P– Weights vary by factors exponential in N

• Similar for Likelihood Weighting

07 approximate inference in bn

Education

introduction to machine learning · introduction to machine...

inference in first-order logic - donald bren …€¢...

topic 4: statistical inference. outline statistical...

6 - inference

bayesian inference, basics - stony...

approximate string matching

approximate inference in graphical models using lp...

simultaneous inference

"approximate residual balancing: de-biased inference of...

chapter 4: inference techniques reasoning inference forward...

approximate linear programming for mdps

7 statistical inference

international journal of approximate reasoning

12statistika inference 2

expectation propagation for bayesian multi-task feature...

approximate likelihoods - statistical inference, learning...

variational inference via upper bound...

approximate inference: decomposition methods with...

inference in first- order logic. outline reducing...

bayesian inference - home | applied mathematics &...