cs 540: machine learning lecture 3: bayesian parameter ... › ~arnaud › cs540 ›...

49
CS 540: Machine Learning Lecture 3: Bayesian parameter estimation and hypothesis testing for discrete data AD from KPM January 2008 AD from KPM () January 2008 1 / 49

Upload: others

Post on 27-Jun-2020

11 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

CS 540: Machine LearningLecture 3: Bayesian parameter estimation and hypothesis

testing for discrete data

AD from KPM

January 2008

AD from KPM () January 2008 1 / 49

Page 2: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Naive Bayes Classi�er

Let C 2 f1, . . . ,Kg represents the class of a document (e.g.,C =spam or C =not spam).Let Wi = 1 if word i occurs in this document, otherwise Wi = 0.A naive Bayes classi�er assumes the words (features) are conditionallyindependent given the class (written as Wi ? Wj jC ).This can be represented as a Bayes net (recall that a node isconditionally independent of its non-descendants given its parents).

C

E1 EN

AD from KPM () January 2008 2 / 49

Page 3: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Naive Bayes Classi�er: Inference

Since Wi ? Wj jC , the joint is

P(C ,W1:N ) = P(C )

"N

∏i=1P(Wi jC )

#

Hence the posterior over class labels is given by

P(C = c jw1:N ) =P(C = c)∏N

i=1 P(w1:N jc)∑c 0 P(C = c 0)∏N

i=1 P(w1:N jc 0)

AD from KPM () January 2008 3 / 49

Page 4: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Naive Bayes Classi�er: Learning

The root CPD P(C = c) can be estimated by counting how manytimes each class occurs(e.g., P(C = spam) = 0.05, P(C = non-spam = 0.95)).

Each leaf CPD P(wi jc) can have a di¤erent kind of distribution, e.g.,Bernoulli, Gaussian, etc.

For document classi�cation, P(Wi = 0/1jC = c) can be estimatedby counting how many times word i occurs in documents of class c .

For real-valued data, p(Wi jC = c) can be estimated by �tting aGaussian to all data points that are labeled as class c .

If the class labels are not observed during training, this model can beused for clustering (see later).

AD from KPM () January 2008 4 / 49

Page 5: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Parameter learning

We said that the root CPD P(C = c) can be estimated by countinghow many times each class occurs. Why?

We said P(Wi = 0/1jC = c) can be estimated by counting howmany times word i occurs in documents of class c . Why? And whatif the word never occurs?

We now discuss these issues, which are equivalent to estimating theparameters of coins and dice.

We will also discuss how to infer which words are useful forclassi�cation (feature selection) by computing the mutual informationbetween two variables.

AD from KPM () January 2008 5 / 49

Page 6: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Bernoulli distribution

Let X 2 f0, 1g represent heads or tails.Suppose P(X = 1) = µ. Then

P(x jµ) = Be(X jµ) = µx (1� µ)1�x

It is easy to show that

E [X ] = µ, Var[X ] = µ(1� µ)

AD from KPM () January 2008 6 / 49

Page 7: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

MLE for a Bernoulli distribution distribution

Given D = (x1, . . . , xN ), the likelihood is

p(D jµ) =N

∏n=1

p(xn jµ) =N

∏n=1

µxn (1� µ)1�xn

The log-likelihood is

L(µ) = log p(D jµ) = ∑nxn log µ+ (1� xn) log(1� µ)

= N1 log µ+N0 log(1� µ)

where N1 = n = ∑n xn is the number of heads andN0 = m = ∑n(1� xn) is the number of tails (su¢ cient statistics).Solving for dLdµ = 0 yields

µML =n

n+m

AD from KPM () January 2008 7 / 49

Page 8: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Problems with the MLE

Suppose we have seen 3 heads out of 3 trials. Then we predict thatall future coins will land heads:

µML =n

n+m=

33+ 0

This is an example of the sparse data problem: if we fail to seesomething in the training set (e.g., an unknown word), we predictthat it can never happen in the future.

We will now see how to solve this pathology using Bayesianestimation.

AD from KPM () January 2008 8 / 49

Page 9: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Conjugate priors

A Bayesian estimate of µ requires a prior p(µ).

A prior is called conjugate if, when multiplied by the likelihoodp(D jµ), the resulting posterior is in the same parametric family asthe prior. (Closed under Bayesian updating.)

The Beta prior is conjugate to the Bernoulli likelihood

P(µjD) ∝ P(D jµ)P(µ)∝ [µn(1� µ)m ][µa�1(1� µ)b�1]

= µn+a�1(1� µ)m+b�1

where n is the number of heads and m is the number of tails.

a, b are hyperparameters (parameters of the prior) and correspond tothe number of �virtual�heads/tails (pseudo counts). N0 = a+ b iscalled the e¤ective sample size (strength) of the prior. a = b = 1 is auniform prior (Laplace smoothing).

AD from KPM () January 2008 9 / 49

Page 10: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

The beta distribution

To ensure the prior is normalized, we de�ne

P(µja, b) = Beta(µja, b) = Γ(a+ b)Γ(a)Γ(b)

µa�1(1� µ)b�1

where the gamma function is de�ned as

Γ(x) =Z ∞

0ux�1e�udu

Note that Γ(x + 1) = xΓ(x) and Γ(1) = 1. Also, for integers,Γ(x + 1) = x !.

The normalization constant 1/Z (a, b) = Γ(a+b)Γ(a)Γ(b) ensuresZ 1

0Beta(µja, b)dµ = 1

AD from KPM () January 2008 10 / 49

Page 11: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

The beta distribution

If µ � Be(a, b), then

Eµ =a

a+ b

mode µ =a� 1

a+ b� 2

AD from KPM () January 2008 11 / 49

Page 12: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Bayesian updating of a beta distribution

If we start with a beta prior Be(µja, b) and see n heads and m tails,we end up with a beta posterior Be(µja+ n, b+m):

P(µjD) =1

P(D)P(D jµ)P(µja, b)

=1

P(D)[µn(1� µ)m ]

1Z (a, b)

[µa�1(1� µ)b�1]

= Be(µjn+ a,m+ b)

The marginal likelihood is the ratio of the normalizing constants:

P(D) =Z (a+ n, b+m)

Z (a, b)

=Γ(a+ n)Γ(b+m)Γ(a+ n+ b+m)

Γ(a+ b)Γ(a)Γ(b)

AD from KPM () January 2008 12 / 49

Page 13: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Sequential Bayesian updating

Start with beta prior p(θjαh, αt ) = B(θ; αh, αt ).Observe N trials with Nh heads and Nt tails. Posterior becomes

p(θjαh, αt ,Nh,Nt ) = B(θ; αh +Nh, αt +Nt ) = B(θ; α0h, α0t )

Observe another N 0 trials with N 0h heads and N0t tails. Posterior

becomes

p(θjα0h, α0t ,N 0h,N 0t ) = B(θ; α0h +N 0h, α0t +N 0t )= B(θ; αh +Nh +N 0h, αt +Nt +N 0t )

So sequentially absorbing data in any order is equivalent to batchupdate. (assuming iid data and exact Bayesian updating).

This is useful for online learning and large datasets.

AD from KPM () January 2008 13 / 49

Page 14: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Bayesian updating in pictures

Start with Be(µja = 2, b = 2) and observe x = 1, so the posterior isBe(µja = 3, b = 2).

AD from KPM () January 2008 14 / 49

Page 15: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Posterior predictive distribution

The posterior predictive distribution is

p(X = 1jD) =Z 1

0p(X = 1jµ)p(µjD)dµ

=Z 1

0µp(µjD)dµ = E [µjD ] = n+ a

n+m+ a+ b

With a uniform prior a = b = 1, we get Laplace�s rule of succession

p(X = 1jNh,Nt ) =Nh + 1

Nh +Nt + 2

Start with Be(µja = 2, b = 2) and observe x = 1 to getBe(µja = 3, b = 2), so the mean shifts from E [µ] = 2/4 toE [µjD ] = 3/5.

AD from KPM () January 2008 15 / 49

Page 16: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

E¤ect of prior strength

Let N = Nh +Nt be number of samples (observations).Let N 0 be the number of pseudo observations (strength of prior) andde�ne the prior means

αh = N0α0h, αt = N 0α0t , α0h + α0t = 1

Then posterior mean is a convex combination of the prior mean andthe MLE (where λ = N 0/(N +N 0)):

P(X = 1jαh, αt ,Nh,Nt ) =αh +Nh

αh +Nh + αt +Nt

=N 0α0h +NhN +N 0

=N 0

N +N 0α0h +

NN +N 0

NhN

= λα0h + (1� λ)NhN

AD from KPM () January 2008 16 / 49

Page 17: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

E¤ect of prior strength

Suppose we have a uniform prior α0h = α0t = 0.5, and we observeNh = 3, Nt = 7.

Weak prior N 0 = 2. Posterior prediction:

P(X = hjαh = 1, αt = 1,Nh = 3,Nt = 7) =3+ 1

3+ 1+ 7+ 1=13� 0.33

Strong prior N 0 = 20. Posterior prediction:

3+ 103+ 10+ 7+ 10

=1330� 0.43

However, if we have enough data, it washes away the prior. e.g.,Nh = 300, Nt = 700. Estimates are 300+1

1000+2 and300+101000+20 , both of

which are close to 0.3

As N ! ∞, P(θjD)! δ(θ, θ̂ML), so E [θjD ]! θ̂ML.

AD from KPM () January 2008 17 / 49

Page 18: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Parameter posterior - small sample, uniform prior

0 0.5 10

1

2prior,1.0, 1.0

0 0.5 10

0.5

1likelihood, 1 heads, 0 tails

0 0.5 10

1

2posterior

0 0.5 1012

prior,1.0, 1.0

0 0.5 100.20.4

likelihood, 1 heads, 1 tails

0 0.5 1012

posterior

0 0.5 1012

prior,1.0, 1.0

0 0.5 100.020.04

likelihood, 10 heads, 1 tails

0 0.5 10

5posterior

0 0.5 1012

prior,1.0, 1.0

0 0.5 100.5

1 x 10 ­4likelihood, 10 heads, 5 tails

0 0.5 1024

posterior

0 0.5 1012

prior,1.0, 1.0

0 0.5 100.5

1 x 10 ­6likelihood, 10 heads, 10 tails

0 0.5 1024

posterior

AD from KPM () January 2008 18 / 49

Page 19: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Parameter posterior - small sample, strong prior

0 0.5 1024

prior,10.0, 10.0

0 0.5 100.5

1likelihood, 1 heads, 0 tails

0 0.5 1024

posterior

0 0.5 1024

prior,10.0, 10.0

0 0.5 100.20.4likelihood, 1 heads, 1 tails

0 0.5 1024

posterior

0 0.5 1024

prior,10.0, 10.0

0 0.5 100.020.04

likelihood, 10 heads, 1 tails

0 0.5 10

5posterior

0 0.5 1024

prior,10.0, 10.0

0 0.5 100.5

1 x 10 ­4likelihood, 10 heads, 5 tails

0 0.5 10

5posterior

0 0.5 1024

prior,10.0, 10.0

0 0.5 100.5

1 x 10 ­6likelihood, 10 heads, 10 tails

0 0.5 105

10posterior

AD from KPM () January 2008 19 / 49

Page 20: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Prior smooths parameter estimates

The MLE can change dramatically with small sample sizes.

The Bayesian estimate changes much more smoothly (depending onthe strength of the prior).

Lower blue=MLE, red = beta(1,1), pink = beta(5,5), upper blue =beta(10,10)

AD from KPM () January 2008 20 / 49

Page 21: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Maximum a posteriori (MAP) estimation

MAP estimation picks the mode of the posterior

θ̂MAP = argmaxθp(D jθ)p(θ)

If θ � Be(a, b), this is just

θ̂MAP = (a� 1)/(a+ b� 2)

MAP is equivalent to maximizing the penalized maximumlog-likelihood

θ̂MAP = argmaxθlog p(D jθ)� λc(θ)

where c(θ) = � log p(θ) is called a regularization term. λ is relatedto the strength of the prior.

AD from KPM () January 2008 21 / 49

Page 22: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Integrate out or Optimize?

θ̂MAP is not Bayesian (even though it uses a prior) since it is a pointestimate.

Consider predicting the future. A Bayesian will integrate out alluncertainty:

p (xnew jX ) =Zp (xnew , θjX ) dθ

=Zp (xnew j θ,X ) p ( θjX ) dθ

=Zp (xnew j θ) p ( θjX ) dθ.

A frequentist will use a �plug-in� estimator eg ML/MAP:

p (xnew jX ) = p�xnew jbθ� , bθ = argmax p ( θjX )

AD from KPM () January 2008 22 / 49

Page 23: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

From coins to dice

Suppose we observe N iid die rolls (K-sided): D = 3, 1,K , 2, ...

Let [x ] 2 f0, 1gK be a one-of-K encoding of x eg. if x = 3 andK = 6, then [x ] = (0, 0, 1, 0, 0, 0)T .

Multinomial distribution: p(X = k) = θk ∑k θk = 1

Log-Likelihood

l (θ;D) = log p (D j θ) = ∑m log∏k

θδk (xm )k

= ∑m ∑k δk (xm) log θk = ∑k Nk log θk

We need to maximize this subject to the constraint ∑k θk = 1, so weuse a Lagrange multiplier.

AD from KPM () January 2008 23 / 49

Page 24: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

MLE for multinomial

Constrained cost function:

l̃ = ∑k

Nk log θk + λ

1�∑

k

θk

!

Take derivatives wrt θk :

∂l̃∂θk

=Nkθk� λ = 0

Nk = λθk

∑k

Nk = N = λ ∑k

θk = λ

θ̂k ,ML =NkN

θ̂k ,ML is the fraction of times k occurs.

AD from KPM () January 2008 24 / 49

Page 25: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Dirichlet priors

Let X 2 f1, . . . ,Kg have a multinomial distribution

P(X jθ) = θI (X=1)1 θ

I (X=2)2 � � � θI (X=k )K

For a set of data X 1, . . . ,XN , the su¢ cient statistics are the countsNi = ∑n I (Xn = i).Consider a Dirichlet prior with hyperparameters α

p(θjα) = D(θjα) = 1Z (α)

� θα1�11 � θα2�1

2 � � � θαK�1K

where Z (α) is the normalizing constant

Z (α) =Z� � �

Zθα1�11 � � � θαK�1

K dθ1 � � � dθK

=∏Ki=1 Γ(αi )

Γ(∑Ki=1 αi )

AD from KPM () January 2008 25 / 49

Page 26: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Properties of the Dirichlet distribution

If θ � Dir(θjα1, . . . , αK ), then

E [θk ] =αkα0

mode[θk ] =αk � 1α0 �K

where α0def= ∑K

k=1 αk is the total strength of the prior.

AD from KPM () January 2008 26 / 49

Page 27: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Likelihood, prior, posterior, evidence

Likelihood, prior, posterior:

P(~N j~θ) =K

∏i=1

θNii

p(θjα) = D(θjα) = 1Z (α)

� θα1�11 � θα2�1

2 � � � θαK�1K

p(θj~N,~α) =1

Z (α)p(~N jα)θα1+N11 � � � θαK+Nk

K

= D(α1 +N1, . . . , αK +NK )

Marginal likelihood (evidence):

P(~N j~α) = Z (~N +~α)Z (~α)

=Γ(∑k αk )

Γ(N +∑k αk )∏k

Γ(Nk + αk )

Γ(αk )

AD from KPM () January 2008 27 / 49

Page 28: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Marginal likelihood as approximation of negative entropy

Marginal likelihood (evidence):

P(~N j~α) = Γ(∑k αk )

Γ(N +∑k αk )∏k

Γ(Nk + αk )

Γ(αk )

If αk = 1, this becomes

P(~N j~α = 1) = Γ(K )Γ(N +K ) ∏

k

Γ(Nk + 1)Γ(1)

Using the fact that Γ(1) = 1 and Stirling�s approximationlog Γ(x + 1) � x log x � x , we get

P(~N j~α = 1) � �N logN +N +∑k

(Nk logNk �Nk )

= ∑k

Nk log(Nk/N) = �NH(fNk/Ng)

where H(pk ) = �∑k pk log pk is the entropy of a distribution.

AD from KPM () January 2008 28 / 49

Page 29: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Entropy

Suprising (unlikely) events convey more information, so we de�ne theinformation content of an observation (in bits) to be

h(x) = log2 1/p(x)

The average information content of a random variable X is

H(X ) = �∑xp(x) log2 p(x)

AD from KPM () January 2008 29 / 49

Page 30: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Data compression

There is a close link between density estimation and data compression.

The noiseless coding theorem (Shannon 1948) says that the entropyis a lower bound on the number of bits needed to transmit the stateof X .

More likely states can be given shorter code words.

There is also a close link between data compression and modelselection/ hypothesis testing.

AD from KPM () January 2008 30 / 49

Page 31: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Hypothesis testing

When spun on edge N = 250 times, a Belgian one-euro coin came upheads Y = 141 times and tails 109.

�It looks very suspicious to me. We can reject the null hypothesis(that the coin is unbiased) with a signifcance level of 5%�. � BarryBlight, LSE (modi�ed from quote in The Guardian, 2002)

Does this mean P(H0jD) < 0.05? Let us compare classical hypothesistesting with a Bayesian approach (using marginal likelihood).

AD from KPM () January 2008 31 / 49

Page 32: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Classical hypothesis testing

We would like to distinguish two models, or hypotheses: H0 meansthe coin is unbiased (so p = 0.5); H1 means the coin is biased (hasprobability of heads p 6= 0.5).We need a decision rule that maps data to accept/reject.We will dothis by computing a scalar quantity of our data called the deviance,d(D), and comparing its observed value with what we would expect ifH0 were true.

We declare �H1� if d(D) > t for some threshold t (to bedetermined).

In our case, we will use d(D) = Nh, the number of heads.

AD from KPM () January 2008 32 / 49

Page 33: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

P-values

The p-value of a threshold t is the probability of falsely rejecting thenull hypothesis:

p(t) = P(fD 0 : d(D 0) > tgjH0,N)

Intuitively, the p-value is the probability of getting data at least thatextreme given H0.

Since computing the p-value requires summing over all possibledatasets of size N, a standard approximation is consider the expecteddistribution of d(D 0), assuming D 0 � P(�jH0) , as N ! ∞.

AD from KPM () January 2008 33 / 49

Page 34: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Signi�cance levels

The p-value of a threshold t is the probability of falsely rejecting thenull hypothesis:

pval(t) = P(fD 0 : d(D 0) > tgjH0,N)

We usually choose a threshold t so that the probability of a falserejection is below some signi�cance level α = 0.05 (i.e., choose t s.t.,pval(t) � α).

This means that on average we will �only�be wrong 1/20 times (!).

AD from KPM () January 2008 34 / 49

Page 35: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Classical analysis of the euro-coin data

Blight used a two-sided test and found a p-value of 0.0497, so he said�we can reject the null hypothesis at signi�cance level 0.05�.

pval = P(Y � 141jH0) + P(Y � 109jH0)= (1� P(Y < 141jH0)) + P(Y � 109jH0)= (1� P(Y � 140jH0)) + P(Y � 109jH0)= 0.0497

n=250; p = 0.5;p1 = 1-binocdf(140,n,p);p2 = binocdf(109,n,p);pval = p1 + p2

AD from KPM () January 2008 35 / 49

Page 36: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Classical analysis violates the likelihood principle

Why do we care about tail probabilities, such as

P(Y � 141jH0) = P(Y = 141jH0) + P(Y = 142jH0) + � � �

when the number of heads we observed was 141, not 142 or larger?

�The use of P-values implies that a hypothesis that may be true canbe rejected because it has not predicted observable results that havenot actually occurred��Je¤reys, 1961.

P-values (and therefore all classical hypothesis tests) violate thelikelihood principle, which says that in order to choose betweenhypotheses H0 and H1 given observed data D, one should ask howlikely the observed data are under each hypothesis; do not askquestions about data that we might have observed but did not.

For more examples, see �What is Bayesian statistics and whyeverything else is wrong�, Michael Lavine (2000).

AD from KPM () January 2008 36 / 49

Page 37: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Bayesian approach

We want to compute the posterior ratio of the 2 hypotheses:

P(H1jD)P(H0jD)

=P(D jH1)P(H1)P(D jH0)P(H0)

Let us assume a uniform prior P(H0) = P(H1) = 0.5.

Then we just focus on the ratio of the marginal likelihoods:

P(D jH1) =Z 1

0dθ P(D jθ,H1)P(θjH1)

For H0, there is no free parameter, so

P(D jH0) = 0.5N

where N is the number of coin tosses in D.

AD from KPM () January 2008 37 / 49

Page 38: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Ratio of evidences (Bayes factor)

We compute the ratio of marginal likelihoods (evidence):

BF (1, 0) =P(D jH1)P(D jH0)

=Z (αh +Nh, αt +Nt )

Z (αh, αt )10.5N

=Γ(140+ α)Γ(110+ α)

Γ(250+ 2α)� Γ(2α)

Γ(α)Γ(α)� 2250

We compute BF (1, 0) for a range of prior strengths αt = αh = α.Must work in log domain to avoid under�ow!

AD from KPM () January 2008 38 / 49

Page 39: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

So, is the coin biased or not?

We compute the likelihood ratio vs hyperparameter α

α 0.37 1 2.7 7.4 20 55 148 403 1096P (H1 jD )P (H0 jD ) 0.21 0.48 0.82 1.32 1.77 1.92 1.72 1.34 1.21

For a uniform prior, P (H1 jD )P (H0 jD ) = 0.48, (weakly) favoring the fair coinhypothesis H0!

At best, for α = 50, we can make the biased hypothesis twice as likely.

Not as dramatic as saying �we reject the null hypothesis (fair coin)with signi�cance 5%�.

AD from KPM () January 2008 39 / 49

Page 40: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Summary: Bayesian vs classical hypothesis testing

The Bayesian approach is simpler and more natural (no need for�p-values�, �signi�cance tests�, etc.)

The Bayesian approach does not violate the likelihood principle.

The Bayesian approach allows the use of prior knowledge to preventus from jumping to conclusions too hastily.

See the excellent tutorials on P-values and Bayes factors by StevenGoodman on the web page.

AD from KPM () January 2008 40 / 49

Page 41: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Another example: testing for independence

Suppose we are given N (x , y) pairs, where X has J possible valuesand Y has K . We want to know if X and Y are independent.

eg. consider this contingency table

y = 1 y = 2 y = 3x = 1 15 29 14x = 2 46 83 56

Traditional approach: compare H0 = X ? Y vs H1 = X 6? Y .Compute p-value using χ2 statistic

dχ2(D) = ∑x ,y

(Ox ,y � Ex ,y )2Ex ,y

= ∑x ,y

(N(x , y)�NP(x)P(y))2NP(x)P(y)

Let us consider a Bayesian approach.

AD from KPM () January 2008 41 / 49

Page 42: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Bayesian test for independence

If independent,P(D jH0) = p(X jαj �)p(Y jα�k )

where αj � and α�k are di¤erent prior vectors.

If dependent,P(D jH1) = p(X ,Y jαjk )

We want to compute

p(H0jD) =p(D jH0)p(H0)

p(D jH0)p(H0) + p(D jH1)p(H1)

=1

1+ p(D jH1)p(D jH0)

p(H1)p(H0)

If we assume p(H0) = p(H1), we can focus on the Bayes factorB = p(D jH0)

p(D jH1) .

AD from KPM () January 2008 42 / 49

Page 43: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Bayesian test for independence

It is simple to show that

B =p(D jH0)p(D jH1)

=p(X jαj �)p(X jα�k )p(X ,Y jαjk )

=Γ(∑jk αjk )

Γ(N +∑jk αjk )

J

∏j=1

Γ(Nj � + αj �)

Γ(αj �)

K

∏k=1

Γ(N�k + α�k )

Γ(α�k )∏j ,k=1

Γ(αjk )Γ(Njk + αjk )

Using the entropy approximation, we get

logp(D jH0)p(D jH1)

� �NH(Nj �N)�NH(N�k

N) +NH(Njk

N)

= �ND�NjkNjjNj �N� N�kN

�= �NI(X ,Y )

whereD(pjjq) def

= ∑k

pk logpkqk

is the Kullback-Leibler divergence between distributions p, q, and

I(X ,Y ) def= D(P(X ,Y )jjP(X )P(Y ))

is the mutual information between X and Y .AD from KPM () January 2008 43 / 49

Page 44: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

KL divergence (relative entropy)

KL(pjjq) is a �distance�measure of q from p

D(pjjq) def= ∑

k

pk logpkqk

It is not strictly a distance, since it is asymmetric.

The KL can be rewritten as

D(pjjq) = ∑k

pk log pk �∑k

pk log qk = �∑k

pk log qk �H(pk )

This makes it clear that the KL measures the extra number of bits wewould need to use to encode X if we thought the distribution was qkbut it was actually pk .

KL satis�es D(pjjq) � 0 with equality i¤ p = q.

AD from KPM () January 2008 44 / 49

Page 45: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Minimizing KL divergence is maximiming likelihood

We would like to �nd q(x jθ) s.t. D(pjjq) is minimized, where p(x) isthe �true�distribution.

Of course p(x) is unknown but we can approximate by the empiricaldistribution given samples. Then

KL(pjjq) � 1N ∑

nlog p(xn)� log q(xn jθ)

Since p(x) is independent of θ, we �nd that

argminqKL(pjjq) = argmax

q

1N ∑

nlog q(xn jθ)

AD from KPM () January 2008 45 / 49

Page 46: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Mutual information

The mutual information measures how close the joint andindependent distributions are:

I(X ,Y ) def= D(P(X ,Y )jjP(X )P(Y ))

It is easy to show that the mutual information between X and Y ishow much our uncertainty about Y decreases when we observe X (orvice versa):

I (X ,Y ) = H(X )�H(X jY ) = H(Y )�H(Y jX )

I (X ,Y ) � 0 with equality i¤ X ? Y .

AD from KPM () January 2008 46 / 49

Page 47: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Bayesian test for independence

It is simple to show (homework!) that

B =p(D jH0)p(D jH1)

=p(X jαj �)p(X jα�k )p(X ,Y jαjk )

=Γ(∑jk αjk )

Γ(N +∑jk αjk )

J

∏j=1

Γ(Nj � + αj �)

Γ(αj �)

K

∏k=1

Γ(N�k + α�k )

Γ(α�k )∏j ,k=1

Γ(αjk )Γ(Njk + αjk )

Using the entropy approximation, we get

logp(D jH0)p(D jH1)

� �NH(Nj �N)�NH(N�k

N) +NH(Njk

N)

= �ND�NjkNjjNj �N� N�kN

�= �NI(X ,Y )

whereD(pjjq) def

= ∑k

pk logpkqk

is the Kullback-Leibler divergence between distributions p, q, and

I(X ,Y ) def= D(P(X ,Y )jjP(X )P(Y ))

is the mutual information between X and Y .AD from KPM () January 2008 47 / 49

Page 48: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Chi square is an approximation to the mutual information

We showed

logp(D jH0)p(D jH1)

� �ND�NjkNjjNj �N� N�kN

�= �NI(X ,Y )

If we make the additional approximation

D(pjjq) � ∑k

(pk � qk )22qk

then we recover the χ2 statistic.

AD from KPM () January 2008 48 / 49

Page 49: CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 › lec3_CS540_handouts.pdf · Lecture 3: Bayesian parameter estimation and hypothesis testing for

Are two histograms from the same distribution?

To see if two samples X and Y come from the same multinomialdistribution, create an indicator variable C 2 f1, 2g which speci�eswhich data set each sample comes from

z = 1 z = 2 � � � z = Kc = 1 N1 N2 � � � NKc = 2 N1 M2 � � � MK

If the two histograms are from the same distribution, then C isindependent of Z . So just compute P(C ? Z jD). As before, we getlog P (D jsame)

P (D jdi¤) � �NI (X ,Y ), which can be further approximatedusing χ2.

AD from KPM () January 2008 49 / 49