sargur srihari [email protected]/cse574/chap9/ch... · – from bayes theorem – view...

Machine Learning Srihari

1

Mixtures of Gaussians

Sargur Srihari [email protected]


9. Mixture Models and EM

0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm in General

2


Topics in Mixtures of Gaussians

•  Goal of Gaussian Mixture Modeling •  Latent Variables •  Maximum Likelihood •  EM for Gaussian Mixtures

3


GMMs and Latent Variables

•  A GMM is a linear superposition of Gaussian components – Provides a richer class of density models than the

single Gaussian •  We formulate a GMM in terms of discrete latent

variables – This provides deeper insight into this distribution – Serves to motivate the EM algorithm

•  Which gives a maximum likelihood solution to no. of components and their means/covariances

4


Latent Variable Representation •  Gaussian mixture distribution is written as

– a linear superposition of K Gaussian components:

– Represent K by a K-dimensional binary variable z •  Using 1-of-K representation (one-hot vector)•  Let z = z1,..,zK whose elements are

•  K possible states of z corresponding to K components

p(x) = π

kN(x | µ

k,Σ

k)

k=1

K

∑

z

k∈{0,1} and z

k= 1

k∑

k 1 2

z 10 01

πk

0.4 0.6

µk

−28 1.86

σk

0.48 0.88

k 1 2 3z 100 010 001

Machine Learning Srihari Goal of Gaussian Mixture Modeling •  A linear superposition of Gaussians in the form •  Goal of Modeling:

•  Find maximum likelihood parameters πk, µk, Σk •  Examples of data sets and models

p(x) = π

kN(x | µ

k,Σ

k)

k=1

K

∑

k 1 2π 0.4 0.6µ −28 1.86

σ 0.48 0.88

1-D data, K=2 subclasses 2-D data, K=3

Each data point is associated with a subclass k with probability πk


Joint Distribution

•  Define joint distribution of latent variable and observed variable – p(x,z)=p(x|z)� p(z) – x is observed variable –  z is the hidden or missing variable – Marginal distribution p(z) – Conditional distribution p(x|z)

7


Graphical Representation of Mixture Model

•  The joint distribution p(x,z) is represented in the form p(z)p(x|z)

– We now specify marginal p(z)and conditional p(x|z) •  Using them we specify p(x) in terms of observed and

latent variables 8

Latent variable z=[z1,..zK] represents subclass

Observed variable x


9

Specifying the marginal p(z) •  Associate a probability with each component zk

– Denote where parameters {πk} satisfy •  Because z uses 1-of-K it follows that

– since and components of z are mutually exclusive and hence are independent

p(z) = π

k

zk

k=1

K

∏

p(zk= 1) = π

k

0 ≤ π

k≤ 1 and π

k= 1

k∑

With one component p(z1) = π

1

z1

With two components p(z1,z

2) = π

1

z1π2

z2

zk∈{0,1}

p(z)

p(x|z)


11

Marginal distribution p(x)

•  The joint distribution p(x,z) is given by p(z)p(x|z) •  Thus marginal distribution of x is obtained by summing

over all possible states of z to give

–  Since

•  This is the standard form of a Gaussian mixture

p(x) = p(z)p(x | z) = π zk

kk=1

K

∏ N x | µk,Σ

k( )zk

z∑ = π

kN x | µ

k,Σ

k( )k=1

K

∑z∑

zk∈{0,1}


12

Value of Introducing Latent Variable •  If we have observations x1,..,xN

•  Because marginal distribution is in the form

–  It follows that for every observed data point xn there is a corresponding latent vector zn , i.e., its sub-class

•  Thus we have found a formulation of Gaussian mixture involving an explicit latent variable –  We are now able to work with joint distribution p(x,z)

instead of marginal p(x)

•  Leads to significant simplification through introduction of expectation maximization

p(x) = p(x,z)

z∑


13

Another conditional probability (Responsibility) •  In EM p(z |x) plays a role •  The probability p(zk=1|x) is denoted

–  From Bayes theorem

–  View as prior probability of component k and as the posterior probability it is also the responsibility that component k takes for explaining the observation x

γ (zk) ≡ p(z

k= 1 |x) =

p(zk= 1)p(x | z

k= 1)

p(zj= 1)p(x | z

j= 1)

j=1

K

∑

=π

kN(x | µ

k,Σ

k)

πjN(x | µ

k,Σ

j)

j=1

K

∑

γ (zk)

γ (zk) = p(z

k= 1 |x)

p(zk= 1) = π

k

p(x,z)=p(x|z)p(z)


Plan of Discussion

•  Next we look at 1.  How to get data from a mixture model synthetically

and then 2.  Given a data set {x1,..xN} how to model the data

using a mixture of Gaussians

14


15

Synthesizing data from mixture •  Use ancestral sampling

–  Start with lowest numbered node and draw a sample,

•  Generate sample of z, called •  move to successor node and draw

a sample given the parent value, etc.

•  Then generate a value for x from conditional p(x| )

•  Samples from p(x,z) are plotted according to value of x and colored with value of z

•  Samples from marginal p(x) obtained by ignoring values of z

500 points from three Gaussians

Complete Data set

Incomplete Data set

z

z


16

Illustration of responsibilities

•  Evaluate for every data point – Posterior probability of each component – Responsibilities associated with

data point xn and component k – Color using proportion of red, blue and

green ink •  If for a data point it is colored red •  If for another point it

has equal blue and green and will appear as cyan

γ (znk

)

γ (zn1

) = 1

γ (zn2

) = γ (zn3

) = 0.5


17

Maximum Likelihood for GMM •  We wish to model data set {x1,..xN} using a mixture of

Gaussians (N items each of dimension D) •  Represent by N x D matrix X

–  nth row is given by xnT

•  Represent N latent variables with N x K matrix Z –  nth row is given by zn

T

•  Goal is to state the likelihood function •  so as to estimate the three sets of parameters •  by maximizing the likelihood

X =

x1

x2

xN

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥

Z =

z1

z2

zN

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥


Graphical representation of GMM

•  For a set of i.i.d. data points {xn} with corresponding latent points {zn} where n=1,..,N

•  Bayesian Network for p(X,Z) using plate notation – N x D matrix X – N x K matrix Z

18


Likelihood Function for GMM

19

p(x) = p(z)p(x | z) = π

kN x | µ

k,Σ

k( )k=1

K

∑z∑

Therefore Likelihood function is

p(X |π,µ,Σ) =

k=1

K

∑ πkN(x

n| µ

k,Σ

k)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∏

Therefore log-likelihood function is

ln p(X |π,µ,Σ) = ln

k=1

K

∑ πkN(x

n| µ

k,Σ

k)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∑

Which we wish to maximize A more difficult problem than for a single Gaussian

Mixture density function is

Product is over the Ni.i.d. samples

Since z has values {zk} with probabilities {πk}


Maximization of Log-Likelihood

•  Goal is to estimate the three sets of parameters

– By taking derivatives in turn w.r.t each while keeping others constant

– But there are no closed-form solutions •  Task is not straightforward since summation appears in

Gaussian and logarithm does not operate on Gaussian

•  While a gradient-based optimization is possible, we consider the iterative EM algorithm 20


k=1

K

∑ πkN(x

n| µ

k,Σ

k)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∑

π k ,µk ,Σk


Some issues with GMM m.l.e.

•  Before proceeding with the m.l.e. briefly mention two technical issues: 1.  Problem of singularities with Gaussian mixtures 2.  Problem of Identifiability of mixtures

21


22

Problem of Singularities with Gaussian mixtures •  Consider Gaussian mixture

–  components with covariance matrices •  Data point that falls on a mean will

contribute to the likelihood function

•  As term goes to infinity •  Therefore maximization of log-likelihood

is not well-posed

–  Does not happen with a single Gaussian •  Multiplicative factors go to zero

–  Does not happen in the Bayesian approach •  Problem is avoided using heuristics

–  Resetting mean or covariance

N (x n |x n ,σ j

2I ) = 1(2π )1/2

1σ j One component

assigns finite values and other to large value

Multiplicative values Take it to zero

σ

j→ 0

Σk =σ k2I

µ j = xn

since exp(xn-µj)2=1


k=1

K

∑ πkN(x

n| µ

k,Σ

k)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∑


23

Problem of Identifiability

A K-component mixture will have a total of K! equivalent solutions

–  Corresponding to K! ways of assigning K sets of parameters to K components

•  E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321

–  For any given point in the space of parameter values there will be a further K!-1 additional points all giving exactly same distribution

•  However any of the equivalent solutions is as good as the other

A

B

CB

A

CTwo ways of labeling three Gaussian subclasses

A density p(x |θ ) is identifiable if θ ≠θ ' then there is an x for which p(x |θ ) ≠ p(x |θ ')


24

EM for Gaussian Mixtures •  EM is a method for finding maximum likelihood

solutions for models with latent variables •  Begin with log-likelihood function

–  We wish to find that maximize this quantity –  Task is not straightforward since summation appears in

Gaussian and logarithm does not operate on Gaussian •  Take derivatives in turn w.r.t

–  Means and set to zero –  covariance matrices and set to zero –  mixing coefficients and set to zero


k=1

K

∑ πkN(x

n| µ

k,Σ

k)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∑

π ,µ,Σ

µk

Σkπ k


25

EM for GMM: Derivative wrt •  Begin with log-likelihood function

•  Take derivative w.r.t the means and set to zero –  Making use of exponential form of Gaussian –  Use formulas: –  We get

the posterior probabilities


k=1

K

∑ πkN(x

n| µ

k,Σ

k)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∑

0 =

πkN(x

n| µ

k,Σ

k)

πjN(x

n| µ

j,Σ

j)

j∑n=1

N

∑ (xn− µ

kk

−1∑ )

ddx

lnu = u 'u

and ddx

eu = euu '

Inverse of covariance matrix

µk

µk

γ (znk )


26

M.L.E. solution for Means

•  Multiplying by (assuming non-singularity)

•  Where we have defined

– Which is the effective number of points assigned to cluster k

µ

k= 1

Nk

γ (znk

)xn

n=1

N

∑Mean of kth Gaussian component is the weighted mean of all the points in the data set: where data point xn is weighted by the posterior probability that component k was responsible for generating xn

N

k= γ (z

nk)

n=1

N

∑

Σk


27

M.L.E. solution for Covariance

•  Set derivative wrt to zero – Making use of mle solution for covariance matrix of

single Gaussian

– Similar to result for a single Gaussian for the data set but each data point weighted by the corresponding posterior probability

– Denominator is effective no of points in component

Σ

k= 1

Nk

γ (znk

)(xn− µ

k)(x

n− µ

k)T

n=1

N

∑

Σk


28

M.L.E. solution for Mixing Coefficients

•  Maximize w.r.t. πk

– Must take into account that mixing coefficients sum to one

– Achieved using Lagrange multiplier and maximizing

– Setting derivative wrt to zero and solving gives

kkNN

π =�

ln p(X |π ,µ,Σ) + λ π k −1k=1

K

∑⎛

⎝ ⎜

⎞

⎠ ⎟

ln p(X |π,µ,Σ)

π k


Summary of m.l.e. expressions •  GMM maximum likelihood parameter estimates

29

1

1 ( )xN

k nk nnk

zN

µ γ=

= ∑

�

Σk = 1Nk

γ(znk )(xn − µk )(xn − µk )T

n=1

N

∑

kkNN

π =

�

Nk = γ(znk )n=1

N

∑

Means

Covariance matrices

Mixing Coefficients

•  All three are in terms of responsibilities •  and so we have not completely solved the problem


30

EM Formulation

•  The results for are not closed form solutions for the parameters – Since the responsibilities depend on those

parameters in a complex way •  Results suggest an iterative solution •  An instance of EM algorithm for the particular

case of GMM

µk ,Σk ,π k

γ (znk )


31

Informal EM for GMM •  First choose initial values for means, covariances and

mixing coefficients •  Alternate between following two updates

–  Called E step and M step •  In E step use current value of parameters to evaluate

posterior probabilities, or responsibilities •  In the M step use these posterior probabilities to to re-

estimate means, covariances and mixing coefficients


32

EM using Old Faithful Data points and Initial mixture model

Initial E step Determine responsibilities

After first M step Re-evaluate Parameters

After 2 cycles After 5 cycles After 20 cycles


Comparison with K-Means

33

K-means result E-M result


Animation of EM for Old Faithful Data

•  http://en.wikipedia.org/wiki/File:Em_old_faithful.gif

#initial parameter estimates (chosen to be deliberately bad) theta <- list( tau=c(0.5,0.5), mu1=c(2.8,75), mu2=c(3.6,58), sigma1=matrix(c(0.8,7,7,70),ncol=2), sigma2=matrix(c(0.8,7,7,70),ncol=2) )

34

Code in R


35

Practical Issues with EM

•  Takes many more iterations than K-means – Each cycle requires significantly more comparison

•  Common to run K-means first in order to find suitable initialization

•  Covariance matrices can be initialized to covariances of clusters found by K-means

•  EM is not guaranteed to find global maximum of log likelihood function


36

Summary of EM for GMM

•  Given a Gaussian mixture model •  Goal is to maximize the likelihood function w.r.t.

the parameters (means, covariances and mixing coefficients)

Step1: Initialize the means , covariances and mixing coefficients and evaluate initial value of

log-likelihood

µk Σk

π k


37

EM continued •  Step 2: E step: Evaluate responsibilities using current

parameter values

•  Step 3: M Step: Re-estimate parameters using current responsibilities

γ (zk)=

πkN(x

n| µ

k,Σ

k)

πjN(x

n| µ

j,Σ

j))

j=1

K

∑

where

Σ

knew = 1

Nk

γ (znk

)(xn− µ

knew)(x

n− µ

knew)T

n=1

N

∑ µ

knew = 1

Nk

γ (znk

)xn

n=1

N

∑

N

k= γ (z

nk)

n=1

N

∑ π

knew =

Nk

N


38

EM Continued

•  Step 4: Evaluate the log likelihood

– And check for convergence of either parameters or log likelihood

•  If convergence not satisfied return to Step 2

�


k=1

K

∑ πkN(x

n| µ

k,Σ

k)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∑

sargur srihari [email protected]/cse574/chap9/ch... · – from bayes theorem – view...

Documents