sargur srihari [email protected]/cse574/chap9/ch... · – from bayes theorem – view...
TRANSCRIPT
Machine Learning Srihari
9. Mixture Models and EM
0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm in General
2
Machine Learning Srihari
Topics in Mixtures of Gaussians
• Goal of Gaussian Mixture Modeling • Latent Variables • Maximum Likelihood • EM for Gaussian Mixtures
3
Machine Learning Srihari
GMMs and Latent Variables
• A GMM is a linear superposition of Gaussian components – Provides a richer class of density models than the
single Gaussian • We formulate a GMM in terms of discrete latent
variables – This provides deeper insight into this distribution – Serves to motivate the EM algorithm
• Which gives a maximum likelihood solution to no. of components and their means/covariances
4
Machine Learning Srihari
Latent Variable Representation • Gaussian mixture distribution is written as
– a linear superposition of K Gaussian components:
– Represent K by a K-dimensional binary variable z • Using 1-of-K representation (one-hot vector)• Let z = z1,..,zK whose elements are
• K possible states of z corresponding to K components
p(x) = π
kN(x | µ
k,Σ
k)
k=1
K
∑
z
k∈{0,1} and z
k= 1
k∑
k 1 2
z 10 01
πk
0.4 0.6
µk
−28 1.86
σk
0.48 0.88
k 1 2 3z 100 010 001
Machine Learning Srihari Goal of Gaussian Mixture Modeling • A linear superposition of Gaussians in the form • Goal of Modeling:
• Find maximum likelihood parameters πk, µk, Σk • Examples of data sets and models
p(x) = π
kN(x | µ
k,Σ
k)
k=1
K
∑
k 1 2π 0.4 0.6µ −28 1.86
σ 0.48 0.88
1-D data, K=2 subclasses 2-D data, K=3
Each data point is associated with a subclass k with probability πk
Machine Learning Srihari
Joint Distribution
• Define joint distribution of latent variable and observed variable – p(x,z)=p(x|z)� p(z) – x is observed variable – z is the hidden or missing variable – Marginal distribution p(z) – Conditional distribution p(x|z)
7
Machine Learning Srihari
Graphical Representation of Mixture Model
• The joint distribution p(x,z) is represented in the form p(z)p(x|z)
– We now specify marginal p(z)and conditional p(x|z) • Using them we specify p(x) in terms of observed and
latent variables 8
Latent variable z=[z1,..zK] represents subclass
Observed variable x
Machine Learning Srihari
9
Specifying the marginal p(z) • Associate a probability with each component zk
– Denote where parameters {πk} satisfy • Because z uses 1-of-K it follows that
– since and components of z are mutually exclusive and hence are independent
p(z) = π
k
zk
k=1
K
∏
p(zk= 1) = π
k
0 ≤ π
k≤ 1 and π
k= 1
k∑
With one component p(z1) = π
1
z1
With two components p(z1,z
2) = π
1
z1π2
z2
zk∈{0,1}
p(z)
p(x|z)
Machine Learning Srihari
10
Specifying the Conditional p(x|z)
• For a particular component (value of z)
• Thus p(x|z) can be written in the form
– Due to the exponent zk all product terms except for one equal one
p(x | z) = N x | µ
k,Σ
k( )zk
k=1
K
∏
p(x | zk= 1) = N(x | µ
k,Σ
k)
p(x|z)
p(z)
Machine Learning Srihari
11
Marginal distribution p(x)
• The joint distribution p(x,z) is given by p(z)p(x|z) • Thus marginal distribution of x is obtained by summing
over all possible states of z to give
– Since
• This is the standard form of a Gaussian mixture
p(x) = p(z)p(x | z) = π zk
kk=1
K
∏ N x | µk,Σ
k( )zk
z∑ = π
kN x | µ
k,Σ
k( )k=1
K
∑z∑
zk∈{0,1}
Machine Learning Srihari
12
Value of Introducing Latent Variable • If we have observations x1,..,xN
• Because marginal distribution is in the form
– It follows that for every observed data point xn there is a corresponding latent vector zn , i.e., its sub-class
• Thus we have found a formulation of Gaussian mixture involving an explicit latent variable – We are now able to work with joint distribution p(x,z)
instead of marginal p(x)
• Leads to significant simplification through introduction of expectation maximization
p(x) = p(x,z)
z∑
Machine Learning Srihari
13
Another conditional probability (Responsibility) • In EM p(z |x) plays a role • The probability p(zk=1|x) is denoted
– From Bayes theorem
– View as prior probability of component k and as the posterior probability it is also the responsibility that component k takes for explaining the observation x
γ (zk) ≡ p(z
k= 1 |x) =
p(zk= 1)p(x | z
k= 1)
p(zj= 1)p(x | z
j= 1)
j=1
K
∑
=π
kN(x | µ
k,Σ
k)
πjN(x | µ
k,Σ
j)
j=1
K
∑
γ (zk)
γ (zk) = p(z
k= 1 |x)
p(zk= 1) = π
k
p(x,z)=p(x|z)p(z)
Machine Learning Srihari
Plan of Discussion
• Next we look at 1. How to get data from a mixture model synthetically
and then 2. Given a data set {x1,..xN} how to model the data
using a mixture of Gaussians
14
Machine Learning Srihari
15
Synthesizing data from mixture • Use ancestral sampling
– Start with lowest numbered node and draw a sample,
• Generate sample of z, called • move to successor node and draw
a sample given the parent value, etc.
• Then generate a value for x from conditional p(x| )
• Samples from p(x,z) are plotted according to value of x and colored with value of z
• Samples from marginal p(x) obtained by ignoring values of z
500 points from three Gaussians
Complete Data set
Incomplete Data set
z
z
Machine Learning Srihari
16
Illustration of responsibilities
• Evaluate for every data point – Posterior probability of each component – Responsibilities associated with
data point xn and component k – Color using proportion of red, blue and
green ink • If for a data point it is colored red • If for another point it
has equal blue and green and will appear as cyan
γ (znk
)
γ (zn1
) = 1
γ (zn2
) = γ (zn3
) = 0.5
Machine Learning Srihari
17
Maximum Likelihood for GMM • We wish to model data set {x1,..xN} using a mixture of
Gaussians (N items each of dimension D) • Represent by N x D matrix X
– nth row is given by xnT
• Represent N latent variables with N x K matrix Z – nth row is given by zn
T
• Goal is to state the likelihood function • so as to estimate the three sets of parameters • by maximizing the likelihood
X =
x1
x2
xN
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
Z =
z1
z2
zN
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
Machine Learning Srihari
Graphical representation of GMM
• For a set of i.i.d. data points {xn} with corresponding latent points {zn} where n=1,..,N
• Bayesian Network for p(X,Z) using plate notation – N x D matrix X – N x K matrix Z
18
Machine Learning Srihari
Likelihood Function for GMM
19
p(x) = p(z)p(x | z) = π
kN x | µ
k,Σ
k( )k=1
K
∑z∑
Therefore Likelihood function is
p(X |π,µ,Σ) =
k=1
K
∑ πkN(x
n| µ
k,Σ
k)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∏
Therefore log-likelihood function is
ln p(X |π,µ,Σ) = ln
k=1
K
∑ πkN(x
n| µ
k,Σ
k)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∑
Which we wish to maximize A more difficult problem than for a single Gaussian
Mixture density function is
Product is over the Ni.i.d. samples
Since z has values {zk} with probabilities {πk}
Machine Learning Srihari
Maximization of Log-Likelihood
• Goal is to estimate the three sets of parameters
– By taking derivatives in turn w.r.t each while keeping others constant
– But there are no closed-form solutions • Task is not straightforward since summation appears in
Gaussian and logarithm does not operate on Gaussian
• While a gradient-based optimization is possible, we consider the iterative EM algorithm 20
ln p(X |π,µ,Σ) = ln
k=1
K
∑ πkN(x
n| µ
k,Σ
k)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∑
π k ,µk ,Σk
Machine Learning Srihari
Some issues with GMM m.l.e.
• Before proceeding with the m.l.e. briefly mention two technical issues: 1. Problem of singularities with Gaussian mixtures 2. Problem of Identifiability of mixtures
21
Machine Learning Srihari
22
Problem of Singularities with Gaussian mixtures • Consider Gaussian mixture
– components with covariance matrices • Data point that falls on a mean will
contribute to the likelihood function
• As term goes to infinity • Therefore maximization of log-likelihood
is not well-posed
– Does not happen with a single Gaussian • Multiplicative factors go to zero
– Does not happen in the Bayesian approach • Problem is avoided using heuristics
– Resetting mean or covariance
N (x n |x n ,σ j
2I ) = 1(2π )1/2
1σ j One component
assigns finite values and other to large value
Multiplicative values Take it to zero
σ
j→ 0
Σk =σ k2I
µ j = xn
since exp(xn-µj)2=1
ln p(X |π,µ,Σ) = ln
k=1
K
∑ πkN(x
n| µ
k,Σ
k)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∑
Machine Learning Srihari
23
Problem of Identifiability
A K-component mixture will have a total of K! equivalent solutions
– Corresponding to K! ways of assigning K sets of parameters to K components
• E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321
– For any given point in the space of parameter values there will be a further K!-1 additional points all giving exactly same distribution
• However any of the equivalent solutions is as good as the other
A
B
CB
A
CTwo ways of labeling three Gaussian subclasses
A density p(x |θ ) is identifiable if θ ≠θ ' then there is an x for which p(x |θ ) ≠ p(x |θ ')
Machine Learning Srihari
24
EM for Gaussian Mixtures • EM is a method for finding maximum likelihood
solutions for models with latent variables • Begin with log-likelihood function
– We wish to find that maximize this quantity – Task is not straightforward since summation appears in
Gaussian and logarithm does not operate on Gaussian • Take derivatives in turn w.r.t
– Means and set to zero – covariance matrices and set to zero – mixing coefficients and set to zero
ln p(X |π,µ,Σ) = ln
k=1
K
∑ πkN(x
n| µ
k,Σ
k)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∑
π ,µ,Σ
µk
Σkπ k
Machine Learning Srihari
25
EM for GMM: Derivative wrt • Begin with log-likelihood function
• Take derivative w.r.t the means and set to zero – Making use of exponential form of Gaussian – Use formulas: – We get
the posterior probabilities
ln p(X |π,µ,Σ) = ln
k=1
K
∑ πkN(x
n| µ
k,Σ
k)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∑
0 =
πkN(x
n| µ
k,Σ
k)
πjN(x
n| µ
j,Σ
j)
j∑n=1
N
∑ (xn− µ
kk
−1∑ )
ddx
lnu = u 'u
and ddx
eu = euu '
Inverse of covariance matrix
µk
µk
γ (znk )
Machine Learning Srihari
26
M.L.E. solution for Means
• Multiplying by (assuming non-singularity)
• Where we have defined
– Which is the effective number of points assigned to cluster k
µ
k= 1
Nk
γ (znk
)xn
n=1
N
∑Mean of kth Gaussian component is the weighted mean of all the points in the data set: where data point xn is weighted by the posterior probability that component k was responsible for generating xn
N
k= γ (z
nk)
n=1
N
∑
Σk
Machine Learning Srihari
27
M.L.E. solution for Covariance
• Set derivative wrt to zero – Making use of mle solution for covariance matrix of
single Gaussian
– Similar to result for a single Gaussian for the data set but each data point weighted by the corresponding posterior probability
– Denominator is effective no of points in component
Σ
k= 1
Nk
γ (znk
)(xn− µ
k)(x
n− µ
k)T
n=1
N
∑
Σk
Machine Learning Srihari
28
M.L.E. solution for Mixing Coefficients
• Maximize w.r.t. πk
– Must take into account that mixing coefficients sum to one
– Achieved using Lagrange multiplier and maximizing
– Setting derivative wrt to zero and solving gives
kkNN
π =�
ln p(X |π ,µ,Σ) + λ π k −1k=1
K
∑⎛
⎝ ⎜
⎞
⎠ ⎟
ln p(X |π,µ,Σ)
π k
Machine Learning Srihari
Summary of m.l.e. expressions • GMM maximum likelihood parameter estimates
29
1
1 ( )xN
k nk nnk
zN
µ γ=
= ∑
�
Σk = 1Nk
γ(znk )(xn − µk )(xn − µk )T
n=1
N
∑
kkNN
π =
�
Nk = γ(znk )n=1
N
∑
Means
Covariance matrices
Mixing Coefficients
• All three are in terms of responsibilities • and so we have not completely solved the problem
Machine Learning Srihari
30
EM Formulation
• The results for are not closed form solutions for the parameters – Since the responsibilities depend on those
parameters in a complex way • Results suggest an iterative solution • An instance of EM algorithm for the particular
case of GMM
µk ,Σk ,π k
γ (znk )
Machine Learning Srihari
31
Informal EM for GMM • First choose initial values for means, covariances and
mixing coefficients • Alternate between following two updates
– Called E step and M step • In E step use current value of parameters to evaluate
posterior probabilities, or responsibilities • In the M step use these posterior probabilities to to re-
estimate means, covariances and mixing coefficients
Machine Learning Srihari
32
EM using Old Faithful Data points and Initial mixture model
Initial E step Determine responsibilities
After first M step Re-evaluate Parameters
After 2 cycles After 5 cycles After 20 cycles
Machine Learning Srihari
Comparison with K-Means
33
K-means result E-M result
Machine Learning Srihari
Animation of EM for Old Faithful Data
• http://en.wikipedia.org/wiki/File:Em_old_faithful.gif
#initial parameter estimates (chosen to be deliberately bad) theta <- list( tau=c(0.5,0.5), mu1=c(2.8,75), mu2=c(3.6,58), sigma1=matrix(c(0.8,7,7,70),ncol=2), sigma2=matrix(c(0.8,7,7,70),ncol=2) )
34
Code in R
Machine Learning Srihari
35
Practical Issues with EM
• Takes many more iterations than K-means – Each cycle requires significantly more comparison
• Common to run K-means first in order to find suitable initialization
• Covariance matrices can be initialized to covariances of clusters found by K-means
• EM is not guaranteed to find global maximum of log likelihood function
Machine Learning Srihari
36
Summary of EM for GMM
• Given a Gaussian mixture model • Goal is to maximize the likelihood function w.r.t.
the parameters (means, covariances and mixing coefficients)
Step1: Initialize the means , covariances and mixing coefficients and evaluate initial value of
log-likelihood
µk Σk
π k
Machine Learning Srihari
37
EM continued • Step 2: E step: Evaluate responsibilities using current
parameter values
• Step 3: M Step: Re-estimate parameters using current responsibilities
γ (zk)=
πkN(x
n| µ
k,Σ
k)
πjN(x
n| µ
j,Σ
j))
j=1
K
∑
where
Σ
knew = 1
Nk
γ (znk
)(xn− µ
knew)(x
n− µ
knew)T
n=1
N
∑ µ
knew = 1
Nk
γ (znk
)xn
n=1
N
∑
N
k= γ (z
nk)
n=1
N
∑ π
knew =
Nk
N
Machine Learning Srihari
38
EM Continued
• Step 4: Evaluate the log likelihood
– And check for convergence of either parameters or log likelihood
• If convergence not satisfied return to Step 2
�
ln p(X |π,µ,Σ) = ln
k=1
K
∑ πkN(x
n| µ
k,Σ
k)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∑