Download - PowerPoint Presentation
Machine Learning: CSE 574 Machine Learning: CSE 574 00
Hidden Markov Models
Sargur N. Srihari [email protected]
Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html
Machine Learning: CSE 574 Machine Learning: CSE 574 11
HMM Topics1. What is an HMM?2. State-space Representation3. HMM Parameters4. Generative View of HMM5. Determining HMM Parameters Using EM6. Forward-Backward or α−β
algorithm
7. HMM Implementation Issues:a) Length of Sequenceb) Predictive Distributionc) Sum-Product Algorithmd) Scaling Factorse) Viterbi Algorithm
Machine Learning: CSE 574 Machine Learning: CSE 574 22
1. What is an HMM?• Ubiquitous tool for modeling time series data• Used in
• Almost all speech recognition systems• Computational molecular biology
• Group amino acid sequences into proteins• Handwritten word recognition
• It is a tool for representing probability distributions over sequences of observations
• HMM gets its name from two defining properties:• Observation xt
at time t
was generated by some process whose state zt
is hidden from the observer• Assumes that state at zt
is dependent only on state zt-1
and independent of all prior states (First order)
• Example: z
are phoneme sequencesx
are acoustic observations
Machine Learning: CSE 574 Machine Learning: CSE 574 33
Graphical Model of HMM• Has the graphical model shown below and
latent variables are discrete
• Joint distribution has the form:
)|()|()(),..,,..(12
1111 nn
N
n
N
nnnnN zxpzzpzpzzxxp ∏∏
==− ⎥
⎦
⎤⎢⎣
⎡=
Machine Learning: CSE 574 Machine Learning: CSE 574 44
HMM Viewed as Mixture
• A single time slice corresponds to a mixture distribution with component densities p(x|z)• Each state of discrete variable z
represents a different component • An extension of mixture model
• Choice of mixture component depends on choice of mixture component for previous distribution
• Latent variables are multinomial variables zn• That describe component responsible for generating xn
• Can use one-of-K coding scheme
Machine Learning: CSE 574 Machine Learning: CSE 574 55
2. State-Space Representation • Probability distribution of zn
depends on state of previous latent variable zn-1
through probability distribution p(zn
|zn-1
)• One-of K
coding
• Since latent variables are K- dimensional binary vectorsAjk
=p(znk
=1|zn-1,j
=1)
Σk
Ajk
=1• These are known as Transition
Probabilities•
K(K-1)
independent parameters
k 1 2 . Kznk 0 1 . 0
1 2 …. K12… Ajk
K
A A matrixmatrix
j 1 2 . Kzn-1,j 1 0 . 0
State State of of zznn
State State of of zznn--11
zn-1
zznn
Machine Learning: CSE 574 Machine Learning: CSE 574 66
Transition Probabilities
zn
=1zn1
=1zn2
=0zn3
=0
zn
=2zn1
=0zn2
=1zn3
=0
zn
=3zn1
=0zn2
=0zn3
=1
zn-1
=1zn-1,1
=1zn-1,2
=0zn-1,3
=0
A11 A12 A13
zn-1
=2zn-1,1
=0zn-1,2
=1zn-1,3
=0
A21 A22 A23
zn-3
= 3zn-1,1
=0zn-1,2
=0zn-1,3
=1
A31 A32 A33
State Transition Diagram
•• Not a graphical modelsince nodes are not separate variables but states of a single variable
• Here K=3K=3
AA1111
+A+A1212
+A+A1313
=1=1
Example with 3-state latent variable
Machine Learning: CSE 574 Machine Learning: CSE 574 77
Conditional Probabilities• Transition probabilities Ajk
represent state-to-state probabilities for each variable
• Conditional probabilities are variable-to-variable probabilities• can be written in terms of transition probabilities as
• Note that exponent zn-1,j zn,k
is a product that evaluates to 0 or 1• Hence the overall product will evaluate to a single Ajk
for each setting of values of zn
and zn-1
• E.g., zn-1
=2
and zn
=3 will result in only zn-1,2
=1
and zn,3
=1. Thus p(zn
=3|zn-1
=2)=A23
•
A
is a global HMM parameter
∏∏= =
−−=
K
k
K
j
zzjknn
knjnAAp1 1
1,,1),z|z(
Machine Learning: CSE 574 Machine Learning: CSE 574 88
Initial Variable Probabilities• Initial latent node z1
is a special case without parent node
• Represented by vector of probabilities π with elements πk
=p(z1k
=1)
so that
• Note that π is an HMM parameter • representing probabilities of each state for the
first variable
1 where)|z( k1
1,1 =Σ= ∏
=k
K
k
zk
kp πππ
Machine Learning: CSE 574 Machine Learning: CSE 574 99
• State transition diagram unfolded over time
• Representation of latent variable states• Each column corresponds to one of latent
variables zn
Lattice or Trellis Diagram
Machine Learning: CSE 574 Machine Learning: CSE 574 1010
Emission Probabilities p(xn
|zn
)• We have so far only specified p(zn
|zn-1
)
by means of transition probabilities
• Need to specify probabilities p(xn
|zn
,φ) to complete the model, where φ are parameters
• These can be continuous or discrete• Because xn
is observed and zn
is discrete p(xn
|zn
,φ) consists of a table of K
numbers corresponding to K
states of zn
• Analogous to class-conditional probabilities• Can be represented as
1
(x | z , ) (x | ) nk
Kz
n n n kk
p pϕ ϕ=
= ∏
Machine Learning: CSE 574 Machine Learning: CSE 574 1111
3. HMM Parameters
• We have defined three types of HMM parameters: θ = (π, A,φ)1. Initial Probabilities of first latent variable:
π is a vector of K
probabilities of the states for latent variable z12. Transition Probabilities (State-to-state for any latent variable):
A
is a K
x K
matrix of transition probabilities Aij
3. Emission Probabilities (Observations conditioned on latent):φ are parameters of conditional distribution p(xk
|zk
)•
A
and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution
Machine Learning: CSE 574 Machine Learning: CSE 574 1212
• Joint can be expressed in terms of parameters:
• Most discussion of HMM is independent of emission probabilities• Tractable for discrete tables, Gaussian, GMMs
Joint Distribution over Latent and Observed variables
1 12 1
1 1
( , | ) (z | ) (z | z , ) (x | z , )
where x x z z { , , }
N N
n n m mn m
N N
p X Z p p A p
X { ,.. }, Z { ,.. }, θ A
θ π ϕ
π ϕ
−= =
⎡ ⎤= ⎢ ⎥
⎣ ⎦= = =
∏ ∏
Machine Learning: CSE 574 Machine Learning: CSE 574 1313
4. Generative View of HMM• Sampling from an HMM• HMM with 3-state latent variable z
• Gaussian emission model p(x|z)• Contours of constant density of emission
distribution shown for each state• Two-dimensional x
• 50 data points generated from HMM• Lines show successive observations
• Transition probabilities fixed so that• 5% probability of making transition• 90% of remaining in same
Machine Learning: CSE 574 Machine Learning: CSE 574 1414
Left-to-Right HMM• Setting elements of Ajk
=0 if k < j
• Corresponding lattice diagram
Machine Learning: CSE 574 Machine Learning: CSE 574 1515
Left-to-Right Applied Generatively to Digits• Examples of on-line handwritten 2’s •
x
is a sequence of pen coordinates
• There are16 states for z, or K=16• Each state can generate a line segment of fixed
length in one of 16 angles• Emission probabilities: 16 x 16 table
• Transition probabilities set to zero except for those that keep state index k
the same or increment by one
• Parameters optimized by 25 EM iterations • Trained on 45 digits• Generative model is quite poor
• Since generated don’t look like training•
If classification is goal, can do better by using a discriminative HMM
Training SetTraining Set
Generated SetGenerated Set
Machine Learning: CSE 574 Machine Learning: CSE 574 1616
5. Determining HMM Parameters• Given data set X={x1
,..xn
}
we can determine HMM parameters θ ={π,A,φ}
using maximum
likelihood• Likelihood function obtained from joint
distribution by marginalizing over latent variables Z={z1
,..zn
}p(X|θ) = ΣZ
p(X,Z|θ)
XX
Joint (X,Z)Joint (X,Z)
Machine Learning: CSE 574 Machine Learning: CSE 574 1717
Computational issues for Parametersp(X|θ) = ΣZ p(X,Z|θ)
•
Computational difficulties•
Joint distribution p(X,Z|θ)
does not factorize over n,
in contrast to mixture model
•
Z
has exponential number of terms corresponding to trellis•
Solution•
Use conditional independence properties to reorder summations
•
Use EM instead to maximize log-likelihood function of joint ln
p(X,Z|θ)
Efficient framework for maximizing the likelihood function in HMMs
Machine Learning: CSE 574 Machine Learning: CSE 574 1818
EM for MLE in HMM1. Start with initial selection for model parameters θold
2. In E step take these parameter values and find posterior distribution of latent variables p(Z|X,θ old)Use this posterior distribution to evaluate expectation of the logarithm of the complete-data likelihood function ln
p(X,Z|θ)
Which can be written as
underlined portion independent of θ
is evaluated3. In M-Step maximize Q
w.r.t. θ
∑=Z
)|,(ln),|Z(),( θθθθ ZXpXpQ oldold
Machine Learning: CSE 574 Machine Learning: CSE 574 1919
Expansion of Q• Introduce notation
γ (zn
) = p(zn
|X,θold): Marginal posterior distribution of latent variable zn
ξ(zn-1
,zn
) = p(zn-1
,zn
|X,θ old): Joint posterior of two successive latent variables
• We will be re-expressing Q in terms of γ and ξ
γγ
ξξ
∑=Z
)|,(ln),|Z(),( θθθθ ZXpXpQ oldold
Machine Learning: CSE 574 Machine Learning: CSE 574 2020
Detail of γ and ξ
For each value of n
we can store γ(zn
)
using K
non-negative numbers that sum to unityξ(zn-1
,zn
) using a K
x K
matrix whose elements also sum to unity
• Using notationγ (znk
) denotes conditional probability of znk=1
Similar notation for ξ(zn-1,j
,znk
)
• Because the expectation of a binary random variable is the probability that it takes value 1γ(znk
)
= E[znk
] = Σz
γ(z)znk
ξ(zn-1,j
,znk
) = E[zn-1,j
,znk
] = Σz
γ(z) zn-1,j
znk
Machine Learning: CSE 574 Machine Learning: CSE 574 2121
Expansion of Q • We begin with
• Substitute
• And use definitions of γ and ξ to get:
∑∑
∑ ∑∑∑
= =
= = = =
+
+=
N
n
K
kknnk
K
k
N
n
K
j
K
kjknkjn-kk
old
xpz
A),z(zzQ
1 1
1 2 1 1,11
)|(ln)(
lnln)(),(
φγ
ξπγθθ
∑=Z
)|,(ln),|Z(),( θθθθ ZXpXpQ oldold
1 12 1
( , | ) (z | ) (z | z , ) (x | z , )N N
n n m mn m
p X Z p p A pθ π ϕ−= =
⎡ ⎤= ⎢ ⎥
⎣ ⎦∏ ∏
Machine Learning: CSE 574 Machine Learning: CSE 574 2222
• Goal of E step is to evaluate γ(zn
)
and ξ(zn-1
,zn
) efficiently (Forward-Backward Algorithm)
∑∑
∑ ∑∑∑
= =
= = = =
+
+=
N
n
K
kknnk
K
k
N
n
K
j
K
kjknkjn-kk
old
xpz
A),z(zzQ
1 1
1 2 1 1,11
)|(ln)(
lnln)(),(
φγ
ξπγθθ
E-Step
γγ
ξξ
Machine Learning: CSE 574 Machine Learning: CSE 574 2323
M-Step• Maximize Q(θ,θ old)
with respect to parameters
θ ={π,A,φ}• Treat γ (zn
) and ξ(zn-1
,zn
) as constant• Maximization w.r.t. π
and A
• easily achieved (using Lagrangian multipliers)
• Maximization w.r.t. φk
• Only last term of Q
depends on φk
• Same form as in mixture distribution for i.i.d.
∑=
= K
jj
kk
z
z
11
1
)(
)(
γ
γπ
∑∑
∑
= =−
=−
= K
l
N
nnljn
N
nnkjn
jk
zz
zzA
1 2,1
2,1
),(
),(
ξ
ξ
∑∑= =
N
n
K
kknnk xpz
1 1)|(ln)( φγ
Machine Learning: CSE 574 Machine Learning: CSE 574 2424
M-step for Gaussian emission• Maximization of Q(θ,θ old)
wrt φk
• Gaussian Emission Densitiesp(x|φk
) ~ N(x|μk
,Σk
)• Solution:
∑
∑
=
== N
nnk
N
nnnk
k
z
z
1
1
)(
x)(μ
γ
γ
∑
∑
=
=
−−=Σ N
nnk
N
n
Tknknnk
k
z
z
1
1
)(
)μx)(μx)((
γ
γ
Machine Learning: CSE 574 Machine Learning: CSE 574 2525
M-Step for Multinomial Observed• Conditional Distribution of Observations
have the form
• M-Step equations:
• Analogous result holds for Bernoulli observed variables
∏∏= =
=D
i
K
k
zxik
kip1 1
)z|x( μ
∑
∑
=
== N
nnk
N
nnink
ik
z
xzμ
1
1
)(
)(
γ
γ
Machine Learning: CSE 574 Machine Learning: CSE 574 2626
6. Forward-Backward Algorithm• E step: efficient procedure to evaluate
γ(zn
)
and ξ(zn-1
,zn
)• Graph of HMM, a tree
• Implies that posterior distribution of latent variables can be obtained efficiently using message passing algorithm
• In HMM it is called forward-backward algorithm or Baum-Welch Algorithm
• Several variants lead to exact marginals• Method called alpha-beta discussed here
Machine Learning: CSE 574 Machine Learning: CSE 574 2727
Derivation of Forward-Backward• Several conditional-independences (A-H) hold
A. p(X|zn
) = p(x1
,..xn
|zn
) p(xn+1
,..xN
|zn
)• Proved using d-separation:
Path fromPath from xx11
toto xxnn--1 1 passes through passes through zznn
which is observed. which is observed. Path is headPath is head--toto--tail. Thus tail. Thus ((xx11
,..,..xxnn--11
) _) _||_||_
xxnn
| | zznnSimilarly Similarly ((xx11
,..,..xxnn--11
,,xxnn
) _) _||_||_
xxn+1n+1
,..x,..xNN
||zznn
……....
……....
xxNN
zzNN
Machine Learning: CSE 574 Machine Learning: CSE 574 2828
Conditional independence B• Since we have
B. p(x1
,..xn-1
|xn,
zn
)= p(x1
,..xn-1
|zn
)
((xx11
,..,..xxnn--11
) _) _||_||_
xxnn
| | zznn
Machine Learning: CSE 574 Machine Learning: CSE 574 2929
Conditional independence C• Since
C. p(x1
,..xn-1
|zn-1
,zn
)= p ( x1
,..xn-1
|zn-1 )
((xx11
,..,..xxnn--11
) _) _||_||_
zznn
| | zznn--11
Machine Learning: CSE 574 Machine Learning: CSE 574 3030
Conditional independence D• Since
D. p(xn+1
,..xN
|zn
,zn+1
)= p ( x1
,..xN
|zn+1 )
((xxn+1n+1
,..,..xxNN
) _) _||_||_
zznn
| | zzn+1n+1
…………xxNN
zzNN
Machine Learning: CSE 574 Machine Learning: CSE 574 3131
Conditional independence E• Since
E. p(xn+2
,..xN
|zn+1
,xn+1
)= p ( xn+2
,..xN
|zn+1 )
((xxn+2n+2
,..,..xxNN
) _) _||_||_
zznn
| | zzn+1n+1
…………xxNN
zzNN
Machine Learning: CSE 574 Machine Learning: CSE 574 3232
Conditional independence F
F. p(X|zn-1
,zn
)= p ( x1
,..xn-1
|zn-1 )p(xn
|zn
)p(xn+1
,..xN
|zn
)
xxNN
zzNN
…………
Machine Learning: CSE 574 Machine Learning: CSE 574 3333
Conditional independence GSince
G. p(xN+1
|X,zN+1
)=p(xN+1
|zN+1 )
xxN+1N+1xxNN
zzN+1N+1zzNN
……..
……..
((xx11
,..,..xxNN
) _) _||_||_
xxNN+1+1
| | zzN+1N+1
Machine Learning: CSE 574 Machine Learning: CSE 574 3434
Conditional independence H
H. p(zN+1
|zN , X)= p ( zN+1
|zN )
xxN+1N+1xxNN
zzN+1N+1zzNN
……..
……..
Machine Learning: CSE 574 Machine Learning: CSE 574 3535
Evaluation of γ(zn
)• Recall that this is to efficiently compute
the E step of estimating parameters of HMMγ (zn
) = p(zn
|X,θold): Marginal posterior distribution of latent variable zn
• We are interested in finding posterior distribution p(zn
|x1
,..xN
)• This is a vector of length K
whose entries
correspond to expected values of znk
Machine Learning: CSE 574 Machine Learning: CSE 574 3636
Introducing α
and β• Using Bayes theorem• Using conditional independence A
• where α(zn
) =
p(x1
,..,xn
,zn
)which is the probability of observing all given data up to time n
and the value of
zn
β(zn
) =
p(xn+1
,..,xN
|zn
)which is the conditional probability of all future data from time n+1
up to N
given the value of zn
)X()z()z(
)X()z|x,..x()z,x,..x(
)X()z()z|x,..x()z|x,..x()z(
11
11
pppp
pppp
nnnNnnn
nnNnnnn
βα
γ
==
=
+
+
)X()()|X()X|()z(
pzpzpzp nn
nn ==γ
alphaalpha
betabeta
Machine Learning: CSE 574 Machine Learning: CSE 574 3737
Recursion Relation for α
αα
α
of definitionby )z|z()z()z|x(
rule Bayesby )z|z()z,x,..,x()z|x(
C ind. cond.by )z()z|z()z|x,..,x()z|x(
rule Bayesby )z()z|z,x,..,x()z|x(
Rule Sumby )z,z,x,..,x()z|x( rule Bayesby )z,x,..,x()z|x(
B ceindependen lconditionaby )z()z|x,..,x()z|x(
rule Bayesby )z()z|x,..,x( )z,x,..,x()z(
1
1
1
1
1
z11
z1111
1z
1111
z1111
z111
11
11
1
1
∑
∑
∑
∑
∑
−
−
−
−
−
−−
−−−
−−−−
−−−
−−
−
−
=
=
=
=
=
=
=
==
n
n
n
n
n
nnnnn
nnnnnn
nnnnnnn
nnnnnn
nnnnn
nnnn
nnnnn
nnn
nnn
pp
ppp
pppp
ppp
pppp
ppp
ppp
Machine Learning: CSE 574 Machine Learning: CSE 574 3838
Forward Recursion for α
Εvaluation• Recursion Relation is
• There are K terms in the summation• Has to be evaluated for each of K
values of zn
• Each step of recursion is O(K2)• Initial condition is
• Overall cost for the chain in O(K2N)
)z|z()z()z|x()z( 111
−−∑−
= nnz
nnnn ppn
αα
{ }∏=
===K
1k1k111111
1)|x()z|x()z()z,x()z( kzkpppp φπα
Machine Learning: CSE 574 Machine Learning: CSE 574 3939
Recursion Relation for β
1
1
1
1
1 1
1 1 1
1 1 1
(z ) (x ,.., x | z ) (x ,.., x , z | z ) by Sum Rule
(x ,.., x | z ,z ) z |z ) by Bayes rule
(x ,.., x | z ) z |z ) by Cond ind. D
n
n
n
n n N n
n N n nz
n N n n n nz
n N n n nz
pp
p p(
p p(
β
+
+
+
+
+ +
+ + +
+ + +
=
=
=
=
∑
∑
∑
1
1
2 1 1 1 1
1 1 1 1
(x ,.., x | z ) x |z ) z |z ) by Cond. ind E
(z ) x |z ) z |z ) by definition of n
n
n N n n n n nz
n n n n nz
p p( p(
p( p(β β+
+
+ + + + +
+ + + +
=
=
∑
∑
Machine Learning: CSE 574 Machine Learning: CSE 574 4040
Backward Recursion for β• Backward message passing
• Evaluates β(zn
) in terms of β(zn+1
)
• Starting condition for recursion is
• Is correct provided we set β(zN
) =1 for all settings of zN• This is the initial condition for backward
computation
)X()z()z,X()X|z(
ppp NN
Nβ
=
1
1 1 1(z ) (z ) x |z ) z |z ) n
n n n n n nz
p( p(β β+
+ + += ∑
Machine Learning: CSE 574 Machine Learning: CSE 574 4141
M step Equations• In the M-step equations p(x)
will cancel out
∑
∑
=
== N
nnk
N
nnnk
k
z
z
1
1
)(
x)(μ
γ
γ
∑=nz
nn zzp )()()X( βα
Machine Learning: CSE 574 Machine Learning: CSE 574 4242
Evaluation of Quantities ξ(zn-1
,zn
)• They correspond to the values of the conditional
probabilities p(zn-1
,zn
|X)
for each of the K
x K
settings for (zn-1
,zn
)
• Thus we calculate ξ(zn-1
,zn
) directly by using results of the α
and β
recursions
p(X)()p(p
p(X))p(p(zp)p|zp(
p(X)),z)p(z,zp(X|z
nn-nnnn
n-n-nnNnnnn-n
nn-nn-
)zz|z)z|x()z(
F ind condby )zz|z)|x,..,x()z|x(x,..x
Rule Bayesby
definitionby X)|z,p(z )z,(z
11
111111
11
n1-nn1-n
βα
ξ
−
+−
=
=
=
=
Machine Learning: CSE 574 Machine Learning: CSE 574 4343
Summary of EM to train HMMStep 1: Initialization• Make an initial selection of parameters θ old
where θ = (π, A,φ)1.
π is a vector of K
probabilities of the states for latent
variable z12.
A
is a K
x K
matrix of transition probabilities Aij
3.
φ are parameters of conditional distribution p(xk
|zk
)•
A
and π parameters are often initialized uniformly
• Initialization of φ depends on form of distribution• For Gaussian:
• parameters μk
initialized by applying K-means to the data, Σk
corresponds to covariance matrix of cluster
Machine Learning: CSE 574 Machine Learning: CSE 574 4444
Summary of EM to train HMMStep 2: E Step• Run both forward α
recursion and
backward β
recursion• Use results to evaluate γ(zn
)
and ξ(zn-1
,zn
)and the likelihood function
Step 3: M Step• Use results of E step to find revised set of
parameters θnew
using M-step equations
Alternate between E and Muntil convergence of likelihood function
Machine Learning: CSE 574 Machine Learning: CSE 574 4545
Values for p(xn
|zn
)• In recursion relations, observations enter
through conditional distributions p(xn
|zn
)• Recursions are independent of
• Dimensionality of observed variables• Form of conditional distribution
• So long as it can be computed for each of K possible states of zn
• Since observed variables {xn
}
are fixed they can be pre-computed at the start of the EM algorithm
Machine Learning: CSE 574 Machine Learning: CSE 574 4646
7(a) Sequence Length: Using Multiple Short Sequences
• HMM can be trained effectively if length of sequence is sufficiently long• True of all maximum likelihood approaches
• Alternatively we can use multiple short sequences• Requires straightforward modification of HMM-EM
algorithm• Particularly important in left-to-right models
• In given observation sequence, a given state transition for a non-diagonal element of A
occurs only
once
Machine Learning: CSE 574 Machine Learning: CSE 574 4747
7(b). Predictive Distribution• Observed data is X={ x1
,..,xN
}• Wish to predict xN+1• Application in financial forecasting
• Can be evaluated by first running forward α
recursion and summing over zN
and zN+1• Can be extended to subsequent predictions of xN+2
, after xN+1
is observed, using a fixed amount of storage
1
1
1
1
1 1 1
1 1 1
1 1 1
1 1
(x | ) (x , z | )
= (x |z | ) (z | ) by Product Rule
= (x |z ) (z , | ) by Sum Rule
= (x |z ) (z
N
N
N N
N
N N Nz
N N Nz
N N N Nz z
N N Nz
p X p X
p X p X
p p z X
p p
+
+
+
+
+ + +
+ + +
+ + +
+ + +
= ∑
∑
∑ ∑
∑
1
1
1
1 1 1
1 1 1
| ) ( | ) by conditional ind H
( , ) = (x |z ) (z | ) by Bayes rule( )
1 = (x |z ) (z | ) ( ) by definition of ( )
N
N N
N N
N Nz
NN N N N
z z
N N N N Nz z
z p z X
p z Xp p zp X
p p z zp X
α α
+
+
+ + +
+ + +
∑
∑ ∑
∑ ∑
Machine Learning: CSE 574 Machine Learning: CSE 574 4848
7(c). Sum-Product and HMM • HMM graph is a tree and
hence sum-product algorithm can be used to find local marginals for hidden variables• Equivalent to forward-
backward algorithm• Sum-product provides a
simple way to derive alpha- beta recursion formulae
• Transform directed graph to factor graph• Each variable has a node,
small squares represent factors, undirected links connect factor nodes to variables used
Fragment of Factor Graph
)|()|()(),..,,..(12
1111 nn
N
n
N
nnnnN zxpzzpzpzzxxp ∏∏
==− ⎥
⎦
⎤⎢⎣
⎡=
HMM Graph
Joint distribution
Machine Learning: CSE 574 Machine Learning: CSE 574 4949
Deriving alpha-beta from Sum-Product• Begin with simplified form of
factor graph• Factors are given by
• Messages propagated are
• Can show that α
recursion is computed
• Similarly starting with the root node β
recursion is computed
• So also γ
and ξ
are derived
Fragment of Factor Graph
Simplified by absorbing emission probabilities into transition probability factors
1 1 1 1
1 1
( ) ( ) ( | )( , ) ( | ) ( | )n n n n n n n
h z p z p x zf z z p z z p x z− −
==
1 1 1
1
1
z 1 z 1
z 1 z 1z
(z ) (z )
(z ) (z , z ) (z )n n n n
n n n n
n
f n f n
f n n n n f nf
μ μ
μ μ− − −
−
−
→ − → −
→ − → −
=
= ∑)z|z()z()z|x()z( 11
1
−−∑−
= nnz
nnnn ppn
αα
)z|z)z|x)()z( 1111
nnnnz
nn p(p(zn
+++∑+
= ββ
(z ) (z )(z )(X)
n nn p
α βγ =
1 1n-1 n
(z ) (x | z ) z | z z )(z ,z ) n n n n n- np p( ) (p(X)
α βξ −=
Final Results
Machine Learning: CSE 574 Machine Learning: CSE 574 5050
7(d). Scaling Factors• Implementation issue for small probabilities• At each step of recursion
• To obtain new value of α(zn
)
from previous value α(zn-1
) we multiply p(zn
|zn-1
)
and p(xn
|zn
)• These probabilities are small and products will underflow• Logs don’t help since we have sums of products
• Solution is rescaling• of α(zn
)
and β(zn
)
whose values remain close to unity
)z|z()z()z|x()z( 111
−−∑−
= nnz
nnnn ppn
αα
Machine Learning: CSE 574 Machine Learning: CSE 574 5151
7(e). The Viterbi Algorithm• Finding most probable sequence of hidden
states for a given sequence of observables• In speech recognition: finding most probable
phoneme sequence for a given series of acoustic observations
• Since graphical model of HMM is a tree, can be solved exactly using max-sum algorithm• Known as Viterbi algorithm in the context of HMM• Since max-sum works with log probabilities no need
to work with re-scaled varaibles as with forward- backward
Machine Learning: CSE 574 Machine Learning: CSE 574 5252
Viterbi Algorithm for HMM• Fragment of HMM lattice
showing two paths• Number of possible paths
grows exponentially with length of chain
• Viterbi searches space of paths efficiently• Finds most probable path
with computational cost linear with length of chain
Machine Learning: CSE 574 Machine Learning: CSE 574 5353
Deriving Viterbi from Max-Sum• Start with simplified factor graph
• Treat variable zN
as root node, passing messages to root from leaf nodes
• Messages passed are
{ }1
1 1 1
z z
z 1 1 1 z
(z ) (z )
(z ) max ln (z , z ) (z )n n n n
n n n nn
f n f n
f n n n n f nzf
μ μ
μ μ+
+ + +
→ →
→ + + + →
=
= +
Machine Learning: CSE 574 Machine Learning: CSE 574 5454
Other Topics on Sequential Data• Sequential Data and Markov Models:
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1- MarkovModels.pdf
• Extensions of HMMs:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3- HMMExtensions.pdf
• Linear Dynamical Systems:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4- LinearDynamicalSystems.pdf
• Conditional Random Fields:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.5- ConditionalRandomFields.pdf