chapter 11. stochastic methods rooted in statistical mechanics · chapter 11. stochastic methods...

Post on 29-Apr-2018

224 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Chapter11.StochasticMethodsRootedinStatisticalMechanics

NeuralNetworksandLearningMachines(Haykin)

LectureNotesonSelf-learningNeuralAlgorithms

Byoung-TakZhangSchoolofComputerScienceandEngineering

SeoulNationalUniversity

Version: 20170926è 20170928è20171011

Contents11.1Introduction……………………………………………………………………………....311.2StatisticalMechanics……………………………………………………………..….411.3MarkovChains………………………..………………………………………..……....611.4MetropolisAlgorithm ……….………..………………….……………………..... 1611.5SimulatedAnnealing ………………………………………………….…….…...….1911.6GibbsSampling ….…………….…….………………..……..………………………..2211.7BoltzmannMachine …..……………………………………….……..……………..2411.8LogisticBeliefNets……………….…………………..……………….…......…….2911.9DeepBeliefNets ………………………….…………………………..….........….3011.10DeterministicAnnealing(DA) …………….………………………..…….…...3411.11AnalogyofDAwithEM…..……….….…………………………………….…….39SummaryandDiscussion…………….…………….………………………….………...41

(c)2017BiointelligenceLab,SNU 2

11.1 Introduction

(c)2017BiointelligenceLab,SNU 3

• Statisticalmechanicsasasourceofideasforunsupervised(self-organized)learningsystems

• Statisticalmechanicsü Theformalstudyofmacroscopicequilibriumproperties oflarge

systemsofelementsthataresubjecttothemicroscopiclawsofmechanics.

ü Thenumberofdegreesoffreedomisenormous,makingtheuseofprobabilisticmethods mandatory.

ü Theconceptofentropy playsavitalroleinstatisticalmechanics,aswiththeShannon’sinformationtheory.

ü The moreordered thesystem,orthemoreconcentrated theunderlyingprobabilitydistribution,thesmallertheentropy willbe.

• Statisticalmechanicsforthestudyofneuralnetworksü Cragg andTemperley (1954)andCowan(1968)ü Boltzmannmachine(HintonandSejnowsky,1983,1986;Ackleyetal.,

1985)

11.2 StatisticalMechanics(1/2)

(c)2017BiointelligenceLab,SNU 4!!

pi : !probability!of!occurrence!of!state!i !of!a!stochastic!system!!!!!pi ≥0!(for!all!i)!!and! pi

i∑ =1

Ei : !energy!of!the!system!when!it!is!in!state!iIn!thermal!equilibrium,!the!probability!of!state!i !is(Canonical!distribution!/!Gibbs!distribution)

!!!!!pi =1Zexp −

EikBT

⎝⎜⎞

⎠⎟

!!!!!Z = exp −EikBT

⎝⎜⎞

⎠⎟i∑

exp −E /kBT( ): !Boltzmann!factor!Z : !sum!over!states!(partition!function)

We!set!kB =1!and!view!− logpi !as!"energy"

1. Statesoflowenergyhaveahigherprobabilityofoccurrencethanthestatesofhighenergy.

2. AsthetemperatureT isreduced,theprobabilityisconcentratedonasmallersubsetoflow-energystates.

11.2 StatisticalMechanics(2/2)

(c)2017BiointelligenceLab,SNU 5!!!

Helmholtz!free!energy!!!!!!!F = −T logZ< E > ! = piEi

i∑ !!!!!!(avergage!energy)

!!!!!! < E > − !F = −T pi logpii∑

H = − pi logpii∑ !!!!!(entropy)

Thus,!we!have!!!!!! < E > − !F =TH!!!!!!!F = ! < E > − !TH

Consider!two!systems!A!and!A'!in!thermal!contact.ΔH !and!ΔH ': !entropy!changes!of!A!and!A'!The!total!entropy!tends!to!increase!with!!!!!!!ΔH +ΔH '≥0

!!!

The!free!energy!of!the!system,!F ,!tends!to!decrease!andbecome!a!minimum!in!an!equilibrium!situation.!The!resulting!probability!distribution!is!defined!by!Gibbs!distribution!(The!Principle!of!Minimum!Free!Energy).!

Naturelikestofindaphysicalsystemwithminimumfreeenergy.

(c)2017BiointelligenceLab,SNU 6

Markov property P( Xn+1 = xn+1 | Xn = xn ,…, X1 = x1) = P( Xn+1 = xn+1 | Xn = xn )

Transition probability from state i at time n to j at time n+1 pij = P( Xn+1 = j | Xn = i)

(pij ≥ 0 ∀i, j and pij = 1 ∀ij∑ )

If the transition probabilities are fixed, the Markov chain is homogeneous.In case of a system with a finite number of possible states K , the transition probabilities constitute a K-by-K matrix (stochastic matrix):

P =

p11 … p1K

! " !pK1 # pKK

⎜⎜⎜

⎟⎟⎟

11.3 MarkovChains(1/9)

11.3 MarkovChains(2/9)

(c)2017BiointelligenceLab,SNU 7

Generalization to m-step transition probability

pij(m) = P( Xn+m = x j | Xn = xi ), m = 1,2,…

pij(m+1) = pik

(m) pkj , m = 1,2,…, pik(1) = pikk∑

We can further generalize to (Chapman-Kolmogorov identity)

pij(m+n) = pik

(m) pkj(n)

k∑ , m,n = 1,2,…

lim

k→∞vi(k) = π i i = 1,2,…, K

11.3 MarkovChains(3/9)

(c)2017BiointelligenceLab,SNU 8

Properties of markov chains

Recurrent pi = P(every returning to state i)Transient pi <1

Periodic

If i ∈Sk and pi > 0, thenj ∈Sk+1, for k = 1,...,d -1

j ∈Sk , for k = 1,...,d

⎧⎨⎪

⎩⎪

AperiodicAccessable: Accessable from i if there is a finite sequence of transition from i to jCommunicate: If the states i and j are accessible to each otherIf two states communicate each other, they belong to the same class.If all the states consists of a single class, the Markov chain is indecomposible or irreducible.

11.3 MarkovChains(4/9)

Figure11.1:AperiodicrecurrentMarkovchainwithd=3.

(c)2017BiointelligenceLab,SNU 9

11.3 MarkovChains(5/9)

(c)2017BiointelligenceLab,SNU 10

Ergodic Markov chains Ergodicity: time average = ensemble average

i.e. long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i

vi(k) : Proportion of time spent in state i after k returns

vi(k) = kTi(ℓ)ℓ=1

k∑

limk→∞

vi(k) = π i i = 1,2,…, K

11.3 MarkovChains(6/9)

(c)2017BiointelligenceLab,SNU 11

Convergence to Stationary DistributionsConsider an ergodic Markov chain with a stochastic matrix P

π (n−1) : state transition vector of the chain at time n -1State transition vector at time n is

π (n) = π (n−1)PBy iteration we obtain

π (n) = π (n−1)P = π (n−2)P2 = π (n−3)P3 =! π (n) = π (0)Pn

π (0) : initial value

limn→∞

Pn =

π1 … π K

" # "π1 ! π K

⎜⎜⎜

⎟⎟⎟=

π"π

⎜⎜⎜

⎟⎟⎟

Ergodic theorem

1. limn→∞

pij(n) = π j ∀i

2. π j > 0 ∀j

3. π j = 1j=1

K∑ 4. π j = π i piji=1

K∑ for j = 1,2,…, K

11.3 MarkovChains(6/9)

Figure11.2:State-transitiondiagramofMarkovchainforExample1:Thestatesx1andx2andmaybeidentifiedasup-to-datebehind,respectively.

12

!!

P=

14

34

12

12

⎢⎢⎢⎢

⎥⎥⎥⎥

!!

π (0) = 16

56

⎣⎢⎢

⎦⎥⎥

π (1) =π (1)P

!!!!!!! = ! 16

56

⎣⎢⎢

⎦⎥⎥

14

34

12

12

⎢⎢⎢⎢

⎥⎥⎥⎥

!!!!!! = ! 1124

1324

⎣⎢⎢

⎦⎥⎥

!!

P(2) = 0.4375 0.56250.3750 0.6250

⎣⎢

⎦⎥

P(3) = 0.4001 0.59990.3999 0.6001

⎣⎢

⎦⎥

P(4) = 0.4000 0.60000.4000 0.6000

⎣⎢

⎦⎥

11.3 MarkovChains(7/9)

Figure11.3:State-transitiondiagramofMarkovchainforExample2.

(c)2017BiointelligenceLab,SNU 13!!

P=

0 0 113

16

12

34

14 0

⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥

!

π1 =13π2 +

34π3

π2 =16π2 +

14π3

π3 =π1 +12π2

π j = π i piji=1

K∑

!

π1 =0.3953π2 =0.1395π3 =0.4652

11.3 MarkovChains(8/9)

Figure11.4:ClassificationofthestatesofaMarkovchainandtheirassociatedlong-termbehavior.

14

11.3 MarkovChains(9/9)

(c)2017BiointelligenceLab,SNU 15

Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i pij = π j p ji

Application : stationary distribution

π i piji=1

K

∑ =π iπ j

pij⎛

⎝⎜

⎠⎟

i=1

K

∑ π j =π j

π j

p ji⎛

⎝⎜

⎠⎟

i=1

K

∑ π j

= p ji( )i=1

K

∑ π j (π i pij = π j p ji ,detailed balance)

= π j (since p jii=1

K

∑ =1)

11.4 MetropolisAlgorithm(1/3)

(c)2017BiointelligenceLab,SNU 16

Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method

Algorithm Metropolis1. Xn = xi . Randomly generate a new state x j .

2. ΔE = E(x j )− E(xi )

3. If ΔE < 0, then Xn+1 = x j

else if ΔE ≥ 0, then { Select a random number ξ ∼ U[0,1]. If ξ < exp(−ΔE / T ), then Xn+1 = x j , (accept)

else Xn+1 = xi . (reject)

}

11.4 MetropolisAlgorithm(2/3)

(c)2017BiointelligenceLab,SNU 17

Choice of Transition ProbabilitiesProposed set of transition probabilities 1. τ ij > 0 (for all i, j) : Nonnegativity

2. τ ijj∑ =1 (for all i) : Normalization

3. τ ij = τ ji (for all i, j) : Symmetry

Desired set of transition probabilities

pij =τ ij

π j

π i

⎝⎜

⎠⎟ for

π j

π i<1

τ ij forπ j

π i≥1

⎪⎪⎪

⎪⎪⎪

pii = τ ii + τ ij 1−π j

π i

⎝⎜

⎠⎟ =1− α ijτ ijj≠i∑j≠i∑

Moving probability

α ij = min 1,π j

π i

⎝⎜⎞

⎠⎟

11.4 MetropolisAlgorithm(3/3)

(c)2017BiointelligenceLab,SNU 18

How to choose the ratio π j / π i ?

We choose the probability distribution to which we want the Markov chain to coverge to be a Gibbs distribution

π j =1Z

exp −E j

T⎛

⎝⎜⎞

⎠⎟

π j

π i

= exp − ΔET

⎛⎝⎜

⎞⎠⎟

ΔE = E j − Ei

Proof of detailed balance :Case 1: ΔE < 0. π i pij = π iτ ij = π iτ ji

π j p ji = π j

π i

π j

τ ji

⎝⎜

⎠⎟ = π iτ ji

Case 2: ΔE > 0.

π i pij = π i

π j

π i

τ ij

⎝⎜⎞

⎠⎟= π jτ ji

π j p ji = π iτ ij

11.5SimulatedAnnealing(1/3)

(c)2017BiointelligenceLab,SNU 19

SimulatedAnnealing• Astochasticrelaxationtechniqueforsolvingoptimizationproblems.• ImprovesthecomputationalefficiencyoftheMetropolisalgorithm.• Makesrandommovesontheenergysurface

• Operateastochasticsystematahigh temperature(whereconvergencetoequilibriumisfast)andtheniterativelylower thetemperature(atT=0,theMarkovchaincollapsesontheglobalminima).

Twoingredients:1. Aschedulethatdeterminestherateatwhichthetemperatureislowered.2. Analgorithm,suchastheMetropolisalgorithm,thatiterativelyfindsthe

equilibriumdistributionateachnewtemperatureintheschedulebyusingthefinalstateofthesystemattheprevioustemperatureasthestartingpointforthenewtemperature.

!!F = ! < E > −TH , !!!!!! limT→0

!F ! = ! < E >

11.5SimulatedAnnealing(2/3)

(c)2017BiointelligenceLab,SNU 20

1. InitialValueoftheTemperature.TheinitialvalueT0 ofthetemperatureischosenhighenoughtoensurethatvirtuallyallproposedtransitionsareacceptedbythesimulated-annealingalgorithm

2. DecrementoftheTemperature.Ordinarily,thecoolingisperformedexponentially,andthechangesmadeinthevalueofthetemperaturearesmall.Inparticular,thedecrementfunctionisdefinedby

whereα isaconstantsmallerthan,butcloseto,unity.Typicalvaluesofαliebetween0.8and0.99.Ateachtemperature,enoughtransitionsareattemptedsothatthereare10acceptedtransitionsperexperiment,onaverage.

3. FinalValueoftheTemperature.Thesystemisfixedandannealingstopsifthedesirednumberofacceptancesisnotachievedatthreesuccessivetemperatures

Tk =αTk−1, k = 1,2,…, K

11.5SimulatedAnnealing(3/3)

SimulatedAnnealingforCombinatorialOptimization

(c)2017BiointelligenceLab,SNU 21

11.6GibbsSampling(1/2)

22

Gibbs sampling An iterative adaptive scheme that generates a single value for the conditional distribution for each component of the random vector X , rather than all values of the variables at the same time.X = X1, X2 ,..., X K : a random vector of K components

Assume we know P( Xk | X−k ),where X−k = X1, X2 ,..., Xk−1Xk+1,..., X K

Gibbs sampling algorithm (Gibbs sampler)1. Initialize x1(0),x2(0),...,xK (0).

2. i ←1 x1(1) ∼ P( X1 | x2(0),x3(0),x4(0),...,xK (0))

x2(1) ∼ P( X2 | x1(1),x3(0),x4(0),...,xK (0))

x3(1) ∼ P( X3 | x1(1),x2(1),x3(0),...,xK (0))

" xk (1) ∼ P( Xk | x1(1),x2(1),...,xk−1(1),xk+1(0),xK (0))

" xK (1) ∼ P( X K | x1(1),x2(1),...,xK−1(1))

3. If (termination condition not met), then i ← i +1 and go to step 2.

11.6GibbsSampling(2/2)

(c)2017BiointelligenceLab,SNU 23

1. Convergence theorem. The random variable Xk (n) converges in distribution to the true

probability distributions of Xk for k = 1, 2, ..., K as n approaches infinity; that is,

limn→∞

P( Xk(n) ≤ x | xk (0)) = PXk

(x) for k = 1,2,…, K

where PXk(x) is marginal cumulative distribution function of Xk .

2. Rate-of-convergence theorem. The joint cumulative distribution of the random variables X1(n), X2(n), ..., X K (n) converges to the true joint cumulative distribution

of X1, X2 , ..., X K at a geometric rate in n.

3. Ergodic theorem. For any measurable function g of the random variables X1, X2 , ..., X K whose expectation exists, we have

limn→∞

1n

g( X1(i), X2(i),…, X K (i))→ E[g( X1, X2 ,…X K )]i=1

n∑ with probability 1 (i.e., almost surely).

11.7BoltzmannMachine(1/5)

Figure11.5:ArchitecturalgraphofBoltzmannmachine;Kisthenumberofvisibleneurons,andListhenumberofhiddenneurons.Thedistinguishingfeaturesofthemachineare:1.Theconnectionsbetweenthevisibleandhiddenneuronsaresymmetric.2.Thesymmetricconnectionsareextendedtothevisibleandhiddenneurons.

24

Boltzmann machine (BM)x : state vector of BMwji : synaptic connection from i to j

Structure (weights) wji = wij ∀i, j

wii = 0 ∀i

Energy

E(x) = − 12

wjixix jj≠i∑i∑Probability

P(X = x) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟

Astochasticmachineconsistingofstochasticneuronswithsymmetricsynapticconnections.

11.7BoltzmannMachine(2/5)

(c)2017BiointelligenceLab,SNU 25

Consider three events: A : X j = x j

B : Xi = xi{ }i=1

Kwith i ≠ j

C : Xi = xi{ }i=1

K

The joint event B excludes A, and the joint event C includes both A and B.P(C) = P( A, B)

= 1Z

exp1

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟

P(B) = P( A, B)A∑

= 1Z

exp1

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟x j

The component involving x j

x j

2Twjixi

i≠ j∑

P( A | B) = P( A, B)P(B)

= 1

1+ exp −x j

Twjixi

ii≠ j

∑⎛

⎜⎜

⎟⎟

P X j = x | Xi = xi{ }i=1,i≠ j

K( ) =ϕ xT

wjixii,i≠ j

K∑⎛⎝⎜

⎞⎠⎟

ϕ(v) = 11+ exp(−v)

11.7BoltzmannMachine(3/5)

Figure11.6:Sigmoid-shapedfunctionP(v).

(c)2017BiointelligenceLab,SNU 26

L(w) = log P(Xα = xα )xα∈ℑ

∏ = log P(Xα = xα )

xα∈ℑ∑

1. Positive phase. In this phase, the network operates in its clamped condition (i.e.,under the direct influence of the training sample ).2. Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input.

𝕵

11.7BoltzmannMachine(4/5)

(c)2017BiointelligenceLab,SNU 27

xα : the state of the visible neurons (subset of x)

xβ : the state of the hidden neurons (subset of x)

Probability of the visible state

P(Xα = xα ) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

Z = exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑

Log-likelihood function given the training data ℑ

L(w) = log P(x | w) = log exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

∑ − log exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑⎛

⎝⎜⎞⎠⎟xα∈ℑ

∑Derivative of the log-likelihood function

∂L(w)∂wji

= 1T

P(Xβ = xβ | Xα = xα )xβ

∑ x jxi − P(X = x)x jxix∑( )xα∈ℑ∑

11.7BoltzmannMachine(5/5)

(c)2017BiointelligenceLab,SNU 28

Mean firing rate in the positive phase (clamped)

ρ ji+ = x jxi

+= P(X = xβ |X = xα )x jxixβ

∑xα∈ℑ∑

Mean firing rate in the negative phase (free-running)

ρ ji− = x jxi

−= P(X = x)x jxix∑xα∈ℑ∑

Thus, we may write

∂L(w)∂wji

= 1T(ρ ji

+ − ρ ji− )

Gradient ascent to maximize the L(w)

Δwji =η∂L(w)∂wji

=η '(ρ ji+ − ρ ji

− )

η ' = εT

Boltzmannmachinelearningrule

11.8LogisticBeliefNets

Figure11.7:Directed(logistic)beliefnetwork.(c)2017BiointelligenceLab,SNU 29

Parents of node j

pa( X j )⊆ X1, X2 ,…, X j−1{ }Conditional probability P( X j = x j | X1 = x1,…, X j−1 = x j−1)

= P( X j = x j | pa( X j ))

Astochasticmachineconsistingofmultiplelayersofstochasticneuronswithdirected synapticconnections.

Calculation of conditional probabilities 1. wji = 0 for all Xi ∉ pa(X j )

2. wji = 0 for i ≥ j (∵acyclic)

Weight update rule

Δwji =η∂

∂wji

L(w)

11.9DeepBeliefNets(1/4)

Figure11.8:NeuralstructureofrestrictedBoltzmannmachine(RBM).ContrastingthiswiththatofFig.11.6,weseethatunliketheBoltzmannmachine,therearenoconnectionsamongthevisibleneuronsandthehiddenneuronsintheRBM.

(c)2017BiointelligenceLab,SNU 30

Maximum-LikelihoodLearninginaRestrictedBoltzmannMachine(RBM)

Sequential pre - training1. Update the hidden states h in parallel, given the visible states x.2. Doing the same, but in reverse: update the visible states x in parallel, given the hidden states h.Maximum - likelihood learning

∂L(w)∂wji

= ρ ji(0) − ρ ji

(∞)

11.9DeepBeliefNets(2/4)

Figure11.9:Top-downlearning,usinglogisticbeliefnetworkofinfinitedepth.

31

Figure11.10:AhybridgenerativemodelinwhichthetwotoplayersformarestrictedBoltzmannmachineandthelowertwolayersformadirectedmodel.Theweightsshownwithblueshadedarrowsarenotpartofthegenerativemodel;theyareusedtoinferthefeaturevaluesgiventothedata,buttheyarenotusedforgeneratingdata.

11.9DeepBeliefNets(3/4)

Figure11.11:IllustratingtheprogressionofalternatingGibbssamplinginanRBM.Aftersufficientlymanysteps,thevisibleandhiddenvectorsaresampledfromthestationarydistributiondefinedbythecurrentparametersofthemodel.

32(c)2017BiointelligenceLab,SNU

11.9DeepBeliefNets(4/4)

Figure11.12:Thetaskofmodelingthesensory(visible)dataisdividedintotwosubtasks.

33(c)2017BiointelligenceLab,SNU

11.10DeterministicAnnealing(1/5)

(c)2017BiointelligenceLab,SNU 34

Deterministic Annealing Incorporates randomness into the energy function, which is then deterministically optimized at a sequence of decreasing temperature (cf. simulated annealing: random moves on the energy surface)Clustering via Deterministic Annealing x : source (input) vector y : reconstruction (output) vector

Distortion measure: d(x,y) = x − y2

Expected distortion: D = P(X = x,Y = y)d(x,y)y∑x∑

= P(X = x) P(Y = y | X = x)d(x,y)y∑x∑

Probability of joint event P(X = x,Y = y) = P(Y = y | X = x)

association probability! "## $## P(X = x)

11.10DeterministicAnnealing(2/5)

Table11.2

(c)2017BiointelligenceLab,SNU 35

Entropy as randomness measure H (X,Y) = − P(X = x,Y = y)logP(X = x,Y = y)

y∑x∑Constrained optimization of D as minimization of the Lagrangean F = D −TH H (X,Y) = H (X)

source entropy! + H (Y |X)

conditional entropy!"# $#

H (Y |X) = − P(X = x) P(Y = y |X = x)logP(Y = y |X = x)y∑x∑

P(Y = y |X = x) = 1Zx

exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟

, Zx = exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟y∑

11.10DeterministicAnnealing(3/5)

(c)2017BiointelligenceLab,SNU 36

F * = minP(Y=y|X=x )

F

= −T P(X = x) log Zxx∑Setting

∂∂y

F * = P(X = x,Y = y)∂∂y

d(x,y)x∑ = 0 ∀y ∈ϒ

The minimizing condition is

1N

P(Y = y | X = x)x∑ ∂

∂yd(x,y) = 0 ∀y ∈ϒ

The deterministic annealing algorithm consists of minimizing the Lagrangian F * with respect to the code vectors at a high value of temperature T and then tracking the minimum while the temperature T is lowered.

11.10DeterministicAnnealing(4/5)

Figure11.13:Clusteringatvariousphases.Thelinesareequiprobability contours,p=½in(b),andp=⅓elsewhere:(a)1cluster(B=0),(b)2clusters(B=0.0049),(c)3clusters(B=0.0056),(d)4clusters(B=0.0100),(e)5clusters(B=0.0156),(f)6clusters(B=0.0347),and(g)19clusters(B=0.0605).

37(c)2017BiointelligenceLab,SNU

!!B = 1

T

11.10DeterministicAnnealing(5/5)

Figure11.14:PhasediagramfortheCaseStudyindeterministicannealing.Thenumberofeffectiveclustersisshownforeachphase.

38(c)2017BiointelligenceLab,SNU!!B = 1

T

11.11AnalogyofDAwithEM(1/2)

(c)2017BiointelligenceLab,SNU 39

Suppose we view the association probability P(Y = y | X = x)as the expected value of a random binary variable Vxy defined as

Vxy

1 if thesource vector x isassigned tocode vector y0 otherwise

⎧⎨⎪

⎩⎪

Then, the two steps of DA = two steps of EM1. Step 1 of DA (= E-step of EM) Compute the association probabilities P(Y = y | X = x) 2. Step 2 of DA (= M-step of EM) Optimize the distortion measure d(x,y)

11.11AnalogyofDAwithEM(2/2)

(c)2017BiointelligenceLab,SNU 40

r : complete data including missing data z d = d(r) : incomplete dataConditional pdf of r given param vector θ

pD (d |θ) = pc (r |θ)drℜ(d )∫

ℜ(d ) : subspace of ℜ determined by d = d(r)Incomplete log-likelihood function L(θ) = log pD (d |θ)Complete-data log-likelihood function Lc (θ) = log pc (r |θ)

Expectation - Maximization Algorithm

θ̂(n) : value of θ at iteration n of EM 1. E-step

Q(θ, θ̂(n)) = Eθ̂(n)

LC (θ)⎡⎣ ⎤⎦2. M-step

θ̂(n +1) = arg maxθQ(θ, θ̂(n))

After an interation of the EM algorithm, the incomplete-data log-likelihood function is not decreased:

L(θ̂(n +1) ≥ L(θ̂(n)) for n = 0,1,2,…,K

SummaryandDiscussionn Statisticalmechanicsasmathematicalbasisforthe

formulationofstochasticsimulation/optimization/learning1. Metropolisalgorithm2. Simulatedannealing3. Gibbssampling

n Stochasticlearningmachines1. (Classical)Boltzmannmachine2. RestrictedBoltzmannmachine(RBM)3. Deepbeliefnets(DBN)

n Deterministicannealing(DA)1. Foroptimization:Connectiontosimulatedannealing(SA)2. Forclustering:Connectiontoexpectation-maximization(EM)

(c)2017BiointelligenceLab,SNU 41

top related