chapter 11. stochastic methods rooted in statistical mechanics · chapter 11. stochastic methods...

41
Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20170926 è 20170928è20171011

Upload: dinhhuong

Post on 29-Apr-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

Chapter11.StochasticMethodsRootedinStatisticalMechanics

NeuralNetworksandLearningMachines(Haykin)

LectureNotesonSelf-learningNeuralAlgorithms

Byoung-TakZhangSchoolofComputerScienceandEngineering

SeoulNationalUniversity

Version: 20170926è 20170928è20171011

Page 2: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

Contents11.1Introduction……………………………………………………………………………....311.2StatisticalMechanics……………………………………………………………..….411.3MarkovChains………………………..………………………………………..……....611.4MetropolisAlgorithm ……….………..………………….……………………..... 1611.5SimulatedAnnealing ………………………………………………….…….…...….1911.6GibbsSampling ….…………….…….………………..……..………………………..2211.7BoltzmannMachine …..……………………………………….……..……………..2411.8LogisticBeliefNets……………….…………………..……………….…......…….2911.9DeepBeliefNets ………………………….…………………………..….........….3011.10DeterministicAnnealing(DA) …………….………………………..…….…...3411.11AnalogyofDAwithEM…..……….….…………………………………….…….39SummaryandDiscussion…………….…………….………………………….………...41

(c)2017BiointelligenceLab,SNU 2

Page 3: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.1 Introduction

(c)2017BiointelligenceLab,SNU 3

• Statisticalmechanicsasasourceofideasforunsupervised(self-organized)learningsystems

• Statisticalmechanicsü Theformalstudyofmacroscopicequilibriumproperties oflarge

systemsofelementsthataresubjecttothemicroscopiclawsofmechanics.

ü Thenumberofdegreesoffreedomisenormous,makingtheuseofprobabilisticmethods mandatory.

ü Theconceptofentropy playsavitalroleinstatisticalmechanics,aswiththeShannon’sinformationtheory.

ü The moreordered thesystem,orthemoreconcentrated theunderlyingprobabilitydistribution,thesmallertheentropy willbe.

• Statisticalmechanicsforthestudyofneuralnetworksü Cragg andTemperley (1954)andCowan(1968)ü Boltzmannmachine(HintonandSejnowsky,1983,1986;Ackleyetal.,

1985)

Page 4: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.2 StatisticalMechanics(1/2)

(c)2017BiointelligenceLab,SNU 4!!

pi : !probability!of!occurrence!of!state!i !of!a!stochastic!system!!!!!pi ≥0!(for!all!i)!!and! pi

i∑ =1

Ei : !energy!of!the!system!when!it!is!in!state!iIn!thermal!equilibrium,!the!probability!of!state!i !is(Canonical!distribution!/!Gibbs!distribution)

!!!!!pi =1Zexp −

EikBT

⎝⎜⎞

⎠⎟

!!!!!Z = exp −EikBT

⎝⎜⎞

⎠⎟i∑

exp −E /kBT( ): !Boltzmann!factor!Z : !sum!over!states!(partition!function)

We!set!kB =1!and!view!− logpi !as!"energy"

1. Statesoflowenergyhaveahigherprobabilityofoccurrencethanthestatesofhighenergy.

2. AsthetemperatureT isreduced,theprobabilityisconcentratedonasmallersubsetoflow-energystates.

Page 5: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.2 StatisticalMechanics(2/2)

(c)2017BiointelligenceLab,SNU 5!!!

Helmholtz!free!energy!!!!!!!F = −T logZ< E > ! = piEi

i∑ !!!!!!(avergage!energy)

!!!!!! < E > − !F = −T pi logpii∑

H = − pi logpii∑ !!!!!(entropy)

Thus,!we!have!!!!!! < E > − !F =TH!!!!!!!F = ! < E > − !TH

Consider!two!systems!A!and!A'!in!thermal!contact.ΔH !and!ΔH ': !entropy!changes!of!A!and!A'!The!total!entropy!tends!to!increase!with!!!!!!!ΔH +ΔH '≥0

!!!

The!free!energy!of!the!system,!F ,!tends!to!decrease!andbecome!a!minimum!in!an!equilibrium!situation.!The!resulting!probability!distribution!is!defined!by!Gibbs!distribution!(The!Principle!of!Minimum!Free!Energy).!

Naturelikestofindaphysicalsystemwithminimumfreeenergy.

Page 6: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

(c)2017BiointelligenceLab,SNU 6

Markov property P( Xn+1 = xn+1 | Xn = xn ,…, X1 = x1) = P( Xn+1 = xn+1 | Xn = xn )

Transition probability from state i at time n to j at time n+1 pij = P( Xn+1 = j | Xn = i)

(pij ≥ 0 ∀i, j and pij = 1 ∀ij∑ )

If the transition probabilities are fixed, the Markov chain is homogeneous.In case of a system with a finite number of possible states K , the transition probabilities constitute a K-by-K matrix (stochastic matrix):

P =

p11 … p1K

! " !pK1 # pKK

⎜⎜⎜

⎟⎟⎟

11.3 MarkovChains(1/9)

Page 7: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(2/9)

(c)2017BiointelligenceLab,SNU 7

Generalization to m-step transition probability

pij(m) = P( Xn+m = x j | Xn = xi ), m = 1,2,…

pij(m+1) = pik

(m) pkj , m = 1,2,…, pik(1) = pikk∑

We can further generalize to (Chapman-Kolmogorov identity)

pij(m+n) = pik

(m) pkj(n)

k∑ , m,n = 1,2,…

lim

k→∞vi(k) = π i i = 1,2,…, K

Page 8: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(3/9)

(c)2017BiointelligenceLab,SNU 8

Properties of markov chains

Recurrent pi = P(every returning to state i)Transient pi <1

Periodic

If i ∈Sk and pi > 0, thenj ∈Sk+1, for k = 1,...,d -1

j ∈Sk , for k = 1,...,d

⎧⎨⎪

⎩⎪

AperiodicAccessable: Accessable from i if there is a finite sequence of transition from i to jCommunicate: If the states i and j are accessible to each otherIf two states communicate each other, they belong to the same class.If all the states consists of a single class, the Markov chain is indecomposible or irreducible.

Page 9: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(4/9)

Figure11.1:AperiodicrecurrentMarkovchainwithd=3.

(c)2017BiointelligenceLab,SNU 9

Page 10: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(5/9)

(c)2017BiointelligenceLab,SNU 10

Ergodic Markov chains Ergodicity: time average = ensemble average

i.e. long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i

vi(k) : Proportion of time spent in state i after k returns

vi(k) = kTi(ℓ)ℓ=1

k∑

limk→∞

vi(k) = π i i = 1,2,…, K

Page 11: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(6/9)

(c)2017BiointelligenceLab,SNU 11

Convergence to Stationary DistributionsConsider an ergodic Markov chain with a stochastic matrix P

π (n−1) : state transition vector of the chain at time n -1State transition vector at time n is

π (n) = π (n−1)PBy iteration we obtain

π (n) = π (n−1)P = π (n−2)P2 = π (n−3)P3 =! π (n) = π (0)Pn

π (0) : initial value

limn→∞

Pn =

π1 … π K

" # "π1 ! π K

⎜⎜⎜

⎟⎟⎟=

π"π

⎜⎜⎜

⎟⎟⎟

Ergodic theorem

1. limn→∞

pij(n) = π j ∀i

2. π j > 0 ∀j

3. π j = 1j=1

K∑ 4. π j = π i piji=1

K∑ for j = 1,2,…, K

Page 12: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(6/9)

Figure11.2:State-transitiondiagramofMarkovchainforExample1:Thestatesx1andx2andmaybeidentifiedasup-to-datebehind,respectively.

12

!!

P=

14

34

12

12

⎢⎢⎢⎢

⎥⎥⎥⎥

!!

π (0) = 16

56

⎣⎢⎢

⎦⎥⎥

π (1) =π (1)P

!!!!!!! = ! 16

56

⎣⎢⎢

⎦⎥⎥

14

34

12

12

⎢⎢⎢⎢

⎥⎥⎥⎥

!!!!!! = ! 1124

1324

⎣⎢⎢

⎦⎥⎥

!!

P(2) = 0.4375 0.56250.3750 0.6250

⎣⎢

⎦⎥

P(3) = 0.4001 0.59990.3999 0.6001

⎣⎢

⎦⎥

P(4) = 0.4000 0.60000.4000 0.6000

⎣⎢

⎦⎥

Page 13: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(7/9)

Figure11.3:State-transitiondiagramofMarkovchainforExample2.

(c)2017BiointelligenceLab,SNU 13!!

P=

0 0 113

16

12

34

14 0

⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥

!

π1 =13π2 +

34π3

π2 =16π2 +

14π3

π3 =π1 +12π2

π j = π i piji=1

K∑

!

π1 =0.3953π2 =0.1395π3 =0.4652

Page 14: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(8/9)

Figure11.4:ClassificationofthestatesofaMarkovchainandtheirassociatedlong-termbehavior.

14

Page 15: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.3 MarkovChains(9/9)

(c)2017BiointelligenceLab,SNU 15

Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i pij = π j p ji

Application : stationary distribution

π i piji=1

K

∑ =π iπ j

pij⎛

⎝⎜

⎠⎟

i=1

K

∑ π j =π j

π j

p ji⎛

⎝⎜

⎠⎟

i=1

K

∑ π j

= p ji( )i=1

K

∑ π j (π i pij = π j p ji ,detailed balance)

= π j (since p jii=1

K

∑ =1)

Page 16: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.4 MetropolisAlgorithm(1/3)

(c)2017BiointelligenceLab,SNU 16

Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method

Algorithm Metropolis1. Xn = xi . Randomly generate a new state x j .

2. ΔE = E(x j )− E(xi )

3. If ΔE < 0, then Xn+1 = x j

else if ΔE ≥ 0, then { Select a random number ξ ∼ U[0,1]. If ξ < exp(−ΔE / T ), then Xn+1 = x j , (accept)

else Xn+1 = xi . (reject)

}

Page 17: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.4 MetropolisAlgorithm(2/3)

(c)2017BiointelligenceLab,SNU 17

Choice of Transition ProbabilitiesProposed set of transition probabilities 1. τ ij > 0 (for all i, j) : Nonnegativity

2. τ ijj∑ =1 (for all i) : Normalization

3. τ ij = τ ji (for all i, j) : Symmetry

Desired set of transition probabilities

pij =τ ij

π j

π i

⎝⎜

⎠⎟ for

π j

π i<1

τ ij forπ j

π i≥1

⎪⎪⎪

⎪⎪⎪

pii = τ ii + τ ij 1−π j

π i

⎝⎜

⎠⎟ =1− α ijτ ijj≠i∑j≠i∑

Moving probability

α ij = min 1,π j

π i

⎝⎜⎞

⎠⎟

Page 18: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.4 MetropolisAlgorithm(3/3)

(c)2017BiointelligenceLab,SNU 18

How to choose the ratio π j / π i ?

We choose the probability distribution to which we want the Markov chain to coverge to be a Gibbs distribution

π j =1Z

exp −E j

T⎛

⎝⎜⎞

⎠⎟

π j

π i

= exp − ΔET

⎛⎝⎜

⎞⎠⎟

ΔE = E j − Ei

Proof of detailed balance :Case 1: ΔE < 0. π i pij = π iτ ij = π iτ ji

π j p ji = π j

π i

π j

τ ji

⎝⎜

⎠⎟ = π iτ ji

Case 2: ΔE > 0.

π i pij = π i

π j

π i

τ ij

⎝⎜⎞

⎠⎟= π jτ ji

π j p ji = π iτ ij

Page 19: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.5SimulatedAnnealing(1/3)

(c)2017BiointelligenceLab,SNU 19

SimulatedAnnealing• Astochasticrelaxationtechniqueforsolvingoptimizationproblems.• ImprovesthecomputationalefficiencyoftheMetropolisalgorithm.• Makesrandommovesontheenergysurface

• Operateastochasticsystematahigh temperature(whereconvergencetoequilibriumisfast)andtheniterativelylower thetemperature(atT=0,theMarkovchaincollapsesontheglobalminima).

Twoingredients:1. Aschedulethatdeterminestherateatwhichthetemperatureislowered.2. Analgorithm,suchastheMetropolisalgorithm,thatiterativelyfindsthe

equilibriumdistributionateachnewtemperatureintheschedulebyusingthefinalstateofthesystemattheprevioustemperatureasthestartingpointforthenewtemperature.

!!F = ! < E > −TH , !!!!!! limT→0

!F ! = ! < E >

Page 20: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.5SimulatedAnnealing(2/3)

(c)2017BiointelligenceLab,SNU 20

1. InitialValueoftheTemperature.TheinitialvalueT0 ofthetemperatureischosenhighenoughtoensurethatvirtuallyallproposedtransitionsareacceptedbythesimulated-annealingalgorithm

2. DecrementoftheTemperature.Ordinarily,thecoolingisperformedexponentially,andthechangesmadeinthevalueofthetemperaturearesmall.Inparticular,thedecrementfunctionisdefinedby

whereα isaconstantsmallerthan,butcloseto,unity.Typicalvaluesofαliebetween0.8and0.99.Ateachtemperature,enoughtransitionsareattemptedsothatthereare10acceptedtransitionsperexperiment,onaverage.

3. FinalValueoftheTemperature.Thesystemisfixedandannealingstopsifthedesirednumberofacceptancesisnotachievedatthreesuccessivetemperatures

Tk =αTk−1, k = 1,2,…, K

Page 21: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.5SimulatedAnnealing(3/3)

SimulatedAnnealingforCombinatorialOptimization

(c)2017BiointelligenceLab,SNU 21

Page 22: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.6GibbsSampling(1/2)

22

Gibbs sampling An iterative adaptive scheme that generates a single value for the conditional distribution for each component of the random vector X , rather than all values of the variables at the same time.X = X1, X2 ,..., X K : a random vector of K components

Assume we know P( Xk | X−k ),where X−k = X1, X2 ,..., Xk−1Xk+1,..., X K

Gibbs sampling algorithm (Gibbs sampler)1. Initialize x1(0),x2(0),...,xK (0).

2. i ←1 x1(1) ∼ P( X1 | x2(0),x3(0),x4(0),...,xK (0))

x2(1) ∼ P( X2 | x1(1),x3(0),x4(0),...,xK (0))

x3(1) ∼ P( X3 | x1(1),x2(1),x3(0),...,xK (0))

" xk (1) ∼ P( Xk | x1(1),x2(1),...,xk−1(1),xk+1(0),xK (0))

" xK (1) ∼ P( X K | x1(1),x2(1),...,xK−1(1))

3. If (termination condition not met), then i ← i +1 and go to step 2.

Page 23: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.6GibbsSampling(2/2)

(c)2017BiointelligenceLab,SNU 23

1. Convergence theorem. The random variable Xk (n) converges in distribution to the true

probability distributions of Xk for k = 1, 2, ..., K as n approaches infinity; that is,

limn→∞

P( Xk(n) ≤ x | xk (0)) = PXk

(x) for k = 1,2,…, K

where PXk(x) is marginal cumulative distribution function of Xk .

2. Rate-of-convergence theorem. The joint cumulative distribution of the random variables X1(n), X2(n), ..., X K (n) converges to the true joint cumulative distribution

of X1, X2 , ..., X K at a geometric rate in n.

3. Ergodic theorem. For any measurable function g of the random variables X1, X2 , ..., X K whose expectation exists, we have

limn→∞

1n

g( X1(i), X2(i),…, X K (i))→ E[g( X1, X2 ,…X K )]i=1

n∑ with probability 1 (i.e., almost surely).

Page 24: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.7BoltzmannMachine(1/5)

Figure11.5:ArchitecturalgraphofBoltzmannmachine;Kisthenumberofvisibleneurons,andListhenumberofhiddenneurons.Thedistinguishingfeaturesofthemachineare:1.Theconnectionsbetweenthevisibleandhiddenneuronsaresymmetric.2.Thesymmetricconnectionsareextendedtothevisibleandhiddenneurons.

24

Boltzmann machine (BM)x : state vector of BMwji : synaptic connection from i to j

Structure (weights) wji = wij ∀i, j

wii = 0 ∀i

Energy

E(x) = − 12

wjixix jj≠i∑i∑Probability

P(X = x) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟

Astochasticmachineconsistingofstochasticneuronswithsymmetricsynapticconnections.

Page 25: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.7BoltzmannMachine(2/5)

(c)2017BiointelligenceLab,SNU 25

Consider three events: A : X j = x j

B : Xi = xi{ }i=1

Kwith i ≠ j

C : Xi = xi{ }i=1

K

The joint event B excludes A, and the joint event C includes both A and B.P(C) = P( A, B)

= 1Z

exp1

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟

P(B) = P( A, B)A∑

= 1Z

exp1

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟x j

The component involving x j

x j

2Twjixi

i≠ j∑

P( A | B) = P( A, B)P(B)

= 1

1+ exp −x j

Twjixi

ii≠ j

∑⎛

⎜⎜

⎟⎟

P X j = x | Xi = xi{ }i=1,i≠ j

K( ) =ϕ xT

wjixii,i≠ j

K∑⎛⎝⎜

⎞⎠⎟

ϕ(v) = 11+ exp(−v)

Page 26: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.7BoltzmannMachine(3/5)

Figure11.6:Sigmoid-shapedfunctionP(v).

(c)2017BiointelligenceLab,SNU 26

L(w) = log P(Xα = xα )xα∈ℑ

∏ = log P(Xα = xα )

xα∈ℑ∑

1. Positive phase. In this phase, the network operates in its clamped condition (i.e.,under the direct influence of the training sample ).2. Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input.

𝕵

Page 27: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.7BoltzmannMachine(4/5)

(c)2017BiointelligenceLab,SNU 27

xα : the state of the visible neurons (subset of x)

xβ : the state of the hidden neurons (subset of x)

Probability of the visible state

P(Xα = xα ) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

Z = exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑

Log-likelihood function given the training data ℑ

L(w) = log P(x | w) = log exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

∑ − log exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑⎛

⎝⎜⎞⎠⎟xα∈ℑ

∑Derivative of the log-likelihood function

∂L(w)∂wji

= 1T

P(Xβ = xβ | Xα = xα )xβ

∑ x jxi − P(X = x)x jxix∑( )xα∈ℑ∑

Page 28: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.7BoltzmannMachine(5/5)

(c)2017BiointelligenceLab,SNU 28

Mean firing rate in the positive phase (clamped)

ρ ji+ = x jxi

+= P(X = xβ |X = xα )x jxixβ

∑xα∈ℑ∑

Mean firing rate in the negative phase (free-running)

ρ ji− = x jxi

−= P(X = x)x jxix∑xα∈ℑ∑

Thus, we may write

∂L(w)∂wji

= 1T(ρ ji

+ − ρ ji− )

Gradient ascent to maximize the L(w)

Δwji =η∂L(w)∂wji

=η '(ρ ji+ − ρ ji

− )

η ' = εT

Boltzmannmachinelearningrule

Page 29: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.8LogisticBeliefNets

Figure11.7:Directed(logistic)beliefnetwork.(c)2017BiointelligenceLab,SNU 29

Parents of node j

pa( X j )⊆ X1, X2 ,…, X j−1{ }Conditional probability P( X j = x j | X1 = x1,…, X j−1 = x j−1)

= P( X j = x j | pa( X j ))

Astochasticmachineconsistingofmultiplelayersofstochasticneuronswithdirected synapticconnections.

Calculation of conditional probabilities 1. wji = 0 for all Xi ∉ pa(X j )

2. wji = 0 for i ≥ j (∵acyclic)

Weight update rule

Δwji =η∂

∂wji

L(w)

Page 30: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.9DeepBeliefNets(1/4)

Figure11.8:NeuralstructureofrestrictedBoltzmannmachine(RBM).ContrastingthiswiththatofFig.11.6,weseethatunliketheBoltzmannmachine,therearenoconnectionsamongthevisibleneuronsandthehiddenneuronsintheRBM.

(c)2017BiointelligenceLab,SNU 30

Maximum-LikelihoodLearninginaRestrictedBoltzmannMachine(RBM)

Sequential pre - training1. Update the hidden states h in parallel, given the visible states x.2. Doing the same, but in reverse: update the visible states x in parallel, given the hidden states h.Maximum - likelihood learning

∂L(w)∂wji

= ρ ji(0) − ρ ji

(∞)

Page 31: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.9DeepBeliefNets(2/4)

Figure11.9:Top-downlearning,usinglogisticbeliefnetworkofinfinitedepth.

31

Figure11.10:AhybridgenerativemodelinwhichthetwotoplayersformarestrictedBoltzmannmachineandthelowertwolayersformadirectedmodel.Theweightsshownwithblueshadedarrowsarenotpartofthegenerativemodel;theyareusedtoinferthefeaturevaluesgiventothedata,buttheyarenotusedforgeneratingdata.

Page 32: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.9DeepBeliefNets(3/4)

Figure11.11:IllustratingtheprogressionofalternatingGibbssamplinginanRBM.Aftersufficientlymanysteps,thevisibleandhiddenvectorsaresampledfromthestationarydistributiondefinedbythecurrentparametersofthemodel.

32(c)2017BiointelligenceLab,SNU

Page 33: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.9DeepBeliefNets(4/4)

Figure11.12:Thetaskofmodelingthesensory(visible)dataisdividedintotwosubtasks.

33(c)2017BiointelligenceLab,SNU

Page 34: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.10DeterministicAnnealing(1/5)

(c)2017BiointelligenceLab,SNU 34

Deterministic Annealing Incorporates randomness into the energy function, which is then deterministically optimized at a sequence of decreasing temperature (cf. simulated annealing: random moves on the energy surface)Clustering via Deterministic Annealing x : source (input) vector y : reconstruction (output) vector

Distortion measure: d(x,y) = x − y2

Expected distortion: D = P(X = x,Y = y)d(x,y)y∑x∑

= P(X = x) P(Y = y | X = x)d(x,y)y∑x∑

Probability of joint event P(X = x,Y = y) = P(Y = y | X = x)

association probability! "## $## P(X = x)

Page 35: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.10DeterministicAnnealing(2/5)

Table11.2

(c)2017BiointelligenceLab,SNU 35

Entropy as randomness measure H (X,Y) = − P(X = x,Y = y)logP(X = x,Y = y)

y∑x∑Constrained optimization of D as minimization of the Lagrangean F = D −TH H (X,Y) = H (X)

source entropy! + H (Y |X)

conditional entropy!"# $#

H (Y |X) = − P(X = x) P(Y = y |X = x)logP(Y = y |X = x)y∑x∑

P(Y = y |X = x) = 1Zx

exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟

, Zx = exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟y∑

Page 36: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.10DeterministicAnnealing(3/5)

(c)2017BiointelligenceLab,SNU 36

F * = minP(Y=y|X=x )

F

= −T P(X = x) log Zxx∑Setting

∂∂y

F * = P(X = x,Y = y)∂∂y

d(x,y)x∑ = 0 ∀y ∈ϒ

The minimizing condition is

1N

P(Y = y | X = x)x∑ ∂

∂yd(x,y) = 0 ∀y ∈ϒ

The deterministic annealing algorithm consists of minimizing the Lagrangian F * with respect to the code vectors at a high value of temperature T and then tracking the minimum while the temperature T is lowered.

Page 37: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.10DeterministicAnnealing(4/5)

Figure11.13:Clusteringatvariousphases.Thelinesareequiprobability contours,p=½in(b),andp=⅓elsewhere:(a)1cluster(B=0),(b)2clusters(B=0.0049),(c)3clusters(B=0.0056),(d)4clusters(B=0.0100),(e)5clusters(B=0.0156),(f)6clusters(B=0.0347),and(g)19clusters(B=0.0605).

37(c)2017BiointelligenceLab,SNU

!!B = 1

T

Page 38: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.10DeterministicAnnealing(5/5)

Figure11.14:PhasediagramfortheCaseStudyindeterministicannealing.Thenumberofeffectiveclustersisshownforeachphase.

38(c)2017BiointelligenceLab,SNU!!B = 1

T

Page 39: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.11AnalogyofDAwithEM(1/2)

(c)2017BiointelligenceLab,SNU 39

Suppose we view the association probability P(Y = y | X = x)as the expected value of a random binary variable Vxy defined as

Vxy

1 if thesource vector x isassigned tocode vector y0 otherwise

⎧⎨⎪

⎩⎪

Then, the two steps of DA = two steps of EM1. Step 1 of DA (= E-step of EM) Compute the association probabilities P(Y = y | X = x) 2. Step 2 of DA (= M-step of EM) Optimize the distortion measure d(x,y)

Page 40: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

11.11AnalogyofDAwithEM(2/2)

(c)2017BiointelligenceLab,SNU 40

r : complete data including missing data z d = d(r) : incomplete dataConditional pdf of r given param vector θ

pD (d |θ) = pc (r |θ)drℜ(d )∫

ℜ(d ) : subspace of ℜ determined by d = d(r)Incomplete log-likelihood function L(θ) = log pD (d |θ)Complete-data log-likelihood function Lc (θ) = log pc (r |θ)

Expectation - Maximization Algorithm

θ̂(n) : value of θ at iteration n of EM 1. E-step

Q(θ, θ̂(n)) = Eθ̂(n)

LC (θ)⎡⎣ ⎤⎦2. M-step

θ̂(n +1) = arg maxθQ(θ, θ̂(n))

After an interation of the EM algorithm, the incomplete-data log-likelihood function is not decreased:

L(θ̂(n +1) ≥ L(θ̂(n)) for n = 0,1,2,…,K

Page 41: Chapter 11. Stochastic Methods Rooted in Statistical Mechanics · Chapter 11. Stochastic Methods Rooted ... 11.5 Simulated Annealing (1/3) (c) ... The system is fixed and annealing

SummaryandDiscussionn Statisticalmechanicsasmathematicalbasisforthe

formulationofstochasticsimulation/optimization/learning1. Metropolisalgorithm2. Simulatedannealing3. Gibbssampling

n Stochasticlearningmachines1. (Classical)Boltzmannmachine2. RestrictedBoltzmannmachine(RBM)3. Deepbeliefnets(DBN)

n Deterministicannealing(DA)1. Foroptimization:Connectiontosimulatedannealing(SA)2. Forclustering:Connectiontoexpectation-maximization(EM)

(c)2017BiointelligenceLab,SNU 41