deep learning : hopfield networks

Lecture 112015. 08. 12 | Do Hoerin

1

Hopfield NetworksLecture 11a

2

Introduce Hopfield Networks

• Energy Based Model• Composed of binary threshold with recurrent connections

• Hard to analyze : Settle to a stable state / oscillate / chaotic

• Connections are symmetricÆ global energy function

𝐸 = − 𝑖

𝑠𝑖𝑏𝑖 −12 𝑖,𝑗

𝑠𝑖𝑠𝑗 𝑤𝑖𝑗, ∆𝐸𝑖 = 𝑏𝑖 + 𝑗

𝑠𝑗𝑤𝑖𝑗

3

Settling to an Energy Minimum

• Start from a random state• Update a unit in random order

−𝐸 = goodness = 3

4

1 0

1 0

0

-4

3 2 3 3

-1-1



5

1 0

1 0

0

-4

3 2 3 3

-1-1


−𝐸 = goodness = 4Settled to the minimum

6

1 1

1 0

0

-4

3 2 3 3

-1-1


7


• Two triangles in which three units mostly support each other

• Why decisions need to be seq.?• If simultaneous : energy can go up• Parallel updating can get oscillations

8

0 1

0 1

1

-4

3 2 3 3

-1-1


A neat way to make use of this type of computation

• Hopfield proposed that memories could be energy minima• We can fill out the memory with incomplete part using the net

• Using energy minima represent memories give a content-addressable memory

9

Storing Memories in a Hopfield Net

• We can store a binary state vector by incrementing the weight between any two units

• biases (weight from a permanently)

∆𝑤𝑖𝑗 = 𝑠𝑖𝑠𝑗

10

Dealing with spurious minima in Hopfiled NetsLecture 11b

11

Spurious Minima Limit Capacity

• Capacity : 0.15N memories with N units• After storing M memories, each connection weight has an

integer value in the range [-M, M]• Number of bits required to store the weights and biases is 𝑁2log(2𝑀 + 1)

12

The Storage Capacity of a Hopfield Net

• Each time we memorize, we hope to create a new energy minimum

• But what if two nearby minima merge to create a minimum at an intermediate location?

13

Increasing the Capacity

• Unlearning : get rid of spurious minima and increase memory capacity

• Unlearing vs. REM sleep?

• Pseudo-likelihood : instead of trying to store vectors in one shot, cycle through the training set many times

• Use the perceptron convergence procedure to train each unit to have the correct state given the states of all the other units in that vector

14

Hopfield Nets with Hidden UnitsLecture 11c

15

A Different Computational Role

• Use network to construct interpretations of sensory input

16

Hidden Units

Visible Units

EX: What can we infer about 3D edges from 2D lines

• The information that has been lost in the image is the 3D depth of each end of the 2D line

17

An example : Interpreting a Line Drawing

• Use a 2D line unit for each possible line

• Use a 3D line unit for each possible 3D line• Make 3D lines support each other

if they join in 3D• Make them strongly support each other

if they join at right angles

18

Two Difficult Computational Issues

• Searching : How to avoid getting trapped in local minima?• Learning : How do we learn the weights on the connections

between units?

19

Using Stochastic Unitsto Improve SearchLecture 11d

20

Noisy Networks Find Better Energy Minima

• Use random noise to escape from poor minima

𝑝 𝑠𝑖 = 1 =1

1+𝑒−∆𝐸𝑖/𝑇(lead from ∆𝐸𝑖 = 𝑏𝑖 + 𝑗 𝑠𝑗𝑤𝑖𝑗)

21

AB 21

∆𝐸𝐴 ∆𝐸𝐵

How Temperature Affects Transition Prob.

• 𝑝(𝑠𝐴)𝑝(𝑠𝐵)= 1+𝑒

−∆𝐸𝐵/𝑇

1+𝑒−∆𝐸𝐴/𝑇Æ if T decrease, ratio would increase

• So low temperature system is much better• But it will take much more times!

22

AB 22

Approaching Thermal Equilibrium

• Thermal equilibrium doesn’t mean that the system has settled down into the lowest energy configuration!

• Just prob. distribution would settle down

• Start with any distribution we like over all the identical system• Keep applying stochastic update rule to pick the next config.• We may reach a situation where the fraction remains const.

23

How aBoltzmann Machine models dataLecture 11e

24

Modeling Binary Data

• Assign a probability to every possible binary vector• Useful for deciding if other binary vectors come from the same distribution• Can be used for monitoring complex system to detect unusual behavior

𝑝 𝑚𝑜𝑑𝑒𝑙 𝑖 𝑑𝑎𝑡𝑎) =𝑝 𝑑𝑎𝑡𝑎 𝑚𝑜𝑑𝑒𝑙 𝑖) 𝑗 𝑝 𝑑𝑎𝑡𝑎 𝑚𝑜𝑑𝑒𝑙 𝑗)

25

Casual Model

• First step : pick the hidden states from their prior distribution• Second step : pick the visible state from their conditional dist.

26

𝑝 𝑣 = ℎ

𝑝 ℎ 𝑝(𝑣|ℎ)0 1

1 0 1

How a Boltzmann Machine Generates Data

• Everything is defined in terms of the energies• The energies of joint configurations are related to their

probabilities in two ways• 𝑝(𝑣, ℎ) ∝ 𝑒−𝐸(𝑣,ℎ)

• The probability of finding the network in that joint configuration after we have updated all of the stochastic binary units

27

Using Energy to Define Probabilities

−𝐸 𝑣, ℎ = 𝑖∈𝑣𝑖𝑠

𝑣𝑖𝑏𝑖 + 𝑘∈ℎ𝑖𝑑

ℎ𝑘𝑏𝑘 + 𝑖<𝑗

𝑣𝑖𝑣𝑗𝑤𝑖𝑗 + 𝑖,𝑘

𝑣𝑖ℎ𝑘𝑤𝑖𝑘 + 𝑘<𝑙

ℎ𝑘ℎ𝑙𝑤𝑘𝑙

28

• Both visible and hidden unit

𝑝 𝑣, ℎ =𝑒−𝐸(𝑣,ℎ)

𝑢,𝑔 𝑒−𝐸(𝑢,𝑔)

• Visible units

𝑝 𝑣 = ℎ 𝑒−𝐸(𝑣,ℎ)

𝑢,𝑔 𝑒−𝐸(𝑢,𝑔)

EX: How Weights Define a Distribution

29

h1 h2

v1 v2

-1

+2 +1

Getting a Sample From the Model

• Exponentially many terms with few more hidden units• Use MCMC starting from a random global configuration

• Until it reaches thermal equilibrium• The probability of a global configuration is then related to its energy by the

Boltzmann distribution

30

Getting a Sample From the Posterior Distributionover Hidden Configurations for a Given Data Vector

• The number of possible hidden configurations is exponential• We need MCMC• All same except we keep the visible units clamped to the given data• Only the hidden units are allowed to change states

• Required for learning the weights• Each hidden config is an explanation of an observed visible config

31

deep learning : hopfield networks

Engineering