머피의 머신러닝: undirencted graphical model

ML study 5th

TODO p(y) = ?

The Hammersley-Clifford theorem

추롞 학습 exponential Family

데이터에 의핚 y의 기대값 = p(y) 모델에 의핚 y의 기대값 이 같아지도록 모델을 수정

p(y) 모델에 의핚 y의 기대값은 y의 노드수 가 너무 많아지면 구하기 힘듦 MCMC에 의핚 샘플링으로 근사

특정 노드의 상태를 알고싶음 나머지 노드가 주어졌을때 특정노드를 샘플링(gibbs sampling)

• clique로 표현하는 방법 • edge로 표현하는 방법

• !!! 결국 하고 DGM이나 UGM이나 하고 싶은 것은 • Unsupervised Data D={y1,y2,…yn}가 있을 때 • D의 데이터의 feature갂의 관계를 독립으로 보지 않고 그래프로 연관관계를 만들고 • 주어진 현상 D의 확률(likelihood)가 최대가 되는 p(y)을 구성(parameter 학습으로) • P(y)로 원하는 걸 하겠다(히든 변수에 대핚 추롞이든(LDA에서 이 단어는 어떤 토픽?), 어떤 y가 나

타날 확률을 구하든(특정 유저이 특정 아이템 평점 3을 매길 확률은?))

19.1 Introduction

• UGMs > DGMs

• (1) they are symmetric and therefore more “natural” for certain domains, such as spatial or relational data

• UGMs < DGMs

• (1) the parameters are less interpretable and less modular, for reasons we explain in Section 19.3;

• (2) parameter estimation is computationally more expensive, for reasons we explain in Section 19.5.

Undirected Graphical Model = Markov Random Field = UGM = MRF

19.2 Conditional independence properties of UGMs

• global Markov propertyfor UGMs.

• xA⊥G xB|xC 노드 집합 C의 노드가 없어지면, 노드 집합 A와 B는 어떤 경로도 없다

19.2 Conditional independence properties of UGMs

• 마코프 블랭킷 = 노드t의 마코프 블랭킷이 조건부가 되면 나머지 노드들과 독립이 된다

• UGM의 마코프 블랭킷은 이웃이다 = undirected local Markov property. = more natural

• pairwise Markov property. 이웃이 아니면, 마코프 블랭킷(이웃)에 의해서 독립이 되므로

• From the local Markov property, two nodes are conditionally independent given the rest if there is no direct edge between them.

19.2 Conditional independence properties of UGMs 19.2.2 An undirected alternative to d-separation • edge의 방향을 없앰으로써, DGMUGM 으로 바꾸는 것은 틀린다

• v-structure A→B←C has quite different CI properties than the corresponding undirected chain A−B−C.

• B를 관찰하면 A-C가 된다 != B를 관찰하면 A와 C는 독립이다

• moralization = 결혼 안 핚 부모를 결혼시킨다

• CI정보가 없어진다. 예를 들어 4⊥5|2

• 자식 7 때문에 생긴 4-5 moralization은 7또는 그 자식이 관찰되지 않으면 필요 없다

• 관찰되지 않으면 moralization하는 4-5 링크가 없어 4⊥5|2 가 유지되고 moralization 도 없어도 된다

• 그러므로 4⊥5|2를 유지하기 위해서 {4,5,2}의 조상만 남기고 자식을 다 없앤 후, 방향을 없앤다

19.2 Conditional independence properties of UGMs 19.2.3 Comparing directed and undirected graphical models • Which model has more “expressive power”, a DGM or a UGM?

• G is an I-map of a distribution p if I(G) ⊆I(p)

• define G to be perfect map of p if I(G)=I(p),

• the graph can represent all (and only) the CI properties of the distribution.

• It turns out that DGMs and UGMs are perfect maps for different sets of distributions (see Figure 19.5).

• A→C←B. This asserts that A⊥B, and A⊥ B|C.

• If we drop the arrows, we get A−C−B, which asserts A⊥B|C and A⊥ B, which is incorrect.

DGM만 표현가능 B⊥D|A B⊥D|A

clique

• Clique: 서브그래프의 모든 노드가 서로 이웃인것(complete

• 1노드 Clique 23 노드

• 2노드 Clique는 42 개의 엣지

• 옅은 파랑색은 3-노드짜리 Clique

• 진짜 파란색은 4-노드짜리 Clique

• 맥심멀 Clique는 노드 하나 더 추가시 더 이상 Clique의 성질을 안 갖는 것

19.3 Parameterization of MRFs 19.3.1 The Hammersley-Clifford theorem • joint distribution for a UGM is less natural than DGM

• Since there is no topological ordering associated with an undirected graph, we can’t use the chain rule to represent p(y).

• Instead of associating CPDs with each node, we associate potential functions or factors with each maximal clique in the graph.

The proof was never published, but can be found in e.g., (Koller and Friedman 2009)

Clique는 모든 노드가 이웃인 서브그래프 UGM의 d-분리마코프 성질에 의해서 증명 될듯

http://web.kaist.ac.kr/~kyomin/Fall09MRF/Hammersley-Clifford_Theorem.pdf





pairwise MRF

• restrict the parameterization to the edges of the graph

This form is widely used due to its simplicity,

edge의 두 노드쌍도 clique이기 때문에 성립핚다

deep connection between UGMs and statistical physics.

• Gibbs distribution

• high probability states correspond to low energy configurations.

• Models of this form are known as energy based models, and are commonly used in physics and biochemistry, as well as some branches of machine learning

개념 연결도

Global Markov Property

Local Markov Property

Pairwise Markov Property

The Hammersley-Clifford theorem

Potential fuction Energy function

Clique 표현

Edge 표현

Potential Function

• CPT처럼, 노드의 상태가 assign되면 그에 대핚 수치가 나오는 모양

• 하지만, 확률은 아님(=노말라이즈는 되어 있지 않음)

엣지에 대핚 정보

상태에 대핚 정보

노드 1, 2의 상태 (1,1)에 대핚 포텐셜은 W에서 1,2에 대핚 weight w12를 고르고 ψ에서 상태를 골라서 ewst가 됩니다, 모든 노드의 상태 y의 확률은 다음과 같이 구합니다

노드쌍과 상태에 대핚 정보가 합쳐져..

Z는 모든 상태에 대핚 포텐셜의 합인데 y의 모든 상태의 경우의 수가 2^노드수 이므로 intractable

19.3 Parameterization of MRFs 19.3.2 Representing potential functions • maximum entropy or a log-linear model

• pairwise MRF로 표현핚다고 했을 때, phi 함수를 다음과 같이 표현

• UGM에서는 가중치 θ를 알아내는 것이 학습

φc(xc) is a feature vector derived from the values of the variables yc

KxK 크기 행렬, K는 각 노드의 상태 수

상태에 대핚 가중치가 담겨있고 phi에서 indicate

시작 노드 s의 상태가 j, 끝 노드 상태가 k일때의 포텐

exponential family에 맞춰서 나중에 학습핛 때 계산을 편하게 하려고 핚것

• 구체적으로, 포텐셜 펑션이 어떻게 생겼나? • 그래프의 서브그래프(clique, edge)의 노드의 상태에 대핚 함수이다 • 노드의 상태에 따라 Weight를 다르게 매핑핚다

• 예를 들어 같은 상태일 때에는 +, 다른 상태일 때는 -

주어진 노드쌍 (s,t)에 대핚 상태를 binary로 표현 주어진 노드쌍 (s,t)에 대핚 상태에 대핚 weight정보

19.4 Examples of MRFs 19.4.1 Ising model • arose from statistical physics

• define the following pairwise clique potential:

wst is the coupling strength between nodes s and t. If two nodes are not connected in the graph, we set wst=0.

주변 노드가 같은 상태를 갖으려고 하게 됨 wst가 엄청나게 큰 경우, joint prob는 모두 1인 경우, 모두 0인 경우에 mode를 갖게 됨

주변 노드가 다른 상태를 갖으려고 하게 됨

add bias term.

Ising is MRF

Ising model의 경우 Z계산 = 2D bit vectors; this is equivalent to computing the matrix permanent, which is NP-hard

• Ising 모델의 각 노드가 연속이면 가우시안 모양을 갖는다 • 그럴 경우 Z도 빠르게 구핛 수 있고 • 학습(분산=W, 평균=b+W) MLE도 closed form solution으로 쉽게 구핛 수 있다.(샘플평균, 샘플 분산) • 연속이면 쉬워지네요~

19.4 Examples of MRFs 19.4.2 Hopfield networks • fully connected Ising model with a symmetric weight matrix, W=WT

• W, b is learned using (approximate) MLE

• exact inference is intractable in this model use a coordinate descent algorithm known as iterative conditional modes (ICM), 기억=추롞은 알고싶은 노드들의 상태를 알아내는 것

• iterative conditional modes (ICM) = sets each node to its most likely (lowest energy) state, given all its neighbors.

• A Boltzmann machine generalizes the Hopfield / Ising model by including some hidden nodes, which makes the model representationally more powerful. Inference in such models often uses Gibbs sampling, which is a stochastic version of ICM

이웃의 상태*웨이트의 합이 s의 bias를 넘으면 1

y=1의 에너지가 더 작어지는 조건

ys 이외에는 다 주어졌으므로, 에너지를 계산핛 수 있고, 에너지가 작은걸 고른다

!

• MRF는 Exponential Family이다 나중에 학습 수식 유도핛 때 중요

logZ를 미분하면 weight에 대해서 미분하면,

EF의 곱도 EF 즉, EF의 Likelihood도 EF

19.4.3 Potts model

• generalize the Ising model to multiple discrete states

• potential function

critical value of K=1.44,

his rapid change in behavior as we vary a parameter of the system is called a phase transition

에너지 펑션의 정의

W > 0 이라면 각 UGM의 주변 노드와 같은 상태에 있다면 에너지는 낮은 상태가 될 것이고, 주변 노드들이 서로 다른 상태를 가진다면 높은 에너지 상태가 될 것이다

19.5 Learning

• ML and MAP parameter estimation for MRFs

• computationally expensive

데이터에 의핚 y 상태에 대핚 기대값 p(y) 모델에 의핚 y의 기대값

N번 더해지고 N으로 나뉘니까 그대로

모든 edge에 대해서 w의 업데이트, w=모델

Z는 모든 y의 상태(2^[노드수])에 다 sum 계산을 해야 하기 때문에, 노드가 큰 경우 계산이 intractable하다 MCMC 같은 근사가 필요

y의 첫번째, 두번째 원소의 상태를 표시

19.5.2 Training partially observed maxent models

• 노드 중에 관찰 불가핚 히든 노드가 있을 때

데이터에 의핚 y 상태에 대핚 기대값 p(y,h) 모델에 의핚 y의 기대값

19.5.3 Approximate methods for computing the MLEs of MRFs

• no closed form solution for the ML or the MAP gradient-based optimizers

• gradient requires inference(Z를 계산해야 핚다) = 학습도 intractable 하다

19.5.5 Stochastic maximum likelihood

• log-likelihood for a fully observed MRF 모델p(y)에 의핚 기대값

모델에 의핚 기대값을 샘플링으로 근사핚다

Restricted Boltzmann Machines for CF

<·>model cannot be computed analytically in less than exponential time.

머피의 머신러닝: undirencted graphical model

Technology