bayesian learning

1

Bayesian LearningBayesian Learning

2

Bayesian ReasoningBayesian Reasoning

• Basic assumption– The quantities of interest are governed by probability distri

bution– These probability + observed data ==> reasoning ==> opti

mal decision• 의의 , 중요성

– 직접적으로 확률을 다루는 알고리듬의 근간• 예 ) naïve Bayes classifier

– 확률을 다루지 않는 알고리듬을 분석하기 위한 틀• 예 ) cross entropy , Inductive bias decision tree, MDL principle

3

Feature & LimitationFeature & Limitation

• Feature of Bayesian Learning– 관측된 데이터들은 추정된 확률을 점진적으로

증감– Prior Knowledge : P(h) , P(D|h)– Probabilistic Prediction 에 응용 – multiple hypothesis 의 결합에 의한 prediction

• 문제점– initial knowledge 요구– significant computational cost

4

Bayes TheoremBayes Theorem

• Terms – P(h) : prior probability of h – P(D) : prior probability that D will be observed– P(D|h) : prior knowledge– P(h|D) : posterior probability of h , given D

• Theorem

• machine learning : 주어진 데이터 들로부터 the most probable hypothesis 를 찾는 과정

)()()|()|(

DPhPhDPDhP

6

MAP hypothesisMAP hypothesis

MAP(Maximum a posteriori) hypothesis

)()|(maxarg)(

)()|(maxarg

)|(maxarg

hPhDPDP

hPhDP

DhPh

Hh

h

HhMAP

)()|(maxarg hPhDPhHh

MAP

7

ML hypothesisML hypothesis

• maximum likelihood (ML) hypothesis – basic assumption : equally probable a priori

• basic formular– P(a^b) = P(A|B)P(B) = P(B|A)P(A)

)|(maxarg hDPhHh

ML

i

ii APABPBP )()|()(

8

Bayes Theorem and Concept LeaBayes Theorem and Concept Learningrning• Brute-force MAP learning

– for each calculate P(h|D)– find hMAP

• consistent assumption– noise free data D– target concept c in hypothesis space H– every hypothesis is equally probable

• Result

• every consistent hypothesis is MAP hypothesis

DHVSDhP

,

1)|( (if h is consistent with D)

P(h|D) = 0 (otherwise)

HVS

H

hP

xhhDP

HhP

DH

VSh

Hhi

i

DHi

i

,

i

i

,

11

)()h|P(DP(D)

0 else , )(d if 1)|(

1)(

DH

DH

VS

VSH

H

DPhP

DPhPhDPDhP

,

,

1

1

)()(1

)()()|()|(

10

Consistent learnerConsistent learner

• 정의 : training example 들에 대해 에러가 없는 hypothesis 를 출력해 주는 알고리듬

• result : – every consistent hypothesis output == MAP hypothesis– every consistent learner output == MAP hypothesis

• if uniform prior probability distribution over H• if deterministic, noise-free training data

11

ML and LSE hypothesisML and LSE hypothesis

• Least squared error hypothesis– NN , curve fitting, linear regression– continuous-valued target function

• task : find f : di=f(xi)+ei

• preliminary : – probability densities, Normal distribution– target value independence

• result :

• limitation : noise only in the target value

m

iii

HhML xhdh

1

2))((minarg

22

222

))((2

1

2

))((2

1minarg

))((2

1

2

1lnmaxarg

2

1maxarg

)|(maxarg

)|(maxarg

22

iih

iih

xhd

h

m

ii

h

HhML

xhd

xhd

e

hdP

hDPh

ii

13

ML hypothesis for predicting ML hypothesis for predicting ProbabilityProbability• Task : find g : g(x) = P(f(x)=1)• question : what criterion should we optimize in

order to find a ML hypothesis for g• result : cross entropy

– entropy function :

m

iiiii

HhML xhdxhdh

1

))(1ln()1()(lnmaxarg

i

ii PP ln

)(),|(

)|,()|(

iii

m

iii

xPxhdP

hdxPhDP

ii di

diii

iii

iii

xhxhxhdP

xhxhdPxhxhdP

1

i

i

))(1()(),|(

0d if , )(1),|(1d if , )(),|(

))(1ln()1()(lnmaxarg

))(1()(maxarg

)())(1()(maxarg

)|(maxarg

1

1

iiiih

di

di

h

id

id

ih

hML

xhdxhd

xhxh

xpxhxh

hDPh

ii

ii

15

Gradient search to ML in NNGradient search to ML in NN

Let G(h,D) = cross entropy

jkjk

DhGw

),(

m

iijkiijk xxhdw

1

))((

m

iijkiiiijk xxhdxhxhw

1

))())((1)(( (BP)

By gradient ascent

ijkii

ijkii

ii

ii

jk

i

ii

ii

jk

i

i

iiii

jk

i

ijk

iiii

xxhd

xxhxhxhxh

xhd

wxh

xhxhxhd

wxh

xhxhdxhd

wxh

xhDhG

wDhG

xhdxhdlet

))((

1))(1)((

))(1)(()(

)())(1)((

)(

)()(

)))(1ln()1()(ln(

)()(

),(),())(1ln()1()(ln D)G(h,

jkjk

DhGw

),(

17

MDL principleMDL principle

• 목적 : Bayesian method 에 의한 inductive bias 와 MLD principle 해석

• Shannon and weaver’s optimal code length))(log)|(log(minarg

))(log)|((logmaxarg

22

22

hPhDP

hPhDPh

Hh

HhMAP

)|()(minarg|

hDLhLhHDH CC

HhMAP

)|()(minarg21

hDLhLh CCHh

MDL

(bits) log2 iP

18

Bayes optimal classifierBayes optimal classifier

• Motivation : 새로운 instance 의 classification 은 모든 hypothesis 에 의한 prediction 의 결합으로 인하여 최적화 되어진다 .

• task : Find the most probable classification of the new instance given the training data

• answer :combining the prediction of all hypotheses• Bayes optimal classification

• limitation : significant computational cost ==> Gibbs algorithm

Vv

iijVv

ij

DhPhvP )|()|(maxarg

19

Bayes optimal classifier exampleBayes optimal classifier example

0)h|P( 1)h|P(- 3.)|(0)h|P( 1)h|P(- 3.)|(1)h|P( 0)h|P(- 4.)|(

333

222

111

DhPDhPDhP

Hhiij

v

Hhii

Hhii

ij

i

i

DhPhvP

DhPhP

DhPhP

)|()|(maxarg

6.)|()|(

4.)|()|(

},{

20

Gibbs algorithmGibbs algorithm

• Algorithm– 1. Choose h from H, according to the posterior probabil

ity distribution over H– 2. Use h to predict the classification of x

• Gibbs algorithm 의 유용성– Haussler , 1994– Error(Gibbs algorithm)< 2*Error(Bayes optimal classifi

er)

21

Naïve Bayes classifierNaïve Bayes classifier

• Naïve Bayes classifier

• difference– no explicit search through H– by counting the frequency of existing examples

• m-estimate of probability =

– m : equivalent sample size , p : prior estimate of probability

)()|,...,,(maxarg 21 jjnMAP vPvaaaPv

i

jijVv

NB vapvPvj

)|()(maxarg

mnmpnc

23

Bayes Belief NetworksBayes Belief Networks

• 정의– describe the joint probability distribution for a set of variables– 모든 변수들이 conditional independence 일것을 요구하지

않음– 변수들간의 부분적 의존 관계를 확률로 표현

• representation

24

Bayesian Belief NetworksBayesian Belief Networks

25

InferenceInference

• Task : infer the probability distribution for the target variables

• methods– exact inference : NP hard– approximate inference

• theoretically NP hard• practically useful• Monte Carlo methods

26

LearningLearning

• Env– structure known + fully observable data

• easy , by naïve Bayes classifier

– structure known + partially observable data• gradient ascent procedure ( by Russel , 1995 )• ML hypothesis 와 유사 P(D|h)

– structure unknown

Dd ijk

ikijhijkijk w

duyPww

)|,(

27

Learning(2)Learning(2)

• Structure unknown– Bayesian scoring metric ( cooper, Herskovits, 1992 )– K2 algorithm

• cooper, Herskovits, 1992• heuristic greedy search• fully observed data

– constraint-based approach• Spirtes, 1993• infer dependency and independency relationship• construct structure using this relationship

Dd ijk

ikijhijkijk w

duyPww

)|,(

',''''''

','''''

)()|(),|()(

1

),(),|()(

1

)(ln)(

1

)(ln

)(ln)(ln

kjikikijhikijh

ijkh

kjikijhikijh

ijkh

d ijk

h

h

dh

ijk

dh

ijkijk

h

uPuyPuydPwdP

uyPuydPwdP

wdP

dP

dPw

DPww

DP

ijk

ikijh

ikijh

ikijh

ikijh

ikikijh

ikijh

ikhikijh

h

ikikijhh

ikikijhikijhijkh

wduyP

uyPduyP

uyPuPduyP

uyPuPdPduyP

dP

uPuydPdP

uPuyPuydPwdP

)|,(

)|()|,(

),()()|,(

),()()()|,(

)(1

)(),|()(

1

)()|(),|()(

1

0w

else

0w

then )k k' ,'(

)|(

ijk

ijk

iiif

uyPw ikijhijk

29

EM algorithmEM algorithm

• EM : estimation, maximization• env

– learning in the presence of unobserved variables– the form of probability distribution is known

• application– training Bayesian belief networks– training radial basis function networks– basis for many unsupervised clustering algorithm– basis for Baum-Welch’s forward-backward algorithm

30

K-means algorithmK-means algorithm

• Env : k normal distribution 들로부터 임의로 data 생성

• task : find mean values of each distribution• instance : < xi,z11,z12>

– if z is known : using – else use EM algorithm

iiML x 2)(minarg

31

K-means algorithmK-means algorithm

• Initialize • calculate E[z]

• calculate a new ML hypothesis

2

2

22

)(2

1

)(2

1

)|()|(

][ki

ji

x

x

kki

jiij

e

exP

xPzE

m

iiijj xzE

m 1

][1

==> converge to a local ML hypothesis

32

General statement of EM algoGeneral statement of EM algo

• Terms : underlying probability distribution– x : observed data from each distribution– z : unobserved data– Y = X union Z– h : current hypothesis of – h’ : revised hypothesis

• task : estimate from X

33

guidelineguideline

• Search h’

• if h = : calculate function Q

)]|([lnmaxarg hYPEhh

],|)|([ln)|( XhhYPEhhQ

34

EM algorithmEM algorithm

• Estimation step

• maximization step

• converge to a local maxima

],|)|([ln)|( XhhYPEhhQ

)|(maxarg hhQhh

k

jjiij xZ

ikiiii

e

hzzzxPhyP2'

2 )(2

1

2

21

2

1

)'|,...,,,()'|(

))(2

1

2

1(ln

)'|(ln

)'|(ln)'|(ln

2'22 jiij

i

i

xZ

hyP

hyPhYP

)|(

)]([2

1

2

1ln

]))(2

1

2

1(ln[)]'|([ln

'

2'22

2'22

hhQ

xZE

xZEhYPE

m

ijiij

jiij

2

2

22

)(2

1

)(2

1

)|()|(

][ki

ji

x

x

kki

jiij

e

exP

xPzE

m

iiijj xzE

m 1

][1

bayesian learning

Documents