1 bayesian learning. 2 bayesian reasoning basic assumption –the quantities of interest are...

1

Bayesian LearningBayesian Learning

2

Bayesian ReasoningBayesian Reasoning

• Basic assumption– The quantities of interest are governed by probability dist

ribution

– These probability + observed data ==> reasoning ==> optimal decision

• 의의 , 중요성– 직접적으로 확률을 다루는 알고리듬의 근간

• 예 ) naïve Bayes classifier

– 확률을 다루지 않는 알고리듬을 분석하기 위한 틀• 예 ) cross entropy , Inductive bias decision tree, MDL principle

3

Feature & LimitationFeature & Limitation

• Feature of Bayesian Learning– 관측된 데이터들은 추정된 확률을 점진적으로

증감– Prior Knowledge : P(h) , P(D|h)

– Probabilistic Prediction 에 응용 – multiple hypothesis 의 결합에 의한 prediction

• 문제점– initial knowledge 요구– significant computational cost

4

Bayes TheoremBayes Theorem

• Terms – P(h) : prior probability of h

– P(D) : prior probability that D will be observed

– P(D|h) : prior knowledge

– P(h|D) : posterior probability of h , given D

• Theorem

• machine learning : 주어진 데이터 들로부터 the most probable hypothesis 를 찾는 과정

)(

)()|()|(

DP

hPhDPDhP

6

MAP hypothesisMAP hypothesis

MAP(Maximum a posteriori) hypothesis

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

hPhDP

DP

hPhDP

DhPh

Hh

h

HhMAP

)()|(maxarg hPhDPhHh

MAP

7

ML hypothesisML hypothesis

• maximum likelihood (ML) hypothesis – basic assumption : equally probable a priori

• basic formular– P(a^b) = P(A|B)P(B) = P(B|A)P(A)

)|(maxarg hDPhHh

ML

i

ii APABPBP )()|()(

8

Bayes Theorem and Concept LeaBayes Theorem and Concept Learningrning• Brute-force MAP learning

– for each calculate P(h|D)

– find hMAP

• consistent assumption– noise free data D– target concept c in hypothesis space H– every hypothesis is equally probable

• Result

• every consistent hypothesis is MAP hypothesis

DHVSDhP

,

1)|( (if h is consistent with D)

P(h|D) = 0 (otherwise)

H

VS

H

hP

xhhDP

HhP

DH

VSh

Hhi

i

DHi

i

,

i

i

,

11

)()h|P(DP(D)

0 else , )(d if 1)|(

1)(

DH

DH

VS

VS

H

H

DP

hP

DP

hPhDPDhP

,

,

1

1

)(

)(1

)(

)()|()|(

10

Consistent learnerConsistent learner

• 정의 : training example 들에 대해 에러가 없는 hypothesis 를 출력해 주는 알고리듬

• result : – every consistent hypothesis output == MAP hypothesis

– every consistent learner output == MAP hypothesis• if uniform prior probability distribution over H

• if deterministic, noise-free training data

11

ML and LSE hypothesisML and LSE hypothesis

• Least squared error hypothesis– NN , curve fitting, linear regression

– continuous-valued target function

• task : find f : di=f(xi)+ei

• preliminary : – probability densities, Normal distribution

– target value independence

• result :

• limitation : noise only in the target value

m

iii

HhML xhdh

1

2))((minarg

22

222

))((2

1

2

))((2

1minarg

))((2

1

2

1lnmaxarg

2

1maxarg

)|(maxarg

)|(maxarg

22

iih

iih

xhd

h

m

ii

h

HhML

xhd

xhd

e

hdP

hDPh

ii

13

ML hypothesis for predicting ML hypothesis for predicting ProbabilityProbability

• Task : find g : g(x) = P(f(x)=1)• question : what criterion should we optimize in

order to find a ML hypothesis for g• result : cross entropy

– entropy function :

m

iiiii

HhML xhdxhdh

1

))(1ln()1()(lnmaxarg

i

ii PP ln

)(),|(

)|,()|(

iii

m

iii

xPxhdP

hdxPhDP

ii di

diii

iii

iii

xhxhxhdP

xhxhdP

xhxhdP

1

i

i

))(1()(),|(

0d if , )(1),|(

1d if , )(),|(

))(1ln()1()(lnmaxarg

))(1()(maxarg

)())(1()(maxarg

)|(maxarg

1

1

iiiih

di

di

h

id

id

ih

hML

xhdxhd

xhxh

xpxhxh

hDPh

ii

ii

15

Gradient search to ML in NNGradient search to ML in NN

Let G(h,D) = cross entropy

jkjk

DhGw

),(

m

iijkiijk xxhdw

1

))((

m

iijkiiiijk xxhdxhxhw

1

))())((1)(( (BP)

By gradient ascent

ijkii

ijkii

ii

ii

jk

i

ii

ii

jk

i

i

iiii

jk

i

ijk

iiii

xxhd

xxhxh

xhxh

xhd

w

xh

xhxh

xhd

w

xh

xh

xhdxhd

w

xh

xh

DhG

w

DhG

xhdxhdlet

))((

1

))(1)((

))(1)((

)(

)(

))(1)((

)(

)(

)(

)))(1ln()1()(ln(

)(

)(

),(),(

))(1ln()1()(ln D)G(h,

jkjk

DhGw

),(

17

MDL principleMDL principle

• 목적 : Bayesian method 에 의한 inductive bias 와 MLD principle 해석

• Shannon and weaver’s optimal code length

))(log)|(log(minarg

))(log)|((logmaxarg

22

22

hPhDP

hPhDPh

Hh

HhMAP

)|()(minarg|

hDLhLhHDH CC

HhMAP

)|()(minarg21

hDLhLh CCHh

MDL

(bits) log2 iP

18

Bayes optimal classifierBayes optimal classifier

• Motivation : 새로운 instance 의 classification 은 모든 hypothesis 에 의한 prediction 의 결합으로 인하여 최적화 되어진다 .

• task : Find the most probable classification of the new instance given the training data

• answer :combining the prediction of all hypotheses• Bayes optimal classification

• limitation : significant computational cost ==> Gibbs algorithm

Vv

iijVv

ij

DhPhvP )|()|(maxarg

19

Bayes optimal classifier exampleBayes optimal classifier example

0)h|P( 1)h|P(- 3.)|(

0)h|P( 1)h|P(- 3.)|(

1)h|P( 0)h|P(- 4.)|(

333

222

111

DhP

DhP

DhP

Hhiij

v

Hhii

Hhii

ij

i

i

DhPhvP

DhPhP

DhPhP

)|()|(maxarg

6.)|()|(

4.)|()|(

},{

20

Gibbs algorithmGibbs algorithm

• Algorithm– 1. Choose h from H, according to the posterior probabil

ity distribution over H

– 2. Use h to predict the classification of x

• Gibbs algorithm 의 유용성– Haussler , 1994

– Error(Gibbs algorithm)< 2*Error(Bayes optimal classifier)

21

Naïve Bayes classifierNaïve Bayes classifier

• Naïve Bayes classifier

• difference– no explicit search through H

– by counting the frequency of existing examples

• m-estimate of probability =

– m : equivalent sample size , p : prior estimate of probability

)()|,...,,(maxarg 21 jjnMAP vPvaaaPv

i

jijVv

NB vapvPvj

)|()(maxarg

mn

mpnc

23

Bayes Belief NetworksBayes Belief Networks

• 정의– describe the joint probability distribution for a set of variables

– 모든 변수들이 conditional independence 일것을 요구하지 않음

– 변수들간의 부분적 의존 관계를 확률로 표현

• representation

24

Bayesian Belief NetworksBayesian Belief Networks

25

InferenceInference

• Task : infer the probability distribution for the target variables

• methods– exact inference : NP hard– approximate inference

• theoretically NP hard

• practically useful

• Monte Carlo methods

26

LearningLearning

• Env– structure known + fully observable data

• easy , by naïve Bayes classifier

– structure known + partially observable data• gradient ascent procedure ( by Russel , 1995 )

• ML hypothesis 와 유사 P(D|h)

– structure unknown

Dd ijk

ikijhijkijk w

duyPww

)|,(

27

Learning(2)Learning(2)

• Structure unknown– Bayesian scoring metric ( cooper, Herskovits, 1992 )

– K2 algorithm• cooper, Herskovits, 1992

• heuristic greedy search

• fully observed data

– constraint-based approach• Spirtes, 1993

• infer dependency and independency relationship

• construct structure using this relationship

Dd ijk

ikijhijkijk w

duyPww

)|,(

',''''''

','''''

)()|(),|()(

1

),(),|()(

1

)(ln

)(

1

)(ln

)(ln)(ln

kjikikijhikijh

ijkh

kjikijhikijh

ijkh

d ijk

h

h

dh

ijk

dh

ijkijk

h

uPuyPuydPwdP

uyPuydPwdP

w

dP

dP

dPw

DPww

DP

ijk

ikijh

ikijh

ikijh

ikijh

ikikijh

ikijh

ikhikijh

h

ikikijhh

ikikijhikijhijkh

w

duyP

uyP

duyP

uyP

uPduyP

uyP

uPdPduyP

dP

uPuydPdP

uPuyPuydPwdP

)|,(

)|(

)|,(

),(

)()|,(

),(

)()()|,(

)(

1

)(),|()(

1

)()|(),|()(

1

0w

else

0w

then )k k' ,'(

)|(

ijk

ijk

iiif

uyPw ikijhijk

29

EM algorithmEM algorithm

• EM : estimation, maximization• env

– learning in the presence of unobserved variables

– the form of probability distribution is known

• application– training Bayesian belief networks

– training radial basis function networks

– basis for many unsupervised clustering algorithm

– basis for Baum-Welch’s forward-backward algorithm

30

K-means algorithmK-means algorithm

• Env : k normal distribution 들로부터 임의로 data 생성

• task : find mean values of each distribution

• instance : < xi,z11,z12>– if z is known : using

– else use EM algorithmi

iML x 2)(minarg

31

K-means algorithmK-means algorithm

• Initialize

• calculate E[z]

• calculate a new ML hypothesis

2

2

22

)(2

1

)(2

1

)|(

)|(][

ki

ji

x

x

kki

jiij

e

e

xP

xPzE

m

iiijj xzE

m 1

][1

==> converge to a local ML hypothesis

32

General statement of EM algoGeneral statement of EM algo

• Terms : underlying probability distribution

– x : observed data from each distribution

– z : unobserved data

– Y = X union Z

– h : current hypothesis of – h’ : revised hypothesis

• task : estimate from X

33

guidelineguideline

• Search h’

• if h = : calculate function Q

)]|([lnmaxarg hYPEhh

],|)|([ln)|( XhhYPEhhQ

34

EM algorithmEM algorithm

• Estimation step

• maximization step

• converge to a local maxima

],|)|([ln)|( XhhYPEhhQ

)|(maxarg hhQhh

k

jjiij xZ

ikiiii

e

hzzzxPhyP

2'2

)(2

1

2

21

2

1

)'|,...,,,()'|(

))(2

1

2

1(ln

)'|(ln

)'|(ln)'|(ln

2'22 jiij

i

i

xZ

hyP

hyPhYP

)|(

)]([2

1

2

1ln

]))(2

1

2

1(ln[)]'|([ln

'

2'22

2'22

hhQ

xZE

xZEhYPE

m

ijiij

jiij

2

2

22

)(2

1

)(2

1

)|(

)|(][

ki

ji

x

x

kki

jiij

e

e

xP

xPzE

m

iiijj xzE

m 1

][1

1 bayesian learning. 2 bayesian reasoning basic assumption –the quantities of interest are...

Documents