1
Bayesian LearningBayesian Learning
2
Bayesian ReasoningBayesian Reasoning
• Basic assumption– The quantities of interest are governed by probability dist
ribution
– These probability + observed data ==> reasoning ==> optimal decision
• 의의 , 중요성– 직접적으로 확률을 다루는 알고리듬의 근간
• 예 ) naïve Bayes classifier
– 확률을 다루지 않는 알고리듬을 분석하기 위한 틀• 예 ) cross entropy , Inductive bias decision tree, MDL principle
3
Feature & LimitationFeature & Limitation
• Feature of Bayesian Learning– 관측된 데이터들은 추정된 확률을 점진적으로
증감– Prior Knowledge : P(h) , P(D|h)
– Probabilistic Prediction 에 응용 – multiple hypothesis 의 결합에 의한 prediction
• 문제점– initial knowledge 요구– significant computational cost
4
Bayes TheoremBayes Theorem
• Terms – P(h) : prior probability of h
– P(D) : prior probability that D will be observed
– P(D|h) : prior knowledge
– P(h|D) : posterior probability of h , given D
• Theorem
• machine learning : 주어진 데이터 들로부터 the most probable hypothesis 를 찾는 과정
)(
)()|()|(
DP
hPhDPDhP
5
ExampleExample
• Medical diagnosis– P(cancer)=0.008 , P(~cancer)=0.992
– P(+|cancer) = 0.98 , P(-|cancer) = 0.02
– P(+|~cancer) = 0.03 , P(-|~cancer) = 0.97
– P(cancer|+) = P(+|cancer)P(cancer) = 0.0078
– P(~cancer|+) = P(+|~cancer)P(~cancer) = 0.0298
– hMAP = ~cancer
6
MAP hypothesisMAP hypothesis
MAP(Maximum a posteriori) hypothesis
)()|(maxarg
)(
)()|(maxarg
)|(maxarg
hPhDP
DP
hPhDP
DhPh
Hh
h
HhMAP
)()|(maxarg hPhDPhHh
MAP
7
ML hypothesisML hypothesis
• maximum likelihood (ML) hypothesis – basic assumption : equally probable a priori
• basic formular– P(a^b) = P(A|B)P(B) = P(B|A)P(A)
)|(maxarg hDPhHh
ML
i
ii APABPBP )()|()(
8
Bayes Theorem and Concept LeaBayes Theorem and Concept Learningrning• Brute-force MAP learning
– for each calculate P(h|D)
– find hMAP
• consistent assumption– noise free data D– target concept c in hypothesis space H– every hypothesis is equally probable
• Result
• every consistent hypothesis is MAP hypothesis
DHVSDhP
,
1)|( (if h is consistent with D)
P(h|D) = 0 (otherwise)
H
VS
H
hP
xhhDP
HhP
DH
VSh
Hhi
i
DHi
i
,
i
i
,
11
)()h|P(DP(D)
0 else , )(d if 1)|(
1)(
DH
DH
VS
VS
H
H
DP
hP
DP
hPhDPDhP
,
,
1
1
)(
)(1
)(
)()|()|(
10
Consistent learnerConsistent learner
• 정의 : training example 들에 대해 에러가 없는 hypothesis 를 출력해 주는 알고리듬
• result : – every consistent hypothesis output == MAP hypothesis
– every consistent learner output == MAP hypothesis• if uniform prior probability distribution over H
• if deterministic, noise-free training data
11
ML and LSE hypothesisML and LSE hypothesis
• Least squared error hypothesis– NN , curve fitting, linear regression
– continuous-valued target function
• task : find f : di=f(xi)+ei
• preliminary : – probability densities, Normal distribution
– target value independence
• result :
• limitation : noise only in the target value
m
iii
HhML xhdh
1
2))((minarg
22
222
))((2
1
2
))((2
1minarg
))((2
1
2
1lnmaxarg
2
1maxarg
)|(maxarg
)|(maxarg
22
iih
iih
xhd
h
m
ii
h
HhML
xhd
xhd
e
hdP
hDPh
ii
13
ML hypothesis for predicting ML hypothesis for predicting ProbabilityProbability
• Task : find g : g(x) = P(f(x)=1)• question : what criterion should we optimize in
order to find a ML hypothesis for g• result : cross entropy
– entropy function :
m
iiiii
HhML xhdxhdh
1
))(1ln()1()(lnmaxarg
i
ii PP ln
)(),|(
)|,()|(
iii
m
iii
xPxhdP
hdxPhDP
ii di
diii
iii
iii
xhxhxhdP
xhxhdP
xhxhdP
1
i
i
))(1()(),|(
0d if , )(1),|(
1d if , )(),|(
))(1ln()1()(lnmaxarg
))(1()(maxarg
)())(1()(maxarg
)|(maxarg
1
1
iiiih
di
di
h
id
id
ih
hML
xhdxhd
xhxh
xpxhxh
hDPh
ii
ii
15
Gradient search to ML in NNGradient search to ML in NN
Let G(h,D) = cross entropy
jkjk
DhGw
),(
m
iijkiijk xxhdw
1
))((
m
iijkiiiijk xxhdxhxhw
1
))())((1)(( (BP)
By gradient ascent
ijkii
ijkii
ii
ii
jk
i
ii
ii
jk
i
i
iiii
jk
i
ijk
iiii
xxhd
xxhxh
xhxh
xhd
w
xh
xhxh
xhd
w
xh
xh
xhdxhd
w
xh
xh
DhG
w
DhG
xhdxhdlet
))((
1
))(1)((
))(1)((
)(
)(
))(1)((
)(
)(
)(
)))(1ln()1()(ln(
)(
)(
),(),(
))(1ln()1()(ln D)G(h,
jkjk
DhGw
),(
17
MDL principleMDL principle
• 목적 : Bayesian method 에 의한 inductive bias 와 MLD principle 해석
• Shannon and weaver’s optimal code length
))(log)|(log(minarg
))(log)|((logmaxarg
22
22
hPhDP
hPhDPh
Hh
HhMAP
)|()(minarg|
hDLhLhHDH CC
HhMAP
)|()(minarg21
hDLhLh CCHh
MDL
(bits) log2 iP
18
Bayes optimal classifierBayes optimal classifier
• Motivation : 새로운 instance 의 classification 은 모든 hypothesis 에 의한 prediction 의 결합으로 인하여 최적화 되어진다 .
• task : Find the most probable classification of the new instance given the training data
• answer :combining the prediction of all hypotheses• Bayes optimal classification
• limitation : significant computational cost ==> Gibbs algorithm
Vv
iijVv
ij
DhPhvP )|()|(maxarg
19
Bayes optimal classifier exampleBayes optimal classifier example
0)h|P( 1)h|P(- 3.)|(
0)h|P( 1)h|P(- 3.)|(
1)h|P( 0)h|P(- 4.)|(
333
222
111
DhP
DhP
DhP
Hhiij
v
Hhii
Hhii
ij
i
i
DhPhvP
DhPhP
DhPhP
)|()|(maxarg
6.)|()|(
4.)|()|(
},{
20
Gibbs algorithmGibbs algorithm
• Algorithm– 1. Choose h from H, according to the posterior probabil
ity distribution over H
– 2. Use h to predict the classification of x
• Gibbs algorithm 의 유용성– Haussler , 1994
– Error(Gibbs algorithm)< 2*Error(Bayes optimal classifier)
21
Naïve Bayes classifierNaïve Bayes classifier
• Naïve Bayes classifier
• difference– no explicit search through H
– by counting the frequency of existing examples
• m-estimate of probability =
– m : equivalent sample size , p : prior estimate of probability
)()|,...,,(maxarg 21 jjnMAP vPvaaaPv
i
jijVv
NB vapvPvj
)|()(maxarg
mn
mpnc
22
exampleexample
• (outlook=sunny,temperature=cool,humidity=high,wind=strong)
• P(wind=strong|playTennis=yes)=3/9=.33
• P(wind=string|PlayTennis=no)=3/5=.60
• P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)=.0053
• P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)=.0206
• vNB = no
23
Bayes Belief NetworksBayes Belief Networks
• 정의– describe the joint probability distribution for a set of variables
– 모든 변수들이 conditional independence 일것을 요구하지 않음
– 변수들간의 부분적 의존 관계를 확률로 표현
• representation
24
Bayesian Belief NetworksBayesian Belief Networks
25
InferenceInference
• Task : infer the probability distribution for the target variables
• methods– exact inference : NP hard– approximate inference
• theoretically NP hard
• practically useful
• Monte Carlo methods
26
LearningLearning
• Env– structure known + fully observable data
• easy , by naïve Bayes classifier
– structure known + partially observable data• gradient ascent procedure ( by Russel , 1995 )
• ML hypothesis 와 유사 P(D|h)
– structure unknown
Dd ijk
ikijhijkijk w
duyPww
)|,(
27
Learning(2)Learning(2)
• Structure unknown– Bayesian scoring metric ( cooper, Herskovits, 1992 )
– K2 algorithm• cooper, Herskovits, 1992
• heuristic greedy search
• fully observed data
– constraint-based approach• Spirtes, 1993
• infer dependency and independency relationship
• construct structure using this relationship
Dd ijk
ikijhijkijk w
duyPww
)|,(
',''''''
','''''
)()|(),|()(
1
),(),|()(
1
)(ln
)(
1
)(ln
)(ln)(ln
kjikikijhikijh
ijkh
kjikijhikijh
ijkh
d ijk
h
h
dh
ijk
dh
ijkijk
h
uPuyPuydPwdP
uyPuydPwdP
w
dP
dP
dPw
DPww
DP
ijk
ikijh
ikijh
ikijh
ikijh
ikikijh
ikijh
ikhikijh
h
ikikijhh
ikikijhikijhijkh
w
duyP
uyP
duyP
uyP
uPduyP
uyP
uPdPduyP
dP
uPuydPdP
uPuyPuydPwdP
)|,(
)|(
)|,(
),(
)()|,(
),(
)()()|,(
)(
1
)(),|()(
1
)()|(),|()(
1
0w
else
0w
then )k k' ,'(
)|(
ijk
ijk
iiif
uyPw ikijhijk
29
EM algorithmEM algorithm
• EM : estimation, maximization• env
– learning in the presence of unobserved variables
– the form of probability distribution is known
• application– training Bayesian belief networks
– training radial basis function networks
– basis for many unsupervised clustering algorithm
– basis for Baum-Welch’s forward-backward algorithm
30
K-means algorithmK-means algorithm
• Env : k normal distribution 들로부터 임의로 data 생성
• task : find mean values of each distribution
• instance : < xi,z11,z12>– if z is known : using
– else use EM algorithmi
iML x 2)(minarg
31
K-means algorithmK-means algorithm
• Initialize
• calculate E[z]
• calculate a new ML hypothesis
2
2
22
)(2
1
)(2
1
)|(
)|(][
ki
ji
x
x
kki
jiij
e
e
xP
xPzE
m
iiijj xzE
m 1
][1
==> converge to a local ML hypothesis
32
General statement of EM algoGeneral statement of EM algo
• Terms : underlying probability distribution
– x : observed data from each distribution
– z : unobserved data
– Y = X union Z
– h : current hypothesis of – h’ : revised hypothesis
• task : estimate from X
33
guidelineguideline
• Search h’
• if h = : calculate function Q
)]|([lnmaxarg hYPEhh
],|)|([ln)|( XhhYPEhhQ
34
EM algorithmEM algorithm
• Estimation step
• maximization step
• converge to a local maxima
],|)|([ln)|( XhhYPEhhQ
)|(maxarg hhQhh
k
jjiij xZ
ikiiii
e
hzzzxPhyP
2'2
)(2
1
2
21
2
1
)'|,...,,,()'|(
))(2
1
2
1(ln
)'|(ln
)'|(ln)'|(ln
2'22 jiij
i
i
xZ
hyP
hyPhYP
)|(
)]([2
1
2
1ln
]))(2
1
2
1(ln[)]'|([ln
'
2'22
2'22
hhQ
xZE
xZEhYPE
m
ijiij
jiij
2
2
22
)(2
1
)(2
1
)|(
)|(][
ki
ji
x
x
kki
jiij
e
e
xP
xPzE
m
iiijj xzE
m 1
][1