확률모델
DESCRIPTION
정보검색시스템 강의노트 강승식교수님TRANSCRIPT
1
2.5.4 Probabilistic Model
Motivation
– Attempt to capture the IR problem within a probabilistic framework
Assumption (Probabilistic Principle)
– Probability of relevance depends on the query and the document representations only
– There is a subset of all documents which the user prefers as the answer set for the query q
– Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user
– Documents in the set R are predicted to be relevant to the query
– Documents not in this set are predicted to be non-relevant R
23
Probabilistic Model
Definition
qddRP
R
R
RdP
RdP
RPRdP
RPRdP
dRP
dRPqdsim
ww
jj
j
j
j
j
j
jj
iqij
query theorelevant t is document y that theProbabilit :)|(
relevant-non be known to documents ofSet :
relevant be known to documents ofSet :
)|(
)|(~
)()|(
)()|(
)|(
)|(),(
binary all are riables weight vaindex term : }1,0{},1,0{
Using Bayes’ rule documents theall
same theare )(),( RPRP
)(
)()|(
)(
)()|(:'
bP
aPabP
bP
baPbaPruleBayes
Rd jj RdP set thefrom document theselectingrandomly ofy Probabilit:)|(
relevant. is collection entire thefrom selectedrandomly document ay that Probabilit:)(RP
23
Probabilistic Model
Definition (Cont.)
t
i i
i
i
iijiqj
dg idg i
dg idg i
j
RkP
RkP
RkP
RkPwwqsim(d
RkPRkP
RkPRkPqdsim
jiji
jiji
1
0)(1)(
0)(1)(
)|(
)|(1log
)|(1
)|(log~),
)|()|(
)|()|(~),(
Assuming independence of
index terms
Taking logarithms, ignoring constants,
1)|()|( RkPRkP ii
Rii kRkP
set thefrom selectedrandomly document ain present is index term y that theprobabilit the:)|(
Rii kRkP
set thefrom selectedrandomly document ain present not is index term y that theprobabilit the:)|(
23
Probabilistic Model
Initial Probability
Improving Probability
iii
i
i
knN
nRkP
RkP
index term econtain th which documents ofnumber : )|(
5.0)|(
ii
iii
iiiii
ii
iii
kVV
VVNN
nVn
VN
Vn
VN
VnRkP
VN
nV
V
V
V
VRkP
index term econtain th which ofsubset :
retrievedinitially documents ofsubset :11
5.0)|(
1
1
5.0 )|(
Adjustment the
problems for small values of V and Vi
23
Probabilistic Model
Advantage
– Documents are ranked in decreasing order of their probability of being relevant
Disadvantage
– Guess the initial separation of documents into relevant and non-relevant sets
– Binary weights
• Does not take into account the frequency with which an index term occurs inside a document
– Independence assumption for index terms
23
2.5.5 Brief Comparison of Classic Models
Boolean Model– Weakest classic method– Inability to recognize partial matches
Vector Model– Popular retrieval model
Vector Model v.s Probabilistic Model– Croft
• Probabilistic model provides a better retrieval performance
– Salton, Buckley• Vector model is expected to outperform the
probabilistic model with general collections
23
참고 : Probabilistic ModelExample of Probabilistic Retrieval Model
Query : q(k1, k2) 5 documents
Assume d2 and d4 are relevant
d1 d2 d3 d4 d5
k1 1 0 1 1 0
k2 0 1 0 1 1
2
1)|( 1 RkP
3
2)|( 1 RkP 1)|( 2 RkP
3
1)|( 2 RkP
23
Probabilistic Model Example of Probabilistic Model
Q: “gold silver truck”
D1: “Shipment of gold damaged in a fire.”
D2: “Delivery of silver arrived in a silver truck.”
D3: “Shipment of gold arrived in a truck.”
– Assume that documents D2 and D3 are relevant to the query
N= number of documents in the collection
R= number of relevant documents for a given query Q
n= number of documents that contain term t
r= number of relevant documents that contain term t
variable Gold Silver Truck
N 3 3 3
R 2 2 2
n 2 1 2
r 1 1 2
23
Probabilistic Model Example of Probabilistic Model (Cont.)
Computing Term Relevance Weight
Similarity Coefficient for each document
– D1: -0.477, D2: 1.653, D3: 0.699
)5.0)()(
5.0
5.0
5.0log(
rRnN
rn
rR
rtrj
477.03
1log)
5.01223
5.012
5.012
5.01log(:
gold
477.0333.0
1log)
5.01213
5.011
5.012
5.01log(:
silver
176.1333.0
5log)
5.02223
5.022
5.022
5.02log(:
truck
23
참고 : Review of Probability Theory Try to predict whether or not a baseball team (e.g. LA
Dodgers) will win one of its games
–
– Let’s assume the independence of evidences
– By Bayes’ Theorem
–
,75.0)|( sunnywinP 6.0)|( chanhowinP,5.0)( winP
?),|( chanhosunnywinP
,),|( cswP ,)|( swP )|( cwP
),(
)()|,(
),(
),,(),|(
csP
wPwcsP
csP
cswPcswP
)()|,(
)()|,(
),(
)()|,(
),(
)()|,(
1 lPlcsP
wPwcsP
csP
lPlcsP
csP
wPwcsP
We can’t calculate it, so…
Probability
to lose
23
Probabilistic ModelReview of Probability Theory (Cont.)
Independent Assumption
Making Substitutions
)|()|()|,(
)|()|()|,(
lcPlsPlcsP
wcPwsPwcsP
)()|()|(
)()|()|(
)()|,(
)()|,(
1 lPlcPlsP
wPwcPwsP
lPlcsP
wPwcsP
)()|(
)()|(
1 ,
)(
)()|()|(
lPlsP
wPwsP
sP
wPwsPswP
)()|(
)()|(
1 ,
)(
)()|()|(
lPlcP
wPwcP
cP
wPwcPcwP
Probabilistic ModelReview of Probability Theory (Cont.)
Making Substitutions (Cont.)
5.45.0
5.0
4.0
6.0
25.0
75.0)(
)(
11
)(
)(
)(
)(
1)(
)(
1
)()|()|(
)()|()|(
1
wP
lPlP
wP
wP
lP
wP
lPlPlcPlsP
wPwcPwsP
818.011
9 The probability of LA Dodgers will win its game
on sunny days when Chanho plays