확률모델

1

2.5.4 Probabilistic Model

Motivation

– Attempt to capture the IR problem within a probabilistic framework

Assumption (Probabilistic Principle)

– Probability of relevance depends on the query and the document representations only

– There is a subset of all documents which the user prefers as the answer set for the query q

– Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user

– Documents in the set R are predicted to be relevant to the query

– Documents not in this set are predicted to be non-relevant R

23

Probabilistic Model

Definition

qddRP

R

R

RdP

RdP

RPRdP

RPRdP

dRP

dRPqdsim

ww

jj

j

j

j

j

j

jj

iqij

query theorelevant t is document y that theProbabilit :)|(

relevant-non be known to documents ofSet :

relevant be known to documents ofSet :

)|(

)|(~

)()|(

)()|(

)|(

)|(),(

binary all are riables weight vaindex term : }1,0{},1,0{

Using Bayes’ rule documents theall

same theare )(),( RPRP

)(

)()|(

)(

)()|(:'

bP

aPabP

bP

baPbaPruleBayes

Rd jj RdP set thefrom document theselectingrandomly ofy Probabilit:)|(

relevant. is collection entire thefrom selectedrandomly document ay that Probabilit:)(RP

23

Probabilistic Model

Definition (Cont.)

t

i i

i

i

iijiqj

dg idg i

dg idg i

j

RkP

RkP

RkP

RkPwwqsim(d

RkPRkP

RkPRkPqdsim

jiji

jiji

1

0)(1)(

0)(1)(

)|(

)|(1log

)|(1

)|(log~),

)|()|(

)|()|(~),(

Assuming independence of

index terms

Taking logarithms, ignoring constants,

1)|()|( RkPRkP ii

Rii kRkP

set thefrom selectedrandomly document ain present is index term y that theprobabilit the:)|(

Rii kRkP

set thefrom selectedrandomly document ain present not is index term y that theprobabilit the:)|(

23

Probabilistic Model

Initial Probability

Improving Probability

iii

i

i

knN

nRkP

RkP

index term econtain th which documents ofnumber : )|(

5.0)|(

ii

iii

iiiii

ii

iii

kVV

VVNN

nVn

VN

Vn

VN

VnRkP

VN

nV

V

V

V

VRkP

index term econtain th which ofsubset :

retrievedinitially documents ofsubset :11

5.0)|(

1

1

5.0 )|(

Adjustment the

problems for small values of V and Vi

23

Probabilistic Model

Advantage

– Documents are ranked in decreasing order of their probability of being relevant

Disadvantage

– Guess the initial separation of documents into relevant and non-relevant sets

– Binary weights

• Does not take into account the frequency with which an index term occurs inside a document

– Independence assumption for index terms

23

2.5.5 Brief Comparison of Classic Models

Boolean Model– Weakest classic method– Inability to recognize partial matches

Vector Model– Popular retrieval model

Vector Model v.s Probabilistic Model– Croft

• Probabilistic model provides a better retrieval performance

– Salton, Buckley• Vector model is expected to outperform the

probabilistic model with general collections

23

참고 : Probabilistic ModelExample of Probabilistic Retrieval Model

Query : q(k1, k2) 5 documents

Assume d2 and d4 are relevant

d1 d2 d3 d4 d5

k1 1 0 1 1 0

k2 0 1 0 1 1

2

1)|( 1 RkP

3

2)|( 1 RkP 1)|( 2 RkP

3

1)|( 2 RkP

23

Probabilistic Model Example of Probabilistic Model

Q: “gold silver truck”

D1: “Shipment of gold damaged in a fire.”

D2: “Delivery of silver arrived in a silver truck.”

D3: “Shipment of gold arrived in a truck.”

– Assume that documents D2 and D3 are relevant to the query

N= number of documents in the collection

R= number of relevant documents for a given query Q

n= number of documents that contain term t

r= number of relevant documents that contain term t

variable Gold Silver Truck

N 3 3 3

R 2 2 2

n 2 1 2

r 1 1 2

23

Probabilistic Model Example of Probabilistic Model (Cont.)

Computing Term Relevance Weight

Similarity Coefficient for each document

– D1: -0.477, D2: 1.653, D3: 0.699

)5.0)()(

5.0

5.0

5.0log(

rRnN

rn

rR

rtrj

477.03

1log)

5.01223

5.012

5.012

5.01log(:

gold

477.0333.0

1log)

5.01213

5.011

5.012

5.01log(:

silver

176.1333.0

5log)

5.02223

5.022

5.022

5.02log(:

truck

23

참고 : Review of Probability Theory Try to predict whether or not a baseball team (e.g. LA

Dodgers) will win one of its games

–

– Let’s assume the independence of evidences

– By Bayes’ Theorem

–

,75.0)|( sunnywinP 6.0)|( chanhowinP,5.0)( winP

?),|( chanhosunnywinP

,),|( cswP ,)|( swP )|( cwP

),(

)()|,(

),(

),,(),|(

csP

wPwcsP

csP

cswPcswP

)()|,(

)()|,(

),(

)()|,(

),(

)()|,(

1 lPlcsP

wPwcsP

csP

lPlcsP

csP

wPwcsP

We can’t calculate it, so…

Probability

to lose

23

Probabilistic ModelReview of Probability Theory (Cont.)

Independent Assumption

Making Substitutions

)|()|()|,(

)|()|()|,(

lcPlsPlcsP

wcPwsPwcsP

)()|()|(

)()|()|(

)()|,(

)()|,(

1 lPlcPlsP

wPwcPwsP

lPlcsP

wPwcsP

)()|(

)()|(

1 ,

)(

)()|()|(

lPlsP

wPwsP

sP

wPwsPswP

)()|(

)()|(

1 ,

)(

)()|()|(

lPlcP

wPwcP

cP

wPwcPcwP

Probabilistic ModelReview of Probability Theory (Cont.)

Making Substitutions (Cont.)

5.45.0

5.0

4.0

6.0

25.0

75.0)(

)(

11

)(

)(

)(

)(

1)(

)(

1

)()|()|(

)()|()|(

1

wP

lPlP

wP

wP

lP

wP

lPlPlcPlsP

wPwcPwsP

818.011

9 The probability of LA Dodgers will win its game

on sunny days when Chanho plays