확률모델

12
1 2.5.4 Probabilistic Model Motivation Attempt to capture the IR problem within a probabilistic framework Assumption (Probabilistic Principle) Probability of relevance depends on the query and the document representations only There is a subset of all documents which the user prefers as the answer set for the query q Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user Documents in the set R are predicted to be relevant to the query Documents not in this set are predicted to be non-relevant R

Upload: jungeun-kang

Post on 10-May-2015

1.417 views

Category:

Documents


0 download

DESCRIPTION

정보검색시스템 강의노트 강승식교수님

TRANSCRIPT

Page 1: 확률모델

1

2.5.4 Probabilistic Model

Motivation

– Attempt to capture the IR problem within a probabilistic framework

Assumption (Probabilistic Principle)

– Probability of relevance depends on the query and the document representations only

– There is a subset of all documents which the user prefers as the answer set for the query q

– Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user

– Documents in the set R are predicted to be relevant to the query

– Documents not in this set are predicted to be non-relevant R

Page 2: 확률모델

23

Probabilistic Model

Definition

qddRP

R

R

RdP

RdP

RPRdP

RPRdP

dRP

dRPqdsim

ww

jj

j

j

j

j

j

jj

iqij

query theorelevant t is document y that theProbabilit :)|(

relevant-non be known to documents ofSet :

relevant be known to documents ofSet :

)|(

)|(~

)()|(

)()|(

)|(

)|(),(

binary all are riables weight vaindex term : }1,0{},1,0{

Using Bayes’ rule documents theall

same theare )(),( RPRP

)(

)()|(

)(

)()|(:'

bP

aPabP

bP

baPbaPruleBayes

Rd jj RdP set thefrom document theselectingrandomly ofy Probabilit:)|(

relevant. is collection entire thefrom selectedrandomly document ay that Probabilit:)(RP

Page 3: 확률모델

23

Probabilistic Model

Definition (Cont.)

t

i i

i

i

iijiqj

dg idg i

dg idg i

j

RkP

RkP

RkP

RkPwwqsim(d

RkPRkP

RkPRkPqdsim

jiji

jiji

1

0)(1)(

0)(1)(

)|(

)|(1log

)|(1

)|(log~),

)|()|(

)|()|(~),(

Assuming independence of

index terms

Taking logarithms, ignoring constants,

1)|()|( RkPRkP ii

Rii kRkP

set thefrom selectedrandomly document ain present is index term y that theprobabilit the:)|(

Rii kRkP

set thefrom selectedrandomly document ain present not is index term y that theprobabilit the:)|(

Page 4: 확률모델

23

Probabilistic Model

Initial Probability

Improving Probability

iii

i

i

knN

nRkP

RkP

index term econtain th which documents ofnumber : )|(

5.0)|(

ii

iii

iiiii

ii

iii

kVV

VVNN

nVn

VN

Vn

VN

VnRkP

VN

nV

V

V

V

VRkP

index term econtain th which ofsubset :

retrievedinitially documents ofsubset :11

5.0)|(

1

1

5.0 )|(

Adjustment the

problems for small values of V and Vi

Page 5: 확률모델

23

Probabilistic Model

Advantage

– Documents are ranked in decreasing order of their probability of being relevant

Disadvantage

– Guess the initial separation of documents into relevant and non-relevant sets

– Binary weights

• Does not take into account the frequency with which an index term occurs inside a document

– Independence assumption for index terms

Page 6: 확률모델

23

2.5.5 Brief Comparison of Classic Models

Boolean Model– Weakest classic method– Inability to recognize partial matches

Vector Model– Popular retrieval model

Vector Model v.s Probabilistic Model– Croft

• Probabilistic model provides a better retrieval performance

– Salton, Buckley• Vector model is expected to outperform the

probabilistic model with general collections

Page 7: 확률모델

23

참고 : Probabilistic ModelExample of Probabilistic Retrieval Model

Query : q(k1, k2) 5 documents

Assume d2 and d4 are relevant

d1 d2 d3 d4 d5

k1 1 0 1 1 0

k2 0 1 0 1 1

2

1)|( 1 RkP

3

2)|( 1 RkP 1)|( 2 RkP

3

1)|( 2 RkP

Page 8: 확률모델

23

Probabilistic Model Example of Probabilistic Model

Q: “gold silver truck”

D1: “Shipment of gold damaged in a fire.”

D2: “Delivery of silver arrived in a silver truck.”

D3: “Shipment of gold arrived in a truck.”

– Assume that documents D2 and D3 are relevant to the query

N= number of documents in the collection

R= number of relevant documents for a given query Q

n= number of documents that contain term t

r= number of relevant documents that contain term t

variable Gold Silver Truck

N 3 3 3

R 2 2 2

n 2 1 2

r 1 1 2

Page 9: 확률모델

23

Probabilistic Model Example of Probabilistic Model (Cont.)

Computing Term Relevance Weight

Similarity Coefficient for each document

– D1: -0.477, D2: 1.653, D3: 0.699

)5.0)()(

5.0

5.0

5.0log(

rRnN

rn

rR

rtrj

477.03

1log)

5.01223

5.012

5.012

5.01log(:

gold

477.0333.0

1log)

5.01213

5.011

5.012

5.01log(:

silver

176.1333.0

5log)

5.02223

5.022

5.022

5.02log(:

truck

Page 10: 확률모델

23

참고 : Review of Probability Theory Try to predict whether or not a baseball team (e.g. LA

Dodgers) will win one of its games

– Let’s assume the independence of evidences

– By Bayes’ Theorem

,75.0)|( sunnywinP 6.0)|( chanhowinP,5.0)( winP

?),|( chanhosunnywinP

,),|( cswP ,)|( swP )|( cwP

),(

)()|,(

),(

),,(),|(

csP

wPwcsP

csP

cswPcswP

)()|,(

)()|,(

),(

)()|,(

),(

)()|,(

1 lPlcsP

wPwcsP

csP

lPlcsP

csP

wPwcsP

We can’t calculate it, so…

Probability

to lose

Page 11: 확률모델

23

Probabilistic ModelReview of Probability Theory (Cont.)

Independent Assumption

Making Substitutions

)|()|()|,(

)|()|()|,(

lcPlsPlcsP

wcPwsPwcsP

)()|()|(

)()|()|(

)()|,(

)()|,(

1 lPlcPlsP

wPwcPwsP

lPlcsP

wPwcsP

)()|(

)()|(

1 ,

)(

)()|()|(

lPlsP

wPwsP

sP

wPwsPswP

)()|(

)()|(

1 ,

)(

)()|()|(

lPlcP

wPwcP

cP

wPwcPcwP

Page 12: 확률모델

Probabilistic ModelReview of Probability Theory (Cont.)

Making Substitutions (Cont.)

5.45.0

5.0

4.0

6.0

25.0

75.0)(

)(

11

)(

)(

)(

)(

1)(

)(

1

)()|()|(

)()|()|(

1

wP

lPlP

wP

wP

lP

wP

lPlPlcPlsP

wPwcPwsP

818.011

9 The probability of LA Dodgers will win its game

on sunny days when Chanho plays