evaluation in information retrieval speaker: ruihua song web data management group, msr asia

45
Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Post on 15-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Evaluation in Information Retrieval

Speaker: Ruihua Song

Web Data Management Group, MSR Asia

Page 2: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Outlines Basics on IR evaluation Introduction of TREC (Text Retrieval

Conference) One selected paper

Select-the-Best-Ones: A new way to judge relative relevance

Page 3: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Motivated Examples Which set is better?

S1={r, r, r, n, n} vs. S2={r, r, n, n, n} S3={r} vs. S4={r, r, n}

Which ranking list is better? L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r>

r: relevant n: non-relevant h: highly relevant

Page 4: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Precision & Recall Precision is fraction of the retrieved document

which is relevant

Recall is fraction of the relevant document which has been retrieved

R (Relevant Set)

A (Answer Set)

Ra

||

||Pr

A

Recision a

||

||Re

R

Rcall a

Page 5: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Precision & Recall (cont.) Assume there are 10 relevant documents in judgments Example 1: S1={r, r, r, n, n} vs. S2={r, r, n, n, n}

P1= 3/5 = 0.6; R1= 3/10 = 0.3 P2= 2/5 = 0.4; R2= 2/10 = 0.2 S1 > S2

Example 2: S3={r} vs. S4={r, r, n} P3= 1/1 = 1; R3= 1/10 = 0.1 P4= 2/3 = 0.667; R4= 2/10 = 0.2 ? (F1-Measure)

Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> ?

r: relevant n: non-relevant h: highly relevant

Page 6: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Mean Average Precision Defined as the mean of Average Precision

for a set of queries

Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> AP1=(1/1+2/2+3/3)/10=0.3

AP2=(1/3+2/4+3/5)/10=0.143

L1 > L2

Page 7: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Other Metrics based on Binary Judgments P@10 (Precision at 10) is the number of

relevant documents in the top 10 documents in the ranked list returned for a topic e.g. there is 3 relevant documents at the top 10

retrieved documents P@10=0.3

MRR (Mean Reciprocal Rank) is the mean of Reciprocal Rank for a set of queries RR is the reciprocal of the first relevant document’s

rank in the ranked list returned for a topic e.g. the first relevant document is ranked as No.4

RR = ¼ = 0.25

Page 8: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Metrics based on Graded Relevance Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r>

r: relevant n: non-relevant h: highly relevant

Which ranking list is better?

Cumulated Gains based metrics CG, DCG, and nDCG

Two assumptions about ranked result list Highly relevant document are more valuable The greater the ranked position of a relevant

document , the less valuable it is for the user

Page 9: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

CG Cumulated Gains

From graded-relevance judgments to gain vectors

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1>

CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

Page 10: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

DCG Discounted Cumulated Gains

Discounted function

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> DG3 = <1, 0, 0.63, 0, 0.86>, DG4 =<2, 0, 0, 0.5, 0.43>

G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1>

DCG3 = <1, 1, 1.63, 1.63, 2.49>, DCG4 =<2, 2, 2, 2.5, 2.93> CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

Page 11: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

nDCG Normalized Discounted Cumulated Gains

Ideal (D)CG vector

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> Lideal = <h, r, r, n, n>

Gideal = <2, 1, 1, 0, 0>; DGideal = <2, 1, 0.63, 0, 0>

CGideal = <2, 3, 4, 0, 0>; DCGideal = <2, 3, 3.63, 3.63, 3.63>

Page 12: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

nDCG Normalized Discounted Cumulated Gains

Normalized (D)CG

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> DCGideal = <2, 3, 3.63, 3.63, 3.63>

nDCG3 = <1/2, 1/3, 1.63/3.63, 1.63/3.63, 2.49/3.63> = <0.5, 0.33, 0.45, 0.45, 0.69>

nDCG4 =<2/2, 2/3, 2/3.63, 2.5/3.63, 2.93/3.63> = <1, 0.67, 0.55, 0.69, 0.81>

L3 < L4

Page 13: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Something Important Dealing with small data sets

Cross validation Significant test

Paired, two tailed t-test

Green < Yellow ?

The difference is significant or just caused

by chance

score

p(.)

score

p(.)

Page 14: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Any questions?

Page 15: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

BY RUIHUA SONGWEB DATA MANAGEMENT GROUP, MSR ASIA

MARCH 30 , 2010

Introduction of TREC

Page 16: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Text Retrieval Conference

Homepage: http://trec.nist.gov/Goals

To encourage retrieval research based on large test collection

To increase communication among industry, academia, and government

To speed the transfer of technology from research labs into commercial products

To increase the availability of appropriate evaluation techniques for use by industry and academia

Page 17: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Yearly Cycle of TREC

Page 18: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

The TREC Tracks

Page 19: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

TREC 2009

Tracks Blog track Chemical IR track Entity track Legal track “Million Query” track Relevance Feedback track Web track

Participants 67 groups representing 19 different countries

Page 20: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

TREC 2010

Schedule By Feb 18 – submit your application to participate in

TREC 2010 Beginning March 2 Nov 16-19: TREC 2010 at NIST in Gaithersburg, Md.

USAWhat’s new

Session track To test whether systems can improve their performance for

a given query by using a previous query To evaluate system performance over an entire query

session instead of a single query Track web page: http://ir.cis.udel.edu/sessions

Page 21: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Why TREC

To obtain public data sets (most frequently used in IR papers) Pooling makes judgments unbiased for participants

To exchange ideas in emerging areas A strong Program Committee A healthy comparison of approaches

To influence evaluation methodologies By feedback or proposals

Page 22: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

TREC 2009 Program Committee

Ellen Voorhees, chairJames AllanChris BuckleyGord CormackSue DumaisDonna HarmanBill Hersh

David LewisDoug OardJohn PragerStephen RobertsonMark SandersonIan SoboroffRichard Tong

Page 23: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Any questions?

Page 24: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

SELECT-THE-BEST-ONES: A NEW WAY TO JUDGE RELATIVE RELEVANCERuihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, Hsiao-Wuen Hon

Information Processing and Management, 2010

Page 25: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

ABSOLUTE RELEVANCE JUDGMENTS

Page 26: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

RELATIVE RELEVANCE JUDGMENTS

Problem formulation

Connections between Absolute and Relative A can be transformed to R as follows:

R can be transformed to A, if the assessors assign a relevance grade to each set.

Page 27: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

QUICK-SORT: A PAIRWISE STRATEGY

R P

P

P

vs.

B

S

W

B

SS

WW

WW

RS

SS

WW

W

WW

BB

Page 28: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 29: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 30: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

SELECT-THE-BEST-ONES: A PROPOSED NEW STRATEGY

P

P

P

B

PP

PP

PP

PP

P

B

P

B

P

P

B

PP

P

PP

P

Page 31: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 32: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 33: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 34: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 35: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 36: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
Page 37: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

USER STUDY

Experiment Design Latin Square design to minimize possible

practice effects and order effects

Each tool has been used to judge all three query sets; Each query has been judged by three subjects; Each subject has used every tool and judged every

query, but there are no overlapped queries when he/she uses two different tools

Page 38: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

USER STUDY

Experiment Design 30 Chinese queries are divided into three

balanced sets, and cover both popular queries and long-tail queries

Page 39: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

SCENE OF USER STUDY

Page 40: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

BASIC EVALUATION RESULTS

Efficiency Majority agreement Discriminative power

Table 2. Basic metrics comparison for the three judgment methods

Method Time Majority Agreement (%) Average Relevance

Degrees # Untied Pairs #

Five-grade Absolute Judgments 6’38 97.50 3.12 2000

Quick-Sort Strategy 10’57 (+65.1%)* 94.82 (-2.7%)* 4.40 (+41.0%)* 2585 (+29.3%)*

Select-the-Best-Ones Strategy 5’54 (-11.1%) 99.31 (+1.9%)* 3.80 (+21.8%)* 2309(+15.5%)*

Note: T-test is conducted with regard to the baseline, i.e. Five-grade Absolution Judgments. “*” denotes that the difference is statistically significant (p-value < 0.05).

Page 41: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

FURTHER ANALYSIS ON DISCRIMINATIVE POWER

Three grades, ‘Excellent’, ‘Good’, and ‘Fair’, are split while ‘Perfect’ and ‘Bad’ are not

More queries are influenced in SBO than in QS, and the splitting is distributed more evenly in SBO

Table 3. Detailed analysis on splitting one grade of absolute relevance judgments into more subsets of relative relevance judgments

(the average number of subsets corresponding to each grade, the percentage of queries that are influenced by splitting a certain

grade)

Strategy Perfect Excellent Good Fair Bad

Quick-Sort (1, 0) (1.31, 14.44%) (1.93, 55.56%) (1.31, 6.67%) (1, 0)

Select-the-Best-Ones (1, 0) (1.21, 23.33%) (1.78, 51.11%) (1.12, 20.00%) (1, 0)

Page 42: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

EVALUATION EXPERIMENT ON JUDGMENT QUALITY

Collecting expert’s judgments 5 experts, for 15 Chinese queries Partial orders Judge individually + discuss as a group

Experimental resultsTable 4. Consistency between expert judgments and the judgments generated by the three methods for document pairs (the

number of concordant/tied/discordant pairs divided by the total number of pairs)

Five-grade Absolute Judgments Quick-Sort Strategy Select-the-Best-Ones Strategy

Concordant 0.2946 0.3493 0.3371

Tied 0.6436 0.4750 0.5638

Discordant 0.0342 0.1142 0.0677

Page 43: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

DISCUSSION

Absolute relevance judgment method Fast and easy-to-implement Loses some useful order information

Quick-sort method Light cognitive load and scalable High complexity and unstable standard

Select-the-Best-Ones method Efficient with good discriminative power Heavy cognitive load and not scalable

Page 44: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

CONCLUSION

We propose a new strategy called Select-the-Best-Ones to address the problem of relative relevance judgment

A user study and an evaluation experiment show that the SBO method Outperforms the absolute method in terms of

agreement and discriminative power Dramatically improves the efficiency over the

pairwise relative method QS strategy Reduces half of the discordant pairs, compared

to the QS method

Page 45: Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Thank you!

[email protected]