evaluation in information retrieval speaker: ruihua song web data management group, msr asia

Evaluation in Information Retrieval

Speaker: Ruihua Song

Web Data Management Group, MSR Asia

Outlines Basics on IR evaluation Introduction of TREC (Text Retrieval

Conference) One selected paper

Select-the-Best-Ones: A new way to judge relative relevance

Motivated Examples Which set is better?

S1={r, r, r, n, n} vs. S2={r, r, n, n, n} S3={r} vs. S4={r, r, n}

Which ranking list is better? L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r>

r: relevant n: non-relevant h: highly relevant

Precision & Recall Precision is fraction of the retrieved document

which is relevant

Recall is fraction of the relevant document which has been retrieved

R (Relevant Set)

A (Answer Set)

Ra

||

||Pr

A

Recision a

||

||Re

R

Rcall a

Precision & Recall (cont.) Assume there are 10 relevant documents in judgments Example 1: S1={r, r, r, n, n} vs. S2={r, r, n, n, n}

P1= 3/5 = 0.6; R1= 3/10 = 0.3 P2= 2/5 = 0.4; R2= 2/10 = 0.2 S1 > S2

Example 2: S3={r} vs. S4={r, r, n} P3= 1/1 = 1; R3= 1/10 = 0.1 P4= 2/3 = 0.667; R4= 2/10 = 0.2 ? (F1-Measure)

Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> ?


Mean Average Precision Defined as the mean of Average Precision

for a set of queries

Example 3: L1=<r, r, r, n, n> vs. L2=<n, n, r, r, r> AP1=(1/1+2/2+3/3)/10=0.3

AP2=(1/3+2/4+3/5)/10=0.143

L1 > L2

Other Metrics based on Binary Judgments P@10 (Precision at 10) is the number of

relevant documents in the top 10 documents in the ranked list returned for a topic e.g. there is 3 relevant documents at the top 10

retrieved documents P@10=0.3

MRR (Mean Reciprocal Rank) is the mean of Reciprocal Rank for a set of queries RR is the reciprocal of the first relevant document’s

rank in the ranked list returned for a topic e.g. the first relevant document is ranked as No.4

RR = ¼ = 0.25

Metrics based on Graded Relevance Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r>


Which ranking list is better?

Cumulated Gains based metrics CG, DCG, and nDCG

Two assumptions about ranked result list Highly relevant document are more valuable The greater the ranked position of a relevant

document , the less valuable it is for the user

CG Cumulated Gains

From graded-relevance judgments to gain vectors

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1>

CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

DCG Discounted Cumulated Gains

Discounted function

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> DG3 = <1, 0, 0.63, 0, 0.86>, DG4 =<2, 0, 0, 0.5, 0.43>

G3 = <1, 0, 1, 0, 2>, G4 =<2, 0, 0, 1, 1>

DCG3 = <1, 1, 1.63, 1.63, 2.49>, DCG4 =<2, 2, 2, 2.5, 2.93> CG3 = <1, 1, 2, 2, 4>, CG4 =<2, 2, 2, 3, 4>

nDCG Normalized Discounted Cumulated Gains

Ideal (D)CG vector

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> Lideal = <h, r, r, n, n>

Gideal = <2, 1, 1, 0, 0>; DGideal = <2, 1, 0.63, 0, 0>

CGideal = <2, 3, 4, 0, 0>; DCGideal = <2, 3, 3.63, 3.63, 3.63>

nDCG Normalized Discounted Cumulated Gains

Normalized (D)CG

Example 4: L3=<r, n, r, n, h> vs. L4=<h,n, n, r, r> DCGideal = <2, 3, 3.63, 3.63, 3.63>

nDCG3 = <1/2, 1/3, 1.63/3.63, 1.63/3.63, 2.49/3.63> = <0.5, 0.33, 0.45, 0.45, 0.69>

nDCG4 =<2/2, 2/3, 2/3.63, 2.5/3.63, 2.93/3.63> = <1, 0.67, 0.55, 0.69, 0.81>

L3 < L4

Something Important Dealing with small data sets

Cross validation Significant test

Paired, two tailed t-test

Green < Yellow ?

The difference is significant or just caused

by chance

score

p(.)

score

p(.)

Any questions?

BY RUIHUA SONGWEB DATA MANAGEMENT GROUP, MSR ASIA

MARCH 30 , 2010

Introduction of TREC

Text Retrieval Conference

Homepage: http://trec.nist.gov/Goals

To encourage retrieval research based on large test collection

To increase communication among industry, academia, and government

To speed the transfer of technology from research labs into commercial products

To increase the availability of appropriate evaluation techniques for use by industry and academia

Yearly Cycle of TREC

The TREC Tracks

TREC 2009

Tracks Blog track Chemical IR track Entity track Legal track “Million Query” track Relevance Feedback track Web track

Participants 67 groups representing 19 different countries

TREC 2010

Schedule By Feb 18 – submit your application to participate in

TREC 2010 Beginning March 2 Nov 16-19: TREC 2010 at NIST in Gaithersburg, Md.

USAWhat’s new

Session track To test whether systems can improve their performance for

a given query by using a previous query To evaluate system performance over an entire query

session instead of a single query Track web page: http://ir.cis.udel.edu/sessions

Why TREC

To obtain public data sets (most frequently used in IR papers) Pooling makes judgments unbiased for participants

To exchange ideas in emerging areas A strong Program Committee A healthy comparison of approaches

To influence evaluation methodologies By feedback or proposals

TREC 2009 Program Committee

Ellen Voorhees, chairJames AllanChris BuckleyGord CormackSue DumaisDonna HarmanBill Hersh

David LewisDoug OardJohn PragerStephen RobertsonMark SandersonIan SoboroffRichard Tong

Any questions?

SELECT-THE-BEST-ONES: A NEW WAY TO JUDGE RELATIVE RELEVANCERuihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, Hsiao-Wuen Hon

Information Processing and Management, 2010

ABSOLUTE RELEVANCE JUDGMENTS

RELATIVE RELEVANCE JUDGMENTS

Problem formulation

Connections between Absolute and Relative A can be transformed to R as follows:

R can be transformed to A, if the assessors assign a relevance grade to each set.

QUICK-SORT: A PAIRWISE STRATEGY

R P

P

P

vs.

…

B

S

W

B

SS

WW

WW

RS

SS

WW

W

WW

BB

SELECT-THE-BEST-ONES: A PROPOSED NEW STRATEGY

P

P

P

…

B

PP

PP

PP

PP

P

B

P

B

P

P

B

PP

P

PP

P

USER STUDY

Experiment Design Latin Square design to minimize possible

practice effects and order effects

Each tool has been used to judge all three query sets; Each query has been judged by three subjects; Each subject has used every tool and judged every

query, but there are no overlapped queries when he/she uses two different tools

USER STUDY

Experiment Design 30 Chinese queries are divided into three

balanced sets, and cover both popular queries and long-tail queries

SCENE OF USER STUDY

BASIC EVALUATION RESULTS

Efficiency Majority agreement Discriminative power

Table 2. Basic metrics comparison for the three judgment methods

Method Time Majority Agreement (%) Average Relevance

Degrees # Untied Pairs #

Five-grade Absolute Judgments 6’38 97.50 3.12 2000

Quick-Sort Strategy 10’57 (+65.1%)* 94.82 (-2.7%)* 4.40 (+41.0%)* 2585 (+29.3%)*

Select-the-Best-Ones Strategy 5’54 (-11.1%) 99.31 (+1.9%)* 3.80 (+21.8%)* 2309(+15.5%)*

Note: T-test is conducted with regard to the baseline, i.e. Five-grade Absolution Judgments. “*” denotes that the difference is statistically significant (p-value < 0.05).

FURTHER ANALYSIS ON DISCRIMINATIVE POWER

Three grades, ‘Excellent’, ‘Good’, and ‘Fair’, are split while ‘Perfect’ and ‘Bad’ are not

More queries are influenced in SBO than in QS, and the splitting is distributed more evenly in SBO

Table 3. Detailed analysis on splitting one grade of absolute relevance judgments into more subsets of relative relevance judgments

(the average number of subsets corresponding to each grade, the percentage of queries that are influenced by splitting a certain

grade)

Strategy Perfect Excellent Good Fair Bad

Quick-Sort (1, 0) (1.31, 14.44%) (1.93, 55.56%) (1.31, 6.67%) (1, 0)

Select-the-Best-Ones (1, 0) (1.21, 23.33%) (1.78, 51.11%) (1.12, 20.00%) (1, 0)

EVALUATION EXPERIMENT ON JUDGMENT QUALITY

Collecting expert’s judgments 5 experts, for 15 Chinese queries Partial orders Judge individually + discuss as a group

Experimental resultsTable 4. Consistency between expert judgments and the judgments generated by the three methods for document pairs (the

number of concordant/tied/discordant pairs divided by the total number of pairs)

Five-grade Absolute Judgments Quick-Sort Strategy Select-the-Best-Ones Strategy

Concordant 0.2946 0.3493 0.3371

Tied 0.6436 0.4750 0.5638

Discordant 0.0342 0.1142 0.0677

DISCUSSION

Absolute relevance judgment method Fast and easy-to-implement Loses some useful order information

Quick-sort method Light cognitive load and scalable High complexity and unstable standard

Select-the-Best-Ones method Efficient with good discriminative power Heavy cognitive load and not scalable

CONCLUSION

We propose a new strategy called Select-the-Best-Ones to address the problem of relative relevance judgment

A user study and an evaluation experiment show that the SBO method Outperforms the absolute method in terms of

agreement and discriminative power Dramatically improves the efficiency over the

pairwise relative method QS strategy Reduces half of the discordant pairs, compared

to the QS method

Thank you!

[email protected]

evaluation in information retrieval speaker: ruihua song web data management group, msr asia

Documents