mediaeval 2016 - ir evaluation: putting the user back in the loop

Post on 09-Jan-2017

29 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IR  evaluation:  Putting  the  user  back  in  the  loop

Evangelos Kanoulase.kanoulas@uva.nl

Change  the  search  algorithm.

How  can  we  know  whether  we  made  the  users  happier?

Different  approaches  to  evaluation

• User-­‐studies

• In-­‐situ  evaluation• A/B  Testing• Interleaving

• Collection-­‐based  evaluation

in-­‐situ  evaluation

A/B  Testing

Baseline  (control) Experimental  (treatment)

collection-­‐based  evaluation

Machine  Learning

• Feature  vectors

• Labels

Cranfield Collections

Information  Retrieval

• Documents• Queries

• Labels– relevance  judgments

Query   1 Query   2 Query   N

Cranfield Paradigm• Simple  user  model• Controlled  experiments• Reusable  but  static  test  

collections

Online  Evaluation• Full  user  participation• Many  degrees  of  freedom• Unrepeatable  experiments

System  Focus User  Focus

Evaluation  Landscape

TREC  Tasks TREC  Session  

TREC  TotalRecall  

TREC  OpenSearch

TREC  Total  Recall

results

human  assessor

search  algorithm

query

documentcollection

TREC  Session  Track

TREC  Session  Track  [2010-­‐2014]

1. improve  search  by  using  session  information

2. improve  search  over  an  entire  user’s  session  instead  of  a  single  query

Paris  Luxurious  Hotels Paris  Hilton

Test  Collection

Evaluating Retrieval over Sessions: The TREC Session Track 2011–2014Ben Carterette1, Paul Clough2, Mark Hall3, Evangelos Kanoulas4, Mark Sanderson5

1 University of Delaware, 2 University of She�eld, 3 Edge Hill University, 4 University of Amsterdam, 5 RMIT University

Objectives

I Test if the retrieval e�ectiveness of a query could be improved by using previousqueries, ranked results, and user interactions.

Test Collection

Four test collections (2011–2014) comprising N sessions of varying length, each con-sisted of:I mi blocks of user interactions (the session’s length);I the current query qm1 in the session;I mi≠1 blocks of interactions in the session prior to the current query, composed of:

Û the user queries in the session, q1, q2, ..., qmi≠1;Û the ranked list of URLs seen by the user for each of those queries;Û the set of clicked URLs/snippets.

Test Collection Statistics

2011 2012 2013 2014collection ClueWeb09 ClueWeb09 ClueWeb12 ClueWeb12

topic propertiestopic set size 62 48 61 60

topic cat. dist. known-item 10 exploratory,6 interpretive,

20 known-item,12 known-subj

10 exploratory,9 interpretive,

32 known-item,10 known-subj

15 exploratory,15 interpretive,15 known-item,15 known-subj

session propertiesuser population U. She�eld U. She�eld U. She�eld + IR

researchersMTurk

search engine BOSS+CW09filter

BOSS+CW09filter

indri indri

total sessions 76 98 133 1,257sessions per topic 1.2 2.0 2.2 21.0

mean length (in queries) 3.7 3.0 3.7 3.7median time between queries 68.5s 66.7s 72.2s 25.6srelevance judgments

topics judged 62 48 49 51total judgments 19,413 17,861 13,132 16,949

Algorithmic Improvements

I Session history can be used to improve e�ectiveness over basic ad hoc retrieval.

0 20 40 60 80 100

−0

.10

.00.1

0.2

run number

ma

x ch

an

ge in

nD

CG

@1

0 fro

m R

L1

ba

selin

e

2011201220132014

Topic - System Analysis

I Known-subject and exploratory topics benefit most from access to session history.I There is substantial variability in topics due to the way the users perform their

search and formulate their query.

0.0

0.5

1.0

1.5

topic (ordered by median)

diff

ere

nce

in ∆

nD

CG

@1

0 o

ver

sess

ion

s

2012

−10

2012

−47

2014

−40

2013

−14

2012

−28

2012

−4

2013

−24

2014

−46

2012

−6

2014

−52

2014

−39

2014

−26

2014

−13

2014

−47

2012

−5

2014

−44

2013

−12

2011

−7

2012

−32

2011

−30

2014

−56

2013

−21

2011

−20

2012

−34

2013

−49

2014

−15

2012

−11

2014

−24

2014

−35

2014

−10

2012

−23

2014

−30

2011

−52

2013

−28

2012

−24

Conclusions

I Retrieval e�ectiveness can be improved for ad hoc retrieval using data based onsession history.

I The more detailed the session data, the greater the improvement.

SIGIR 2016

TREC  Session  Track  [2010-­‐2014]

1. improve  search  by  using  session  information

2. improve  search  over  an  entire  user’s  session  instead  of  a  single  query

TREC  Tasks  Track

TREC  Tasks  Track  [2015–now]

1. understand  underlying  user’s  task

2. assist  user  in  completing  the  task

Make Improvements At Home

TASKUNDERSTANDING

Make Improvements At HomeTASK

COMPLETION

TREC  Session  Track  [2010-­‐2014]

1. improve  search  by  using  session  information

2. improve  search  over  an  entire  user’s  session  instead  of  a  single  query

CLEF  Dynamic  Search  for  Complex  Tasks

CLEF  Complex  Tasks  [now]

1. Produce  methodology  and  algorithms  that  will  lead  to  a  dynamic  test  collection by  simulating  users

2. Understand  and  quantify  what  constitutes  a  good  ranking  of  documents  at  different  stages of  a  session,  and  a  good  overall session

TREC  Open  Search

top related