Информационный поиск, осень 2016: Оценка качества...

44
Online evaluation Hypothesis testing Information Retrieval Online Evaluation Ilya Markov [email protected] University of Amsterdam Ilya Markov [email protected] Information Retrieval 1

Upload: cs-center

Post on 15-Apr-2017

258 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Information RetrievalOnline Evaluation

Ilya [email protected]

University of Amsterdam

Ilya Markov [email protected] Information Retrieval 1

Page 2: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Course overview

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 2

Page 3: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

This lecture

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 3

Page 4: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Online scenario

Ilya Markov [email protected] Information Retrieval 4

Page 5: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

1 Online evaluationOnline metricsBetween-subject experimentsWithin-subject experimentsSummary

2 Hypothesis testing

Ilya Markov [email protected] Information Retrieval 5

Page 6: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

1 Online evaluationOnline metricsBetween-subject experimentsWithin-subject experimentsSummary

Ilya Markov [email protected] Information Retrieval 6

Page 7: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Which user-generated signals indicate search quality?

Ilya Markov [email protected] Information Retrieval 7

Page 8: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Online metrics

Type of Metric Good Badinteraction

ClicksClick-through rate ↑ ↓Click rank (reciprocal rank) ↓ ↑Abandonment ↓ ↑

TimeDwell time ↑ ↓Time to first click ↓ ↑Time to last click ↑ ↓

QueriesNumber of reformulations ↓ ↑Number of abandoned queries ↓ ↑

Ilya Markov [email protected] Information Retrieval 8

Page 9: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

1 Online evaluationOnline metricsBetween-subject experimentsWithin-subject experimentsSummary

Ilya Markov [email protected] Information Retrieval 9

Page 10: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

A/B testing

Control Treatment Control Treatment Control Treatment Control Treatment Control

Ilya Markov [email protected] Information Retrieval 10

Page 11: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

A/B testing

1 Set the current search system as control

2 Set an alternative search system as treatment

3 Assign 0.5%-1.0% of users to each of these systems

4 Record user interactions with these systems during timeperiod T

5 Compare the systems using online metrics

6 Choose a winner based on one or several metrics

Ilya Markov [email protected] Information Retrieval 11

Page 12: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

A/B testing discussion

Pros

Can evaluate anythingUsing any online metric

Cons

High variance between usersNot very sensitiveNeeds lots of observations

Ilya Markov [email protected] Information Retrieval 12

Page 13: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

1 Online evaluationOnline metricsBetween-subject experimentsWithin-subject experimentsSummary

Ilya Markov [email protected] Information Retrieval 13

Page 14: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Interleaving

1 Given a user’s query, produce two rankings(current and alternative)

2 Merge the rankings into a single rankingusing a mixing policy

3 Present the merged ranking to a userand collect interactions (see online metrics)

4 Choose a winning ranking using a scoring rule

5 Repeat steps 1–4 until a clear winner is identified

Ilya Markov [email protected] Information Retrieval 14

Page 15: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Team draft interleaving

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

versusGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

search engines that learn from their usersGoogle A Google B

INTERLEAVING

click

search engines that learn from their usersGoogle A Google B

INTERLEAVING

click

search engines that learn from their usersGoogle A Google B

INTERLEAVING

Google A Google B

INTERLEAVING

loses againstGoogle A Google B

INTERLEAVING

Slides by Anne Schuth, http://www.anneschuth.nl

Ilya Markov [email protected] Information Retrieval 15

Page 16: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Team draft interleaving

Mixing policy: each ranker selects its highest rankeddocument that is not yet in the combined list

Scoring rule: a ranker is preferred if its results get more clicks

Ilya Markov [email protected] Information Retrieval 16

Page 17: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Other interleaving methods

Probabilistic interleaving

Optimized interleaving

Multileaving

Ilya Markov [email protected] Information Retrieval 17

Page 18: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Interleaving discussion

Pros

No variance due to different usersHighly sensitiveNeeds much fewer observations compared to A/B testing

Cons

Can only use clicks

Ilya Markov [email protected] Information Retrieval 18

Page 19: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

1 Online evaluationOnline metricsBetween-subject experimentsWithin-subject experimentsSummary

Ilya Markov [email protected] Information Retrieval 19

Page 20: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Online evaluation summary

Online metrics

ClicksTimeQueries

Between-subject experiments – A/B testing

Within-subject experiments – interleaving

Ilya Markov [email protected] Information Retrieval 20

Page 21: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

What are the advantages of online evaluation?

Based on real users

Cheap as it uses a running search system

Ilya Markov [email protected] Information Retrieval 21

Page 22: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

What are the disadvantages of online evaluation?

Online metrics are difficult to interpret

May disturb users

Cannot run too many experiments in parallel

User search interactions are biased

Ilya Markov [email protected] Information Retrieval 22

Page 23: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Click models

http://clickmodels.weebly.com/the-book.html

Ilya Markov [email protected] Information Retrieval 23

Page 24: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Materials

K. Hofmann, L. Li, F. RadlinskiOnline Evaluation for Information RetrievalFoundations and Trends in Information Retrieval, 2016

Ilya Markov [email protected] Information Retrieval 24

Page 25: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Which evaluation paradigm should we use?

Both

Ilya Markov [email protected] Information Retrieval 25

Page 26: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Evaluating efficiency

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 26

Page 27: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

1 Online evaluation

2 Hypothesis testingBasicsHypothesis testing in IR

Ilya Markov [email protected] Information Retrieval 27

Page 28: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Example

How can we be surethat B is better than A?

Test statistical significance(hypothesis testing)

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 28

Page 29: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

2 Hypothesis testingBasicsHypothesis testing in IR

Ilya Markov [email protected] Information Retrieval 29

Page 30: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Test procedure

1 Set null hypothesis H0

2 Set alternative hypothesis H1

3 Collect sample data X = {X1, . . . ,Xn}X is unlikely under H0 =⇒ reject H0 in favor of H1

X is not unlikely under H0 =⇒ no evidence against H0

. . . but this is not an evidence in favor of H0!

Ilya Markov [email protected] Information Retrieval 30

Page 31: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Example

1 H0 : phead = 0.8

2 H1 : phead 6= 0.8

3 Perform 10 tosses, observe 4 heads

n!

h!(n − h)!ph(1− p)n−h =

10!

4!6!0.840.26 = 0.005

4 Perform 10 tosses, observe 7 heads

10!

7!3!0.870.23 = 0.201

Ilya Markov [email protected] Information Retrieval 31

Page 32: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Test procedure (cont’d)

1 Consider a statistical model

2 Set H0 and H1

3 Choose a test statistics T (X1, . . . ,Xn)

4 Choose a critical region C (discussed next)5 Decision rule

T ∈ C =⇒ reject H0 in favor of H1

T /∈ C =⇒ fail to reject H0

Ilya Markov [email protected] Information Retrieval 32

Page 33: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Example (cont’d)

H0 : phead = 0.8

H1 : phead 6= 0.8

Sample meanX = 1

n

∑ni=1 Xi

For large enough nX | H0 ∼ N (0.8, σ2/n)

Test statisticsT = X−0.8

σ/√n∼ N (0, 1)

Critical regionC = (−∞,−zc ]∪ [zc ,∞)

P(T ∈ C |H0) ≤ αα – size, usually ∈ {0.01, 0.05}For α = 0.05, zα/2 = 1.96

Picture taken from http://2012books.lardbucket.org/books/beginning-statistics/

s09-04-areas-of-tails-of-distribution.html

Ilya Markov [email protected] Information Retrieval 33

Page 34: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Example (cont’d)

Reject H0 if

T ≤ −zα/2, T ≥ zα/2

X ≤ −zα/2 · σ/√n + 0.8,

X ≥ zα/2 · σ/√n + 0.8

Alternatively, reject H0 if

p = P(|T | > Tobs | H0) ≤ αp-value

Test statistics and p-value can be calculatedusing any statistical software, e.g., R

Ilya Markov [email protected] Information Retrieval 34

Page 35: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Errors

H0 is false H0 is true

Reject H0 power type I error (α)Not reject H0 type II error

Ilya Markov [email protected] Information Retrieval 35

Page 36: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Outline

2 Hypothesis testingBasicsHypothesis testing in IR

Ilya Markov [email protected] Information Retrieval 36

Page 37: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

T-test

1 Get measurements for systems A and BM(A) ∼ N (µA, σ

2), M(B) ∼ N (µB , σ2)

2 H0 : µA = µB3 H1 : µA 6= µB

4 T = A−Bσ̂/√n∼ T (n−1) –

Student’s t-distribution

5 Use standard hypothesis testing procedurewith α ∈ {0.01, 0.05}

Picture taken from https://en.wikipedia.org/wiki/Student%27s_t-distribution

Ilya Markov [email protected] Information Retrieval 37

Page 38: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

T-test example

B − A = 21.4

σ̂ = 29.1

T = 21.429.1/

√10

= 2.33

p = P(|T | > 2.33 | H0) = 0.02

If α = 0.05, reject H0

If α = 0.01, do not reject H0

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 38

Page 39: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Wilcoxon signed-ranks test

1 Get measurements for systems A and B

2 For each item i , compute |mA,i −mB,i | and sgn(mA,i −mB,i )

3 Exclude items with |mA,i −mB,i | = 0

4 Order the remaining Nnz items based on |mA,i −mB,i |5 Assign ranks Ri from smallest to largest

6 Compute the test statistics

W =Nnz∑i=1

[sgn(mA,i −mB,i ) · Ri ]

7 For large Nnz , W ∼ N8 Use standard hypothesis testing procedure withα ∈ {0.05, 0.01}

Ilya Markov [email protected] Information Retrieval 39

Page 40: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Wilcoxon test example

Ranked non-zero differences2, 9, 10, 24, 25, 25, 41, 60, 70

Signed ranks-1, +2, +3, -4, +5.5, +5.5, +7,+8, +9

W = 35

p = P(|W | > 35 | H0) = 0.025

If α = 0.05, reject H0

If α = 0.01, do not reject H0

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 40

Page 41: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Hypothesis testing summary

IR must use statistical testing

The most common and one of the most powerfulis the paired t-test

Ilya Markov [email protected] Information Retrieval 41

Page 42: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Materials

M. Smucker, J. Allan, B. CarteretteA Comparison of Statistical Significance Tests forInformation Retrieval EvaluationProceedings of CIKM, pages 623–632, 2007

Ilya Markov [email protected] Information Retrieval 42

Page 43: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

Course overview

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 43

Page 44: Информационный поиск, осень 2016: Оценка качества информационного поиска (онлайн)

Online evaluation Hypothesis testing

See you tomorrow at 11:15

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 44