information retrieval: a primer ellen voorhees. 2 ir primer (parts based on an outline by james...

40
Information Retrieval: A Primer Ellen Voorhees

Upload: sara-jasmin-williams

Post on 27-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

Information Retrieval:A Primer

Ellen Voorhees

Page 2: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

2

IR Primer(Parts based on an outline by James Allan, UMass)

• Basic IR processing– bag of words– alternate approaches

• Web searching– differences from non-web searching– “advanced’ features

• Available systems– SMART, MG– Lemur, Lucene

Page 3: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

3

IR Problem Definition

• Find “documents” that are relevant to a user’s information need– unstructured, natural language

text– lack of structure precludes use

of traditional database technologies

– a document is the unit of text of interest, traditionally at least a paragraph in length

Page 4: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

4

IR Issues

• How to represent text– indexing

• How to represent information need– free text vs. formal query language

• How to compare representations– retrieval models

• How to evaluate quality of the search– recall/precision variants

Page 5: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

5

Original Approach

• Manually assign descriptors– expensive– human agreement on descriptors is poor– controlled vocabulary increases annotator

consistency but requires searchers to use the same vocabulary

• Still used – MEDLINE– motivation for many semantic web

proposals

Page 6: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

6

Current Standard Approach

• Statistical: compute degree of match between query and document– weighted “bag of words”– rank documents by how well they match

• Assume top documents are good & modify query based their words

• Rerun search with modified query

Page 7: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

7

Single biggest factor in IR Effectiveness

• The query!• many studies shown topic effect bigger than

system effect or interaction effects• TREC query track demonstrated effect of

different queries for same topic

• Creating good queries• select correct document set to ask• use discriminative, high content query words• include informative alternate expressions• prefer specific examples over general concepts

Page 8: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

8

• Tokenization • might include identifying phrases• might include identifying other lexical

structures such as names, amounts, etc• increased importance for (factoid) QA

• Remove “stop words”

• Perform stemming

• Weight terms • might include mining other data sources

How to Represent Text

Page 9: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

9

Original Text

Czechs Play Indoor Soccer for More Than Four Days Straight for Record

Twenty Czechs broke the world record for indoor soccer last month, playing the game continuously for 107 hours and 15 minutes, the official Czechoslovak news agency CTK reported.

Two teams began the endeavor in the West Bohemian town of Holysov on Dec. 13 and ended with the new world record on Dec. 17, CTK said in the dispatch Monday.

According to the news agency, the previous record of 106 hours 10 minutes was held by English players. The Czechs new record is to be recorded in the Guinness Book of World Records, CTK said.

Page 10: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

10

Bag of Tokens

agency(2); began; bohemian; book; broke; continuously; ctk(3); czechoslovak; czechs(3); days; dec(2); dispatch; ended; endeavor; english; game; guinness; held; holysov; hour(2); indoor(2); minutes(2); monday; month; news(2); official; play(3); previous; record(7); reported; soccer(2); straight; teams; town; twenty; west; world(3)

official czechoslovak; previous record; world record(3); days straight; news agency(2); ctk reported; guinness book; ctk agency; teams began

Page 11: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

11

Final Document Representation

agent 2.80; begin 1.55; bohem 6.01; book 2.63; brok 2.60; continu 1.55; ctk 13.35; czechoslovak 4.36; czech 11.65; day 1.38; dec 4.34; dispatch 4.12; end 1.36; endeavor 5.03; engl 3.51; game 3.14; guin 5.40; held 2.05; holysov 10.75; hour 3.44; indoor 8.13; minut 4.38; monday 2.12; month 1.34; new 2.62; offic .98; play 4.82; prev 1.89; record 5.80; report 1.04; socc 9.16; straight 3.61; team 2.86; town 2.86; twent 3.91; west 2.14; world 3.85

czechoslovak offic 8.64; prev record 6.09; record world 13.69; day straight 6.41; agent new 6.69; ctk report 8.51; book guin 7.15; agent ctk 7.67; begin team 8.20

Page 12: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

12

How to Represent & CompareInformation Need

• Formal query language– query is a pattern to be matched by doc– Boolean systems best well-known– others

• density measures such as Waterloo’s MultiText• Inquery’s structured query operators

Page 13: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

13

Information Need

• Free text: query is a (short) document• Vector space derivatives:

– compute similarity function between document and query vectors• length normalization of vectors key• cosine traditional, but highly biased toward

short docs; pivot length normalization now standard

• Language models– select document most likely to have

generated query

Page 14: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

14

Beneficial techniques for automatic IR

• Term weighting

• Query expansion

• Phrasing

• Passages

Page 15: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

15

Term weighting• Largest single system factor that

affects retrieval effectiveness

• Current best weights are combination of three factors:– term frequency: how often the term occurs

in a text– collection frequency: how many

documents the term occurs in– length normalization: compensating factor

for widely varying document lengths

Page 16: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

16

Query expansion• Good query is essential, yet users tend

to provide only a few keywords

• Variety of query expansion techniques, both interactive and automatic

• Most often used automatic technique is blind feedback: – assume the top ranked documents are

relevant and apply feedback to produce new query

– run new query and present results of this query to user

Page 17: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

17

Phrasing• Creation of compound index terms (i.e.,

terms that correspond to >1 word stem in original text)

• Usually found statistically by searching for word pairs that co-occur (much) more frequently in the corpus than expected by chance

• Lots of work on linguistically-motivated phrasing

Page 18: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

18

Passages

• Passage is a document subpart

• Useful for breaking long, multi-subject documents into areas of homogenous content

• Also useful for mitigating effects of widely varying document lengths if weighting scheme sub-par

Page 19: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

19

(Factoid) QA System Applications

• Predictive Annotation (IBM)– index entity types as well as content words

• Controlled query expansion (LCC)– use progressively more aggressive term

expansion in different iterations of querying

– new iteration invoked if predetermined conditions not met

Page 20: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

20

• Current IR evaluation assumes ranked retrieval– some methods (e.g., Boolean matching)

don’t easily produce ranked results

• Measures– variations on recall, precision

– MAP most common measure reported

IR Evaluation

# relevant retrieved

# relevantrecall =

# relevant retrieved

# retrievedprecision =

Page 21: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

21

Relevance• Fundamental concept in IR evaluation

• restricted to topical relevance– document and query discuss same topic

• operational definition used in TREC: if you were writing a report on the subject of the query and would use any information included in the document in that report, mark the document relevant

• known that different assessors judge documents differently

• assessor differences affect absolute scores of systems, but generally not relative scores

Page 22: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

22

0

0.1

0.2

0.3

0.4

System

Ave

rage

Pre

cisi

on

Line 1

Line 2

Mean

Original

Union

I ntersection

Average Precision by Qrel

Page 23: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

23

Recall-Precision Graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

ok7ax

att98atdc

I NQ502

mds98td

bbn1

tno7exp1

pirc8Aa2

Cor7A3rrf

(Interpolated, Extrapolated, Averaged)

Page 24: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

24

Averages Do Hide Variability

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

Page 25: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

25

Mean Average Precision• Most frequently used summary measure of

a ranked retrieval run– average precision of single query is mean of the

precision scores after each relevant retrieved– value for run is the mean of the individual average

precision scores

• Contains both recall- and precision- oriented aspects; sensitive to entire ranking

• Interpretation is less obvious than other measures (e.g., P(20) )

Page 26: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

26

IR Evaluation for (factoid) QA• Lots of interest currently

• SIGIR 2004 workshop• Christof Monz thesis• papers by MIT group

• Suggested measures– p@n, r@n, a@n (Monz)– coverage, redundancy (Roberts &

Gaizauskas)• coverage(n) = % of questions with answer in top

n docs• redundancy(n) = average number of answers per

question in top n docs

Page 27: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

27

Issues with IR Eval for QA

• Increased retrieval effectiveness can harm overall QA system performance

• Optimizing average effectiveness for a single retrieval strategy unlikely to be as effective as iterative strategies

• Beware! TREC QA track data sets never intended for this purpose

• see Billotti, Katz, Lin SIGIR 2004 workshop paper for examples

Page 28: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

28

Web Searching• Differences from standard ad hoc

searching– quality of spider affects overall

effectiveness– web search engines exploit link structure– web search engines must defend against

deliberate spamming– web search engines operate under severe

efficiency constraints– web search engines usually optimized for

precision

Page 29: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

29

Quality of Spider

• Spider sets upper bound on coverage, recency of retrieval

• Bigger index is not necessarily better– eliminating “junk pages” at this stage is

big win for efficiency, possible win for effectiveness

– but, need to carefully define junk

• Affects user queries to the extent that retrieval suffers if not well done

Page 30: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

30

Exploiting Link Structure

• Key to effective web retrieval– web retrieval tasks not always “Find docs

about…”– in links treated as recommendations for page– anchor text a form of manual keyword

assignment

• Web engines generally need that structure– e.g., Google-in-a-Box for newswire unlikely to

work well

Page 31: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

31

Defending Against Spam

• IR systems generally trust input text, but can’t on the web– link exploitation major way of coping

• Assuming engine does a reasonable job, not a big impact on user queries

Page 32: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

32

Efficiency Constraints

• Massive size of web and volume of queries means query processing must be very fast– stemming used rarely, if at all– no complicated NLP

• Economics supports caching/special processing (even manual) for extremely frequent queries

Page 33: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

33

Optimized for Early Precision

• Default user model assumes precision is only measure of interest

• “I’m feeling lucky” is Prec(1)

Page 34: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

34

Advanced Search

• “Phrase operator”– requires all words in phrase to appear

contiguously in document– close to turning search into a giant grep

• provides a type of pattern matching, but be sure that’s what you really want

Page 35: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

35

Advanced Search

• -forbidden words– same semantics as Boolean NOT operator,

with the same pitfalls

Page 36: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

36

Available Systems(with input from Ian Soboroff of NIST)

• SMART• MG• Lucene• Lemur

Page 37: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

37

SMART• Vector-space system written by

Chris Buckley• available from Cornell ftp site

• Designed to support experimentation• extremely easy to switch components of

indexing/search• hard to change fundamental search process

(from ad hoc retrieval to filtering, say)

• Freely available version old, but well-known

• many different groups have used it in TREC

• Little user documentation• not supported

Page 38: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

38

MG• Written largely at RMIT using research

of Bell, Moffat, Witten, Zobel• retrieval system accompanying the book

“Managing Gigabytes”

• Intended to use as is as black box• implements tf*idf vector model• research emphasis on efficiency

• Not officially supported• documentation essentially limited to book

• Zettair new system by same group

Page 39: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

39

Lucene

• Retrieval toolkit written in Java• open source from Jakarta Apache• Doug Cutting main architect

• Target audience is people looking to put search on a web site

• while a toolkit so extensible, general use more out-of-the-box

• vector space, tf*idf system• no relevance feedback

• Well-documented with active user community

Page 40: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

40

Lemur

• Retrieval toolkit written largely in C++• available from CMU website• written at CMU (Jamie Callan) and UMass (Bruce

Croft, James Allan) with support from ARDA

• Target audience is IR research community

• primary retrieval model is language modeling approach, also supports tf*idf, Okapi weighting

• has components for major areas of IR research including relevance feedback, distributed IR, etc.

• Active user community; good documentation