![Page 1: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/1.jpg)
Information Retrieval:A Primer
Ellen Voorhees
![Page 2: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/2.jpg)
2
IR Primer(Parts based on an outline by James Allan, UMass)
• Basic IR processing– bag of words– alternate approaches
• Web searching– differences from non-web searching– “advanced’ features
• Available systems– SMART, MG– Lemur, Lucene
![Page 3: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/3.jpg)
3
IR Problem Definition
• Find “documents” that are relevant to a user’s information need– unstructured, natural language
text– lack of structure precludes use
of traditional database technologies
– a document is the unit of text of interest, traditionally at least a paragraph in length
![Page 4: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/4.jpg)
4
IR Issues
• How to represent text– indexing
• How to represent information need– free text vs. formal query language
• How to compare representations– retrieval models
• How to evaluate quality of the search– recall/precision variants
![Page 5: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/5.jpg)
5
Original Approach
• Manually assign descriptors– expensive– human agreement on descriptors is poor– controlled vocabulary increases annotator
consistency but requires searchers to use the same vocabulary
• Still used – MEDLINE– motivation for many semantic web
proposals
![Page 6: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/6.jpg)
6
Current Standard Approach
• Statistical: compute degree of match between query and document– weighted “bag of words”– rank documents by how well they match
• Assume top documents are good & modify query based their words
• Rerun search with modified query
![Page 7: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/7.jpg)
7
Single biggest factor in IR Effectiveness
• The query!• many studies shown topic effect bigger than
system effect or interaction effects• TREC query track demonstrated effect of
different queries for same topic
• Creating good queries• select correct document set to ask• use discriminative, high content query words• include informative alternate expressions• prefer specific examples over general concepts
![Page 8: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/8.jpg)
8
• Tokenization • might include identifying phrases• might include identifying other lexical
structures such as names, amounts, etc• increased importance for (factoid) QA
• Remove “stop words”
• Perform stemming
• Weight terms • might include mining other data sources
How to Represent Text
![Page 9: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/9.jpg)
9
Original Text
Czechs Play Indoor Soccer for More Than Four Days Straight for Record
Twenty Czechs broke the world record for indoor soccer last month, playing the game continuously for 107 hours and 15 minutes, the official Czechoslovak news agency CTK reported.
Two teams began the endeavor in the West Bohemian town of Holysov on Dec. 13 and ended with the new world record on Dec. 17, CTK said in the dispatch Monday.
According to the news agency, the previous record of 106 hours 10 minutes was held by English players. The Czechs new record is to be recorded in the Guinness Book of World Records, CTK said.
![Page 10: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/10.jpg)
10
Bag of Tokens
agency(2); began; bohemian; book; broke; continuously; ctk(3); czechoslovak; czechs(3); days; dec(2); dispatch; ended; endeavor; english; game; guinness; held; holysov; hour(2); indoor(2); minutes(2); monday; month; news(2); official; play(3); previous; record(7); reported; soccer(2); straight; teams; town; twenty; west; world(3)
official czechoslovak; previous record; world record(3); days straight; news agency(2); ctk reported; guinness book; ctk agency; teams began
![Page 11: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/11.jpg)
11
Final Document Representation
agent 2.80; begin 1.55; bohem 6.01; book 2.63; brok 2.60; continu 1.55; ctk 13.35; czechoslovak 4.36; czech 11.65; day 1.38; dec 4.34; dispatch 4.12; end 1.36; endeavor 5.03; engl 3.51; game 3.14; guin 5.40; held 2.05; holysov 10.75; hour 3.44; indoor 8.13; minut 4.38; monday 2.12; month 1.34; new 2.62; offic .98; play 4.82; prev 1.89; record 5.80; report 1.04; socc 9.16; straight 3.61; team 2.86; town 2.86; twent 3.91; west 2.14; world 3.85
czechoslovak offic 8.64; prev record 6.09; record world 13.69; day straight 6.41; agent new 6.69; ctk report 8.51; book guin 7.15; agent ctk 7.67; begin team 8.20
![Page 12: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/12.jpg)
12
How to Represent & CompareInformation Need
• Formal query language– query is a pattern to be matched by doc– Boolean systems best well-known– others
• density measures such as Waterloo’s MultiText• Inquery’s structured query operators
![Page 13: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/13.jpg)
13
Information Need
• Free text: query is a (short) document• Vector space derivatives:
– compute similarity function between document and query vectors• length normalization of vectors key• cosine traditional, but highly biased toward
short docs; pivot length normalization now standard
• Language models– select document most likely to have
generated query
![Page 14: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/14.jpg)
14
Beneficial techniques for automatic IR
• Term weighting
• Query expansion
• Phrasing
• Passages
![Page 15: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/15.jpg)
15
Term weighting• Largest single system factor that
affects retrieval effectiveness
• Current best weights are combination of three factors:– term frequency: how often the term occurs
in a text– collection frequency: how many
documents the term occurs in– length normalization: compensating factor
for widely varying document lengths
![Page 16: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/16.jpg)
16
Query expansion• Good query is essential, yet users tend
to provide only a few keywords
• Variety of query expansion techniques, both interactive and automatic
• Most often used automatic technique is blind feedback: – assume the top ranked documents are
relevant and apply feedback to produce new query
– run new query and present results of this query to user
![Page 17: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/17.jpg)
17
Phrasing• Creation of compound index terms (i.e.,
terms that correspond to >1 word stem in original text)
• Usually found statistically by searching for word pairs that co-occur (much) more frequently in the corpus than expected by chance
• Lots of work on linguistically-motivated phrasing
![Page 18: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/18.jpg)
18
Passages
• Passage is a document subpart
• Useful for breaking long, multi-subject documents into areas of homogenous content
• Also useful for mitigating effects of widely varying document lengths if weighting scheme sub-par
![Page 19: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/19.jpg)
19
(Factoid) QA System Applications
• Predictive Annotation (IBM)– index entity types as well as content words
• Controlled query expansion (LCC)– use progressively more aggressive term
expansion in different iterations of querying
– new iteration invoked if predetermined conditions not met
![Page 20: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/20.jpg)
20
• Current IR evaluation assumes ranked retrieval– some methods (e.g., Boolean matching)
don’t easily produce ranked results
• Measures– variations on recall, precision
– MAP most common measure reported
IR Evaluation
# relevant retrieved
# relevantrecall =
# relevant retrieved
# retrievedprecision =
![Page 21: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/21.jpg)
21
Relevance• Fundamental concept in IR evaluation
• restricted to topical relevance– document and query discuss same topic
• operational definition used in TREC: if you were writing a report on the subject of the query and would use any information included in the document in that report, mark the document relevant
• known that different assessors judge documents differently
• assessor differences affect absolute scores of systems, but generally not relative scores
![Page 22: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/22.jpg)
22
0
0.1
0.2
0.3
0.4
System
Ave
rage
Pre
cisi
on
Line 1
Line 2
Mean
Original
Union
I ntersection
Average Precision by Qrel
![Page 23: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/23.jpg)
23
Recall-Precision Graph
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
n
ok7ax
att98atdc
I NQ502
mds98td
bbn1
tno7exp1
pirc8Aa2
Cor7A3rrf
(Interpolated, Extrapolated, Averaged)
![Page 24: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/24.jpg)
24
Averages Do Hide Variability
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
n
![Page 25: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/25.jpg)
25
Mean Average Precision• Most frequently used summary measure of
a ranked retrieval run– average precision of single query is mean of the
precision scores after each relevant retrieved– value for run is the mean of the individual average
precision scores
• Contains both recall- and precision- oriented aspects; sensitive to entire ranking
• Interpretation is less obvious than other measures (e.g., P(20) )
![Page 26: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/26.jpg)
26
IR Evaluation for (factoid) QA• Lots of interest currently
• SIGIR 2004 workshop• Christof Monz thesis• papers by MIT group
• Suggested measures– p@n, r@n, a@n (Monz)– coverage, redundancy (Roberts &
Gaizauskas)• coverage(n) = % of questions with answer in top
n docs• redundancy(n) = average number of answers per
question in top n docs
![Page 27: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/27.jpg)
27
Issues with IR Eval for QA
• Increased retrieval effectiveness can harm overall QA system performance
• Optimizing average effectiveness for a single retrieval strategy unlikely to be as effective as iterative strategies
• Beware! TREC QA track data sets never intended for this purpose
• see Billotti, Katz, Lin SIGIR 2004 workshop paper for examples
![Page 28: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/28.jpg)
28
Web Searching• Differences from standard ad hoc
searching– quality of spider affects overall
effectiveness– web search engines exploit link structure– web search engines must defend against
deliberate spamming– web search engines operate under severe
efficiency constraints– web search engines usually optimized for
precision
![Page 29: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/29.jpg)
29
Quality of Spider
• Spider sets upper bound on coverage, recency of retrieval
• Bigger index is not necessarily better– eliminating “junk pages” at this stage is
big win for efficiency, possible win for effectiveness
– but, need to carefully define junk
• Affects user queries to the extent that retrieval suffers if not well done
![Page 30: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/30.jpg)
30
Exploiting Link Structure
• Key to effective web retrieval– web retrieval tasks not always “Find docs
about…”– in links treated as recommendations for page– anchor text a form of manual keyword
assignment
• Web engines generally need that structure– e.g., Google-in-a-Box for newswire unlikely to
work well
![Page 31: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/31.jpg)
31
Defending Against Spam
• IR systems generally trust input text, but can’t on the web– link exploitation major way of coping
• Assuming engine does a reasonable job, not a big impact on user queries
![Page 32: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/32.jpg)
32
Efficiency Constraints
• Massive size of web and volume of queries means query processing must be very fast– stemming used rarely, if at all– no complicated NLP
• Economics supports caching/special processing (even manual) for extremely frequent queries
![Page 33: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/33.jpg)
33
Optimized for Early Precision
• Default user model assumes precision is only measure of interest
• “I’m feeling lucky” is Prec(1)
![Page 34: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/34.jpg)
34
Advanced Search
• “Phrase operator”– requires all words in phrase to appear
contiguously in document– close to turning search into a giant grep
• provides a type of pattern matching, but be sure that’s what you really want
![Page 35: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/35.jpg)
35
Advanced Search
• -forbidden words– same semantics as Boolean NOT operator,
with the same pitfalls
![Page 36: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/36.jpg)
36
Available Systems(with input from Ian Soboroff of NIST)
• SMART• MG• Lucene• Lemur
![Page 37: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/37.jpg)
37
SMART• Vector-space system written by
Chris Buckley• available from Cornell ftp site
• Designed to support experimentation• extremely easy to switch components of
indexing/search• hard to change fundamental search process
(from ad hoc retrieval to filtering, say)
• Freely available version old, but well-known
• many different groups have used it in TREC
• Little user documentation• not supported
![Page 38: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/38.jpg)
38
MG• Written largely at RMIT using research
of Bell, Moffat, Witten, Zobel• retrieval system accompanying the book
“Managing Gigabytes”
• Intended to use as is as black box• implements tf*idf vector model• research emphasis on efficiency
• Not officially supported• documentation essentially limited to book
• Zettair new system by same group
![Page 39: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/39.jpg)
39
Lucene
• Retrieval toolkit written in Java• open source from Jakarta Apache• Doug Cutting main architect
• Target audience is people looking to put search on a web site
• while a toolkit so extensible, general use more out-of-the-box
• vector space, tf*idf system• no relevance feedback
• Well-documented with active user community
![Page 40: Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate](https://reader035.vdocuments.pub/reader035/viewer/2022062717/56649e305503460f94b20ace/html5/thumbnails/40.jpg)
40
Lemur
• Retrieval toolkit written largely in C++• available from CMU website• written at CMU (Jamie Callan) and UMass (Bruce
Croft, James Allan) with support from ARDA
• Target audience is IR research community
• primary retrieval model is language modeling approach, also supports tf*idf, Okapi weighting
• has components for major areas of IR research including relevance feedback, distributed IR, etc.
• Active user community; good documentation