龙星计划课程 : 信息检索 overview of text retrieval: part 2
DESCRIPTION
龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign - PowerPoint PPT PresentationTRANSCRIPT
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1
龙星计划课程 :信息检索 Overview of Text Retrieval: Part 2
ChengXiang Zhai (翟成祥 ) Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 2
Outline
• Other retrieval models
• Implementation of a TR System
• Applications of TR techniques
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 3
P-norm (Extended Boolean)(Salton et al. 83)
• Motivation: how to rank documents with a Boolean query?
• Intuitions
– Docs satisfying the query constraint should get the highest ranks
– Partial satisfaction of query constraint can be used to rank other docs
• Question: How to capture “partial satisfaction”?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 4
P-norm: Basic Ideas
• Normalized term weights for doc rep ([0,1])
• Define similarity between a Boolean query and a doc vector
Q= T1 AND T2
(0,0) (1,0)
(0,1) (1,1)
(x,y)
2/))1()1((1),"21(" 22 yxdTANDTsim
Q= T1 ORT2
(0,0) (1,0)
(0,1) (1,1)
(x,y)
2/)(),"21(" 22 yxdTORTsim
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 5
P-norm: Formulas
Since the similarity value is normalized to [0,1], these two formulas can be applied recursively.
p
n
xx
dist
ddistdTTTsim
pn
p
p
pn
1
])1(...)1(
[1)1,0(
)1,(1),"...(" 1
21
p
n
xx
dist
ddistdTTTsim
pn
p
p
pn
1
]...
[)0,1(
)0,(),"...(" 1
21
1 P +
vector-space Boolean/Fuzzy logic
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 6
P-norm: Summary
• A general (and elegant) similarity function for Boolean query and a regular document vector
• Connecting Boolean model and vector space model with models in between
• Allowing different “confidence” on Boolean operators (different p for different operators)
• A model worth more exploration (how to learn optimal p values from feedback?)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 7
Probabilistic Retrieval Models
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 8
Overview of Retrieval ModelsRelevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel
(Salton et al., 75)
Prob. distr.model
(Wong & Yao, 89)
…
GenerativeModel
RegressionModel
(Fox 83)
Classicalprob. Model(Robertson &
Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach
(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model
(Wong & Yao, 95)
Differentinference system
Inference network model
(Turtle & Croft, 91)
Learn toRank
(Joachims 02)(Burges et al. 05)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 9
The Basic Question
What is the probability that THIS document is relevant to THIS query?
Formally…
3 random variables: query Q, document D, relevance R {0,1}
Given a particular query q, a particular document d, p(R=1|Q=q,D=d)=?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 10
Probability of Relevance
• Three random variables
– Query Q
– Document D
– Relevance R {0,1}
• Goal: rank D based on P(R=1|Q,D)
– Evaluate P(R=1|Q,D)
– Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents
• Several different ways to refine P(R=1|Q,D)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 11
Refining P(R=1|Q,D) Method 1: conditional models
• Basic idea: relevance depends on how well a query matches a document
– Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen,..
– P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), )
– Using training data (known relevance judgments) to estimate parameter
– Apply the model to rank new documents
• Special case: logistic regression
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 12
Logistic Regression (Cooper 92, Gey 94)
)exp(1
1),|1(
6
10
iii X
DQRP
6
10),|1(1
),|1(log
iii XDQRP
DQRP x
x
1loglogit function:
)exp(1
)exp(
)exp(1
1
x
x
x
logistic (sigmoid) function:
X
P(R=1|Q,D)
1.0Uses 6 features X1, …, X6
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 13
Features/Attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 14
Logistic Regression: Pros & Cons
• Advantages
– Absolute probability of relevance available
– May re-use all the past relevance judgments
• Problems
– Performance is very sensitive to the selection of features
– No much guidance on feature selection
• In practice, performance tends to be average
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 15
Refining P(R=1|Q,D) Method 2:generative models
• Basic idea
– Define P(Q,D|R)
– Compute O(R=1|Q,D) using Bayes’ rule
• Special cases
– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)
– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)
)0(
)1(
)0|,(
)1|,(
),|0(
),|1(),|1(
RP
RP
RDQP
RDQP
DQRP
DQRPDQRO
Ignored for ranking D
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 16
Document Generation
)0,|(
)1,|(
)0|()0,|(
)1|()1,|(
)0|,(
)1|,(
),|0(
),|1(
RQDP
RQDP
RQPRQDP
RQPRQDP
RDQP
RDQP
DQRP
DQRP
Model of relevant docs for Q
Model of non-relevant docs for Q
Assume independent attributes A1…Ak ….(why?)Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk )
)0),0,|1()1,|1(()1,|0()0,|1(
)0,|0()1,|1(
)1,|0()0,|1(
)0,|0()1,|1(
)0,|0(
)1,|0(
)0,|1(
)1,|1(
)0,|(
)1,|(
),|0(
),|1(
1,1
1,1
0,11,1
1
iii
k
qdi ii
ii
k
di ii
ii
k
di i
ik
di i
i
k
i ii
ii
qifRQAPRQAPAssumeRQAPRQAP
RQAPRQAP
RQAPRQAP
RQAPRQAP
RQAP
RQAP
RQAP
RQAP
RQdAP
RQdAP
DQRP
DQRP
ii
i
ii
Non-query terms are equally likely to
appear in relevant and non-relevant docs
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 17
Robertson-Sparck Jones Model(Robertson & Sparck Jones 76)
Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc
k
qdi ii
iiRank
iipq
qpDQRO
1,1 )1(
)1(log),|1(log (RSJ model)
How to estimate parameters?Suppose we have relevance judgments,
1).(#
5.0).(#ˆ
1).(#
5.0).(#ˆ
docnonrel
Awithdocnonrelq
docrel
Awithdocrelp i
ii
i
“+0.5” and “+1” can be justified by Bayesian estimation
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 18
RSJ Model: No Relevance Info(Croft & Harper 79)
k
qdi ii
iiRank
iipq
qpDQRO
1,1 )1(
)1(log),|1(log (RSJ model)
How to estimate parameters?Suppose we do not have relevance judgments,
- We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant
k
qdi i
iRank
iin
nNDQRO
1,1 5.0
5.0log),|1(log
N: # documents in collectionni: # documents in which term Ai occurs
i
i
n
nNIDF
log'
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 19
RSJ Model: Summary
• The most important classic prob. IR model
• Use only term presence/absence, thus also referred to as Binary Independence Model
• Essentially Naïve Bayes for doc ranking
• Most natural for relevance/pseudo feedback
• When without relevance judgments, the model parameters must be estimated in an ad hoc way
• Performance isn’t as good as tuned VS model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 20
Improving RSJ: Adding TF
Let D=d1…dk, where dk is the frequency count of term Ak
k
di iii
iii
k
di i
ik
di ii
ii
k
i ii
ii
i
ii
RQAPRQdAP
RQAPRQdAP
RQAP
RQAP
RQdAP
RQdAP
RQdAP
RQdAP
DQRP
DQRP
1,1
0,11,1
1
)1,|0()0,|(
)0,|0()1,|(
)0,|0(
)1,|0(
)0,|(
)1,|(
)0,|(
)1,|(
),|0(
),|1(
)0,|(
)1,|(
),|0(
),|1(
RQDP
RQDP
DQRP
DQRPBasic doc. generation model:
EE ef
RQEPef
RQEp
EfApRQEPEfApRQEpRQfApf
Ef
E
iii
!),|(
!),|(
)|(),|()|(),|(),|(
2-Poisson mixture model
Many more parameters to estimate! (how many exactly?)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 21
BM25/Okapi Approximation(Robertson et al. 94)
• Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties
• Observations:
– log O(R=1|Q,D) is a sum of term weights Wi
– Wi= 0, if TFi=0
– Wi increases monotonically with TFi
– Wi has an asymptotic limit
• The simple function is
)1(
)1(log
)1(
1
1
ii
ii
i
ii pq
qp
TFk
kTFW
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 22
Adding Doc. Length & Query TF
• Incorporating doc length
– Motivation: The 2-Poisson model assumes equal document length
– Implementation: “Carefully” penalize long doc
• Incorporating query TF
– Motivation: Appears to be not well-justified
– Implementation: A similar TF transformation
• The final formula is called BM25, achieving top TREC performance
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 23
The BM25 Formula
“Okapi TF/BM25 TF”
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 24
Extensions of “Doc Generation” Models
• Capture term dependence (Rijsbergen & Harper 78)
• Alternative ways to incorporate TF (Croft 83, Kalt96)
• Feature/term selection for feedback (Okapi’s TREC reports)
• Other Possibilities (machine learning … )
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 25
Query Generation
))0|()0,|(()0|(
)1|()1,|(
)0|()0,|(
)1|()1,|(
)0|,(
)1|,(),|1(
RQPRDQPAssumeRDP
RDPRDQP
RDPRDQP
RDPRDQP
RDQP
RDQPDQRO
Assuming uniform prior, we have
Query likelihood p(q| d) Document prior
)1,|(),|1( RDQPDQRO
Now, the question is how to compute ?)1,|( RDQP
Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated model
Leading to the so-called “Language Modeling Approach” …
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 26
What is a Statistical LM?• A probability distribution over word sequences
– p(“Today is Wednesday”) 0.001
– p(“Today Wednesday is”) 0.0000000000001
– p(“The eigenvalue is positive”) 0.00001
• Context-dependent!
• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 27
The Simplest Language Model(Unigram Model)
• Generate a piece of text by generating each word INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn according to this word distribution
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 28
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1association 0.01clustering 0.02…food 0.00001
…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02
…
Topic 2:Health
Document
Text miningpaper
Food nutritionpaper
Sampling
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 29
Estimation of Unigram LM
(Unigram) Language Model p(w| )=?
Document
text 10mining 5
association 3database 3algorithm 2
…query 1
efficient 1
…text ?mining ?association ?database ?…query ?
…
Estimation
A “text mining paper”(total #words=100)
10/1005/1003/1003/100
1/100
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 30
Language Models for Retrieval(Ponte & Croft 98)
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?
…
…food ?nutrition ?healthy ?diet ?
…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 31
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 32
Retrieval as Language Model Estimation
• Document ranking based on query likelihood
n
ii
wwwqwhere
dwpdqp
...,
)|(log)|(log
21
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and distinguishes different approaches
Document language model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 33
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood Estimator
– P(w|d) = relative frequency of word w in d
– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a word that has not been observed?
• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words
• This is what “smoothing” is about …
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 34
Language Model Smoothing (Illustration)
P(w)
Word w
Max. Likelihood Estimate
wordsallofcountwofcount
ML wp )(
Smoothed LM
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 35
How to Smooth?• All smoothing methods try to
– discount the probability of words seen in a document
– re-allocate the extra counts so that unseen words will have a non-zero count
• A simple method (additive smoothing): Add a constant to the counts of each word
• Problems?
( , ) 1( | )
| | | |
c w dp w d
d V
“Add one”, Laplace smoothing
Vocabulary size
Counts of w in d
Length of d (total counts)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 36
A General Smoothing Scheme
• All smoothing methods try to
– discount the probability of words seen in a doc
– re-allocate the extra probability so that unseen words will have a non-zero probability
• Most use a reference model (collection language model) to discriminate unseen words
otherwiseCwp
dinseeniswifdwpdwp
d
seen
)|(
)|()|(
Discounted ML estimate
Collection language model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 37
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain
i
id
qwdw id
iseen CwpnCwp
dwpdqp
i
i
)|(loglog])|(
)|([log)|(log
Ignore for rankingIDF weighting
TF weightingDoc length normalization(long doc is expected to have a smaller d)
• Smoothing with p(w|C) TF-IDF + length norm.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 38
Derivation of the Query Likelihood Retrieval Formula
( | )( | )
( | )Seen
d
p w d if w is seen in dp w d
p w C otherwise
Discounted ML estimate
Reference language model1 ( | )
( | )
Seenw is seen
d
w is unseen
p w d
p w C
, ( , ) 0
, ( , ) 0, , ( , ) 0, ( , ) 0( , ) 0
, ( , ) 0 , ( , )
log ( | ) ( , ) log ( | )
( , ) log ( | ) ( , ) log ( | )
( , ) log ( | ) ( , ) log ( | ) ( , ) log ( | )
w V c w q
Seen dw V c w d w V c w q c w dc w q
Seen d dw V c w d w V c w q
p q d c w q p w d
c w q p w d c w q p w C
c w q p w d c w q p w C c w q p w C
, ( , ) 0 0, ( , ) 0
, ( , ) 0 , ( , ) 0( , ) 0
( | )( , ) log | | log ( , ) ( | )
( | )
w V c w q c w d
Seend
w V c w d w V c w qdc w q
p w dc w q q c w q p w C
p w C
Retrieval formula using the general smoothing scheme
Key rewriting step
Similar rewritings are very common when using LMs for IR…
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 39
More Smoothing Methods
• Method 1 (Absolute discounting): Subtract a constant from the counts of each word
• Method 2 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF)
max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d
# uniq words
( , )( | ) (1 ) ( | )
| |
c w dp w d p w REF
d
parameterML estimate
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 40
More Smoothing Methods (cont.)
• Method 3 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF)
• Method 4 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way
( ; ) ( | ) | || | | | | |
( , )( | ) ( | )
| |c w d p w REF d
d d d
c w dp w d p w REF
d
parameter
*( ; )1| |
1 2
0 1
1( | ) ; *( , ) * , ( , )
2*0* ,1* ,..... 0? | ?
c w drd
r
r
rp w d c w d r n where r c w d
n
n nWhat if n What about p w REF
n n
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 41
Dirichlet Prior Smoothing
• ML estimator: M=argmax M p(d|M)• Bayesian estimator:
– First consider posterior: p(M|d) =p(d|M)p(M)/p(d)– Then, consider the mean or mode of the posterior dist.
• p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…, N) : our prior on the model parameters• conjugate = prior can be interpreted as “extra”/“pseudo” data• Dirichlet distribution is a conjugate prior for multinomial
sampling distribution 1
11
11 )()(
)(),,|(
i
i
N
iN
NNDir
“extra”/“pseudo” word counts i= p(wi|REF)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 42
Dirichlet Prior Smoothing (cont.)
))( , ,)( | () | ( 11 NNwcwcDirdp
Posterior distribution of parameters:
}{)E(then ),|(~ If :Property i
iDir
The predictive distribution is the same as the mean:
|d|
)|()w(
|d|
)w(
)|()|p(w)ˆ|p(w
i
1i
i
ii
REFwpcc
dDir
iN
i
i
Dirichlet prior smoothing
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 43
Advantages of Language Models
• Solid statistical foundation
• Parameters can be optimized automatically using statistical estimation methods
• Can easily model many different retrieval tasks
• To be covered more later
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 44
What You Should Know
• Global relationship among different probabilistic models
• How logistic regression works
• How the Robertson-Sparck Jones model works
• The BM25 formula
• All document-generation models have trouble when no relevance judgments are available
• How the language modeling approach (query likelihood scoring) works
• How Dirichlet prior smoothing works
• 3 state of the art retrieval models: Pivoted Norm Okapi/BM25 Query Likelihood (Dirichlet prior smoothing)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 45
Implementation of an IR System
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 46
IR System Architecture
User
query
judgments
docs
results
QueryRep
DocRep
Ranking
Feedback
INDEXING
SEARCHING
QUERY MODIFICATION
INTERFACE
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 47
Indexing
• Indexing = Convert documents to data structures that enable fast search
• Inverted index is the dominating indexing method (used by all search engines)
• Other indices (e.g., document index) may be needed for feedback
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 48
Inverted Index
• Fast access to all docs containing a given term (along with freq and pos information)
• For each term, we get a list of tuples (docID, freq, pos).
• Given a query, we can fetch the lists for all query terms and work on the involved documents.
– Boolean query: set operation
– Natural language query: term weight summing
• More efficient than scanning docs (why?)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 49
Inverted Index Example
This is a sample document
with one samplesentence
Doc 1
This is another sample document
Doc 2
Dictionary Postings
Term # docs
Total freq
This 2 2
is 2 2
sample 2 3
another 1 1
… … …
Doc id Freq
1 1
2 1
1 1
2 1
1 2
2 1
2 1
… …
… …
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 50
Data Structures for Inverted Index
• Dictionary: modest size
– Needs fast random access
– Preferred to be in memory
– Hash table, B-tree, trie, …
• Postings: huge
– Sequential access is expected
– Can stay on disk
– May contain docID, term freq., term pos, etc
– Compression is desirable
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 51
Inverted Index Compression
• Observations
– Inverted list is sorted (e.g., by docid or termfq)
– Small numbers tend to occur more frequently
• Implications
– “d-gap” (store difference): d1, d2-d1, d3-d2-d1,…
– Exploit skewed frequency distribution: fewer bits for small (high frequency) integers
• Binary code, unary code, -code, -code
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 52
Integer Compression Methods
• In general, to exploit skewed distribution
• Binary: equal-length coding
• Unary: x1 is coded as x-1 one bits followed by 0, e.g., 3=> 110; 5=>11110
-code: x=> unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001
-code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>10101
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 53
Constructing Inverted Index
• The main difficulty is to build a huge index with limited memory
• Memory-based methods: not usable for large collections
• Sort-based methods:
– Step 1: collect local (termID, docID, freq) tuples
– Step 2: sort local tuples (to make “runs”)
– Step 3: pair-wise merge runs
– Step 4: Output inverted file
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 54
Sort-based Inversion
...
Term Lexicon:
the 1cold 2days 3a 4
...
DocIDLexicon:
doc1 1doc2 2doc3 3
...
doc1
doc1
doc300
<1,1,3><2,1,2><3,1,1>... <1,2,2><3,2,3><4,2,2>…
<1,300,3><3,300,1>...
Sort by doc-id
Parse & Count
<1,1,3><1,2,2><2,1,2><2,4,3>...<1,5,3><1,6,2>…
<1,299,3><1,300,1>...
Sort by term-id
“Local” sort
<1,1,3><1,2,2><1,5,2><1,6,3>...<1,300,3><2,1,2>…
<5000,299,1><5000,300,1>...
Merge sort
All info about term 1
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 55
Searching
• Given a query, score documents efficiently
• Boolean query
– Fetch the inverted list for all query terms
– Perform set operations to get the subset of docs that satisfy the Boolean condition
– E.g., Q1=“info” AND “security” , Q2=“info” OR “security”
• info: d1, d2, d3, d4
• security: d2, d4, d6
• Results: {d2,d4} (Q1) {d1,d2,d3,d4,d6} (Q2)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 56
Ranking Documents
• Assumption:score(d,q)=f[g(w(d,q,t1),…w(d,q,tn)), w(d),w(q)], where, ti’s are the matched terms
• Maintain a score accumulator for each doc to compute function g
• For each query term ti
– Fetch the inverted list {(d1,f1),…,(dn,fn)}
– For each entry (dj,fj), Compute w(dj,q,ti), and Update score accumulator for doc di
• Adjust the score to compute f, and sort
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 57
Ranking Documents: Example
Query = “info security”S(d,q)=g(t1)+…+g(tn) [sum of freq of matched terms]
Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5)Security: (d2, 3), (d4,1), (d5, 3)
Accumulators: d1 d2 d3 d4 d5 0 0 0 0 0 (d1,3) => 3 0 0 0 0 (d2,4) => 3 4 0 0 0 (d3,1) => 3 4 1 0 0 (d4,5) => 3 4 1 5 0 (d2,3) => 3 7 1 5 0 (d4,1) => 3 7 1 6 0 (d5,3) => 3 7 1 6 3
info
security
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 58
Further Improving Efficiency
• Keep only the most promising accumulators
• Sort the inverted list in decreasing order of weights and fetch only N entries with the highest weights
• Pre-compute as much as possible
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 59
Open Source IR Toolkits
• Smart (Cornell)
• MG (RMIT & Melbourne, Australia; Waikato, New Zealand),
• Lemur (CMU/Univ. of Massachusetts)
• Terrier (Glasgow)
• Lucene (Open Source)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 60
Smart
• The most influential IR system/toolkit
• Developed at Cornell since 1960’s
• Vector space model with lots of weighting options
• Written in C
• The Cornell/AT&T groups have used the Smart system to achieve top TREC performance
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 61
MG
• A highly efficient toolkit for retrieval of text and images
• Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990’s
• Written in C, running on Unix
• Vector space model with lots of compression and speed up tricks
• People have used it to achieve good TREC performance
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 62
Lemur/Indri
• An IR toolkit emphasizing language models
• Developed at CMU and Univ. of Massachusetts in 2000’s
• Written in C++, highly extensible
• Vector space and probabilistic models including language models
• Achieving good TREC performance with a simple language model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 63
Terrier
• A large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC support
• Developed at University of Glasgow, UK
• Written in Java, open source
• “Divergence from randomness” retrieval model and other modern retrieval formulas
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 64
Lucene
• Open Source IR toolkit
• Initially developed by Doug Cutting in Java
• Now has been ported to some other languages
• Good for building IR/Web applications
• Many applications have been built using Lucene (e.g., Nutch Search Engine)
• Currently the retrieval algorithms have poor accuracy
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 65
What You Should Know
• What is an inverted index
• Why does an inverted index help make search fast
• How to construct a large inverted index
• Simple integer compression methods
• How to use an inverted index to rank documents efficiently
• HOW TO IMPLEMENT A SIMPLE IR SYSTEM
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 66
Applications of
Basic IR Techniques
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 67
Some “Basic” IR Techniques
• Stemming
• Stop words
• Weighting of terms (e.g., TF-IDF)
• Vector/Unigram representation of text
• Text similarity (e.g., cosine)
• Relevance/pseudo feedback (e.g., Rocchio)
They are not just for retrieval!
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 68
Generality of Basic Techniques
Raw text
Term similarity
Doc similarity
Vector centroid
CLUSTERING
d
CATEGORIZATION
META-DATA/ANNOTATION
d d d
d
d d
d
d d d
d d
d d
t t
t t
t t t
t t
t
t t
Stemming & Stop words
Tokenized text
Term Weighting
w11 w12… w1n
w21 w22… w2n
… …wm1 wm2… wmn
t1 t2 … tn
d1
d2 … dm
Sentenceselection
SUMMARIZATION
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 69
Sample Applications
• Information Filtering (covered earlier)
• Text Categorization
• Document/Term Clustering
• Text Summarization
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 70
Text Categorization
• Pre-given categories and labeled document examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
CategorizationSystem
…
Sports
Business
Education
Science…
SportsBusiness
Education
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 71
“Retrieval-based” Categorization
• Treat each category as representing an “information need”
• Treat examples in each category as “relevant documents”
• Use feedback approaches to learn a good “query”
• Match all the learned queries to a new document
• A document gets the category(categories) represented by the best matching query(queries)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 72
Prototype-based Classifier
• Key elements (“retrieval techniques”)– Prototype/document representation (e.g., term vector)
– Document-prototype distance measure (e.g., dot product)
– Prototype vector learning: Rocchio feedback
• Example
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 73
K-Nearest Neighbor Classifier
• Keep all training examples
• Find k examples that are most similar to the new document (“neighbor” documents)
• Assign the category that is most common in these neighbor documents (neighbors vote for the category)
• Can be improved by considering the distance of a neighbor ( A closer neighbor has more influence)
• Technical elements (“retrieval techniques”)– Document representation
– Document distance measure
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 74
Example of K-NN Classifier
(k=1)(k=4)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 75
Examples of Text Categorization
• News article classification
• Meta-data annotation
• Automatic Email sorting
• Web page classification
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 76
Sample Applications
• Information Filtering
• Text Categorization
Document/Term Clustering
• Text Summarization
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 77
The Clustering Problem
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
• Example
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 78
Similarity-based Clustering(as opposed to “model-based”)
• Define a similarity function to measure similarity between two objects
• Gradually group similar objects together in a bottom-up fashion
• Stop when some stopping criterion is met
• Variations: different ways to compute group similarity based on individual object similarity
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 79
Similarity-induced Structure
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 80
How to Compute Group Similarity?
Given two groups g1 and g2,
Single-link algorithm: s(g1,g2)= similarity of the closest pair
complete-link algorithm: s(g1,g2)= similarity of the farthest pair
average-link algorithm: s(g1,g2)= average of similarity of all pairs
Three Popular Methods:
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 81
Three Methods Illustrated
Single-link algorithm
?
g1 g2
complete-link algorithm
……
average-link algorithm
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 82
Examples of Doc/Term Clustering
• Clustering of retrieval results
• Clustering of documents in the whole collection
• Term clustering to define “concept” or “theme”
• Automatic construction of hyperlinks
• In general, very useful for text mining
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 83
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
Text Summarization
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 84
The Summarization Problem
• Essentially “semantic compression” of text
• Selection-based vs. generation-based summary
• In general, we need a purpose for summarization, but it’s hard to define it
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 85
“Retrieval-based” Summarization
• Observation: term vector summary?
• Basic approach
– Rank “sentences”, and select top N as a summary
• Methods for ranking sentences
– Based on term weights
– Based on position of sentences
– Based on the similarity of sentence and document vector
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 86
Simple Discourse Analysis
----------------------------------------------------------------------------------------------------------------------------------------------------------------
vector 1vector 2vector 3……
vector n-1vector n
similarity
similarity
similarity
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 87
A Simple Summarization Method
----------------------------------------------------------------------------------------------------------------------------------------------------------------
sentence 1
sentence 2
sentence 3
summary
Doc vector
Most similarin each segment
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 88
Examples of Summarization
• News summary
• Summarize retrieval results
– Single doc summary
– Multi-doc summary
• Summarize a cluster of documents (automatic label creation for clusters)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 89
What You Should Know
• Retrieval touches some basic issues in text information management (what are these basic issues?)
• How to apply simple retrieval techniques, such as the vector space model, to information filtering, text categorization, clustering, and summarization
• There are many other tasks that can potentially benefit from simple IR techniques
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 90
Roadmap
• This lecture
– Other retrieval models
– IR system implementation
– Applications of basic TR techniques
• Next lecture: more in-depth treatment of language models