龙星计划课程 : 信息检索 overview of text retrieval: part 2

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1

龙星计划课程 :信息检索 Overview of Text Retrieval: Part 2

ChengXiang Zhai (翟成祥 ) Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, Statistics

University of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

http://www.cs.uiuc.edu/


Outline

• Other retrieval models

• Implementation of a TR System

• Applications of TR techniques



P-norm (Extended Boolean)(Salton et al. 83)

• Motivation: how to rank documents with a Boolean query?

• Intuitions

– Docs satisfying the query constraint should get the highest ranks

– Partial satisfaction of query constraint can be used to rank other docs

• Question: How to capture “partial satisfaction”?



P-norm: Basic Ideas

• Normalized term weights for doc rep ([0,1])

• Define similarity between a Boolean query and a doc vector

Q= T1 AND T2

(0,0) (1,0)

(0,1) (1,1)

(x,y)

2/))1()1((1),"21(" 22 yxdTANDTsim

Q= T1 ORT2

(0,0) (1,0)

(0,1) (1,1)

(x,y)

2/)(),"21(" 22 yxdTORTsim



P-norm: Formulas

Since the similarity value is normalized to [0,1], these two formulas can be applied recursively.

p

n

xx

dist

ddistdTTTsim

pn

p

p

pn

1

])1(...)1(

[1)1,0(

)1,(1),"...(" 1

21

p

n

xx

dist

ddistdTTTsim

pn

p

p

pn

1

]...

[)0,1(

)0,(),"...(" 1

21

1 P +

vector-space Boolean/Fuzzy logic



P-norm: Summary

• A general (and elegant) similarity function for Boolean query and a regular document vector

• Connecting Boolean model and vector space model with models in between

• Allowing different “confidence” on Boolean operators (different p for different operators)

• A model worth more exploration (how to learn optimal p values from feedback?)



Probabilistic Retrieval Models



Overview of Retrieval ModelsRelevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

…

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Learn toRank

(Joachims 02)(Burges et al. 05)



The Basic Question

What is the probability that THIS document is relevant to THIS query?

Formally…

3 random variables: query Q, document D, relevance R {0,1}

Given a particular query q, a particular document d, p(R=1|Q=q,D=d)=?



Probability of Relevance

• Three random variables

– Query Q

– Document D

– Relevance R {0,1}

• Goal: rank D based on P(R=1|Q,D)

– Evaluate P(R=1|Q,D)

– Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents

• Several different ways to refine P(R=1|Q,D)



Refining P(R=1|Q,D) Method 1: conditional models

• Basic idea: relevance depends on how well a query matches a document

– Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen,..

– P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), )

– Using training data (known relevance judgments) to estimate parameter

– Apply the model to rank new documents

• Special case: logistic regression



Logistic Regression (Cooper 92, Gey 94)

)exp(1

1),|1(

6

10

iii X

DQRP

6

10),|1(1

),|1(log

iii XDQRP

DQRP x

x

1loglogit function:

)exp(1

)exp(

)exp(1

1

x

x

x

logistic (sigmoid) function:

X

P(R=1|Q,D)

1.0Uses 6 features X1, …, X6



Features/Attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged



Logistic Regression: Pros & Cons

• Advantages

– Absolute probability of relevance available

– May re-use all the past relevance judgments

• Problems

– Performance is very sensitive to the selection of features

– No much guidance on feature selection

• In practice, performance tends to be average



Refining P(R=1|Q,D) Method 2:generative models

• Basic idea

– Define P(Q,D|R)

– Compute O(R=1|Q,D) using Bayes’ rule

• Special cases

– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)

– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)

)0(

)1(

)0|,(

)1|,(

),|0(

),|1(),|1(

RP

RP

RDQP

RDQP

DQRP

DQRPDQRO

Ignored for ranking D



Document Generation

)0,|(

)1,|(

)0|()0,|(

)1|()1,|(

)0|,(

)1|,(

),|0(

),|1(

RQDP

RQDP

RQPRQDP

RQPRQDP

RDQP

RDQP

DQRP

DQRP

Model of relevant docs for Q

Model of non-relevant docs for Q

Assume independent attributes A1…Ak ….(why?)Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk )

)0),0,|1()1,|1(()1,|0()0,|1(

)0,|0()1,|1(

)1,|0()0,|1(

)0,|0()1,|1(

)0,|0(

)1,|0(

)0,|1(

)1,|1(

)0,|(

)1,|(

),|0(

),|1(

1,1

1,1

0,11,1

1

iii

k

qdi ii

ii

k

di ii

ii

k

di i

ik

di i

i

k

i ii

ii

qifRQAPRQAPAssumeRQAPRQAP

RQAPRQAP

RQAPRQAP

RQAPRQAP

RQAP

RQAP

RQAP

RQAP

RQdAP

RQdAP

DQRP

DQRP

ii

i

ii

Non-query terms are equally likely to

appear in relevant and non-relevant docs



Robertson-Sparck Jones Model(Robertson & Sparck Jones 76)

Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc

k

qdi ii

iiRank

iipq

qpDQRO

1,1 )1(

)1(log),|1(log (RSJ model)

How to estimate parameters?Suppose we have relevance judgments,

1).(#

5.0).(#ˆ

1).(#

5.0).(#ˆ

docnonrel

Awithdocnonrelq

docrel

Awithdocrelp i

ii

i

“+0.5” and “+1” can be justified by Bayesian estimation



RSJ Model: No Relevance Info(Croft & Harper 79)

k

qdi ii

iiRank

iipq

qpDQRO

1,1 )1(

)1(log),|1(log (RSJ model)

How to estimate parameters?Suppose we do not have relevance judgments,

- We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant

k

qdi i

iRank

iin

nNDQRO

1,1 5.0

5.0log),|1(log

N: # documents in collectionni: # documents in which term Ai occurs

i

i

n

nNIDF

log'



RSJ Model: Summary

• The most important classic prob. IR model

• Use only term presence/absence, thus also referred to as Binary Independence Model

• Essentially Naïve Bayes for doc ranking

• Most natural for relevance/pseudo feedback

• When without relevance judgments, the model parameters must be estimated in an ad hoc way

• Performance isn’t as good as tuned VS model



Improving RSJ: Adding TF

Let D=d1…dk, where dk is the frequency count of term Ak

k

di iii

iii

k

di i

ik

di ii

ii

k

i ii

ii

i

ii

RQAPRQdAP

RQAPRQdAP

RQAP

RQAP

RQdAP

RQdAP

RQdAP

RQdAP

DQRP

DQRP

1,1

0,11,1

1

)1,|0()0,|(

)0,|0()1,|(

)0,|0(

)1,|0(

)0,|(

)1,|(

)0,|(

)1,|(

),|0(

),|1(

)0,|(

)1,|(

),|0(

),|1(

RQDP

RQDP

DQRP

DQRPBasic doc. generation model:

EE ef

RQEPef

RQEp

EfApRQEPEfApRQEpRQfApf

Ef

E

iii

!),|(

!),|(

)|(),|()|(),|(),|(

2-Poisson mixture model

Many more parameters to estimate! (how many exactly?)



BM25/Okapi Approximation(Robertson et al. 94)

• Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties

• Observations:

– log O(R=1|Q,D) is a sum of term weights Wi

– Wi= 0, if TFi=0

– Wi increases monotonically with TFi

– Wi has an asymptotic limit

• The simple function is

)1(

)1(log

)1(

1

1

ii

ii

i

ii pq

qp

TFk

kTFW



Adding Doc. Length & Query TF

• Incorporating doc length

– Motivation: The 2-Poisson model assumes equal document length

– Implementation: “Carefully” penalize long doc

• Incorporating query TF

– Motivation: Appears to be not well-justified

– Implementation: A similar TF transformation

• The final formula is called BM25, achieving top TREC performance



The BM25 Formula

“Okapi TF/BM25 TF”



Extensions of “Doc Generation” Models

• Capture term dependence (Rijsbergen & Harper 78)

• Alternative ways to incorporate TF (Croft 83, Kalt96)

• Feature/term selection for feedback (Okapi’s TREC reports)

• Other Possibilities (machine learning … )



Query Generation

))0|()0,|(()0|(

)1|()1,|(

)0|()0,|(

)1|()1,|(

)0|,(

)1|,(),|1(

RQPRDQPAssumeRDP

RDPRDQP

RDPRDQP

RDPRDQP

RDQP

RDQPDQRO

Assuming uniform prior, we have

Query likelihood p(q| d) Document prior

)1,|(),|1( RDQPDQRO

Now, the question is how to compute ?)1,|( RDQP

Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated model

Leading to the so-called “Language Modeling Approach” …



What is a Statistical LM?• A probability distribution over word sequences

– p(“Today is Wednesday”) 0.001

– p(“Today Wednesday is”) 0.0000000000001

– p(“The eigenvalue is positive”) 0.00001

• Context-dependent!

• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model



The Simplest Language Model(Unigram Model)

• Generate a piece of text by generating each word INDEPENDENTLY

• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

• Essentially a multinomial distribution over words

• A piece of text can be regarded as a sample drawn according to this word distribution



Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1association 0.01clustering 0.02…food 0.00001

…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

…

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling



Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5

association 3database 3algorithm 2

…query 1

efficient 1

…text ?mining ?association ?database ?…query ?

…

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100



Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?

…

…food ?nutrition ?healthy ?diet ?

…

Query = “data mining algorithms”

? Which model would most likely have generated this query?



Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood



Retrieval as Language Model Estimation

• Document ranking based on query likelihood

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

• Retrieval problem Estimation of p(wi|d)

• Smoothing is an important issue, and distinguishes different approaches

Document language model



How to Estimate p(w|d)?

• Simplest solution: Maximum Likelihood Estimator

– P(w|d) = relative frequency of word w in d

– What if a word doesn’t appear in the text? P(w|d)=0

• In general, what probability should we give a word that has not been observed?

• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words

• This is what “smoothing” is about …



Language Model Smoothing (Illustration)

P(w)

Word w

Max. Likelihood Estimate

wordsallofcountwofcount

ML wp )(

Smoothed LM



How to Smooth?• All smoothing methods try to

– discount the probability of words seen in a document

– re-allocate the extra counts so that unseen words will have a non-zero count

• A simple method (additive smoothing): Add a constant to the counts of each word

• Problems?

( , ) 1( | )

| | | |

c w dp w d

d V

“Add one”, Laplace smoothing

Vocabulary size

Counts of w in d

Length of d (total counts)



A General Smoothing Scheme

• All smoothing methods try to

– discount the probability of words seen in a doc

– re-allocate the extra probability so that unseen words will have a non-zero probability

• Most use a reference model (collection language model) to discriminate unseen words

otherwiseCwp

dinseeniswifdwpdwp

d

seen

)|(

)|()|(

Discounted ML estimate

Collection language model



Smoothing & TF-IDF Weighting

• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain

i

id

qwdw id

iseen CwpnCwp

dwpdqp

i

i

)|(loglog])|(

)|([log)|(log

Ignore for rankingIDF weighting

TF weightingDoc length normalization(long doc is expected to have a smaller d)

• Smoothing with p(w|C) TF-IDF + length norm.



Derivation of the Query Likelihood Retrieval Formula

( | )( | )

( | )Seen

d

p w d if w is seen in dp w d

p w C otherwise

Discounted ML estimate

Reference language model1 ( | )

( | )

Seenw is seen

d

w is unseen

p w d

p w C

, ( , ) 0

, ( , ) 0, , ( , ) 0, ( , ) 0( , ) 0

, ( , ) 0 , ( , )

log ( | ) ( , ) log ( | )

( , ) log ( | ) ( , ) log ( | )

( , ) log ( | ) ( , ) log ( | ) ( , ) log ( | )

w V c w q

Seen dw V c w d w V c w q c w dc w q

Seen d dw V c w d w V c w q

p q d c w q p w d

c w q p w d c w q p w C

c w q p w d c w q p w C c w q p w C

, ( , ) 0 0, ( , ) 0

, ( , ) 0 , ( , ) 0( , ) 0

( | )( , ) log | | log ( , ) ( | )

( | )

w V c w q c w d

Seend

w V c w d w V c w qdc w q

p w dc w q q c w q p w C

p w C

Retrieval formula using the general smoothing scheme

Key rewriting step

Similar rewritings are very common when using LMs for IR…



More Smoothing Methods

• Method 1 (Absolute discounting): Subtract a constant from the counts of each word

• Method 2 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF)

max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d

# uniq words

( , )( | ) (1 ) ( | )

| |

c w dp w d p w REF

d

parameterML estimate



More Smoothing Methods (cont.)

• Method 3 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF)

• Method 4 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

parameter

*( ; )1| |

1 2

0 1

1( | ) ; *( , ) * , ( , )

2*0* ,1* ,..... 0? | ?

c w drd

r

r

rp w d c w d r n where r c w d

n

n nWhat if n What about p w REF

n n



Dirichlet Prior Smoothing

• ML estimator: M=argmax M p(d|M)• Bayesian estimator:

– First consider posterior: p(M|d) =p(d|M)p(M)/p(d)– Then, consider the mean or mode of the posterior dist.

• p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…, N) : our prior on the model parameters• conjugate = prior can be interpreted as “extra”/“pseudo” data• Dirichlet distribution is a conjugate prior for multinomial

sampling distribution 1

11

11 )()(

)(),,|(

i

i

N

iN

NNDir

“extra”/“pseudo” word counts i= p(wi|REF)



Dirichlet Prior Smoothing (cont.)

))( , ,)( | () | ( 11 NNwcwcDirdp

Posterior distribution of parameters:

}{)E(then ),|(~ If :Property i

iDir

The predictive distribution is the same as the mean:

|d|

)|()w(

|d|

)w(

)|()|p(w)ˆ|p(w

i

1i

i

ii

REFwpcc

dDir

iN

i

i

Dirichlet prior smoothing



Advantages of Language Models

• Solid statistical foundation

• Parameters can be optimized automatically using statistical estimation methods

• Can easily model many different retrieval tasks

• To be covered more later



What You Should Know

• Global relationship among different probabilistic models

• How logistic regression works

• How the Robertson-Sparck Jones model works

• The BM25 formula

• All document-generation models have trouble when no relevance judgments are available

• How the language modeling approach (query likelihood scoring) works

• How Dirichlet prior smoothing works

• 3 state of the art retrieval models: Pivoted Norm Okapi/BM25 Query Likelihood (Dirichlet prior smoothing)



Implementation of an IR System



IR System Architecture

User

query

judgments

docs

results

QueryRep

DocRep

Ranking

Feedback

INDEXING

SEARCHING

QUERY MODIFICATION

INTERFACE



Indexing

• Indexing = Convert documents to data structures that enable fast search

• Inverted index is the dominating indexing method (used by all search engines)

• Other indices (e.g., document index) may be needed for feedback



Inverted Index

• Fast access to all docs containing a given term (along with freq and pos information)

• For each term, we get a list of tuples (docID, freq, pos).

• Given a query, we can fetch the lists for all query terms and work on the involved documents.

– Boolean query: set operation

– Natural language query: term weight summing

• More efficient than scanning docs (why?)



Inverted Index Example

This is a sample document

with one samplesentence

Doc 1

This is another sample document

Doc 2

Dictionary Postings

Term # docs

Total freq

This 2 2

is 2 2

sample 2 3

another 1 1

… … …

Doc id Freq

1 1

2 1

1 1

2 1

1 2

2 1

2 1

… …

… …



Data Structures for Inverted Index

• Dictionary: modest size

– Needs fast random access

– Preferred to be in memory

– Hash table, B-tree, trie, …

• Postings: huge

– Sequential access is expected

– Can stay on disk

– May contain docID, term freq., term pos, etc

– Compression is desirable



Inverted Index Compression

• Observations

– Inverted list is sorted (e.g., by docid or termfq)

– Small numbers tend to occur more frequently

• Implications

– “d-gap” (store difference): d1, d2-d1, d3-d2-d1,…

– Exploit skewed frequency distribution: fewer bits for small (high frequency) integers

• Binary code, unary code, -code, -code



Integer Compression Methods

• In general, to exploit skewed distribution

• Binary: equal-length coding

• Unary: x1 is coded as x-1 one bits followed by 0, e.g., 3=> 110; 5=>11110

-code: x=> unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001

-code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>10101



Constructing Inverted Index

• The main difficulty is to build a huge index with limited memory

• Memory-based methods: not usable for large collections

• Sort-based methods:

– Step 1: collect local (termID, docID, freq) tuples

– Step 2: sort local tuples (to make “runs”)

– Step 3: pair-wise merge runs

– Step 4: Output inverted file



Sort-based Inversion

...

Term Lexicon:

the 1cold 2days 3a 4

...

DocIDLexicon:

doc1 1doc2 2doc3 3

...

doc1

doc1

doc300

<1,1,3><2,1,2><3,1,1>... <1,2,2><3,2,3><4,2,2>…

<1,300,3><3,300,1>...

Sort by doc-id

Parse & Count

<1,1,3><1,2,2><2,1,2><2,4,3>...<1,5,3><1,6,2>…

<1,299,3><1,300,1>...

Sort by term-id

“Local” sort

<1,1,3><1,2,2><1,5,2><1,6,3>...<1,300,3><2,1,2>…

<5000,299,1><5000,300,1>...

Merge sort

All info about term 1



Searching

• Given a query, score documents efficiently

• Boolean query

– Fetch the inverted list for all query terms

– Perform set operations to get the subset of docs that satisfy the Boolean condition

– E.g., Q1=“info” AND “security” , Q2=“info” OR “security”

• info: d1, d2, d3, d4

• security: d2, d4, d6

• Results: {d2,d4} (Q1) {d1,d2,d3,d4,d6} (Q2)



Ranking Documents

• Assumption:score(d,q)=f[g(w(d,q,t1),…w(d,q,tn)), w(d),w(q)], where, ti’s are the matched terms

• Maintain a score accumulator for each doc to compute function g

• For each query term ti

– Fetch the inverted list {(d1,f1),…,(dn,fn)}

– For each entry (dj,fj), Compute w(dj,q,ti), and Update score accumulator for doc di

• Adjust the score to compute f, and sort



Ranking Documents: Example

Query = “info security”S(d,q)=g(t1)+…+g(tn) [sum of freq of matched terms]

Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5)Security: (d2, 3), (d4,1), (d5, 3)

Accumulators: d1 d2 d3 d4 d5 0 0 0 0 0 (d1,3) => 3 0 0 0 0 (d2,4) => 3 4 0 0 0 (d3,1) => 3 4 1 0 0 (d4,5) => 3 4 1 5 0 (d2,3) => 3 7 1 5 0 (d4,1) => 3 7 1 6 0 (d5,3) => 3 7 1 6 3

info

security



Further Improving Efficiency

• Keep only the most promising accumulators

• Sort the inverted list in decreasing order of weights and fetch only N entries with the highest weights

• Pre-compute as much as possible



Open Source IR Toolkits

• Smart (Cornell)

• MG (RMIT & Melbourne, Australia; Waikato, New Zealand),

• Lemur (CMU/Univ. of Massachusetts)

• Terrier (Glasgow)

• Lucene (Open Source)



Smart

• The most influential IR system/toolkit

• Developed at Cornell since 1960’s

• Vector space model with lots of weighting options

• Written in C

• The Cornell/AT&T groups have used the Smart system to achieve top TREC performance



MG

• A highly efficient toolkit for retrieval of text and images

• Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990’s

• Written in C, running on Unix

• Vector space model with lots of compression and speed up tricks

• People have used it to achieve good TREC performance



Lemur/Indri

• An IR toolkit emphasizing language models

• Developed at CMU and Univ. of Massachusetts in 2000’s

• Written in C++, highly extensible

• Vector space and probabilistic models including language models

• Achieving good TREC performance with a simple language model



Terrier

• A large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC support

• Developed at University of Glasgow, UK

• Written in Java, open source

• “Divergence from randomness” retrieval model and other modern retrieval formulas



Lucene

• Open Source IR toolkit

• Initially developed by Doug Cutting in Java

• Now has been ported to some other languages

• Good for building IR/Web applications

• Many applications have been built using Lucene (e.g., Nutch Search Engine)

• Currently the retrieval algorithms have poor accuracy




• What is an inverted index

• Why does an inverted index help make search fast

• How to construct a large inverted index

• Simple integer compression methods

• How to use an inverted index to rank documents efficiently

• HOW TO IMPLEMENT A SIMPLE IR SYSTEM



Applications of

Basic IR Techniques



Some “Basic” IR Techniques

• Stemming

• Stop words

• Weighting of terms (e.g., TF-IDF)

• Vector/Unigram representation of text

• Text similarity (e.g., cosine)

• Relevance/pseudo feedback (e.g., Rocchio)

They are not just for retrieval!



Generality of Basic Techniques

Raw text

Term similarity

Doc similarity

Vector centroid

CLUSTERING

d

CATEGORIZATION

META-DATA/ANNOTATION

d d d

d

d d

d

d d d

d d

d d

t t

t t

t t t

t t

t

t t

Stemming & Stop words

Tokenized text

Term Weighting

w11 w12… w1n

w21 w22… w2n

… …wm1 wm2… wmn

t1 t2 … tn

d1

d2 … dm

Sentenceselection

SUMMARIZATION



Sample Applications

• Information Filtering (covered earlier)

• Text Categorization

• Document/Term Clustering

• Text Summarization



Text Categorization

• Pre-given categories and labeled document examples (Categories may form hierarchy)

• Classify new documents

• A standard supervised learning problem

CategorizationSystem

…

Sports

Business

Education

Science…

SportsBusiness

Education



“Retrieval-based” Categorization

• Treat each category as representing an “information need”

• Treat examples in each category as “relevant documents”

• Use feedback approaches to learn a good “query”

• Match all the learned queries to a new document

• A document gets the category(categories) represented by the best matching query(queries)



Prototype-based Classifier

• Key elements (“retrieval techniques”)– Prototype/document representation (e.g., term vector)

– Document-prototype distance measure (e.g., dot product)

– Prototype vector learning: Rocchio feedback

• Example



K-Nearest Neighbor Classifier

• Keep all training examples

• Find k examples that are most similar to the new document (“neighbor” documents)

• Assign the category that is most common in these neighbor documents (neighbors vote for the category)

• Can be improved by considering the distance of a neighbor ( A closer neighbor has more influence)

• Technical elements (“retrieval techniques”)– Document representation

– Document distance measure



Example of K-NN Classifier

(k=1)(k=4)



Examples of Text Categorization

• News article classification

• Meta-data annotation

• Automatic Email sorting

• Web page classification



Sample Applications

• Information Filtering


Document/Term Clustering

• Text Summarization



The Clustering Problem

• Discover “natural structure”

• Group similar objects together

• Object can be document, term, passages

• Example



Similarity-based Clustering(as opposed to “model-based”)

• Define a similarity function to measure similarity between two objects

• Gradually group similar objects together in a bottom-up fashion

• Stop when some stopping criterion is met

• Variations: different ways to compute group similarity based on individual object similarity



Similarity-induced Structure



How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:



Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm



Examples of Doc/Term Clustering

• Clustering of retrieval results

• Clustering of documents in the whole collection

• Term clustering to define “concept” or “theme”

• Automatic construction of hyperlinks

• In general, very useful for text mining



Sample Applications

• Information Filtering


• Document/Term Clustering

Text Summarization



The Summarization Problem

• Essentially “semantic compression” of text

• Selection-based vs. generation-based summary

• In general, we need a purpose for summarization, but it’s hard to define it



“Retrieval-based” Summarization

• Observation: term vector summary?

• Basic approach

– Rank “sentences”, and select top N as a summary

• Methods for ranking sentences

– Based on term weights

– Based on position of sentences

– Based on the similarity of sentence and document vector



Simple Discourse Analysis

----------------------------------------------------------------------------------------------------------------------------------------------------------------

vector 1vector 2vector 3……

vector n-1vector n

similarity

similarity

similarity



A Simple Summarization Method

----------------------------------------------------------------------------------------------------------------------------------------------------------------

sentence 1

sentence 2

sentence 3

summary

Doc vector

Most similarin each segment



Examples of Summarization

• News summary

• Summarize retrieval results

– Single doc summary

– Multi-doc summary

• Summarize a cluster of documents (automatic label creation for clusters)




• Retrieval touches some basic issues in text information management (what are these basic issues?)

• How to apply simple retrieval techniques, such as the vector space model, to information filtering, text categorization, clustering, and summarization

• There are many other tasks that can potentially benefit from simple IR techniques



Roadmap

• This lecture

– Other retrieval models

– IR system implementation

– Applications of basic TR techniques

• Next lecture: more in-depth treatment of language models