information retrieval

Information Retrieval

PengBoOct 28, 2010

本次课大纲

Introduction of Information Retrieval 索引技术： Index Techniques 排序： Scoring and Ranking 性能评测： Evaluation

4

Basic Index Techniques

Document Collection

site:pkunews.pku.edu.cn baidu report 12,800 pagesGoogle report 6820 pages

site:pkunews.pku.edu.cn baidu report 12,800 pagesGoogle report 6820 pages

6

User Information Need

在这个新闻网站内查找 articles talks about Culture of China and Japan,

and doesn’t talk about students abroad. QUERY ：

“ 中国日本文化 —留学生”

中国日本文化 - 留学生 site:pkunews.pku.edu.cnBaidu report 38 resultsGoogle report 361 results

中国日本文化 - 留学生 site:pkunews.pku.edu.cnBaidu report 38 resultsGoogle report 361 results

7

How to do it?

字符串匹配，如使用 grep 所有 WebPages ，找到包含 “中国” ,“文化” and “日本” 的页面 , 再去除包含 “留学生”的页面 ? Slow (for large corpora) NOT “留学生” is non-trivial Other operations (e.g., find “中国” NEAR “日

本” ) not feasible

8

Document Representation

Bag of words model Document-term incidence matrix （关

联矩阵）中国文化日本留学生

教育北京 …

D1 1 1 0 0 1 1

D2 0 1 1 1 0 0

D3 1 0 1 1 0 0

D4 1 0 0 1 1 0

D5 1 1 1 0 0 1

D6 0 0 1 0 0 1

…1 if page contains word, 0 otherwise

9

Incidence Vector

D1 D2 D3 D4 D5 D6…

中国 1 0 1 1 1 0

文化 1 1 0 0 1 0

日本 0 1 1 0 1 1

留学生

0 1 1 1 0 0

教育 1 0 0 1 0 0

北京 1 0 0 0 1 1

…

Transpose ：把 Document-term 矩阵转置得到 term-document 关联矩阵每个 term 对应一个 0/1 向量 , incidence vector

10

Retrieval

Information Need: 在这个新闻网站内查找 : articles talks about Culture

of China and Japan, and doesn’t talk about students abroad.

To answer query: 读取 term 向量 “中国” ,“文化” ,“日本” , “留学

生” (complemented) bitwise AND 101110 AND 110010 AND 011011 AND 100011

= 000010

11

D5

12

Let’s build a search system!

考虑系统规模：文档数： N = 1million documents, 每篇文档约有

1K terms. 平均 6 bytes/term =>6GB of data in the

documents. 不相同的 term 数： M = 500K distinct terms

这个 Matrix 规模是？ 500K x 1M 十分稀疏：不超过 one billion 1’s What’s a better representation?

13

1875 年 ,Mary Cowden Clarke 为莎士比亚作品编纂了词汇索引。在书的前言，她骄傲的写到她“纷献了一个通向智慧宝藏的可靠指南… , 希望这十六年来的辛勤劳动没有辜负这个理想…”

1875 年 ,Mary Cowden Clarke 为莎士比亚作品编纂了词汇索引。在书的前言，她骄傲的写到她“纷献了一个通向智慧宝藏的可靠指南… , 希望这十六年来的辛勤劳动没有辜负这个理想…”

1911,LaneCooper 教授出版了一本 William Wordsworth 诗集的词汇索引。耗时 7 个月，共有 67人参与工作，使用的工具八廓卡片、剪刀、胶水和邮票等。

1911,LaneCooper 教授出版了一本 William Wordsworth 诗集的词汇索引。耗时 7 个月，共有 67人参与工作，使用的工具八廓卡片、剪刀、胶水和邮票等。1965, 使用计算机整理这样的资料只需要几天时间，而且会完成得更好……1965, 使用计算机整理这样的资料只需要几天时间，而且会完成得更好……

14

Inverted index

对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表

中国文化留学生

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings

Sorted by docID (more later on why).

15

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

16

输出： <Modified token, Document ID> 元组序列 .

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Indexer steps

17

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Core indexing step

18

合并一个文档中的多次出现，添加 term 的Frequency 信息 .

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

19

结果 split 为一个 Dictionary 文件和一个Postings 文件 .

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Why split?

20

Boolean Query processing

查询 : 中国 AND 文化查找 Dictionary ，定位中国 ;

读取对应的 postings. 查找 Dictionary ，定位文化 ;

读取对应的 postings. “Merge” 合并 (AND) 两个 postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

中国文化

21

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge

Lists 的合并算法

34

2 4 8 16 32 64

1 2 3 5 8 13 21

中国文化2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

22

Boolean queries: Exact match

Queries using AND, OR and NOT together with query terms

Primary commercial retrieval tool for 3 decades.

Professional searchers (e.g., Lawyers) still like Boolean queries:

You know exactly what you’re getting.

23

Example: WestLaw

Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

About 7 terabytes of data; 700,000 users Majority of users still use boolean queries

Example query: What is the statute of limitations in cases involving the federal

tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

特点： Long, precise queries; proximity operators; incrementally developed; not like web search

24

http://www.westlaw.com/

Beyond Boolean term search

短语 phrase ： Find “Bill Gates” , not “Bill and Gates”

词的临近关系 Proximity: Find Gates NEAR Microsoft.

文档中的区域限定 : Find documents with (author = Ullman) AND

(text contains automata). Solution ：

记录 term 的 field property 记录 term 在 docs 中的 position information.

25

LAST COURSE REVIEW

26

Bag of words model

Vector representation doesn’t consider the ordering of words in a document

John is quicker than Mary and Mary is quicker than John have the same vectors

This is called the bag of words model. In a sense, this is a step back: The

positional index was able to distinguish these two documents.

We will look at “recovering” positional information later in this course.

For now: bag of words model

Inverted index

对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表

中国文化留学生

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings

Sorted by docID (more later on why).

28

Simple Inverted

Index

Inverted Indexwith counts

• supports better ranking

algorithms

Inverted Indexwith positions

• supports proximity matches

Query Processing

Document-at-a-time Calculates complete scores for documents by

processing all term lists, one document at a time

Term-at-a-time Accumulates scores for documents by

processing term lists one at a time Both approaches have optimization

techniques that significantly reduce time required to generate scores

Document-At-A-Time

Term-At-A-Time

Scoring and Ranking

Beyond Boolean Search

对大多数用户来说… . LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

大多数用户可能会输入 bill rights or bill of rights 作为 Query 怎样解释和处理这样 full text queries? 没有 AND OR NOT 等 boolean 连接符某些 query term 不一定要出现在结果文档中

用户会期待结果按某种 order 返回， most likely to be useful 的文档在结果的前面

36

Scoring: density-based

按 query ，给文档打分 scoring ，根据 score 排序

Idea 如果一个文档 talks about a topic more, then it is

a better match if 如果包含很多次 query term 的出现，文档是

relevant( 相关的 ) term weighting.

37

Term frequency vectors

考察 term t 在文档 d, 中出现的次数 number of occurrences ，记作 tft,d

D1 D2 D3 D4 D5 D6…

中国 11 0 7 13 4 0

文化 2 2 0 0 6 0

日本 0 5 2 0 1 9

留学生

0 1 2 6 0 0

教育 3 0 0 2 0 0

北京 17 0 0 0 11 8

…

对一个 free-text Query qScore(q,d) = tq tft,d

对一个 free-text Query qScore(q,d) = tq tft,d

38

Problem of TF scoring

没有区分词序 Positional information index

长文档具有优势归一化： normalizing for document

length wft,d = tft,d / |d|

出现的重要程度其实与出现次数不成正比关系从 0 次到 1 次的出现，和 100 次出现到 101 次出现，意义大不相同

平滑不同的词，其重要程度其实不一样

Consider query 日本的汉字丼区分 Discrimination of terms

otherwise log1 ,0 if 0 ,,, dtdtdt tftfwf

39

Discrimination of terms

怎样度量 terms 的 common程度 collection frequency (cf ) ：文档集合里 term 出现

的总次数 document frequency (df ) ：文档集合里出现过

term 的文档总数

Word cf df

try 10422 8760

insurance

10440 3997

40

tf x idf term weights

tf x idf 权值计算公式 : term frequency (tf )

or wf, some measure of term density in a doc inverse document frequency (idf )

表达 term 的重要度 ( 稀有度 ) 原始值 idft = 1/dft

同样，通常会作平滑

为文档中每个词计算其 tf.idf权重：

dfNidf

t

t log

)/log(,, tdtdt dfNtfw 41

Documents as vectors

每一个文档 j 能够被看作一个向量，每个 term 是一个维度，取值为 tf.idf

So we have a vector space terms are axes docs live in this space 高维空间：即使作 stemming, may have 20,000+

dimensions

D1 D2 D3 D4 D5 D6…

中国 4.1 0.0 3.7 5.9 3.1 0.0

文化 4.5 4.5 0 0 11.6 0

日本 0 3.5 2.9 0 2.1 3.9

留学生 0 3.1 5.1 12.8 0 0

教育 2.9 0 0 2.2 0 0

北京 7.1 0 0 0 4.4 3.8

…

42

Intuition

Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

用例： Query-by-example ， Free Text query as vector用例： Query-by-example ， Free Text query as vector

43

Formalizing vector space proximity

First cut: distance between two points ( = distance between the end points of the two

vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for

vectors of different lengths.

Sec. 6.3

Why distance is a bad idea

The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are

very similar.

Sec. 6.3

Cosine similarity

M

ijij wd

1,

2

向量 d1和 d2的

“ closeness” 可以用它们之间的夹角大小来度量

具体的，可用 cosine of the angle x 来计算向量相似度 .

向量按长度归一化Normalization

t 1

d 2

d 1

t 3

t 2

θ

M

i ki

M

i ji

M

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

M

i ki

M

i ji

M

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

46

Example

Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights

cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929

SaS PaP WHaffection 115 58 20jealous 10 7 11gossip 2 0 6

SaS PaP WHaffection 0.996 0.993 0.847jealous 0.087 0.120 0.466gossip 0.017 0.000 0.254

47

Notes on Index Structure

怎样保存 normalized tf-idf 值？在每一个 postings entry 吗 ? 保存 tf/normalization? Space blowup because of floats

通常： tf以整数值保存 index compression 文档长度， idf 每 doc 只保存一个

48

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Sec. 6.4

Weighting may differ in queries vs documents

Many search engines allow for different weightings for queries vs. documents

SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table

A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first

character), no idf and cosine normalization

Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …

A bad idea?

Sec. 6.4

tf-idf example: lnc.ltc

Term Query Document Prod

tf-raw

tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize

auto 0 0 5000 2.3 0 0 1 1 1 0.52 0

best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0

car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27

insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Document: car insurance auto insuranceQuery: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+0.27+0.53 = 0.8

Doc length =

12 02 12 1.32 1.92

Sec. 6.4

Thus far

We can build a Information Retrieval System Support Boolean query Support Free-text query Support ranking result

52

IR Evaluation

Measures for a search engine

创建 index 的速度 Number of documents/hour Documents size

搜索的速度响应时间： Latency as a function of index size 吞吐率： Throughput as a function of index size

查询语言的表达能力 Ability to express complex information needs Speed on complex queries

These criteria measurable.但更关键的 measure 是

user happiness怎样量化的度量它？

These criteria measurable.但更关键的 measure 是

user happiness怎样量化的度量它？

54

Measuring user happiness

Issue: 谁是 user? Web engine: user finds what they want and return

to the engine Can measure rate of return users

eCommerce site: user finds what they want and make a purchase

Is it the end-user, or the eCommerce site, whose happiness we measure?

Measure time to purchase, or fraction of searchers who become buyers?

Enterprise (company/govt/academic): Care about “user productivity”

How much time do my users save when looking for information?

Many other criteria having to do with breadth of access, secure access, etc.

55

Happiness: elusive to measure

Commonest proxy: relevance of search results

But how do you measure relevance? Methodology: test collection

1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or

Irrelevant for each query-doc pair Some work on more-than-binary, but not the

standard

56

Evaluating an IR system

Note: the information need is translated into a query, Relevance is assessed relative to the information need not the query

E.g., Information need: I'm looking for information on

whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query: wine red white heart attack effective You evaluate whether the doc addresses the

information need, not whether it has those words

57

Standard Test Collections

TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years

Reuters, CWT100G/CWT200G , etc… Human experts mark, for each query and

for each doc, Relevant or Irrelevant or at least for subset of docs that some system

returned for that query

58

http://trec.nist.gov/

http://www.cwirf.org/

Unranked retrieval evaluation:Precision and Recall

Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved)

Recall: 相关文档被检索出来的比率 = P(retrieved|relevant)

精度 Precision P = tp/(tp + fp) 召回率 Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp fp

Not Retrieved

fn tn

59

Accuracy

给定一个 Query ，搜索引擎对每个文档分类classifies as “Relevant” or “Irrelevant”.

Accuracy of an engine: 分类的正确比率 . Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure

in IR?

Relevant Not Relevant

Retrieved tp fp

Not Retrieved fn tn

60

Why not just use accuracy?

How to build a 99.9999% accurate search engine on a low budget….

People doing information retrieval want to find something and have a certain tolerance for junk.

Search for:

0 matching results found.

61

Precision and recall when ranked

把集合上的定义扩展到 ranked list 在 ranked list 中每个文档处，计算 P/R point 这样计算出来的值，那些是有用的？

Consider a P/R point for each relevant document Consider value only at fixed rank cutoffs

e.g., precision at rank 20 Consider value only at fixed recall points

e.g., precision at 20% recall May be more than one precision value at a

recall point

62

Precision and Recall example

63

Average precision of a query

Often want a single-number effectiveness measure

Average precision is widely used in IR Calculate by averaging precision when

recall increases

64

Recall/precision graphs

Average precision .vs. P/R graph

AP hides information Recall/precision graph

has odd saw-shape if done directly

但是 P/R图很难比较

65

Precision and Recall, toward averaging

66

Averaging graphs: a false start

How can graphs be averaged? 不同的 queries 有不同的 recall values

What is precision at25% recall?

插值 interpolate How?

67

Interpolation of graphs

可能的插值方法 No interpolation

Not very useful Connect the dots

Not a function Connect max Connect min Connect average …

0%recall 怎么处理 ? Assume 0? Assume best? Constant start?

68

How to choose?

一个好的检索系统具有这样的特点：一般来说（ On average ），随着 recall增加 , 它的 precision 会降低

Verified time and time again On average

插值，使得 makes function monotonically decreasing

比如 : 从左往右，取右边最大 precisions值为插值

where S is the set of observed (R,P) points 结果是一个 step function

69

Our example, interpolated this way

monotonically decreasing Handles 0% recall smoothly

70

Averaging graphs: using interpolation

Asked: what is precision at 25% recall?

Interpolate values

71

Averaging across queries

多个 queries 间的平均微平均 Micro-average – 每个 relevant document

是一个点，用来计算平均宏平均 Macro-average – 每个 query 是一个点，用

来计算平均 Average of many queries’ average precision

values Called mean average precision (MAP) “Average average precision” sounds weird

Mostcommon

72

Interpolated average precision

Average precision at standard recall points For a given query, compute P/R point for every relevant doc

doc. Interpolate precision at standard recall levels

11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) 3-pt is usually 75%, 50%, 25%

Average over all queries to get average precision at each recall level

Average interpolated recall levels to get single result Called “interpolated average precision”

Not used much anymore;MAP “mean average precision” more common

Values at specific interpolated points still commonly used

73

Interpolation and averaging

74

A combined measure: F

P/R 的综合指标 F measure (weighted harmonic mean):

通常使用 balanced F1 measure( = 1 or = ½)

Harmonic mean is a conservative average ， Heavily penalizes low values of P or R

RP

PR

RP

F

2

2 )1(1

)1(1

1

75

Averaging F, example

Q-bad has 1 relevant document Retrieved at rank 1000 (R P) = (1, 0.001) F value of 0.2%, so AvgF = 0.2%

Q-perfect has 10 relevant documents Retrieved at ranks 1-10 (R,P) = (.1,1), (.2,1), …, (1,1) F values of 18%, 33%, …, 100%, so AvgF = 66.2%

Macro average (0.2% + 66.2%) / 2 = 33.2%

Micro average (0.2% + 18% + … 100%) / 11 = 60.2%

76

本次课小结

Basic Index Techniques

Inverted index Dictionary & Postings

Scoring and Ranking Term weighting tf·idf Vector Space Model Cosine Similarity

IR evaluation Precision, Recall, F Interpolation MAP, interpolated AP

77

Thank You!

Q&A

阅读材料

[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4 [2] M. P. Jay and W. B. Croft, "A language

modeling approach to information retrieval," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Melbourne, Australia: ACM Press, 1998.

79

http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html