information retrieval

79
Information Retrieval PengBo Oct 28, 2010

Upload: magar

Post on 08-Jan-2016

36 views

Category:

Documents


2 download

DESCRIPTION

Information Retrieval. PengBo Oct 28, 2010. 本次课大纲. Introduction of Information Retrieval 索引技术 : Index Techniques 排序: Scoring and Ranking 性能评测 : Evaluation. Basic Index Techniques. Document Collection. site:pkunews.pku.edu.cn baidu report 12,800 pages Google report 6820 pages. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval

Information Retrieval

PengBoOct 28, 2010

Page 2: Information Retrieval

2

Page 3: Information Retrieval

3

Page 4: Information Retrieval

本次课大纲

Introduction of Information Retrieval 索引技术: Index Techniques 排序: Scoring and Ranking 性能评测: Evaluation

4

Page 5: Information Retrieval

Basic Index Techniques

Page 6: Information Retrieval

Document Collection

site:pkunews.pku.edu.cn baidu report 12,800 pagesGoogle report 6820 pages

site:pkunews.pku.edu.cn baidu report 12,800 pagesGoogle report 6820 pages

6

Page 7: Information Retrieval

User Information Need

在这个新闻网站内查找 articles talks about Culture of China and Japan,

and doesn’t talk about students abroad. QUERY :

“ 中国 日本 文化 —留学生”

中国 日本 文化 - 留学生 site:pkunews.pku.edu.cnBaidu report 38 resultsGoogle report 361 results

中国 日本 文化 - 留学生 site:pkunews.pku.edu.cnBaidu report 38 resultsGoogle report 361 results

7

Page 8: Information Retrieval

How to do it?

字符串匹配,如使用 grep 所有 WebPages ,找到包含 “中国” ,“文化” and “日本” 的页面 , 再去除包含 “留学生”的页面 ? Slow (for large corpora) NOT “留学生” is non-trivial Other operations (e.g., find “中国” NEAR “日

本” ) not feasible

8

Page 9: Information Retrieval

Document Representation

Bag of words model Document-term incidence matrix (关

联矩阵)中国 文化 日本 留学生

教育 北京 …

D1 1 1 0 0 1 1

D2 0 1 1 1 0 0

D3 1 0 1 1 0 0

D4 1 0 0 1 1 0

D5 1 1 1 0 0 1

D6 0 0 1 0 0 1

…1 if page contains word, 0 otherwise

9

Page 10: Information Retrieval

Incidence Vector

D1 D2 D3 D4 D5 D6…

中国 1 0 1 1 1 0

文化 1 1 0 0 1 0

日本 0 1 1 0 1 1

留学生

0 1 1 1 0 0

教育 1 0 0 1 0 0

北京 1 0 0 0 1 1

Transpose :把 Document-term 矩阵转置 得到 term-document 关联矩阵 每个 term 对应一个 0/1 向量 , incidence vector

10

Page 11: Information Retrieval

Retrieval

Information Need: 在这个新闻网站内查找 : articles talks about Culture

of China and Japan, and doesn’t talk about students abroad.

To answer query: 读取 term 向量 “中国” ,“文化” ,“日本” , “留学

生” (complemented) bitwise AND 101110 AND 110010 AND 011011 AND 100011

= 000010

11

Page 12: Information Retrieval

D5

12

Page 13: Information Retrieval

Let’s build a search system!

考虑系统规模: 文档数: N = 1million documents, 每篇文档约有

1K terms. 平均 6 bytes/term =>6GB of data in the

documents. 不相同的 term 数: M = 500K distinct terms

这个 Matrix 规模是? 500K x 1M 十分稀疏:不超过 one billion 1’s What’s a better representation?

13

Page 14: Information Retrieval

1875 年 ,Mary Cowden Clarke 为莎士比亚作品编纂了词汇索引。在书的前言,她骄傲的写到她“纷献了一个通向智慧宝藏的可靠指南… , 希望这十六年来的辛勤劳动没有辜负这个理想…”

1875 年 ,Mary Cowden Clarke 为莎士比亚作品编纂了词汇索引。在书的前言,她骄傲的写到她“纷献了一个通向智慧宝藏的可靠指南… , 希望这十六年来的辛勤劳动没有辜负这个理想…”

1911,LaneCooper 教授出版了一本 William Wordsworth 诗集的词汇索引。耗时 7 个月,共有 67人参与工作,使用的工具八廓卡片、剪刀、胶水和邮票等。

1911,LaneCooper 教授出版了一本 William Wordsworth 诗集的词汇索引。耗时 7 个月,共有 67人参与工作,使用的工具八廓卡片、剪刀、胶水和邮票等。1965, 使用计算机整理这样的资料只需要几天时间,而且会完成得更好……1965, 使用计算机整理这样的资料只需要几天时间,而且会完成得更好……

14

Page 15: Information Retrieval

Inverted index

对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表

中国文化留学生

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings

Sorted by docID (more later on why).

15

Page 16: Information Retrieval

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

16

Page 17: Information Retrieval

输出: <Modified token, Document ID> 元组序列 .

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Indexer steps

17

Page 18: Information Retrieval

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Core indexing step

18

Page 19: Information Retrieval

合并一个文档中的多次出现,添加 term 的Frequency 信息 .

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

19

Page 20: Information Retrieval

结果 split 为一个 Dictionary 文件和一个Postings 文件 .

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Why split?

20

Page 21: Information Retrieval

Boolean Query processing

查询 : 中国 AND 文化 查找 Dictionary ,定位中国 ;

读取对应的 postings. 查找 Dictionary ,定位文化 ;

读取对应的 postings. “Merge” 合并 (AND) 两个 postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

中国文化

21

Page 22: Information Retrieval

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge

Lists 的合并算法

34

2 4 8 16 32 64

1 2 3 5 8 13 21

中国文化2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

22

Page 23: Information Retrieval

Boolean queries: Exact match

Queries using AND, OR and NOT together with query terms

Primary commercial retrieval tool for 3 decades.

Professional searchers (e.g., Lawyers) still like Boolean queries:

You know exactly what you’re getting.

23

Page 24: Information Retrieval

Example: WestLaw

Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

About 7 terabytes of data; 700,000 users Majority of users still use boolean queries

Example query: What is the statute of limitations in cases involving the federal

tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

特点: Long, precise queries; proximity operators; incrementally developed; not like web search

24

Page 25: Information Retrieval

Beyond Boolean term search

短语 phrase : Find “Bill Gates” , not “Bill and Gates”

词的临近关系 Proximity: Find Gates NEAR Microsoft.

文档中的区域限定 : Find documents with (author = Ullman) AND

(text contains automata). Solution :

记录 term 的 field property 记录 term 在 docs 中的 position information.

25

Page 26: Information Retrieval

LAST COURSE REVIEW

26

Page 27: Information Retrieval

Bag of words model

Vector representation doesn’t consider the ordering of words in a document

John is quicker than Mary and Mary is quicker than John have the same vectors

This is called the bag of words model. In a sense, this is a step back: The

positional index was able to distinguish these two documents.

We will look at “recovering” positional information later in this course.

For now: bag of words model

Page 28: Information Retrieval

Inverted index

对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表

中国文化留学生

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings

Sorted by docID (more later on why).

28

Page 29: Information Retrieval

Simple Inverted

Index

Page 30: Information Retrieval

Inverted Indexwith counts

• supports better ranking

algorithms

Page 31: Information Retrieval

Inverted Indexwith positions

• supports proximity matches

Page 32: Information Retrieval

Query Processing

Document-at-a-time Calculates complete scores for documents by

processing all term lists, one document at a time

Term-at-a-time Accumulates scores for documents by

processing term lists one at a time Both approaches have optimization

techniques that significantly reduce time required to generate scores

Page 33: Information Retrieval

Document-At-A-Time

Page 34: Information Retrieval

Term-At-A-Time

Page 35: Information Retrieval

Scoring and Ranking

Page 36: Information Retrieval

Beyond Boolean Search

对大多数用户来说… . LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

大多数用户可能会输入 bill rights or bill of rights 作为 Query 怎样解释和处理这样 full text queries? 没有 AND OR NOT 等 boolean 连接符 某些 query term 不一定要出现在结果文档中

用户会期待结果按某种 order 返回, most likely to be useful 的文档在结果的前面

36

Page 37: Information Retrieval

Scoring: density-based

按 query ,给文档打分 scoring ,根据 score 排序

Idea 如果一个文档 talks about a topic more, then it is

a better match if 如果包含很多次 query term 的出现,文档是

relevant( 相关的 ) term weighting.

37

Page 38: Information Retrieval

Term frequency vectors

考察 term t 在文档 d, 中出现的次数 number of occurrences ,记作 tft,d

D1 D2 D3 D4 D5 D6…

中国 11 0 7 13 4 0

文化 2 2 0 0 6 0

日本 0 5 2 0 1 9

留学生

0 1 2 6 0 0

教育 3 0 0 2 0 0

北京 17 0 0 0 11 8

对一个 free-text Query qScore(q,d) = tq tft,d

对一个 free-text Query qScore(q,d) = tq tft,d

38

Page 39: Information Retrieval

Problem of TF scoring

没有区分词序 Positional information index

长文档具有优势 归一化: normalizing for document

length wft,d = tft,d / |d|

出现的重要程度其实与出现次数不成正比关系 从 0 次到 1 次的出现,和 100 次出现到 101 次出现,意义大不相同

平滑 不同的词,其重要程度其实不一样

Consider query 日本 的 汉字 丼 区分 Discrimination of terms

otherwise log1 ,0 if 0 ,,, dtdtdt tftfwf

39

Page 40: Information Retrieval

Discrimination of terms

怎样度量 terms 的 common程度 collection frequency (cf ) :文档集合里 term 出现

的总次数 document frequency (df ) :文档集合里出现过

term 的文档总数

Word cf df

try 10422 8760

insurance

10440 3997

40

Page 41: Information Retrieval

tf x idf term weights

tf x idf 权值计算公式 : term frequency (tf )

or wf, some measure of term density in a doc inverse document frequency (idf )

表达 term 的重要度 ( 稀有度 ) 原始值 idft = 1/dft

同样,通常会作平滑

为文档中每个词计算其 tf.idf权重:

dfNidf

t

t log

)/log(,, tdtdt dfNtfw 41

Page 42: Information Retrieval

Documents as vectors

每一个文档 j 能够被看作一个向量,每个 term 是一个维度,取值为 tf.idf

So we have a vector space terms are axes docs live in this space 高维空间:即使作 stemming, may have 20,000+

dimensions

D1 D2 D3 D4 D5 D6…

中国 4.1 0.0 3.7 5.9 3.1 0.0

文化 4.5 4.5 0 0 11.6 0

日本 0 3.5 2.9 0 2.1 3.9

留学生 0 3.1 5.1 12.8 0 0

教育 2.9 0 0 2.2 0 0

北京 7.1 0 0 0 4.4 3.8

42

Page 43: Information Retrieval

Intuition

Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

用例: Query-by-example , Free Text query as vector用例: Query-by-example , Free Text query as vector

43

Page 44: Information Retrieval

Formalizing vector space proximity

First cut: distance between two points ( = distance between the end points of the two

vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for

vectors of different lengths.

Sec. 6.3

Page 45: Information Retrieval

Why distance is a bad idea

The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are

very similar.

Sec. 6.3

Page 46: Information Retrieval

Cosine similarity

M

ijij wd

1,

2

向量 d1和 d2的

“ closeness” 可以用它们之间的夹角大小来度量

具体的,可用 cosine of the angle x 来计算向量相似度 .

向量按长度归一化Normalization

t 1

d 2

d 1

t 3

t 2

θ

M

i ki

M

i ji

M

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

M

i ki

M

i ji

M

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

46

Page 47: Information Retrieval

Example

Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights

cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929

SaS PaP WHaffection 115 58 20jealous 10 7 11gossip 2 0 6

SaS PaP WHaffection 0.996 0.993 0.847jealous 0.087 0.120 0.466gossip 0.017 0.000 0.254

47

Page 48: Information Retrieval

Notes on Index Structure

怎样保存 normalized tf-idf 值? 在每一个 postings entry 吗 ? 保存 tf/normalization? Space blowup because of floats

通常: tf以整数值保存 index compression 文档长度, idf 每 doc 只保存一个

48

Page 49: Information Retrieval

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Sec. 6.4

Page 50: Information Retrieval

Weighting may differ in queries vs documents

Many search engines allow for different weightings for queries vs. documents

SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table

A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first

character), no idf and cosine normalization

Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …

A bad idea?

Sec. 6.4

Page 51: Information Retrieval

tf-idf example: lnc.ltc

Term Query Document Prod

tf-raw

tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize

auto 0 0 5000 2.3 0 0 1 1 1 0.52 0

best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0

car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27

insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Document: car insurance auto insuranceQuery: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+0.27+0.53 = 0.8

Doc length =

12 02 12 1.32 1.92

Sec. 6.4

Page 52: Information Retrieval

Thus far

We can build a Information Retrieval System Support Boolean query Support Free-text query Support ranking result

52

Page 53: Information Retrieval

IR Evaluation

Page 54: Information Retrieval

Measures for a search engine

创建 index 的速度 Number of documents/hour Documents size

搜索的速度 响应时间: Latency as a function of index size 吞吐率: Throughput as a function of index size

查询语言的表达能力 Ability to express complex information needs Speed on complex queries

These criteria measurable.但更关键的 measure 是

user happiness怎样量化的度量它?

These criteria measurable.但更关键的 measure 是

user happiness怎样量化的度量它?

54

Page 55: Information Retrieval

Measuring user happiness

Issue: 谁是 user? Web engine: user finds what they want and return

to the engine Can measure rate of return users

eCommerce site: user finds what they want and make a purchase

Is it the end-user, or the eCommerce site, whose happiness we measure?

Measure time to purchase, or fraction of searchers who become buyers?

Enterprise (company/govt/academic): Care about “user productivity”

How much time do my users save when looking for information?

Many other criteria having to do with breadth of access, secure access, etc.

55

Page 56: Information Retrieval

Happiness: elusive to measure

Commonest proxy: relevance of search results

But how do you measure relevance? Methodology: test collection

1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or

Irrelevant for each query-doc pair Some work on more-than-binary, but not the

standard

56

Page 57: Information Retrieval

Evaluating an IR system

Note: the information need is translated into a query, Relevance is assessed relative to the information need not the query

E.g., Information need: I'm looking for information on

whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query: wine red white heart attack effective You evaluate whether the doc addresses the

information need, not whether it has those words

57

Page 58: Information Retrieval

Standard Test Collections

TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years

Reuters, CWT100G/CWT200G , etc… Human experts mark, for each query and

for each doc, Relevant or Irrelevant or at least for subset of docs that some system

returned for that query

58

Page 59: Information Retrieval

Unranked retrieval evaluation:Precision and Recall

Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved)

Recall: 相关文档被检索出来的比率 = P(retrieved|relevant)

精度 Precision P = tp/(tp + fp) 召回率 Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp fp

Not Retrieved

fn tn

59

Page 60: Information Retrieval

Accuracy

给定一个 Query ,搜索引擎对每个文档分类classifies as “Relevant” or “Irrelevant”.

Accuracy of an engine: 分类的正确比率 . Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure

in IR?

Relevant Not Relevant

Retrieved tp fp

Not Retrieved fn tn

60

Page 61: Information Retrieval

Why not just use accuracy?

How to build a 99.9999% accurate search engine on a low budget….

People doing information retrieval want to find something and have a certain tolerance for junk.

Search for:

0 matching results found.

61

Page 62: Information Retrieval

Precision and recall when ranked

把集合上的定义扩展到 ranked list 在 ranked list 中每个文档处,计算 P/R point 这样计算出来的值,那些是有用的?

Consider a P/R point for each relevant document Consider value only at fixed rank cutoffs

e.g., precision at rank 20 Consider value only at fixed recall points

e.g., precision at 20% recall May be more than one precision value at a

recall point

62

Page 63: Information Retrieval

Precision and Recall example

63

Page 64: Information Retrieval

Average precision of a query

Often want a single-number effectiveness measure

Average precision is widely used in IR Calculate by averaging precision when

recall increases

64

Page 65: Information Retrieval

Recall/precision graphs

Average precision .vs. P/R graph

AP hides information Recall/precision graph

has odd saw-shape if done directly

但是 P/R图很难比较

65

Page 66: Information Retrieval

Precision and Recall, toward averaging

66

Page 67: Information Retrieval

Averaging graphs: a false start

How can graphs be averaged? 不同的 queries 有不同的 recall values

What is precision at25% recall?

插值 interpolate How?

67

Page 68: Information Retrieval

Interpolation of graphs

可能的插值方法 No interpolation

Not very useful Connect the dots

Not a function Connect max Connect min Connect average …

0%recall 怎么处理 ? Assume 0? Assume best? Constant start?

68

Page 69: Information Retrieval

How to choose?

一个好的检索系统具有这样的特点:一般来说( On average ),随着 recall增加 , 它的 precision 会降低

Verified time and time again On average

插值,使得 makes function monotonically decreasing

比如 : 从左往右,取右边最大 precisions值为插值

where S is the set of observed (R,P) points 结果是一个 step function

69

Page 70: Information Retrieval

Our example, interpolated this way

monotonically decreasing Handles 0% recall smoothly

70

Page 71: Information Retrieval

Averaging graphs: using interpolation

Asked: what is precision at 25% recall?

Interpolate values

71

Page 72: Information Retrieval

Averaging across queries

多个 queries 间的平均 微平均 Micro-average – 每个 relevant document

是一个点,用来计算平均 宏平均 Macro-average – 每个 query 是一个点,用

来计算平均 Average of many queries’ average precision

values Called mean average precision (MAP) “Average average precision” sounds weird

Mostcommon

72

Page 73: Information Retrieval

Interpolated average precision

Average precision at standard recall points For a given query, compute P/R point for every relevant doc

doc. Interpolate precision at standard recall levels

11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) 3-pt is usually 75%, 50%, 25%

Average over all queries to get average precision at each recall level

Average interpolated recall levels to get single result Called “interpolated average precision”

Not used much anymore;MAP “mean average precision” more common

Values at specific interpolated points still commonly used

73

Page 74: Information Retrieval

Interpolation and averaging

74

Page 75: Information Retrieval

A combined measure: F

P/R 的综合指标 F measure (weighted harmonic mean):

通常使用 balanced F1 measure( = 1 or = ½)

Harmonic mean is a conservative average , Heavily penalizes low values of P or R

RP

PR

RP

F

2

2 )1(1

)1(1

1

75

Page 76: Information Retrieval

Averaging F, example

Q-bad has 1 relevant document Retrieved at rank 1000 (R P) = (1, 0.001) F value of 0.2%, so AvgF = 0.2%

Q-perfect has 10 relevant documents Retrieved at ranks 1-10 (R,P) = (.1,1), (.2,1), …, (1,1) F values of 18%, 33%, …, 100%, so AvgF = 66.2%

Macro average (0.2% + 66.2%) / 2 = 33.2%

Micro average (0.2% + 18% + … 100%) / 11 = 60.2%

76

Page 77: Information Retrieval

本次课小结

Basic Index Techniques

Inverted index Dictionary & Postings

Scoring and Ranking Term weighting tf·idf Vector Space Model Cosine Similarity

IR evaluation Precision, Recall, F Interpolation MAP, interpolated AP

77

Page 78: Information Retrieval

Thank You!

Q&A

Page 79: Information Retrieval

阅读材料

[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4 [2] M. P. Jay and W. B. Croft, "A language

modeling approach to information retrieval," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Melbourne, Australia: ACM Press, 1998.

79