information retrieval
DESCRIPTION
Information Retrieval. PengBo Oct 28, 2010. 本次课大纲. Introduction of Information Retrieval 索引技术 : Index Techniques 排序: Scoring and Ranking 性能评测 : Evaluation. Basic Index Techniques. Document Collection. site:pkunews.pku.edu.cn baidu report 12,800 pages Google report 6820 pages. - PowerPoint PPT PresentationTRANSCRIPT
Information Retrieval
PengBoOct 28, 2010
2
3
本次课大纲
Introduction of Information Retrieval 索引技术: Index Techniques 排序: Scoring and Ranking 性能评测: Evaluation
4
Basic Index Techniques
Document Collection
site:pkunews.pku.edu.cn baidu report 12,800 pagesGoogle report 6820 pages
site:pkunews.pku.edu.cn baidu report 12,800 pagesGoogle report 6820 pages
6
User Information Need
在这个新闻网站内查找 articles talks about Culture of China and Japan,
and doesn’t talk about students abroad. QUERY :
“ 中国 日本 文化 —留学生”
中国 日本 文化 - 留学生 site:pkunews.pku.edu.cnBaidu report 38 resultsGoogle report 361 results
中国 日本 文化 - 留学生 site:pkunews.pku.edu.cnBaidu report 38 resultsGoogle report 361 results
7
How to do it?
字符串匹配,如使用 grep 所有 WebPages ,找到包含 “中国” ,“文化” and “日本” 的页面 , 再去除包含 “留学生”的页面 ? Slow (for large corpora) NOT “留学生” is non-trivial Other operations (e.g., find “中国” NEAR “日
本” ) not feasible
8
Document Representation
Bag of words model Document-term incidence matrix (关
联矩阵)中国 文化 日本 留学生
教育 北京 …
D1 1 1 0 0 1 1
D2 0 1 1 1 0 0
D3 1 0 1 1 0 0
D4 1 0 0 1 1 0
D5 1 1 1 0 0 1
D6 0 0 1 0 0 1
…1 if page contains word, 0 otherwise
9
Incidence Vector
D1 D2 D3 D4 D5 D6…
中国 1 0 1 1 1 0
文化 1 1 0 0 1 0
日本 0 1 1 0 1 1
留学生
0 1 1 1 0 0
教育 1 0 0 1 0 0
北京 1 0 0 0 1 1
…
Transpose :把 Document-term 矩阵转置 得到 term-document 关联矩阵 每个 term 对应一个 0/1 向量 , incidence vector
10
Retrieval
Information Need: 在这个新闻网站内查找 : articles talks about Culture
of China and Japan, and doesn’t talk about students abroad.
To answer query: 读取 term 向量 “中国” ,“文化” ,“日本” , “留学
生” (complemented) bitwise AND 101110 AND 110010 AND 011011 AND 100011
= 000010
11
D5
12
Let’s build a search system!
考虑系统规模: 文档数: N = 1million documents, 每篇文档约有
1K terms. 平均 6 bytes/term =>6GB of data in the
documents. 不相同的 term 数: M = 500K distinct terms
这个 Matrix 规模是? 500K x 1M 十分稀疏:不超过 one billion 1’s What’s a better representation?
13
1875 年 ,Mary Cowden Clarke 为莎士比亚作品编纂了词汇索引。在书的前言,她骄傲的写到她“纷献了一个通向智慧宝藏的可靠指南… , 希望这十六年来的辛勤劳动没有辜负这个理想…”
1875 年 ,Mary Cowden Clarke 为莎士比亚作品编纂了词汇索引。在书的前言,她骄傲的写到她“纷献了一个通向智慧宝藏的可靠指南… , 希望这十六年来的辛勤劳动没有辜负这个理想…”
1911,LaneCooper 教授出版了一本 William Wordsworth 诗集的词汇索引。耗时 7 个月,共有 67人参与工作,使用的工具八廓卡片、剪刀、胶水和邮票等。
1911,LaneCooper 教授出版了一本 William Wordsworth 诗集的词汇索引。耗时 7 个月,共有 67人参与工作,使用的工具八廓卡片、剪刀、胶水和邮票等。1965, 使用计算机整理这样的资料只需要几天时间,而且会完成得更好……1965, 使用计算机整理这样的资料只需要几天时间,而且会完成得更好……
14
Inverted index
对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表
中国文化留学生
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings
Sorted by docID (more later on why).
15
Inverted index construction
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
16
输出: <Modified token, Document ID> 元组序列 .
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Indexer steps
17
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Core indexing step
18
合并一个文档中的多次出现,添加 term 的Frequency 信息 .
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
19
结果 split 为一个 Dictionary 文件和一个Postings 文件 .
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Why split?
20
Boolean Query processing
查询 : 中国 AND 文化 查找 Dictionary ,定位中国 ;
读取对应的 postings. 查找 Dictionary ,定位文化 ;
读取对应的 postings. “Merge” 合并 (AND) 两个 postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
中国文化
21
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
The merge
Lists 的合并算法
34
2 4 8 16 32 64
1 2 3 5 8 13 21
中国文化2 8
If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.
22
Boolean queries: Exact match
Queries using AND, OR and NOT together with query terms
Primary commercial retrieval tool for 3 decades.
Professional searchers (e.g., Lawyers) still like Boolean queries:
You know exactly what you’re getting.
23
Example: WestLaw
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
About 7 terabytes of data; 700,000 users Majority of users still use boolean queries
Example query: What is the statute of limitations in cases involving the federal
tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
特点: Long, precise queries; proximity operators; incrementally developed; not like web search
24
Beyond Boolean term search
短语 phrase : Find “Bill Gates” , not “Bill and Gates”
词的临近关系 Proximity: Find Gates NEAR Microsoft.
文档中的区域限定 : Find documents with (author = Ullman) AND
(text contains automata). Solution :
记录 term 的 field property 记录 term 在 docs 中的 position information.
25
LAST COURSE REVIEW
26
Bag of words model
Vector representation doesn’t consider the ordering of words in a document
John is quicker than Mary and Mary is quicker than John have the same vectors
This is called the bag of words model. In a sense, this is a step back: The
positional index was able to distinguish these two documents.
We will look at “recovering” positional information later in this course.
For now: bag of words model
Inverted index
对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表
中国文化留学生
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings
Sorted by docID (more later on why).
28
Simple Inverted
Index
Inverted Indexwith counts
• supports better ranking
algorithms
Inverted Indexwith positions
• supports proximity matches
Query Processing
Document-at-a-time Calculates complete scores for documents by
processing all term lists, one document at a time
Term-at-a-time Accumulates scores for documents by
processing term lists one at a time Both approaches have optimization
techniques that significantly reduce time required to generate scores
Document-At-A-Time
Term-At-A-Time
Scoring and Ranking
Beyond Boolean Search
对大多数用户来说… . LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
大多数用户可能会输入 bill rights or bill of rights 作为 Query 怎样解释和处理这样 full text queries? 没有 AND OR NOT 等 boolean 连接符 某些 query term 不一定要出现在结果文档中
用户会期待结果按某种 order 返回, most likely to be useful 的文档在结果的前面
36
Scoring: density-based
按 query ,给文档打分 scoring ,根据 score 排序
Idea 如果一个文档 talks about a topic more, then it is
a better match if 如果包含很多次 query term 的出现,文档是
relevant( 相关的 ) term weighting.
37
Term frequency vectors
考察 term t 在文档 d, 中出现的次数 number of occurrences ,记作 tft,d
D1 D2 D3 D4 D5 D6…
中国 11 0 7 13 4 0
文化 2 2 0 0 6 0
日本 0 5 2 0 1 9
留学生
0 1 2 6 0 0
教育 3 0 0 2 0 0
北京 17 0 0 0 11 8
…
对一个 free-text Query qScore(q,d) = tq tft,d
对一个 free-text Query qScore(q,d) = tq tft,d
38
Problem of TF scoring
没有区分词序 Positional information index
长文档具有优势 归一化: normalizing for document
length wft,d = tft,d / |d|
出现的重要程度其实与出现次数不成正比关系 从 0 次到 1 次的出现,和 100 次出现到 101 次出现,意义大不相同
平滑 不同的词,其重要程度其实不一样
Consider query 日本 的 汉字 丼 区分 Discrimination of terms
otherwise log1 ,0 if 0 ,,, dtdtdt tftfwf
39
Discrimination of terms
怎样度量 terms 的 common程度 collection frequency (cf ) :文档集合里 term 出现
的总次数 document frequency (df ) :文档集合里出现过
term 的文档总数
Word cf df
try 10422 8760
insurance
10440 3997
40
tf x idf term weights
tf x idf 权值计算公式 : term frequency (tf )
or wf, some measure of term density in a doc inverse document frequency (idf )
表达 term 的重要度 ( 稀有度 ) 原始值 idft = 1/dft
同样,通常会作平滑
为文档中每个词计算其 tf.idf权重:
dfNidf
t
t log
)/log(,, tdtdt dfNtfw 41
Documents as vectors
每一个文档 j 能够被看作一个向量,每个 term 是一个维度,取值为 tf.idf
So we have a vector space terms are axes docs live in this space 高维空间:即使作 stemming, may have 20,000+
dimensions
D1 D2 D3 D4 D5 D6…
中国 4.1 0.0 3.7 5.9 3.1 0.0
文化 4.5 4.5 0 0 11.6 0
日本 0 3.5 2.9 0 2.1 3.9
留学生 0 3.1 5.1 12.8 0 0
教育 2.9 0 0 2.2 0 0
北京 7.1 0 0 0 4.4 3.8
…
42
Intuition
Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
用例: Query-by-example , Free Text query as vector用例: Query-by-example , Free Text query as vector
43
Formalizing vector space proximity
First cut: distance between two points ( = distance between the end points of the two
vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for
vectors of different lengths.
Sec. 6.3
Why distance is a bad idea
The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are
very similar.
Sec. 6.3
Cosine similarity
M
ijij wd
1,
2
向量 d1和 d2的
“ closeness” 可以用它们之间的夹角大小来度量
具体的,可用 cosine of the angle x 来计算向量相似度 .
向量按长度归一化Normalization
t 1
d 2
d 1
t 3
t 2
θ
M
i ki
M
i ji
M
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
M
i ki
M
i ji
M
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
46
Example
Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights
cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929
SaS PaP WHaffection 115 58 20jealous 10 7 11gossip 2 0 6
SaS PaP WHaffection 0.996 0.993 0.847jealous 0.087 0.120 0.466gossip 0.017 0.000 0.254
47
Notes on Index Structure
怎样保存 normalized tf-idf 值? 在每一个 postings entry 吗 ? 保存 tf/normalization? Space blowup because of floats
通常: tf以整数值保存 index compression 文档长度, idf 每 doc 只保存一个
48
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Sec. 6.4
Weighting may differ in queries vs documents
Many search engines allow for different weightings for queries vs. documents
SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table
A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first
character), no idf and cosine normalization
Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …
A bad idea?
Sec. 6.4
tf-idf example: lnc.ltc
Term Query Document Prod
tf-raw
tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53
Document: car insurance auto insuranceQuery: best car insurance
Exercise: what is N, the number of docs?
Score = 0+0+0.27+0.53 = 0.8
Doc length =
12 02 12 1.32 1.92
Sec. 6.4
Thus far
We can build a Information Retrieval System Support Boolean query Support Free-text query Support ranking result
52
IR Evaluation
Measures for a search engine
创建 index 的速度 Number of documents/hour Documents size
搜索的速度 响应时间: Latency as a function of index size 吞吐率: Throughput as a function of index size
查询语言的表达能力 Ability to express complex information needs Speed on complex queries
These criteria measurable.但更关键的 measure 是
user happiness怎样量化的度量它?
These criteria measurable.但更关键的 measure 是
user happiness怎样量化的度量它?
54
Measuring user happiness
Issue: 谁是 user? Web engine: user finds what they want and return
to the engine Can measure rate of return users
eCommerce site: user finds what they want and make a purchase
Is it the end-user, or the eCommerce site, whose happiness we measure?
Measure time to purchase, or fraction of searchers who become buyers?
Enterprise (company/govt/academic): Care about “user productivity”
How much time do my users save when looking for information?
Many other criteria having to do with breadth of access, secure access, etc.
55
Happiness: elusive to measure
Commonest proxy: relevance of search results
But how do you measure relevance? Methodology: test collection
1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or
Irrelevant for each query-doc pair Some work on more-than-binary, but not the
standard
56
Evaluating an IR system
Note: the information need is translated into a query, Relevance is assessed relative to the information need not the query
E.g., Information need: I'm looking for information on
whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
Query: wine red white heart attack effective You evaluate whether the doc addresses the
information need, not whether it has those words
57
Standard Test Collections
TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years
Reuters, CWT100G/CWT200G , etc… Human experts mark, for each query and
for each doc, Relevant or Irrelevant or at least for subset of docs that some system
returned for that query
58
Unranked retrieval evaluation:Precision and Recall
Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved)
Recall: 相关文档被检索出来的比率 = P(retrieved|relevant)
精度 Precision P = tp/(tp + fp) 召回率 Recall R = tp/(tp + fn)
Relevant Not Relevant
Retrieved tp fp
Not Retrieved
fn tn
59
Accuracy
给定一个 Query ,搜索引擎对每个文档分类classifies as “Relevant” or “Irrelevant”.
Accuracy of an engine: 分类的正确比率 . Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure
in IR?
Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
60
Why not just use accuracy?
How to build a 99.9999% accurate search engine on a low budget….
People doing information retrieval want to find something and have a certain tolerance for junk.
Search for:
0 matching results found.
61
Precision and recall when ranked
把集合上的定义扩展到 ranked list 在 ranked list 中每个文档处,计算 P/R point 这样计算出来的值,那些是有用的?
Consider a P/R point for each relevant document Consider value only at fixed rank cutoffs
e.g., precision at rank 20 Consider value only at fixed recall points
e.g., precision at 20% recall May be more than one precision value at a
recall point
62
Precision and Recall example
63
Average precision of a query
Often want a single-number effectiveness measure
Average precision is widely used in IR Calculate by averaging precision when
recall increases
64
Recall/precision graphs
Average precision .vs. P/R graph
AP hides information Recall/precision graph
has odd saw-shape if done directly
但是 P/R图很难比较
65
Precision and Recall, toward averaging
66
Averaging graphs: a false start
How can graphs be averaged? 不同的 queries 有不同的 recall values
What is precision at25% recall?
插值 interpolate How?
67
Interpolation of graphs
可能的插值方法 No interpolation
Not very useful Connect the dots
Not a function Connect max Connect min Connect average …
0%recall 怎么处理 ? Assume 0? Assume best? Constant start?
68
How to choose?
一个好的检索系统具有这样的特点:一般来说( On average ),随着 recall增加 , 它的 precision 会降低
Verified time and time again On average
插值,使得 makes function monotonically decreasing
比如 : 从左往右,取右边最大 precisions值为插值
where S is the set of observed (R,P) points 结果是一个 step function
69
Our example, interpolated this way
monotonically decreasing Handles 0% recall smoothly
70
Averaging graphs: using interpolation
Asked: what is precision at 25% recall?
Interpolate values
71
Averaging across queries
多个 queries 间的平均 微平均 Micro-average – 每个 relevant document
是一个点,用来计算平均 宏平均 Macro-average – 每个 query 是一个点,用
来计算平均 Average of many queries’ average precision
values Called mean average precision (MAP) “Average average precision” sounds weird
Mostcommon
72
Interpolated average precision
Average precision at standard recall points For a given query, compute P/R point for every relevant doc
doc. Interpolate precision at standard recall levels
11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) 3-pt is usually 75%, 50%, 25%
Average over all queries to get average precision at each recall level
Average interpolated recall levels to get single result Called “interpolated average precision”
Not used much anymore;MAP “mean average precision” more common
Values at specific interpolated points still commonly used
73
Interpolation and averaging
74
A combined measure: F
P/R 的综合指标 F measure (weighted harmonic mean):
通常使用 balanced F1 measure( = 1 or = ½)
Harmonic mean is a conservative average , Heavily penalizes low values of P or R
RP
PR
RP
F
2
2 )1(1
)1(1
1
75
Averaging F, example
Q-bad has 1 relevant document Retrieved at rank 1000 (R P) = (1, 0.001) F value of 0.2%, so AvgF = 0.2%
Q-perfect has 10 relevant documents Retrieved at ranks 1-10 (R,P) = (.1,1), (.2,1), …, (1,1) F values of 18%, 33%, …, 100%, so AvgF = 66.2%
Macro average (0.2% + 66.2%) / 2 = 33.2%
Micro average (0.2% + 18% + … 100%) / 11 = 60.2%
76
本次课小结
Basic Index Techniques
Inverted index Dictionary & Postings
Scoring and Ranking Term weighting tf·idf Vector Space Model Cosine Similarity
IR evaluation Precision, Recall, F Interpolation MAP, interpolated AP
77
Thank You!
Q&A
阅读材料
[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4 [2] M. P. Jay and W. B. Croft, "A language
modeling approach to information retrieval," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Melbourne, Australia: ACM Press, 1998.
79