![Page 1: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/1.jpg)
信息检索与搜索引擎Introduction to Information Retrieval
GESC1007
Philippe Fournier-Viger
Full professor
School of Natural Sciences and Humanities
Spring 20201
![Page 2: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/2.jpg)
Last week
We have discussed about:
◦ Index construction
◦ Index compression
QQ Group: 1059666166
Website: Videos, PPTs…
2
![Page 3: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/3.jpg)
Course schedule (日程安排)
3
Lecture 1IntroductionBoolean retrieval (布尔检索模型)
Lecture 2 Term vocabulary and posting lists
Lecture 3 Dictionaries and tolerant retrieval
Lecture 4 Index construction and compression
Lecture 5Scoring, weighting, and the vector space model
Lecture 6 Computer scores, and a complete search system
Lecture 7 Evaluation in information retrieval
Lecture 8 Web search engines, advanced topics, andconclusion
![Page 4: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/4.jpg)
CHAPTER 6:
SCORING, TERM
WEIGHTING AND
THE VECTOR SPACE MODEL
4pdf p146
![Page 5: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/5.jpg)
Introduction
5
Until now, we have only discussed
Boolean queries (布尔查询)
Shenzhen AND Beijing
car AND rental AND NOT buy
For each query, a document matches the query
or not.
(e.g. each document contains
“Shenzhen AND Beijing” or does not)
![Page 6: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/6.jpg)
Introduction
For small document collections, a Boolean
query may retrieve few documents.
Query = HITSZ AND Basketball AND Football
Only three documents!
![Page 7: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/7.jpg)
Introduction
7
For large document collections (e.g. Web),
the number of documents matching a query may
be huge.
Query = Beijing AND Restaurants
5 million webpages !
![Page 8: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/8.jpg)
Introduction
8
For large document collections, a user may not be
able to look at all the results.
the user may not want or may not have time
to look at 5 million webpages
Thus, a search engine such as Baidu should not
show all the documents to the user.
![Page 9: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/9.jpg)
Introduction
A search engine should:
◦ calculate a score for each document
indicating its relevance for the query.
◦ presents documents to the user by
decreasing order of relevance
(using a score得分).
◦ In other words, the most relevant documents
are presented first.
9
Example →
![Page 10: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/10.jpg)
10
Shenzhen airport
![Page 11: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/11.jpg)
11
More than 2
million results
Most relevant
document
Least relevant
documents
Advertisement
(广告)
![Page 12: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/12.jpg)
The structure of a document
Documents typically contains:
◦ Text
◦ Pictures, videos, and other information
A document may also contain data about the document itself
(metadata -元数据).
Example →
12
![Page 13: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/13.jpg)
Example of metadata (元数据)
FieldAuthor:
Title:
Date of publication:
Date of creation:
Language:
…
13
Values
Shakespeare
Bla Bla Bla
1900/01/01
1899/0101
English
…
Metadata may consist of values for various
fields.
![Page 14: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/14.jpg)
Parametric search
Definition: searching for documents
based on their properties (metadata元数据).
14
A search form
![Page 15: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/15.jpg)
Parametric search
Definition: searching for documents
based on their properties (metadata元数据).
15
e.g. Searching all English documents
published in 1615 by Shakespeare
A search form
1615
Shakespeare
![Page 16: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/16.jpg)
Another example“Find documents authored by Shakespeare in 1601, containing the word “England”
Author = “Shakespeare”Publication year = 1601Body contains “England”
How to find documents for this query?
Find documents containing “England” using a dictionary.
When searching using the dictionary, only retrieve the documents having the field values specified by the query (Author = Shakespeare, Year = 1601)
16
![Page 17: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/17.jpg)
Search forms
Search forms have two types of components:
Field: takes a value from a limited set of values
Zone: the user can write any text
17
Publication date:
Title:
Keywords:
Abstract:
Field
Zones
Travel to Beijing
![Page 18: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/18.jpg)
Indexes for fields and zones
It is easy to create an index for a field (field
index) because the set of terms to use for
searching is restricted and small.
2010, 2011, 2012….
It is more difficult to create a zone index because
all possible terms must be considered.
there can be many!
How to create a “zone index” ?
→
18
![Page 19: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/19.jpg)
Zone index (1st approach)
A zone index is a dictionary, where each terms is
specific to its zone.
19
This indicates that the value “william”
appears in the zone “author” for
documents 2,3,5 and 8.
![Page 20: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/20.jpg)
Zone index (1st approach)
A zone index is a dictionary, where each terms is
specific to its zone.
20
Problem: This would work but there might be too many
terms in the dictionary!
![Page 21: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/21.jpg)
Zone index (2nd approach)
Another way of representing a zone index is the
following:
21
This indicates that the value “william” appears in
the zone “author” and in the “title” of document 2
![Page 22: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/22.jpg)
Zone index (2nd approach)
Another way of representing a zone index is the
following:
22
Is it better?
• The number of terms in the dictionary is
reduced…
• But the posting lists may be longer
• Useful to search for terms appearing
anywhere in the document (title, author etc.)
![Page 23: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/23.jpg)
Ranking documents using a zone index
A zone index is useful for rankingdocuments from most relevant to least relevant.
This technique is called:“weighted zone scoring”.
A score between 0 and 1 will be assigned to each document.
1 = highly relevant0 = irrelevant document
23Note: “Weighted zone scoring” is also called “Ranked Boolean retrieval”
![Page 24: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/24.jpg)
How to assign scores?
The score of a document:
The overall score of a document is
the sum of the scores of its zones.
Each zone of a document contributes
to the document’s score if the query
matches the zone,
Weights may be assigned to each zone
to indicate their relative importance.
(title may be more important than
author name)
24
![Page 25: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/25.jpg)
Example
A collection of documents with three zones:
author, title and body
25
author:
title:
body:
![Page 26: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/26.jpg)
Example
We can set the weights as follow:
weight of “author” = 0.2
weight of “title” = 0.3
weight of “body” = 0.5
26
author:
title:
body:
![Page 27: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/27.jpg)
Example
We can set the weights as follow:
weight of “author” = 0.2
weight of “title” = 0.3
weight of “body” = 0.5
27
author:
title:
body:
Thus, a match in the author zone does not change
much the overall score, the title zone somewhat
more, and the body is the most important zone.
![Page 28: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/28.jpg)
ExampleWe will search for documents using this query:
QUERY = Shakespeare
28
![Page 29: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/29.jpg)
Exampleweight of “author” = 0.2
weight of “title” = 0.3
weight of “body” = 0.5
If a document D1 contains “Shakespeare” in the title and body, its overall score is :
0 + 0.3 + 0.5 = 0.8
If a document D2 contains “Shakespeare” only in the body, its overall score is :
0 + 0 + 0.5 = 0.5
Thus, document D1 is ranked higher than document D2.
29
QUERY = Shakespeare
![Page 30: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/30.jpg)
How to calculate scores using a zone index?
e.g.: query: “william”
Using the index, we can quickly find all the documents containing “william”.
The information in the postling list(s), allows calculating the score of each document:
Doc 2: 0.2 + 0.3 = 0.5Doc 3: 0.2
Doc 4: 0.3Doc 5: 0.2
30
![Page 31: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/31.jpg)
How to choose the weights?
So how do we choose the weights?
Different weights will result in different rankings
of documents.
Weights could be set by a human expert.
31
weight of “author” = 0.2
weight of “title” = 0.3
weight of “body” = 0.5
weight of “author” = 0.4
weight of “title” = 0.4
weight of “body” = 0.2
Which one is better?
![Page 32: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/32.jpg)
How to choose the weights?
But often, the weights are “learned”
automatically using “machine learning
methods” (机器学习方法).
The idea is to try to find the best weights
automatically.
32
![Page 33: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/33.jpg)
How does it works?We have a set of training examples:
a query + some documents + a relevance judgment on
each document
Relevance judgement: {relevant, non_relevant} or a score.
33
The query matches the title
of this document
0 = no 1 = yes
The query matches
the body of this
document
0 = no 1 = yes
![Page 34: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/34.jpg)
How does it works?
The weights are then “learned” from these examples so that they will approximate the relevance judgments in the examples.
Let’say that: relevant = 1 non_relevant = 0
What should be the weights?
weight of body = ? 0.5 ? 0.6 ? 0.3?
weight of title = ? 0.5 ? 0.4 ? 0.7 ?
It is an optimization problem (优化问题). We should find the weights that provide the best approximation (minimizes the error).
34
![Page 35: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/35.jpg)
TERM FREQUENCY AND WEIGHTING
35
![Page 36: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/36.jpg)
Introduction
Until now, we have considered that a term
appears or does not appear in a document.
We did not consider how often a term appears in
a document!
«shenzhen » appears 3 times in webpage 1
«shenzhen » appears 30 times in webpage 2
36
We need to consider term frequency !
(how many times a term appears)
![Page 37: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/37.jpg)
Free text queries
Users of a search engine like Baidu will often search using a
list of keywords (terms) as query.
e.g. shenzhen airport schedule
This query is said to contains three terms or keywords.
This type of query is called a «free form query »
No Boolean operators are used.
37
![Page 38: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/38.jpg)
Scoring documents for free text queries
To answer free text queries, the score of a
document should be based on:
◦ which keywords appear in the document,
◦ how many times they appear in the
document
38
![Page 39: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/39.jpg)
Scoring documents for free text queries
Simple approach: The score of a document is the sum of the number of occurrences of each term in that document (Term Frequency –TF -词频).
QUERY: shenzhen aiport
DOCUMENT: Shenzhen is a big city, founded in 1979. Shenzhen has an aiport. There is also a few ports in Shenzhen
SCORE = 3 + 1 = 4
39
3 times «shenzhen »
+ 1 time «airport »
![Page 40: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/40.jpg)
Bag of word model (词袋模型)
This approach for scoring document is
called the «bag of word model »
because it does not consider the order
between words.
40
![Page 41: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/41.jpg)
All words are equally important?
No!
Example: airports in Shenzhen
The word «in » is a very common word that is
in general not important (it is a stop word 停用词). We may ignore it.
41
![Page 42: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/42.jpg)
Problem with term frequency
all terms are viewed as equally important by the previous approach.
words that are very frequent will not allow to discriminate between documents.
e.g. documents related to the automobile industry will likely all contain the word «auto »
42
![Page 43: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/43.jpg)
Solution
Idea: we want to assign smaller weights to terms
that are very frequent.
Inverse document frequency (IDF) of a term:
(逆文档频率)
43
IDFt = log (𝐍
𝐃𝐅𝐭)
N = number of documents in the collection
DFt = number of documents containing the term
(document frequency)
![Page 44: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/44.jpg)
Example
N = 806, 791 documents
44
IDFt = log (𝐍
𝐃𝐅𝐭)
![Page 45: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/45.jpg)
45
![Page 46: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/46.jpg)
How to combine document
frequency and term frequency?
To score each term in a document, a
popular measure is TF-IDF
TF = term frequency
IDF = inverse document frequency
46
![Page 47: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/47.jpg)
TF-IDF
47
TF-IDFt = TFt × IDFt
Term frequency -词频): number of times
that the term
appears in the
document
Inverse
document
frequency
(逆文档频率) of
the term
![Page 48: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/48.jpg)
TF-IDF
For a document, a term will get:
a high score if it appears many
times in the document but rarely
in other documents.
a lower score if the term appears
few times in the document, or if it
appears in many documents
a low score if the term appears in
almost all documents
48
![Page 49: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/49.jpg)
To score a document for a query
Based on the TF-IDF, we can calculate the score of a document for a query:
Score(query, document) = sum of the TD-IDF of all terms in the query for that
document.
Consider the query: Shenzhen Beijing
Score(Shenzhen + Beijing, Doc1) =
TD-IDFShenzhen + TD-IDFBeijing
49
![Page 50: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/50.jpg)
THE VECTOR SPACEMODEL FOR SCORING
50
![Page 51: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/51.jpg)
Vector model (矢量模型)
The vector space model is a way of
representing documents that is often used
in information retrieval.
Each document is represented as a
vector (矢量).
51
![Page 52: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/52.jpg)
Vector model (矢量模型)
A vector (矢量) representing a document
contains a score for each word of the
dictionary (using the TF-IDF measure or other
measure).
Suppose that we have only two words in the
dictionary: Shenzhen and Beijing
vector(doc1) = [0.7, 0.1]
…
52
Score of Shenzhen in
that document
Score of Beijing
in that document
A document
![Page 53: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/53.jpg)
Vector model (矢量模型)
vector(doc1) = [0.7, 0.1]
…
53
Score of Shenzhen in
that document
Score of Beijing
in that document
A document
The vector of a document provides information about:
• the words contained in the document
• how often they appear in the document
But it does not keep information about the order between
the words
![Page 54: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/54.jpg)
Vector model (矢量模型)
The collection of all documents can be
viewed as a set of vectors.
vector(doc1) = [0.4, 0.5]
vector(doc2) = [0.7, 0.1]
vector(doc3) = [0.4, 0.8]
vector(doc4) = [0.7, 0.9]
…
… …
..
54
![Page 55: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/55.jpg)
Vector model (矢量模型)
The vectors of documents can be viewed as a
chart:
vector(doc1) = [0.4, 0.5]
vector(doc2) = [0.7, 0.1]
vector(doc3) = [0.4, 0.8]
vector(doc4) = [0.7, 0.9]
…
… …
..
55
0
0.2
0.4
0.6
0.8
1
0 0.5 1
Each document is a point or vector
(an arrow on the chart)
![Page 56: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/56.jpg)
Vector model (矢量模型)
The vectors of documents can be viewed as a
chart:
vector(doc1) = [0.4, 0.5]
vector(doc2) = [0.7, 0.1]
vector(doc3) = [0.4, 0.8]
vector(doc4) = [0.7, 0.9]
…
… …
..
56
0
0.2
0.4
0.6
0.8
1
0 0.5 1
In our case, we have two dimensions (2D) because we
assume that two words “Shenzhen” and “Beijing” are
the only words in the dictionary
![Page 57: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/57.jpg)
Vector model (矢量模型)
The vectors of documents can be viewed as a
chart:
vector(doc1) = [0.4, 0.5]
vector(doc2) = [0.7, 0.1]
vector(doc3) = [0.4, 0.8]
vector(doc4) = [0.7, 0.9]
…
… …
..
57
0
0.2
0.4
0.6
0.8
1
0 0.5 1
If we create the vectors using 3 words, we would
have 3 dimensions (3D) instead, and so on…
![Page 58: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/58.jpg)
Similarity between documents
Using the vector space model, we can try to calculate
how similar two documents are.
58
Intuitively, two
documents are
expected to be
similar if their
vectors are close
to each other
![Page 59: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/59.jpg)
Similarity between documents
However, longer documents, will tend to have larger values in
their vectors (when calculating scores using TF-IDF)
59
[0.7, 0.9]
[0.4, 0.5]
This would be
a longer
document
This would be
a shorter
document
![Page 60: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/60.jpg)
Similarity between documents
How to calculate the similarity between two
vectors while considering the length of
documents?
Solution: calculate the cosine similarity (余弦相似度 ) of two documents d1 and d2.
Example →
60
![Page 61: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/61.jpg)
cosine similarity (余弦相似度 )
vector(doc1) = [a1, a2, … an]
vector(doc2) = [b1, b2, … bn]
cossimilarity(doc1,doc2) = (𝑎1×𝑏1 + 𝑎2×𝑏2 + … 𝑎𝑛 × 𝑏𝑛)
𝑎12+𝑎22+⋯+𝑎𝑛2× 𝑏12+𝑏22+⋯+𝑏𝑛2
61
Length of the vector (arrow)
of document doc1
Length of the vector
(arrow) of document doc2
![Page 62: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/62.jpg)
If we use the standard math notation:
𝑑𝑜𝑐1 = [a1, a2, … an]
𝑑𝑜𝑐2 = [b1, b2, … bn]
cossimilarity(𝑑𝑜𝑐1, 𝑑𝑜𝑐2) = 𝑑𝑜𝑐1 ∙𝑑𝑜𝑐2
|𝑑𝑜𝑐1||𝑑𝑜𝑐2|
62
Length
of this
vector
Length
of this
vector
dot
product
![Page 63: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/63.jpg)
Example – 3 documents
63
We will use the Term frequency (TF) measure
instead of TF-IDF for the vectors
We will consider three documents with the
following term frequencies:
| 𝑑𝑜𝑐1| = 30.56 | 𝑑𝑜𝑐2| = 46.86 | 𝑑𝑜𝑐3| = 41.30
![Page 64: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/64.jpg)
Example – 3 documents
64
• Then, we length-normalize each vector.
• This means to divide each values in a vector
Ԧv by the length of the vector | Ԧv |
With these values, we could easily calculate the
cosine similarity between any documents….
![Page 65: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/65.jpg)
Example – 3 documents
…
We will not do all the calculations!
What is important is that we can now
calculate the similarity between two
vectors (documents)
65
![Page 66: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/66.jpg)
Queries as vectors
Just like documents, a query can be
represented as a vector.
Example: assume a dictionary of three
words: affection, jealous, gossip
The query: jealous gossip
can be viewed as a vector:
vector(query) = [0, 0.7, 0.7]
66
![Page 67: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/67.jpg)
Queries as vectors
The score of a document for a query can be
defined as the cosine similarity between the
query and the document.
score(query,document) =
cossimilarity(query, document)
67
query
![Page 68: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/68.jpg)
Queries as vectors
Using this approach, a document may have a
high score for a query even if it does not
contain all the query terms
Note: in practice, we can use TF, TF-IDF or
other measures for calculating the scores of
terms in documents.
68
![Page 69: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/69.jpg)
Queries as vectors (cont’d)
To search for documents:
◦ Calculate the cosine similarity between the query
and each document.
◦ Show only the documents to the user that have the
highest similarity with the query
69
top 10
documents
![Page 70: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/70.jpg)
Queries as vectors (cont’d)
However, calculating the similarity between a
query and all documents is time-consuming!
We want a search engine to be fast!
70
top 10
documents
![Page 71: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/71.jpg)
Computing vector scores
1) Set the score of all documents to 0.
2) For each query term t :
◦ calculate the weight of the query term t.
◦ obtain the posting list of t.
◦ increase the score of each document appearing in the posting list of t.
3) Divide the score of each document by its vector length.
4) Show the documents having the highest score to the user.
71
Some details…
![Page 72: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/72.jpg)
Computing vector scores
A key observation is that:
If we have multiple computers, we can
calculate the score using several
computers working in parallel.
Each computer may score different
documents for a same query.
72
Some details…
![Page 73: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/73.jpg)
SUMMARY
73
![Page 74: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/74.jpg)
Term frequency (TF): (词频) how many times
a term appears in a document
Document frequency (DF) (文档频率):
how many documents contain a term in a
collection of documents.
74
Inverse document frequency (IDF) of a term t:
(逆文档频率)
IDFt = log (𝐍
𝐃𝐅𝐭)
N = number of documents in the collection
DFt = document frequency of the term t
![Page 75: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/75.jpg)
Conclusion
Today, we discussed chapter 6 about
weighted Boolean retrieval.
We will continue next week…
The PPT slides are on the website.
75
![Page 76: Cours INF1025 - Groupe 50 Outils de bureautique et internet Ivector(doc1) = [0.7, 0.1] … 53 Score of Shenzhen in that document Score of Beijing in that document A document The vector](https://reader034.vdocuments.pub/reader034/viewer/2022051907/5ffa590c6d1c64226a001cb6/html5/thumbnails/76.jpg)
References
Manning, C. D., Raghavan, P., Schütze, H.
Introduction to information retrieval. Cambridge:
Cambridge University Press, 2008
76