cours inf1025 - groupe 50 outils de bureautique et internet ispam for some queries, there is a huge...
TRANSCRIPT
![Page 1: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/1.jpg)
信息检索与搜索引擎Introduction to Information Retrieval
GESC1007
Philippe Fournier-Viger
Full professor
School of Natural Sciences and Humanities
Spring 20201
![Page 2: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/2.jpg)
Last week
We have discussed:
◦ Evaluation in an information retrieval system
Today:
◦ Web search engines
◦ Second assignment
◦ About the final exam
2
![Page 3: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/3.jpg)
Course schedule (日程安排)
3
Week 1Introduction (Chapter 1)
Boolean retrieval
Week 2 Term vocabulary and posting lists (Chapter 2)
Week 3 Dictionaries and tolerant retrieval (Chapter 3)
Week 4 Index construction (Chapter 4)
Week 5Scoring, term weighting, the vector space model (Chapter 6)
Week 6 A complete search system (Chapter 7)
Week 7 Evaluation in information retrieval
Week 8 Web search engines, advanced topics, conclusion
Final exam (to be announced)
![Page 4: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/4.jpg)
WEB SEARCH ENGINES
4
![Page 5: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/5.jpg)
The Web
What is special about the Web?
◦ The number of documents (very large)
◦ Lack of coordination in the creation of
documents,
◦ Diversity of backgrounds and motives of
content creators.
5
19.1
![Page 6: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/6.jpg)
The Web
The Web is a set of webpages (网页)
Webpages are created using a language called HTML
6Webpage HTMLhttp://www.wikihow.com/Create-a-Simple-Web-Page-with-HTML
![Page 7: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/7.jpg)
The Web
Webpages are stored on servers (服务器)
To access a webpage, one must use a software
called a Web browser (浏览器)
7
Browser
SERVER
of
HITSZ
Internet
Home
![Page 8: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/8.jpg)
The Web
Webpages are stored on servers (服务器)
To access a webpage, one must use a software
called a Web browser (浏览器)
8
Browser
SERVER
of
HITSZ
Internet
Webpages are sent over the
internet using the HTTP
protocol (HTTP协议)
Home
![Page 9: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/9.jpg)
The Web
The idea of the Web:
each webpage contain links to other webpages
(hyperlinks -超链接).
Each webpage has an address (URL)
e.g. http://www.hitsz.edu.cn
Creating a simple webpage is not very difficult.
Webpages have become one of the best way to
supply and consume information.
9
![Page 10: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/10.jpg)
The Web
Billions of webpages containing information.
But if we cannot search this information, it is
useless.
Historically, two ways of searching for
information:
◦ Web directories (Yahoo!, etc.)
◦ Web search engines
10
![Page 11: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/11.jpg)
Web directories (网络目录)
Web directory: a list of websites, separated by
categories.
11
![Page 12: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/12.jpg)
Web directories (网络目录)
A Web directory contains only the “best” webpages for each category.
Problems:◦ Web directories are created by humans.
◦ This takes a lot of time.
◦ It is not convenient for searching. A user need to know how to find information within the categories.
◦ There can be thousands of categories.
◦ Information in categories is often old
For this reason, Web directories have mostly disappeared.
12
![Page 13: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/13.jpg)
Web search engines
Examples: Baidu Bing
They adapt information retrieval techniques
to search billions of documents.
Adapted in terms of:
◦ Indexing,
◦ Processing queries,
◦ Ranking documents
13
![Page 14: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/14.jpg)
Web search engines
Why are they popular?
◦ ability to quickly answer queries.
◦ ability to index millions of documents.
◦ almost always up-to-date.
Fifteen years ago, results returned by Web search
engines were not very good
Novel ranking techniques (排序技术 ) and
spam-fighting techniques (反垃圾邮件技术)
allowed to obtain better results.
14
![Page 15: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/15.jpg)
Web characteristics
The Web is mainly decentralized (分散).
Many languages.
Many different types of content.
Some webpages contains only pictures and
no text.
The Web contains a lot of non reliable
information.
How can a search engine know which
websites can be trusted?
15
19.2
![Page 16: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/16.jpg)
Size of the Web
1995: 30 million webpages indexed by AltaVista
2017: 4.48 billion webpageshttp://www.worldwidewebsize.com/
Note: only static webpages are counted.
Dynamic webpage: the content is generated in real-time for the user.
16
![Page 17: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/17.jpg)
The Web graph
The Web can be viewed as a graph (图)
◦ Each webpage is a vertex (顶点)
◦ A link between two webpages is an edge (图的边)
◦ The Web is a directed graph (有向图)
17
Webpages: A,B,C, …, F
![Page 18: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/18.jpg)
The Web graph
Two types of links:
In-links: links that go to a page
Out-links: links that leave a page
18
Node B has 3 in-links
has 1 out-link
![Page 19: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/19.jpg)
The Web graph
Not all web pages are equally popular
◦ Many web pages have few in-links
◦ Few web pages have many in-links
◦ The number of in-links per website follows a
power law distribution (幂律分布)
19Number of in-links
Number of
webpages
![Page 20: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/20.jpg)
Spam
For some queries, there is a huge competition to
appear high in the results of search engines.
e.g. Beijing real-estate (房地产)
Thus, many people modify their website to try
to appear first in the search results.
◦ e.g. write multiple times Beijing real-estate in a webpage to
increase the Term Frequency (TF).
◦ e.g. write invisible text using the background color of the
webpage (e.g. white)
20
![Page 21: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/21.jpg)
Spam detection
Nowadays, search engines use many sophisticated
methods to detect spam (repeated keywords,
etc.).
Websites that are trying to cheat may be blocked
from search engines.
Thus, some people have developed new techniques
to cheat search engines (欺骗搜索引擎) →
21
![Page 22: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/22.jpg)
Cloaking (伪装)
One such technique is cloaking.
Some websites try to cheat by showing
different content to search engines and
users.
22
This is a problem that did not exist in traditional IR.
![Page 23: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/23.jpg)
Paid inclusion
Paid inclusion: a website can also pay a
search engine to appear high in the
results.
Some search engines do not allow paid
inclusion.
23
![Page 24: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/24.jpg)
24
![Page 25: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/25.jpg)
Doorway page
◦ a page containing carefully chosen text to
rank highly in search engines for some
keywords.
◦ the page then links to another page containing
commercial content.
◦ a website may have many doorway pages.
25
Doorway
page Another
webpage
Doorway
page
![Page 26: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/26.jpg)
Link analysis
To reduce the problem of spam on the Web, many search
engine perform link analysis.
Basic idea: to rank a page higher or treat it as more reliable if
it has many in-links.
e.g. PageRank algorithm
26
![Page 27: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/27.jpg)
Link analysis
But some people create fake links to increase the
popularity of their webpages.
There is thus a continuing battle between spammers
and search engines.
27
![Page 28: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/28.jpg)
Advertising (广告)
Two main advertisement models:
1) cost per view:
The goal is to show some content to the user
(branding).
An image is typically used.
A company may pay to display the image 1000
times.
28
19.3
![Page 29: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/29.jpg)
Advertising (广告)
Two main advertisement models:
2) cost per click:
The goal is that some people click on an
advertisement to visit the website of the advertiser
(initiate a transaction).
The website may ask the person to buy something.
An image or text may be used with a link.
A company may pay for 1000 clicks.
29
![Page 30: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/30.jpg)
Advertising (广告)
Today, many search engines earn money
from advertising.
◦ Some will display search results and
advertisement separately.
Search results
Sponsored search results
◦ Some other search engines will combine
search results and advertisement.
30
19.3
![Page 31: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/31.jpg)
31
Search results Sponsored
search results
![Page 32: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/32.jpg)
Online advertisement networks
There are many advertisement networks :
◦ Bing Ads: provides pay per-click
advertisements for Bing and Yahoo,
◦ Adwords: sells advertisements on various
websites.
◦ …
32
![Page 33: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/33.jpg)
Click-spam
Click spam: a company clicks on the
advertisement of its competitors to spend
their money.
This may be done using some automatic
software.
A search engine must use some
techniques to block click spam.
33
![Page 34: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/34.jpg)
Example: AllAdvantage (1999- 2001)
It was an online advertisement company.
34
![Page 35: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/35.jpg)
Search user experience
(用户体验) It is also important to understand users of
search engines.
For traditional IR systems:
◦ Users often received a training about how to search and write
queries.
For Web search engines:
◦ Users may not know or care about how to write queries.
◦ Usually, people use 2 or 3 keywords in a query.
◦ Usually people do not use special operators (wildcard queries,
Boolean operators…)
35
19.4
![Page 36: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/36.jpg)
Search user experience
(用户体验) The more people use a search engine, the more money
it can earn.
How a search engine can get more users?
◦ By increasing the precision in the first few results,
◦ By updating the index frequently,
◦ By having a larger index,
◦ By offering a website that is simple and easy to use, and
that is very fast.
A user can quickly find what he is looking for.
36
![Page 37: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/37.jpg)
Three types of user queries
1) Informational queries: seek general
information on a broad topic.
◦ e.g. how to play piano
◦ There is not a single webpage that contains all
the information that the user wants.
◦ The user generally wants to combine
information from several webpages.
37
![Page 38: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/38.jpg)
Three types of user queries
2) Navigational queries: seek the website
or home page of a given entity.
◦ e.g. find the webpage of Huawei(华为)
◦ The user expects that the first result is the
webpage of the entity (e.g. Huawei)
◦ The user only needs one document.
◦ He wants a very high precision (1).
38
![Page 39: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/39.jpg)
Three types of user queries
3) Transactional queries: the user wants
to make a transaction.
◦ e.g. reserve a hotel room in Guangzhou,
e.g. buy train tickets…
◦ The search engine should provide links to
service providers.
39
![Page 40: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/40.jpg)
Three types of user queries
For a given query, it can be difficult to identify
the type of the query.
Identifying the type of a query is useful:
◦ for selecting the most relevant results,
◦ for displaying relevant advertisements
(e.g. advertisement about train tickets)
40
![Page 41: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/41.jpg)
41
![Page 42: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/42.jpg)
42
![Page 43: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/43.jpg)
43
![Page 44: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/44.jpg)
44
![Page 45: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/45.jpg)
Components of a Web search engine
45
(网络爬虫)
![Page 46: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/46.jpg)
Index size
How can we compare the sizes of the indexes of
two search engines (e.g. Baidu vs Bing)?
This may be difficult to evaluate
◦ A search engine may only index the first few thousands words
in a page.
◦ Search engines may organize their indexes in tiers using tiered
indexes.
For general queries, only the main page of a website may be
shown, and other pages may not be shown.
46
![Page 47: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/47.jpg)
Index size
Some techniques have been developed to
compare the size of search engines’ indexes.
Hypothesis: each search engine indexes only
one part of the Web, chosen randomly.
The “Capture-recapture” method →
47
![Page 48: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/48.jpg)
Capture-recapture method
Two search engines E1 and E2.
Take a page from E1 and check if it is in E2
This gives a ratio x
Take a page from E2 and check if it is in E1
This gives a ratio y
If E1 and E2 are independent and uniform random
subsets of the Web, we should have:
48More details in the book…
![Page 49: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/49.jpg)
Near-duplicates (近似重复)
Another issue: the Web may contain multiple
copies of the same webpage.
Up to 40% of the webpages are duplicates
(重复 ) of other pages.
Some of these copies are legitimate (合法的).
Others are not.
Search engines try to avoid indexing duplicates
to reduce the size of their indexes.
49
19.6
![Page 50: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/50.jpg)
Detecting duplicates
How to detect duplicates?
We do not want to compare billions of webpages
with each other.
Simple approach:
◦ calculate a fingerprint (hash) for each webpage that is a
number.
◦ If two pages have the same fingerprint, they may be
duplicates, so we need to compare them.
◦ If they are duplicates, only one of them is indexed.
50
![Page 51: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/51.jpg)
Web crawling (Web信息发现)
Web crawling: the process by which a
search engine gather pages from the Web
to index them.
Goal:
◦ Collect information about webpages,
◦ Collect information about links between
webpages,
◦ Do this quickly!
51
20
![Page 52: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/52.jpg)
Web crawler (网络爬虫)
A web crawler must have the following features
(特征):
1) Robustness:
◦ Several websites try to cheat and may try to
generate an infinite number of pages to
mislead web crawlers.
◦ Web-crawlers must be able to avoid these
«traps » (陷阱).
52
![Page 53: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/53.jpg)
Web crawler (网络爬虫)
2) Politeness (礼貌 ):
◦ A Web crawler should be polite.
◦ It should not visit a website too often.
◦ Otherwise, the owner of the website may not
be happy.
3) Efficient
◦ The Web crawler should be able to efficiently
index a huge amount of webpages.
53
![Page 54: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/54.jpg)
Web crawler (网络爬虫)
4) Quality
◦ The Web crawler should try to index the high
quality or most useful webpages first
◦ The Web crawler must be able to assign different
priority levels to different webpages.
5) ExtensibleA Web crawler should work with different
technologies, different languages,
different data format, etc.
54
![Page 55: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/55.jpg)
Crawling
How a Web crawler indexes websites?
The crawler begins with one or more URL
(web page addresses).
The crawler visit one of these webpages.
◦ The crawler extracts the text and links.
◦ The text is indexed.
◦ The links are used to find more webpages.
The crawler then continue visiting other
webpages.
55
![Page 56: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/56.jpg)
Crawling
A Web crawler should not visit the same
webpage twice.
How fast can it be to crawl the Web?
◦ 4 billion webpages
◦ 1 month = 1540 webpages / second!
A Web Crawler may be designed to visit
popular websites more often than less
popular websites.
56
![Page 57: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/57.jpg)
Robot exclusion
Some people do not want that Web
crawlers index their website.
To do this, we can put a file robots.txt on a
website to tell the Web Crawlers to
ignore the website.
57
Name of a search
engine
![Page 58: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/58.jpg)
Crawling
Generally, a search engine will have many
computers working as Web crawlers.
These Web crawlers could be in different
locations:
China, Europe, America, etc.
These Web crawlers must work together.
They must share the work and avoid visiting
the same websites multiple times.
This can be challenging!
58
![Page 59: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/59.jpg)
Distributed index
For a Web search engine, the index may
be very large.
Moreover, many users may want to access
the index at the same time.
Thus an index will be stored on several
computers.
59
![Page 60: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/60.jpg)
Link analysis
Many search engines consider the links between websites
as an important information to rank webpages.
Link analysis: analyzing the links between websites to
derive useful information.
A link from a website A to another website B is
considered as an endorsement (认可 ) of the website B
by A.
60
A B
![Page 61: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/61.jpg)
Link analysis
When analyzing links, we can also analyze the context
of each link in a webpage (the text of the link).
e.g. The real-estate market in Shenzhen (…)
This is useful because the webpage B may not provide
an accurate description of itself.
61
A B
![Page 62: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/62.jpg)
Link analysis
In fact, there is often a gap between the terms in a
webpage and how web users would describe a page.
The text used in a link is useful. But some terms may
not be useful.
e.g. Click here for information about Shenzhen.
We can use the TF-IDF measure to filter unimportant
words.
62
A B
![Page 63: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/63.jpg)
Link analysis
Thanks to the analysis of the text of links:
If we search «big blue », we may find the
webpage of IBM.
This is great.
But there can be some side-effects.
For example, if we search «miserable failure »
we can find the page of George W. Bush. →
63
![Page 64: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/64.jpg)
This is because many people have purposely linked to the
page of George W Bush. with the text «miserable
failure » to fool the search engines.64
![Page 65: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/65.jpg)
65
Another example
![Page 66: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/66.jpg)
Link analysis
Search engines try to use various techniques to
avoid this problem.
Some search engines will not only consider the
text of links, but also the text before and
after a link.
66
![Page 67: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/67.jpg)
FINAL EXAM
67
![Page 68: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/68.jpg)
68
![Page 69: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/69.jpg)
Some questions
69
![Page 70: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/70.jpg)
Some questions
70
![Page 71: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/71.jpg)
Some questions
71
![Page 72: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/72.jpg)
SECOND ASSIGNMENT
72
![Page 73: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/73.jpg)
Conclusion
Today,
◦ Web search engine
Wish you a good preparation for the final
exam! 再见!
73
![Page 74: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate](https://reader035.vdocuments.pub/reader035/viewer/2022071001/5fbe6a7af4bb8814b12e5639/html5/thumbnails/74.jpg)
References
Manning, C. D., Raghavan, P., Schütze, H.
Introduction to information retrieval. Cambridge:
Cambridge University Press, 2008
74