cours inf1025 - groupe 50 outils de bureautique et internet ispam for some queries, there is a huge...

74
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities [email protected] Spring 2020 1

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

信息检索与搜索引擎Introduction to Information Retrieval

GESC1007

Philippe Fournier-Viger

Full professor

School of Natural Sciences and Humanities

[email protected]

Spring 20201

Page 2: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Last week

We have discussed:

◦ Evaluation in an information retrieval system

Today:

◦ Web search engines

◦ Second assignment

◦ About the final exam

2

Page 3: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Course schedule (日程安排)

3

Week 1Introduction (Chapter 1)

Boolean retrieval

Week 2 Term vocabulary and posting lists (Chapter 2)

Week 3 Dictionaries and tolerant retrieval (Chapter 3)

Week 4 Index construction (Chapter 4)

Week 5Scoring, term weighting, the vector space model (Chapter 6)

Week 6 A complete search system (Chapter 7)

Week 7 Evaluation in information retrieval

Week 8 Web search engines, advanced topics, conclusion

Final exam (to be announced)

Page 4: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

WEB SEARCH ENGINES

4

Page 5: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web

What is special about the Web?

◦ The number of documents (very large)

◦ Lack of coordination in the creation of

documents,

◦ Diversity of backgrounds and motives of

content creators.

5

19.1

Page 6: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web

The Web is a set of webpages (网页)

Webpages are created using a language called HTML

6Webpage HTMLhttp://www.wikihow.com/Create-a-Simple-Web-Page-with-HTML

Page 7: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web

Webpages are stored on servers (服务器)

To access a webpage, one must use a software

called a Web browser (浏览器)

7

Browser

SERVER

of

HITSZ

Internet

Home

Page 8: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web

Webpages are stored on servers (服务器)

To access a webpage, one must use a software

called a Web browser (浏览器)

8

Browser

SERVER

of

HITSZ

Internet

Webpages are sent over the

internet using the HTTP

protocol (HTTP协议)

Home

Page 9: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web

The idea of the Web:

each webpage contain links to other webpages

(hyperlinks -超链接).

Each webpage has an address (URL)

e.g. http://www.hitsz.edu.cn

Creating a simple webpage is not very difficult.

Webpages have become one of the best way to

supply and consume information.

9

Page 10: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web

Billions of webpages containing information.

But if we cannot search this information, it is

useless.

Historically, two ways of searching for

information:

◦ Web directories (Yahoo!, etc.)

◦ Web search engines

10

Page 11: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web directories (网络目录)

Web directory: a list of websites, separated by

categories.

11

Page 12: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web directories (网络目录)

A Web directory contains only the “best” webpages for each category.

Problems:◦ Web directories are created by humans.

◦ This takes a lot of time.

◦ It is not convenient for searching. A user need to know how to find information within the categories.

◦ There can be thousands of categories.

◦ Information in categories is often old

For this reason, Web directories have mostly disappeared.

12

Page 13: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web search engines

Examples: Baidu Bing

They adapt information retrieval techniques

to search billions of documents.

Adapted in terms of:

◦ Indexing,

◦ Processing queries,

◦ Ranking documents

13

Page 14: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web search engines

Why are they popular?

◦ ability to quickly answer queries.

◦ ability to index millions of documents.

◦ almost always up-to-date.

Fifteen years ago, results returned by Web search

engines were not very good

Novel ranking techniques (排序技术 ) and

spam-fighting techniques (反垃圾邮件技术)

allowed to obtain better results.

14

Page 15: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web characteristics

The Web is mainly decentralized (分散).

Many languages.

Many different types of content.

Some webpages contains only pictures and

no text.

The Web contains a lot of non reliable

information.

How can a search engine know which

websites can be trusted?

15

19.2

Page 16: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Size of the Web

1995: 30 million webpages indexed by AltaVista

2017: 4.48 billion webpageshttp://www.worldwidewebsize.com/

Note: only static webpages are counted.

Dynamic webpage: the content is generated in real-time for the user.

16

Page 17: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web graph

The Web can be viewed as a graph (图)

◦ Each webpage is a vertex (顶点)

◦ A link between two webpages is an edge (图的边)

◦ The Web is a directed graph (有向图)

17

Webpages: A,B,C, …, F

Page 18: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web graph

Two types of links:

In-links: links that go to a page

Out-links: links that leave a page

18

Node B has 3 in-links

has 1 out-link

Page 19: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

The Web graph

Not all web pages are equally popular

◦ Many web pages have few in-links

◦ Few web pages have many in-links

◦ The number of in-links per website follows a

power law distribution (幂律分布)

19Number of in-links

Number of

webpages

Page 20: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Spam

For some queries, there is a huge competition to

appear high in the results of search engines.

e.g. Beijing real-estate (房地产)

Thus, many people modify their website to try

to appear first in the search results.

◦ e.g. write multiple times Beijing real-estate in a webpage to

increase the Term Frequency (TF).

◦ e.g. write invisible text using the background color of the

webpage (e.g. white)

20

Page 21: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Spam detection

Nowadays, search engines use many sophisticated

methods to detect spam (repeated keywords,

etc.).

Websites that are trying to cheat may be blocked

from search engines.

Thus, some people have developed new techniques

to cheat search engines (欺骗搜索引擎) →

21

Page 22: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Cloaking (伪装)

One such technique is cloaking.

Some websites try to cheat by showing

different content to search engines and

users.

22

This is a problem that did not exist in traditional IR.

Page 23: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Paid inclusion

Paid inclusion: a website can also pay a

search engine to appear high in the

results.

Some search engines do not allow paid

inclusion.

23

Page 24: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

24

Page 25: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Doorway page

◦ a page containing carefully chosen text to

rank highly in search engines for some

keywords.

◦ the page then links to another page containing

commercial content.

◦ a website may have many doorway pages.

25

Doorway

page Another

webpage

Doorway

page

Page 26: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Link analysis

To reduce the problem of spam on the Web, many search

engine perform link analysis.

Basic idea: to rank a page higher or treat it as more reliable if

it has many in-links.

e.g. PageRank algorithm

26

Page 27: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Link analysis

But some people create fake links to increase the

popularity of their webpages.

There is thus a continuing battle between spammers

and search engines.

27

Page 28: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Advertising (广告)

Two main advertisement models:

1) cost per view:

The goal is to show some content to the user

(branding).

An image is typically used.

A company may pay to display the image 1000

times.

28

19.3

Page 29: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Advertising (广告)

Two main advertisement models:

2) cost per click:

The goal is that some people click on an

advertisement to visit the website of the advertiser

(initiate a transaction).

The website may ask the person to buy something.

An image or text may be used with a link.

A company may pay for 1000 clicks.

29

Page 30: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Advertising (广告)

Today, many search engines earn money

from advertising.

◦ Some will display search results and

advertisement separately.

Search results

Sponsored search results

◦ Some other search engines will combine

search results and advertisement.

30

19.3

Page 31: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

31

Search results Sponsored

search results

Page 32: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Online advertisement networks

There are many advertisement networks :

◦ Bing Ads: provides pay per-click

advertisements for Bing and Yahoo,

◦ Adwords: sells advertisements on various

websites.

◦ …

32

Page 33: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Click-spam

Click spam: a company clicks on the

advertisement of its competitors to spend

their money.

This may be done using some automatic

software.

A search engine must use some

techniques to block click spam.

33

Page 34: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Example: AllAdvantage (1999- 2001)

It was an online advertisement company.

34

Page 35: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Search user experience

(用户体验) It is also important to understand users of

search engines.

For traditional IR systems:

◦ Users often received a training about how to search and write

queries.

For Web search engines:

◦ Users may not know or care about how to write queries.

◦ Usually, people use 2 or 3 keywords in a query.

◦ Usually people do not use special operators (wildcard queries,

Boolean operators…)

35

19.4

Page 36: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Search user experience

(用户体验) The more people use a search engine, the more money

it can earn.

How a search engine can get more users?

◦ By increasing the precision in the first few results,

◦ By updating the index frequently,

◦ By having a larger index,

◦ By offering a website that is simple and easy to use, and

that is very fast.

A user can quickly find what he is looking for.

36

Page 37: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Three types of user queries

1) Informational queries: seek general

information on a broad topic.

◦ e.g. how to play piano

◦ There is not a single webpage that contains all

the information that the user wants.

◦ The user generally wants to combine

information from several webpages.

37

Page 38: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Three types of user queries

2) Navigational queries: seek the website

or home page of a given entity.

◦ e.g. find the webpage of Huawei(华为)

◦ The user expects that the first result is the

webpage of the entity (e.g. Huawei)

◦ The user only needs one document.

◦ He wants a very high precision (1).

38

Page 39: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Three types of user queries

3) Transactional queries: the user wants

to make a transaction.

◦ e.g. reserve a hotel room in Guangzhou,

e.g. buy train tickets…

◦ The search engine should provide links to

service providers.

39

Page 40: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Three types of user queries

For a given query, it can be difficult to identify

the type of the query.

Identifying the type of a query is useful:

◦ for selecting the most relevant results,

◦ for displaying relevant advertisements

(e.g. advertisement about train tickets)

40

Page 41: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

41

Page 42: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

42

Page 43: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

43

Page 44: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

44

Page 45: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Components of a Web search engine

45

(网络爬虫)

Page 46: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Index size

How can we compare the sizes of the indexes of

two search engines (e.g. Baidu vs Bing)?

This may be difficult to evaluate

◦ A search engine may only index the first few thousands words

in a page.

◦ Search engines may organize their indexes in tiers using tiered

indexes.

For general queries, only the main page of a website may be

shown, and other pages may not be shown.

46

Page 47: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Index size

Some techniques have been developed to

compare the size of search engines’ indexes.

Hypothesis: each search engine indexes only

one part of the Web, chosen randomly.

The “Capture-recapture” method →

47

Page 48: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Capture-recapture method

Two search engines E1 and E2.

Take a page from E1 and check if it is in E2

This gives a ratio x

Take a page from E2 and check if it is in E1

This gives a ratio y

If E1 and E2 are independent and uniform random

subsets of the Web, we should have:

48More details in the book…

Page 49: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Near-duplicates (近似重复)

Another issue: the Web may contain multiple

copies of the same webpage.

Up to 40% of the webpages are duplicates

(重复 ) of other pages.

Some of these copies are legitimate (合法的).

Others are not.

Search engines try to avoid indexing duplicates

to reduce the size of their indexes.

49

19.6

Page 50: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Detecting duplicates

How to detect duplicates?

We do not want to compare billions of webpages

with each other.

Simple approach:

◦ calculate a fingerprint (hash) for each webpage that is a

number.

◦ If two pages have the same fingerprint, they may be

duplicates, so we need to compare them.

◦ If they are duplicates, only one of them is indexed.

50

Page 51: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web crawling (Web信息发现)

Web crawling: the process by which a

search engine gather pages from the Web

to index them.

Goal:

◦ Collect information about webpages,

◦ Collect information about links between

webpages,

◦ Do this quickly!

51

20

Page 52: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web crawler (网络爬虫)

A web crawler must have the following features

(特征):

1) Robustness:

◦ Several websites try to cheat and may try to

generate an infinite number of pages to

mislead web crawlers.

◦ Web-crawlers must be able to avoid these

«traps » (陷阱).

52

Page 53: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web crawler (网络爬虫)

2) Politeness (礼貌 ):

◦ A Web crawler should be polite.

◦ It should not visit a website too often.

◦ Otherwise, the owner of the website may not

be happy.

3) Efficient

◦ The Web crawler should be able to efficiently

index a huge amount of webpages.

53

Page 54: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Web crawler (网络爬虫)

4) Quality

◦ The Web crawler should try to index the high

quality or most useful webpages first

◦ The Web crawler must be able to assign different

priority levels to different webpages.

5) ExtensibleA Web crawler should work with different

technologies, different languages,

different data format, etc.

54

Page 55: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Crawling

How a Web crawler indexes websites?

The crawler begins with one or more URL

(web page addresses).

The crawler visit one of these webpages.

◦ The crawler extracts the text and links.

◦ The text is indexed.

◦ The links are used to find more webpages.

The crawler then continue visiting other

webpages.

55

Page 56: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Crawling

A Web crawler should not visit the same

webpage twice.

How fast can it be to crawl the Web?

◦ 4 billion webpages

◦ 1 month = 1540 webpages / second!

A Web Crawler may be designed to visit

popular websites more often than less

popular websites.

56

Page 57: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Robot exclusion

Some people do not want that Web

crawlers index their website.

To do this, we can put a file robots.txt on a

website to tell the Web Crawlers to

ignore the website.

57

Name of a search

engine

Page 58: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Crawling

Generally, a search engine will have many

computers working as Web crawlers.

These Web crawlers could be in different

locations:

China, Europe, America, etc.

These Web crawlers must work together.

They must share the work and avoid visiting

the same websites multiple times.

This can be challenging!

58

Page 59: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Distributed index

For a Web search engine, the index may

be very large.

Moreover, many users may want to access

the index at the same time.

Thus an index will be stored on several

computers.

59

Page 60: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Link analysis

Many search engines consider the links between websites

as an important information to rank webpages.

Link analysis: analyzing the links between websites to

derive useful information.

A link from a website A to another website B is

considered as an endorsement (认可 ) of the website B

by A.

60

A B

Page 61: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Link analysis

When analyzing links, we can also analyze the context

of each link in a webpage (the text of the link).

e.g. The real-estate market in Shenzhen (…)

This is useful because the webpage B may not provide

an accurate description of itself.

61

A B

Page 62: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Link analysis

In fact, there is often a gap between the terms in a

webpage and how web users would describe a page.

The text used in a link is useful. But some terms may

not be useful.

e.g. Click here for information about Shenzhen.

We can use the TF-IDF measure to filter unimportant

words.

62

A B

Page 63: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Link analysis

Thanks to the analysis of the text of links:

If we search «big blue », we may find the

webpage of IBM.

This is great.

But there can be some side-effects.

For example, if we search «miserable failure »

we can find the page of George W. Bush. →

63

Page 64: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

This is because many people have purposely linked to the

page of George W Bush. with the text «miserable

failure » to fool the search engines.64

Page 65: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

65

Another example

Page 66: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Link analysis

Search engines try to use various techniques to

avoid this problem.

Some search engines will not only consider the

text of links, but also the text before and

after a link.

66

Page 67: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

FINAL EXAM

67

Page 68: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

68

Page 69: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Some questions

69

Page 70: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Some questions

70

Page 71: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Some questions

71

Page 72: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

SECOND ASSIGNMENT

72

Page 73: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

Conclusion

Today,

◦ Web search engine

Wish you a good preparation for the final

exam! 再见!

73

Page 74: Cours INF1025 - Groupe 50 Outils de bureautique et internet ISpam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate

References

Manning, C. D., Raghavan, P., Schütze, H.

Introduction to information retrieval. Cambridge:

Cambridge University Press, 2008

74