搜尋引擎簡介

32
搜搜搜搜搜搜 搜搜搜搜搜搜 搜搜搜搜搜搜搜搜搜搜搜搜搜 搜搜 搜搜搜 ([email protected])

Upload: zanna

Post on 09-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

搜尋引擎簡介. 國立中正大學資訊工程研究所 吳昇 副教授 ([email protected]). What is a search engine?. A web service site for the Internet Users to find information in the Internet Cyberspace 。 The software to provide web search service. Use of search engines?. Search for the url of a company/website - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 搜尋引擎簡介

搜尋引擎簡介搜尋引擎簡介

國立中正大學資訊工程研究所吳昇 副教授 ([email protected])

Page 2: 搜尋引擎簡介

What is a search engine?What is a search engine?

◆ A web service site for the Internet Users to find information in the Internet Cyberspace。

◆ The software to provide web search service

Page 3: 搜尋引擎簡介

Use of search engines? Use of search engines?

◆ Search for the url of a company/website

◆ Look for the contact info about a person or an organization

◆ Search for information related to a term, eg. to collect information about 櫻花鉤吻鮭

◆ Look for news regarding XXX

◆ Treat the search engine as a big dictionary

◆ …...

Page 4: 搜尋引擎簡介

Types of search enginesTypes of search engines◆ Directory browse/search

◆ Web pages search

◆ USENET news search

◆ Ftp search

◆ People/organization search

◆ Daily-life information search

◆ Library search

◆ Commercial product search

Page 5: 搜尋引擎簡介

Example search enginesExample search engines

◆ Yahoo,

◆ Google,

◆ AltaVista,

◆ MSN,

◆ Excite,

◆ Lycos, ...

◆ YAM, Kimo, PCHome

◆ GAIS, Openfind, ...

◆ DejaNews,

◆ Archie, ...

Page 6: 搜尋引擎簡介

Portal ServicesPortal Services

◆ Directory / Search

◆ Daily information: Weather, Maps. TV, ...

◆ Free Emails, Free Pages, Calendar

◆ Personalized services, channel subscription

◆ Web Chat,

◆ E-Commerce,

◆ Content Aggregation

◆ …...

Page 7: 搜尋引擎簡介

Directory implementationDirectory implementation

◆ Each url data is a record

◆ The url data is managed by a database system

◆ Search function is supported for searching the data in the directory tree

Page 8: 搜尋引擎簡介

Directory implementationDirectory implementation

◆ The search is in general for locating a website or a category of web sites.

◆ The data input is through manual registration by the website owner or the suffer

◆ The management of the directory tree needs intensive labor work by people who are familiar with certain domain knowledge

Page 9: 搜尋引擎簡介

The Advantages/DisadvantagesThe Advantages/Disadvantages of Directory search engine of Directory search engine

◆ Advantages– The data is manually maintained, and thus

contains less noise, and is more precise. – The output of search can be categorized and can

be more organized– Can support search within a category

Page 10: 搜尋引擎簡介

The Advantages/DisadvantagesThe Advantages/Disadvantages of Directory search engine of Directory search engine

◆ Disadvantages:– The data coverage is limited, and sometimes,

can not find wanted– Does not support relevance ranking– Labor intensive

Page 11: 搜尋引擎簡介

Implementation of Webpage seImplementation of Webpage search enginearch engine

1.Feature consideration

2.Data Gathering

3.Data Preprocessing

4.Data Indexing

5.Query Processing

6. Interaction

7.Service tools

8.Personalization

Page 12: 搜尋引擎簡介

Requirements for WebPage seRequirements for WebPage search enginesarch engines

0. The quality of the search result in a search engine basically depends on

– a. the quality of the underlying data– b. the search techniques such as ranking tech.

1. Data coverage should be large enough

2. Data needs to be filtered, such as removing redundant pages

Page 13: 搜尋引擎簡介

3. Full text search capability should be provided

4. Relevance Ranking mechanism should be provided

5. Search Speed should be fast enough

6. Search features ;

I.e., evaluation points:

Quality, speed, scale, robustness, features,

Requirements for WebPage seRequirements for WebPage search enginesarch engines

Page 14: 搜尋引擎簡介

Data GathererData Gatherer

◆ Also known as spider, crawler, robot, ...

◆ Periodically travels the web space to collect web pages

◆ Need a list management to decide which and when to collect

◆ Need a link analyzer to generate new URL list

◆ Need to decide what to collect and what not to.

Page 15: 搜尋引擎簡介

Data GathererData Gatherer

◆ Get-file function through http protocol is the basic function

◆ Webpage parser module used to extract link info from a retrieved page,

◆ URL bank manager module to manage the urls to be fetched.

◆ Robot-controller module to manage the data collection using multiple clients

Page 16: 搜尋引擎簡介

Issues of RobotIssues of Robot

◆ Site Based vs URL based– Site based is popular such as wget, teleport– robots.txt is easier to implement in SiteBased ro

bot– URL based robot is more appropriate for large s

cale search engines

◆ Retrieval Schedule, BFS is better

◆ Incremental Retrieval

Page 17: 搜尋引擎簡介

Robot IssuesRobot Issues

◆ What to gather and what not to?

◆ Hidden web data collection

◆ Focused crawling– targeting specialized content of web pages– suitable for special search engines– evaluated by precision and recall

Page 18: 搜尋引擎簡介

Data PreprocessingData Preprocessing

◆ Remove redundant pages

◆ Transform the page into internal data format.

◆ Perform web cross-link analysis to generate a URL databank.

◆ Filter the data to remove data that better not be indexed

◆ Partition the data space*

Page 19: 搜尋引擎簡介

Redundancy removalRedundancy removal

◆ 15% to 20% of the web pages are replicated on different websites, e.g., some tutorials such as Java, Perl, Python, …

◆ Can be implemented by partitioned-hashing or external sorting

Page 20: 搜尋引擎簡介

Ranking the URLsRanking the URLs

◆ Link analysis is done to count the mutual reference between web pages

◆ A URL receiving higher number of references will get higher score– weighted link– discount internal link // such as back to home

◆ Order the web pages in order of score such that a page with higher rank will have lower ID

Page 21: 搜尋引擎簡介

Data PartitionData Partition

◆ The data is partitioned by language type

◆ The language partition can be done as follows:– for each known language, collect certain amoun

t of webpages of that language– build up high-frequent term set for each langua

ge set from the analysis of the sample data– determine the language type by term analysis

Page 22: 搜尋引擎簡介

IndexerIndexer

◆ In general, inverted file is used to generate the index

◆ Need large data space for the indexing task.

◆ For each indexed term, an index list is generated to record which files/locations such term appears.

◆ Need about the same or more space as the original data

Page 23: 搜尋引擎簡介

Indexer - implementatioIndexer - implementation issuen issue

◆ Data filter module is used to cope with different data sources

◆ Inversion module is the kernel module

◆ Need to be scalable to handle continuous growing data size. – Hundreds of Giga bytes – Tera bytes

◆ Distributed/Concurrent Indexing

Page 24: 搜尋引擎簡介

Indexer - implementatioIndexer - implementation issuen issue

◆ Temporary space minimization

◆ Index speed is crucial

◆ Memory can be utilized to improve the index performance

◆ Hashing and Sorting is the key!

Page 25: 搜尋引擎簡介

Query ProcessingQuery Processing◆ Use dictionary/stop-list to preprocess the

query string

◆ Parse the query into expressions of tokens

◆ Use index structure to locate the matched

◆ Use TF*IDF type technique to score the matched documents

◆ Combine URL scores to rank the result

Page 26: 搜尋引擎簡介

Search CGI programsSearch CGI programs◆ search agent CGI:

– parse the query and fork a searcher process to do the search (or use IPC to query the searcher)

– when the searcher returns, analyze and process the result for formatted output

– process the result and store it in tmp result store– log query and some status info

◆ cgi for view-next-page

◆ showmatch cgi

Page 27: 搜尋引擎簡介

Output controlOutput control

◆ Site grouping: group the pages from same website together

◆ Title grouping: group the pages with similar title

◆ Sort the output according to certain criteria

Page 28: 搜尋引擎簡介

InteractionInteraction

◆ Term Suggestion:– Related terms– thesaurus– term-expansion– error correction

• phonetic

• spelling

Page 29: 搜尋引擎簡介

PersonalizationPersonalization

◆ Keeping track of a user’s interest such that the search result can be tuned to improve the satisfaction to the user

◆ Query Tracking and classification

Page 30: 搜尋引擎簡介

Service toolsService tools

◆ Query cache to improve the performance of the Search, for queries that have been served.

◆ Use memory cache file system to reduce the dick access overhead

◆ Mechanism for special case handling

◆ Log analyzer

Page 31: 搜尋引擎簡介

Research IssuesResearch Issues

◆ Hidden Web data collection

◆ Distributed index/search

◆ Index minimization, incremental Indexing

◆ Smart robot

◆ Intelligent Retrieval

◆ Output result auto classification/clustering

◆ Data source clustering/classification– classifying/clustering the whole web

Page 32: 搜尋引擎簡介

ConclusionConclusion

◆ Size does matter

◆ Is still searching for a better engine!