搜尋引擎簡介

搜尋引擎簡介搜尋引擎簡介

國立中正大學資訊工程研究所吳昇副教授 ([email protected])

What is a search engine?What is a search engine?

◆ A web service site for the Internet Users to find information in the Internet Cyberspace。

◆ The software to provide web search service

Use of search engines? Use of search engines?

◆ Search for the url of a company/website

◆ Look for the contact info about a person or an organization

◆ Search for information related to a term, eg. to collect information about 櫻花鉤吻鮭

◆ Look for news regarding XXX

◆ Treat the search engine as a big dictionary

◆ …...

Types of search enginesTypes of search engines◆ Directory browse/search

◆ Web pages search

◆ USENET news search

◆ Ftp search

◆ People/organization search

◆ Daily-life information search

◆ Library search

◆ Commercial product search

Example search enginesExample search engines

◆ Yahoo,

◆ Google,

◆ AltaVista,

◆ MSN,

◆ Excite,

◆ Lycos, ...

◆ YAM, Kimo, PCHome

◆ GAIS, Openfind, ...

◆ DejaNews,

◆ Archie, ...

Portal ServicesPortal Services

◆ Directory / Search

◆ Daily information: Weather, Maps. TV, ...

◆ Free Emails, Free Pages, Calendar

◆ Personalized services, channel subscription

◆ Web Chat,

◆ E-Commerce,

◆ Content Aggregation

◆ …...

Directory implementationDirectory implementation

◆ Each url data is a record

◆ The url data is managed by a database system

◆ Search function is supported for searching the data in the directory tree

Directory implementationDirectory implementation

◆ The search is in general for locating a website or a category of web sites.

◆ The data input is through manual registration by the website owner or the suffer

◆ The management of the directory tree needs intensive labor work by people who are familiar with certain domain knowledge

The Advantages/DisadvantagesThe Advantages/Disadvantages of Directory search engine of Directory search engine

◆ Advantages– The data is manually maintained, and thus

contains less noise, and is more precise. – The output of search can be categorized and can

be more organized– Can support search within a category

The Advantages/DisadvantagesThe Advantages/Disadvantages of Directory search engine of Directory search engine

◆ Disadvantages:– The data coverage is limited, and sometimes,

can not find wanted– Does not support relevance ranking– Labor intensive

Implementation of Webpage seImplementation of Webpage search enginearch engine

1.Feature consideration

2.Data Gathering

3.Data Preprocessing

4.Data Indexing

5.Query Processing

6. Interaction

7.Service tools

8.Personalization

Requirements for WebPage seRequirements for WebPage search enginesarch engines

0. The quality of the search result in a search engine basically depends on

– a. the quality of the underlying data– b. the search techniques such as ranking tech.

1. Data coverage should be large enough

2. Data needs to be filtered, such as removing redundant pages

3. Full text search capability should be provided

4. Relevance Ranking mechanism should be provided

5. Search Speed should be fast enough

6. Search features ;

I.e., evaluation points:

Quality, speed, scale, robustness, features,

Requirements for WebPage seRequirements for WebPage search enginesarch engines

Data GathererData Gatherer

◆ Also known as spider, crawler, robot, ...

◆ Periodically travels the web space to collect web pages

◆ Need a list management to decide which and when to collect

◆ Need a link analyzer to generate new URL list

◆ Need to decide what to collect and what not to.

Data GathererData Gatherer

◆ Get-file function through http protocol is the basic function

◆ Webpage parser module used to extract link info from a retrieved page,

◆ URL bank manager module to manage the urls to be fetched.

◆ Robot-controller module to manage the data collection using multiple clients

Issues of RobotIssues of Robot

◆ Site Based vs URL based– Site based is popular such as wget, teleport– robots.txt is easier to implement in SiteBased ro

bot– URL based robot is more appropriate for large s

cale search engines

◆ Retrieval Schedule, BFS is better

◆ Incremental Retrieval

Robot IssuesRobot Issues

◆ What to gather and what not to?

◆ Hidden web data collection

◆ Focused crawling– targeting specialized content of web pages– suitable for special search engines– evaluated by precision and recall

Data PreprocessingData Preprocessing

◆ Remove redundant pages

◆ Transform the page into internal data format.

◆ Perform web cross-link analysis to generate a URL databank.

◆ Filter the data to remove data that better not be indexed

◆ Partition the data space*

Redundancy removalRedundancy removal

◆ 15% to 20% of the web pages are replicated on different websites, e.g., some tutorials such as Java, Perl, Python, …

◆ Can be implemented by partitioned-hashing or external sorting

Ranking the URLsRanking the URLs

◆ Link analysis is done to count the mutual reference between web pages

◆ A URL receiving higher number of references will get higher score– weighted link– discount internal link // such as back to home

◆ Order the web pages in order of score such that a page with higher rank will have lower ID

Data PartitionData Partition

◆ The data is partitioned by language type

◆ The language partition can be done as follows:– for each known language, collect certain amoun

t of webpages of that language– build up high-frequent term set for each langua

ge set from the analysis of the sample data– determine the language type by term analysis

IndexerIndexer

◆ In general, inverted file is used to generate the index

◆ Need large data space for the indexing task.

◆ For each indexed term, an index list is generated to record which files/locations such term appears.

◆ Need about the same or more space as the original data

Indexer - implementatioIndexer - implementation issuen issue

◆ Data filter module is used to cope with different data sources

◆ Inversion module is the kernel module

◆ Need to be scalable to handle continuous growing data size. – Hundreds of Giga bytes – Tera bytes

◆ Distributed/Concurrent Indexing

Indexer - implementatioIndexer - implementation issuen issue

◆ Temporary space minimization

◆ Index speed is crucial

◆ Memory can be utilized to improve the index performance

◆ Hashing and Sorting is the key!

Query ProcessingQuery Processing◆ Use dictionary/stop-list to preprocess the

query string

◆ Parse the query into expressions of tokens

◆ Use index structure to locate the matched

◆ Use TF*IDF type technique to score the matched documents

◆ Combine URL scores to rank the result

Search CGI programsSearch CGI programs◆ search agent CGI:

– parse the query and fork a searcher process to do the search (or use IPC to query the searcher)

– when the searcher returns, analyze and process the result for formatted output

– process the result and store it in tmp result store– log query and some status info

◆ cgi for view-next-page

◆ showmatch cgi

Output controlOutput control

◆ Site grouping: group the pages from same website together

◆ Title grouping: group the pages with similar title

◆ Sort the output according to certain criteria

InteractionInteraction

◆ Term Suggestion:– Related terms– thesaurus– term-expansion– error correction

• phonetic

• spelling

PersonalizationPersonalization

◆ Keeping track of a user’s interest such that the search result can be tuned to improve the satisfaction to the user

◆ Query Tracking and classification

Service toolsService tools

◆ Query cache to improve the performance of the Search, for queries that have been served.

◆ Use memory cache file system to reduce the dick access overhead

◆ Mechanism for special case handling

◆ Log analyzer

Research IssuesResearch Issues

◆ Hidden Web data collection

◆ Distributed index/search

◆ Index minimization, incremental Indexing

◆ Smart robot

◆ Intelligent Retrieval

◆ Output result auto classification/clustering

◆ Data source clustering/classification– classifying/clustering the whole web

ConclusionConclusion

◆ Size does matter

◆ Is still searching for a better engine!

搜尋引擎簡介

Documents