moogle: a model search engine - springer

11
اﻻدارﻳﺔ ﻟﻠﺘﻨﻤﻴﺔ اﻟﻌﺮﺑﻴﺔ اﻟﻤﻨﻈﻤﺔ اﻟﻌــﺮﺑﻴـ اﻟﺪول ﺟـﺎﻡﻌــــﺔ ـ ـﺔ ﻥﺪوة اﻹﻧﺘﺮﻧﺖ إدارة ﻋﻠﻰ اﻟﺒﺤﺚ ﻣﺤﺮآﺎت ﺗﺄﺛﻴﺮ وﺗﻄﺒﻴﻘﻴﺔ ﻋﻤﻠﻴﺔ ﺡﺎﻻت ﻋﻤﻞ وورﺷﺔ31 / 7 - 4 / 8 / 2005 اﻹﺳﻜﻨﺪرﻳﺔ- اﻟﻌﺮﺑﻴﺔ ﻡﺼﺮ ﺟﻤﻬﻮرﻳﺔ اﻟﻤﻨﺘﺰة ﺷﻴﺮاﺕﻮن ﻓﻨﺪق ﻋﻦ ورﻗﺔA Model for Multilingual Search Engine أ. اﻟﻄﻴﺐ اﻟﺤﻤﻴﺪ ﻋﺒﺪ وﺟﺪان اﻟﺘﻜﻨﻮﻟﻮﺟﻴﺎ و ﻟﻠﻌﻠﻮم اﻟﺴﻮدان ﺟﺎﻡﻌﺔAbstract How to find needed information from the web is a critical issue in the Internet. Fortunately, search engines are useful tool to retrieve information from Internet. Although, Internet users speak different languages most of resources are written and published in the English. All Internet search engines provide a lingual search, although some of them enable the searcher to select the language of the results, Internet users need a multilingual search engine, which can give them results in English language and in their own language. This paper investigates multilingual search and develops a multilingual search engine model using UML. The purpose of

Upload: beachboysam

Post on 11-Apr-2015

14 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: MOOGLE: A Model Search Engine - Springer

المنظمة العربية للتنمية االدارية ـةـجـامعــــة الدول العــربيـ

ندوة

تأثير محرآات البحث على إدارة اإلنترنت وورشة عمل حاالت عملية وتطبيقية

31/7-4/8/2005

جمهورية مصر العربية-اإلسكندرية فندق شيراتون المنتزة

ورقة عن

A Model for Multilingual Search Engine وجدان عبد الحميد الطيب.أ

جامعة السودان للعلوم و التكنولوجيا

Abstract How to find needed information from the web is a critical issue in the Internet. Fortunately, search

engines are useful tool to retrieve information from Internet.

Although, Internet users speak different languages most of resources are written and published in the

English. All Internet search engines provide a lingual search, although some of them enable the

searcher to select the language of the results, Internet users need a multilingual search engine, which

can give them results in English language and in their own language. This paper investigates

multilingual search and develops a multilingual search engine model using UML. The purpose of

Page 2: MOOGLE: A Model Search Engine - Springer

2

multilingual search model is to help people finding useful contents stored in multi-languages databases.

The Paper examines the case of English and Arabic languages.

1- Introduction A great number of search engines can be used to retrieve relevant information on the Internet. These

engines though have optimum search performance and quality do not satisfy non-English users, since

they do not support other languages within English. Internet users may not have English skills and need

to find information in their mother languages. Search

engines index English resources more than any other language.

This paper focuses on multilingual information retrieval to help searchers find useful contents and

documents in their mother languages as well as English.

The paper gives a general overview of search engines and multilingual search. Optimizing query

execution and caching query results are important factors that affect search processes.

There are some differences in search process in Internet search engines and intranet search engines.

The paper attempts to solve the multilingual search problem and develops a UML model which,

describes system classes and their relations through the class model, proposes functionality of the

system defined with use case model. The business process is described in activity diagrams, and

interactions between object instances at run-time are shown in sequence diagrams.

2- Search Engines Search engines are useful tools of the Internet, since they enable people to explore the Internet and find

what they are looking for according to a specific topic or query.

Search engines must be up-to-date. So the critical problem that faces them is updating their databases.

Daily there are many new web pages published, more web pages updated and some webpages deleted.

In general, search services on the Internet derive from two basic paradigms, directory services and

query-based search engines.

2.1 How Search Engines Work? Web search engines work by storing information about a large number of web pages, which they

retrieve from the WWW itself. These pages are retrieved by a web crawler, an automated web browser

which follows every link it sees. The contents of each page are then analyzed to determine how it

should be indexed (for example, words are extracted from the titles, headings, or special fields called

meta tags). Data about web pages is stored in an index database for use in later queries. Some search

engines, such as Google, store all or part of the source page (referred to as a cache) as well as

information about the web pages.

When a user accesses a search engine and makes a query, typically by giving keywords, the engine

looks up the index and provides a listing of best-matching web pages according to its criteria, usually

with a short summary containing the document's title and sometimes parts of the text. [3]

2.2 Search Engine Components Search engines are composed of three basic components: crawler, indexer, and query engines. The

crawler (also called spider) gathers web pages from the Internet. The indexer examines each page,

Page 3: MOOGLE: A Model Search Engine - Springer

3

extracts useful keywords and creates an index. The query engine maps keywords against the index,

ranks the results and returns them to the user. [10]

2.3 Optimizing Query Execution

2.3.1 Caching Query Results Caching of Search Engine Query Results is to reduce the computing and I/O requirements needed to

support the functionality of a search engine. Performance evaluation of caching query results can be

measured by comparing the effectiveness of several cache replacement algorithms:

• LRU (Least Recently Used). • FBR (Frequency-Based Replacement). • LRU/2 (Least Recently Used). • SLRU(size and LRU).

2.3.1.1 The Limitation of Caching

Although the hit rate increases with cache size, it reaches a maximum at 25%. %. A study of a very

large Alta-Vista trace suggests that the maximum hit rate is close to 75%. So, when caches are warmed

up, their effectiveness increases with the size of the trace and may reach close to 30%, or to put it

simply, (roughly) one out of three queries submitted will be served from the cache. [7]

2.3.2 Improving Query Throughput The following are some of the major ways for improving query throughput:

• Compressing inverted lists.

• Optimizing intersection computation.

- Avoiding scanning entire inverted lists of query terms.

• Pruning during query processing or indexing.

- Computing top-k without looking at the whole list or removing low-value entries entirely

from index.

• Caching

- Result caching: if repeated query, use cached old result

List caching: keep frequently used lists in main memory, or keep entire index in main

memory.

2.3.3 Improving Websites Ranking. [10]

3- Multilingual Search More people on the Internet are non-English speaking and web documents must be found in other

languages besides English. Multilingual search engine is strongly required for users to be able to

retrieve information from multilingual databases.

Recently, the Internet is positioned to be an international mechanism for communications and

information exchange. It is indispensable to develop systems that can efficiently search, index and

retrieve multilingual information.

4-Information Retrieval

Page 4: MOOGLE: A Model Search Engine - Springer

4

The goal of an Information Retrieval (IR) system is to select the documents relevant to a given

information need out of a document database. It is difficult to find out relevant information from a huge

information mass like Internet. Traditional approaches to IR, which are popular in current search

engines, use direct keyword matching between documents and query representations in order to select

relevant documents. The main problem is that if a document is described by a keyword different from

those given in a query, then the document cannot be selected although it may be highly related.

Recently, the most adequate relevance estimation should be based on inference rather than direct

keyword matching. That is, the relevance relationship between a document and a query should be

inferred using available knowledge. This inference, however, cannot be performed with complete

certainty in the concept of relevance. In IR, uncertainty is always associated with the inference process.

In order to deal with this uncertainty, probability theory has been a commonly used tool in IR.

Probabilistic models usually attempt to determine the relationship between a document and a query

through a set of terms. The main method adopted by probability theory to determine the degree of

relevance among terms is by considering term co-occurrences in the document collection. [11]

5- Search Engines Challenges

The following are among the most challenges facing the search engines:

• The web is growing much faster than any present-technology search engine can

possibly index.

• Many web pages are updated frequently, which forces the search engine to revisit

them periodically.

• The queries one can make are currently limited to searching for keywords.

• Dynamically generated sites, which may be slow or difficult to index, or may result

in excessive results from a single site.

• Many dynamically generated sites are not indexable by search engines; this

phenomenon is known as the invisible web.

• Some search engines do not order the results by relevance, but rather according to

how much money the sites have paid them.

• Some sites use tricks to manipulate the search engine to display them as the first

result returned for some keywords. This can lead to some search results being

polluted, with more relevant links being pushed down in the result list. [1]

6- Search Engines Problems The following are some of the problems that faced search engines:

• Search engine coverage has decreased:

Page 5: MOOGLE: A Model Search Engine - Springer

5

Search engines indexing capability is increasingly falling behind the fast growth of the Web.

Relative to the estimated size of the publicly indexable web contents, search engine coverage has

decreased substantially since 1997.

• Unequal access:

Web search engines typically index a biased sample of web. Search engines either follow the links

(i.e. URL) to find new pages, or by analyzing user registration. They are typically more likely to

index sites that have more links to them (more 'popular' sites). They are also typically more likely

to index commercial sites than educational sites.

• Out of date:

By looking at the percentage of documents reported by each engine that are no longer valid

(because the page has moved or no longer exists), and age of new documents found that match

given queries, it was found that the mean age of a matching page when indexed by the first of the

nine engines is 186 days.

• Low metadata use:

Many search engines index web information based on web page defined metadata. However, the

simple HTML "keywords" and "description" meta-tags are only used on the homepages of 34% of

sites. Only 0.3% of sites use the Core meta-data standard. [5]

7- Search Engine Cloaking Cloaking is a search engine optimization technique where

one version of the page is delivered to the search engine robot and a different version is delivered from

the same address to the human visitor. This is done by tracking whether a browser or a spider has made

the request.

There are several reasons for using search engine cloaking technology:

• Hiding the aggressively optimized website from the human visitor.

• Custom language delivery.

• IP based delivery for broadband users.

Cloaking is a program that resides in the server and detects requests to the web page. It works by

identifying the search engine crawlers by their IP address, or user-agent string. The use of IP address is

more secure and is a much better solution. This, however, requires an updated database of known

spider IPs. It takes a lot of time to create this database and to keep it up-to-date.

How does cloaking work? When a request comes to browse a website, the cloaking script picks up the

IP address and searches for it in the IPs database. If it finds a match, the request has been made from a

search engine. If not a match , the request came from a browser.

8- Search Engine Spamming

Page 6: MOOGLE: A Model Search Engine - Springer

6

Search engine spamming is the technique used to improve the position of a web site in a search engine.

Some website owners use spamming techniques to manipulate their positions in search engine

rankings, and thereby try to fool the search engines. Each search engine's objective is to produce

relevant results. This is because; producing the most relevant results of any particular search query is

the determining factor of being a successful search engine.

8.1 Spam Techniques Classification

spamming techniques can be classified as follows:

• Content Spam (Invisible Text, Keyword Stuffing).

• Meta Spam (Unrelated Keywords, Hidden Tags).

• Identical Pages

• Code Swapping

• Page Redirects

• No content

• Over Submitting

• Agent-Based

• Cloaking [9]

9- Intranet Search Engines As intranets grow, providing access to more and more documents, their value grows. An intranet search

engine helps locate information, including archives and unstructured data. Search engines need to be

tuned and indexed to provide the best answers. [8]

10- Intranets Search Engine vs. Internet Search Engine While intranets have the same basic overall structure as the Internet, there are several qualitative

differences between them. These differences stem from the fact that the content generation processes

for them are fundamentally different. While the Internet tends to grow more democratically, content

generation in intranets often tends to be autocratic. Here are the main differences between them:

• Intranet documents are often created for simple dissemination of information, rather

than to attract and hold the attention of any specific group of users.

• A large fraction of queries tend to have a small set of correct answers.

• Intranets are essentially spam free.

• Large portions of intranets are not search-engine-friendly. They are usually accessed

through database queries, and other specialized interfaces.

• It is difficult to obtain unbiased data of sufficient quantity to draw conclusions from.

The problem arises when working for very big corporations that have a very

heterogeneous intranet. [2]

Page 7: MOOGLE: A Model Search Engine - Springer

7

11- Multilingual Search Engine Model The tool we use here for the designing process of the model is UML. The main design phases are

logical model, use case model, and dynamic model.

Logical model is a static view of the objects and classes. The Class model shows the search engine

classes and their attributes, methods or operations, and relationships.

Use case model describes the proposed functionality of the system. Use case views show the main

operations of the system.

Dynamic model describes the dynamic aspects of software systems. They are represented by activity

and sequence diagrams. Activity diagrams describe the business process, while sequence diagrams

show interactions between objects at run-time.

11.1 Class Model

Figure 1: Class Diagram[13]

11.2 Use Case Model

Figure 2: Searching Use case

u d s e a r c h i n g

B r o w s e r S e a r c h E n g i n eS e r v e r

I n d e x S e r v e r

Q u e r y E n g i n e

C r a w l e r

S e a r c h i n g

R e q u e t i n gP r o c e s s i n g Q u e r y

P a r s i n g Q u e r y S e a r c h i n g I n d e x

R e l a v a n c y C h e c k

R a n k i n g C h e c k

I n d e x i n g

D o w n l o a d i n g P a g eR e d o w n l o a d i n g P a g e

Tr a n s l a t i n g K e y w o r d

D i s p l a y R e s u l t s

« i n c l u d e »

« i n c l u d e »

« i n c l u d e »

« i n c l u d e »

« e x t e n d »

« i n c l u d e »

« i n c l u d e »

« i n c l u d e »« i n c l u d e »

« e x t e n d »

« i n c l u d e »

cd Class Diagram

Server

- serverID: Integer- OS: Variant- IPAdress: Variant

WebServ er

+ ResponseForIndex(plainMsg) : void+ sendHeadResponse(metaInformation) : String+ SendResource() : web page

URLServ er

+ Send() : URL+ SaveURL() : void

IndexServ er

+ Store(webpage) : boolean+ SortIndex() : void

Crawler

Browser

+ Create(query) : void+ Display(results) : Results

SearchEngine

QueryEngine

Results

- Auto_NO: Integer- URL: String- Summary: String

+ «display» NumberOfResultsPerPage(Integer*) : Results

Page

- URL: string- Summary: string

Dictionary

used_to_translate

store_on

consist_of

displayed

Request(URLs)

crawl{Modify(page)}

Store(Page)

is a

is a

is a

u d C r a w l

C r a w l e rS e r v e r

I n d e x S e r v e r W e b S e r v e r U R L S e r v e r

C r a w l

Page 8: MOOGLE: A Model Search Engine - Springer

8

Figure 3: Crawl Use case

11.3 Activity Diagrams

Figure 4: Crawling Activity Diagram 11.4 Sequence Diagrams

Figure 5: Crawling Sequence Diagram

sd Crawl a Page

Crawler URLServer IndexServer WebServer

Request(URLs)

Send (URLs)

Make (URLsQueue )

Fetch(page)

check(existance)

[if existed page(send information)]: sendMsg(ExistPage)

response(PageInformation)

[to check update]: request(PageInformation)

{if web server found }response(PageInformation)

compare(WebServer response,IndexServer response)

[if updated]: record(PageProcessed)

[if page not update]: insert(PageInformation)

[if not exist page]: sendMsg(Not exist)

index(PageInformation)

ad Crawls page

Start

Request URLs

Build URLs queue

Take a URL

Record URL information Recorod page informationChoose crawl algorithm

Add links to the queue

Compute page rank

Record page isprocessed

End

[Processed]

[Not processed]

[Processed]

[Go to next URL]

[Go to next URL]

Page 9: MOOGLE: A Model Search Engine - Springer

9

12- Related Work 12.1 MULINEX Multilingual Search Engine MULINX is a multilingual search engine which is available in German, French and English, was

developed in 1997. The goal of MULINX development is to find techniques for the effective retrieval

of multilingual documents from the Internet. [6]

12.2 Chinese_English Search Engine This is a Chinese Academy Sciences project. It has been accomplished in June 1998. The project

provides a solution to the multilingual search problems. The languages accepted are simplified Chinese

and English.

The search engine provides Chinese and English indexing, retrieval and searching using English and

Chinese. [12]

12.3 Tamil Search Engine Tamil search engine project was started in 2001. The objective is to search Tamil websites. When the

user enters the text criteria in Tamil, there is a tool that would convert the text into English and then

searches the database and returns the results to the user in Tamil after translating it back from English

to Tamil. [4]

13- Future Works Search engines are an unlimited field to study. Here are recommended areas for future studies:

• The process of query execution and optimization.

• Results caching to improve the search quality and response time.

• More through investigation of crawling algorithms and problems of invisible sites,

dead links, and updated links.

• Ranking algorithms and information retrieval relevancy.

• Browsers encoding and versions and their influence on results displayed to users

(frame pages, javascript, and dynamic pages, graphics).

• The implementation of a bilingual Arabic/English search engines.

• Extension of multilingual search model.

• Automation of translation process from dictionary programs.

14- Conclusion Internet is not only the huge network with millions of computers but it is the place where peoples, can

exchange knowledge, information, culture and languages.

The most important aspect of searching web sites is to enable Internet users to find what they need.

Although Internet users differ in their nationalities and languages yet those sites do not offer a

multilingual search. Most of the search engines support one language search i.e English.

The paper concerns itself with the multilingual search process in an attempt to enable users finding

information they need in Arabic and English. The important points on search engines area are to crawl

the Internet sites, index the resources, give every resource a rank, and check information retrieval

relevancy.

Page 10: MOOGLE: A Model Search Engine - Springer

10

Other important issues for multilingual search are the translation of keywords from one language to the

others and the multilingual database. In this case the keywords and their translation are stored in one

database so on to facilitate the search, and the Arabic and English resources are stored in one database.

This would improve the response time, because there is one connection to one database.

Page 11: MOOGLE: A Model Search Engine - Springer

11

References [1] Brin Sergey, Page Lawrence, The Anatomy of a Large-Scale Hypertextual Web Search Engine,

Accessed on [April 2005], Available on

http://www-db.stanford.edu/pub/papers/google.pdf.

[2] Fagin Ronald et al, Searching the Workplace Web, Accessed on [April 2005], Published on May

20-24, 2003,Available on http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html. [3] Franklin Curt, How Internet Search Engines Work, Accessed on [July 2004], Available on

http://computer.howstuffworks.com/search-engine1.htm.

[4] Krishnan C N, New Tamil search engine, Accessed on [March 2005], Published on Nov 18th,

2004, Available on http://www.chennaionline.com/science/Technology/11tamilsearch.asp

[5] Li Zheng et al, A NEW ARCHITECTURE FOR WEB META-SEARCH ENGINES, Accessed

on [April 2005], Published on 4 22, 2001, Available on

http://web.njit.edu/~zxl8078/Publication/BB040.PDF.

[6] Mulinex, Welcome to MULINEX, your multilingual search engine, Accessed on [March 2005],

Published on 08.02.99, Available on http://mulinex.dfki.de/demo.html.

[7] P. Markatos Evangelos, On Caching Search Engine Query Results, Accessed on [April 2005],

Published on 2000-05-15, Available on

http://www.ics.forth.gr/carv/r-d-activities/wwwPerf/TR241/paper.html

[8] Rappoport Avi, Intranet Search Engines, Accessed on [April 2005], Published on May 1,

2002, Available on

http://www.Intranetstoday.com/Articles/?ArticleID=5064&ContextSubtypeID=56. [9] Search engines optimization Ethics, Search engine spamming, Accessed on [April 2005],

Published on 3 May 2002, Available on http://www.searchengineethics.com/spamming.htm. [10] Suel Torsten, Optimizing Query Execution in Large-Scale Web Search Engines, Accessed on

[April 2005], Available on

http://cis.poly.edu/cs912/scalability6.pdf.

[11] Z. LIU, Y. GAO, AN INTELLIGENT GIS SEARCH ENGINE TO RETRIEVE

INFORMATION FROM INTERNET, ACCESSED ON [APRIL 2005], AVAILABLE ON

HTTP://WWW.DIGITALEARTH.CA/HTML_PAPERS/DE-A-214.HTM.

[12] ZHANG Wen-hui et al, A Multilingual (Chinese, English) Indexing,Retrieval, Searching Search

Engine, Accessed on [March 2005], Available on http://www.isoc.org/inet99/proceedings/posters/210/

[13] Altayib, Wegdan, Search Engines with Particular Reference to Multilingual Search, Msc

,Sudan University of Science and Technology, June 2005.