novel and efficient approch for detection of duplicate pages in web crawling

A Novel And Efficient Approach For Near Duplicate Page Detection In Web Crawling

VIPIN KP Guided by: Mr . Aneesh M Haneef08103066 Asst . ProfessorS7 CSE A Department of CSE,MESCE

04/12/2023 2

Presentation Outline

Introduction What are near duplicates Drawbacks of near duplicate pages What is a Web crawler Simplified Crawl Architecture Near duplicate detection Advantages Conclusion Reference

04/12/2023 3

Introduction

The main gateways for access of a information in the web are search engines .

A search engine operates in the following order: Web crawling Indexing Searching Web crawling ,a process that create a indexed

repository utilized by the search engines. The large amount of web documents in the web

have huge challenges to the search engine making their results less relevant to the user.

04/12/2023 4

Introduction cont’d…

Web search engines face additional problems due to near duplicate web pages.

It is an important requirements for search engines to provide users with relevant results without duplication.

Near duplicate page detection is a challenging problem.

04/12/2023 5

What are near duplicates ?The near duplicates are not considered

as “exact duplicates ” , but are files with minute differences .

They differ slightly in advertisement, counters , timestamps , etc…

Most of the web sites have boiler plate codes.

04/12/2023 6

What are near duplicates ?

http://shop.asus.co.uk/shop/gb/en-gb/home.aspx

04/12/2023 7

What are near duplicates ?

http://shop.asus.es/shop/gb/en-gb/home.aspx

04/12/2023 8

Drawbacks of Near Duplicate web pages

Waste network bandwidthIncrease storage costAffect the quality of search indexesIncrease the load on the remote host

that is serving such web pagesAffect customer satisfaction

04/12/2023 9

Web CrawlerA Web crawler is a computer program that browses

the World Wide Web in an orderly fashion.Other terms for Web crawlers are ants, automatic

indexers, bots , Web spiders, Web robots.Search engines uses web crawlers to create a copy

of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

This indexed database will use for searching process.A crawler may examine the URL if it ends with

certain characters such as .html, .htm, .asp, .aspx, .php, .jsp, .jspx or a slash.

Some crawlers may also avoid requesting any resources that have a "?" in them.

04/12/2023 10

Simplified Crawl Architecture

Web Index

HTMLDocume

nt

Near-duplicat

e?

traverse

links

newly-crawled document(s)

one document

entire index

inserttrash

No Yes

Web

04/12/2023 11

Near Duplicate DetectionThe Steps Involved In This Approach Are,

Web document parsingStemming algorithmKeyword representationSimilarity score calculation

04/12/2023 12

Near Duplicate Detection cont’d…

Web Document Parsing:

• It may either be simple as URL extraction or complex as removing the HTML tags and java scripts from a web page.

•Stop Word Removal Remove commonly used words such as ‘an', ‘and’ , ’the’ ,’to’ , ’with’ , ’by’ , ’for’ etc…It helps to reduce the size of the indexing file.

04/12/2023 13


Stemming Algorithm:

•Stemming is the process for reducing derived words to their stem, base or root form—generally a written word form.•The relation between a query and a document is determined by the number and frequency of terms which they have common.•Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem.

eg : “connect”, “connected”,” connecting” are all condensed to connect.

04/12/2023 14


Stemming Algorithm cont’d..•The prefix removal algorithm removes: anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro…

•The suffix removal algorithm removes: ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,ous….

• The derivation are converted to their stems which are related to original in both form and semantics.

04/12/2023 15

Near Duplicate Detection cont’d…Key Word Representation:

• Keywords and their counts in each crawled page is the result of stemming

• Keywords are sorted in descending order based on the counts

• Keywords with highest counts are called prime keywords stored in table and the remaining indexed and stored in another table.

04/12/2023 16


Similarity score calculation:• If prime keywords of the new web page do not match with the prime keywords of the pages in the table then new page is added to the repository.

• If all the keywords of the both pages are same then new page is a duplicate.

• If prime keywords of the both pages are same then similarity score (SSM) is calculated as follows.

04/12/2023 17


K1 K2 ……….. Kn

C1 C2 ……….. Cn

K1 K2 ………… Kn

C1 C2 …………. Cn

Table of web page in the repository containing keywords and count

Table of new web page containing keywords and count

If a key word present in both tables thena=Δ[ki]T1b=Δ[ki]T2

Using the formulaSDc=log(count(a)/count(b))*Abs(1+(a-b))

04/12/2023 18


• If keywords present in T1 but not in T2 and amount of keywords present

is NT1 thenSDT1 =log(count(a))*Abs(1+|T2|)

• If keywords present in T2 but not in T1 and amount of keywords present

is NT2 thenSDT2 =log(count(b))*Abs(1+|T1|)

• The similarity score of page against another page is calculated by ΣSDC +

|NC|

i=1SSM =ΣSDT1 |NT1|

i=1+ ΣSDT2

|NT@|

i=1

NWhere N=(|T1|+|T2|)/2

04/12/2023 19


• The web documents with similarity score greater than a predefined threshold are considered as near duplicates

• These near duplicated pages are not added to the repository of search engine

04/12/2023 20

Advantages

• Save the network bandwidth

• Reduce storage cost of search engines

• Improve the quality of search index

04/12/2023 21

Conclusion

• The proposed method solve the difficulties of information retrieval from the web.

• The approach has detected the near duplicate web pages efficiently based on the keywords extracted from the web pages.

• It reduces the memory space for web repositories.

• The near duplicate detection increases the search engines quality.

04/12/2023 22

Reference • Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection mechanisms for digital documents", In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), ACM Press.

• Pandey, S.; Olston, C., (2005) "User-centric Web crawling", Proceedings

of the 14th international conference on World Wide Web, pp: 401 - 411.

• Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins

for Near Duplicate Detection", Proceeding of the 17th international 443 - 452. conference on World Wide Web, pp:131--140. • Lovins, J.B. (1968) "Development of a stemming algorithm".

Mechanical Translation and Computational Linguistics.

04/12/2023 23

Questions

04/12/2023 24

Thank you

novel and efficient approch for detection of duplicate pages in web crawling

Lifestyle