novel and efficient approch for detection of duplicate pages in web crawling
TRANSCRIPT
A Novel And Efficient Approach For Near Duplicate Page Detection In Web Crawling
VIPIN KP Guided by: Mr . Aneesh M Haneef08103066 Asst . ProfessorS7 CSE A Department of CSE,MESCE
04/12/2023 2
Presentation Outline
Introduction What are near duplicates Drawbacks of near duplicate pages What is a Web crawler Simplified Crawl Architecture Near duplicate detection Advantages Conclusion Reference
04/12/2023 3
Introduction
The main gateways for access of a information in the web are search engines .
A search engine operates in the following order: Web crawling Indexing Searching Web crawling ,a process that create a indexed
repository utilized by the search engines. The large amount of web documents in the web
have huge challenges to the search engine making their results less relevant to the user.
04/12/2023 4
Introduction cont’d…
Web search engines face additional problems due to near duplicate web pages.
It is an important requirements for search engines to provide users with relevant results without duplication.
Near duplicate page detection is a challenging problem.
04/12/2023 5
What are near duplicates ?The near duplicates are not considered
as “exact duplicates ” , but are files with minute differences .
They differ slightly in advertisement, counters , timestamps , etc…
Most of the web sites have boiler plate codes.
04/12/2023 6
What are near duplicates ?
http://shop.asus.co.uk/shop/gb/en-gb/home.aspx
04/12/2023 7
What are near duplicates ?
http://shop.asus.es/shop/gb/en-gb/home.aspx
04/12/2023 8
Drawbacks of Near Duplicate web pages
Waste network bandwidthIncrease storage costAffect the quality of search indexesIncrease the load on the remote host
that is serving such web pagesAffect customer satisfaction
04/12/2023 9
Web CrawlerA Web crawler is a computer program that browses
the World Wide Web in an orderly fashion.Other terms for Web crawlers are ants, automatic
indexers, bots , Web spiders, Web robots.Search engines uses web crawlers to create a copy
of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
This indexed database will use for searching process.A crawler may examine the URL if it ends with
certain characters such as .html, .htm, .asp, .aspx, .php, .jsp, .jspx or a slash.
Some crawlers may also avoid requesting any resources that have a "?" in them.
04/12/2023 10
Simplified Crawl Architecture
Web Index
HTMLDocume
nt
Near-duplicat
e?
traverse
links
newly-crawled document(s)
one document
entire index
inserttrash
No Yes
Web
04/12/2023 11
Near Duplicate DetectionThe Steps Involved In This Approach Are,
Web document parsingStemming algorithmKeyword representationSimilarity score calculation
04/12/2023 12
Near Duplicate Detection cont’d…
Web Document Parsing:
• It may either be simple as URL extraction or complex as removing the HTML tags and java scripts from a web page.
•Stop Word Removal Remove commonly used words such as ‘an', ‘and’ , ’the’ ,’to’ , ’with’ , ’by’ , ’for’ etc…It helps to reduce the size of the indexing file.
04/12/2023 13
Near Duplicate Detection cont’d…
Stemming Algorithm:
•Stemming is the process for reducing derived words to their stem, base or root form—generally a written word form.•The relation between a query and a document is determined by the number and frequency of terms which they have common.•Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem.
eg : “connect”, “connected”,” connecting” are all condensed to connect.
04/12/2023 14
Near Duplicate Detection cont’d…
Stemming Algorithm cont’d..•The prefix removal algorithm removes: anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro…
•The suffix removal algorithm removes: ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,ous….
• The derivation are converted to their stems which are related to original in both form and semantics.
04/12/2023 15
Near Duplicate Detection cont’d…Key Word Representation:
• Keywords and their counts in each crawled page is the result of stemming
• Keywords are sorted in descending order based on the counts
• Keywords with highest counts are called prime keywords stored in table and the remaining indexed and stored in another table.
04/12/2023 16
Near Duplicate Detection cont’d…
Similarity score calculation:• If prime keywords of the new web page do not match with the prime keywords of the pages in the table then new page is added to the repository.
• If all the keywords of the both pages are same then new page is a duplicate.
• If prime keywords of the both pages are same then similarity score (SSM) is calculated as follows.
04/12/2023 17
Near Duplicate Detection cont’d…
K1 K2 ……….. Kn
C1 C2 ……….. Cn
K1 K2 ………… Kn
C1 C2 …………. Cn
Table of web page in the repository containing keywords and count
Table of new web page containing keywords and count
If a key word present in both tables thena=Δ[ki]T1b=Δ[ki]T2
Using the formulaSDc=log(count(a)/count(b))*Abs(1+(a-b))
04/12/2023 18
Near Duplicate Detection cont’d…
• If keywords present in T1 but not in T2 and amount of keywords present
is NT1 thenSDT1 =log(count(a))*Abs(1+|T2|)
• If keywords present in T2 but not in T1 and amount of keywords present
is NT2 thenSDT2 =log(count(b))*Abs(1+|T1|)
• The similarity score of page against another page is calculated by ΣSDC +
|NC|
i=1SSM =ΣSDT1 |NT1|
i=1+ ΣSDT2
|NT@|
i=1
NWhere N=(|T1|+|T2|)/2
04/12/2023 19
Near Duplicate Detection cont’d…
• The web documents with similarity score greater than a predefined threshold are considered as near duplicates
• These near duplicated pages are not added to the repository of search engine
04/12/2023 20
Advantages
• Save the network bandwidth
• Reduce storage cost of search engines
• Improve the quality of search index
04/12/2023 21
Conclusion
• The proposed method solve the difficulties of information retrieval from the web.
• The approach has detected the near duplicate web pages efficiently based on the keywords extracted from the web pages.
• It reduces the memory space for web repositories.
• The near duplicate detection increases the search engines quality.
04/12/2023 22
Reference • Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection mechanisms for digital documents", In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), ACM Press.
• Pandey, S.; Olston, C., (2005) "User-centric Web crawling", Proceedings
of the 14th international conference on World Wide Web, pp: 401 - 411.
• Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins
for Near Duplicate Detection", Proceeding of the 17th international 443 - 452. conference on World Wide Web, pp:131--140. • Lovins, J.B. (1968) "Development of a stemming algorithm".
Mechanical Translation and Computational Linguistics.
04/12/2023 23
Questions
04/12/2023 24
Thank you