web noises detection and elimination pengbo dec 3, 2010

80
Web Noises Detection and Elimination PengBo Dec 3, 2010

Upload: sandra-gilmore

Post on 25-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Web Noises Detection and Elimination PengBo Dec 3, 2010

Web NoisesDetection and Elimination

PengBoDec 3, 2010

Page 2: Web Noises Detection and Elimination PengBo Dec 3, 2010

What are Web Noises?

Page 3: Web Noises Detection and Elimination PengBo Dec 3, 2010

主题 Topic

导航 NavGuide

广告 Adv

Page 4: Web Noises Detection and Elimination PengBo Dec 3, 2010
Page 5: Web Noises Detection and Elimination PengBo Dec 3, 2010

Call them Noises

虽然这些信息对于人浏览 Web 有用,但常常对自动 Web 信息处理带来负面影响,比如 Web page clustering, classification, information retrieval and information extraction.

hamper automated information gathering and Web data mining,

“Template Detection via Data Mining and its Applications”

Page 6: Web Noises Detection and Elimination PengBo Dec 3, 2010

Non-Relevant Data on the Web

A fundamental problem on the Web:

“non-relevant” – not directly related to the main topic / functionality of the page

Local (intra-page) noise Irrelevant items within a Web page. E.g., banner ads, navigational guides

Many pages contain lots of non-relevant data

Page 7: Web Noises Detection and Elimination PengBo Dec 3, 2010

Duplicate data on the Web

Another problem on the Web:

Mirrors , News copy, etc, Global noise

Redundant objects Larger than individual page E.g., mirror sites, duplicated Web pages

There are much duplicate or near duplicate data

Page 8: Web Noises Detection and Elimination PengBo Dec 3, 2010

Why it influences?

Hypertext IR Principles--principles of all link based IR tools: Relevant Linkage Principle

p links to q q is relevant to p Topical Unity Principle

q1 and q2 are co-cited in p q1 and q2 are related to each other

Lexical Affinity Principle The closer the links to q1 and q2 are the

stronger the relation between them.

Page 9: Web Noises Detection and Elimination PengBo Dec 3, 2010

Violations of Relevant Linkage Principle

Navigational links http://www.ibm.com/

Download links http://www.beethoven.com/

Advertisement links http://www.yahoo.com/

Endorsement links http://www.ebay.com/

Spam links

Page 10: Web Noises Detection and Elimination PengBo Dec 3, 2010

Violations of Topical Unity Principle

Violations of the Relevant Linkage Principle

Bookmark pages http://

bookmark.yinsha.com/ 网上书签

General resource lists http://sewm.pku.edu.cn/IR-

Guide.txt IR Guide Personal homepages

http://www.cse.iitb.ac.in/~soumen/ Soumen’s Home Page

Page 11: Web Noises Detection and Elimination PengBo Dec 3, 2010

Violations of Lexical Affinity Principle

Alphabetical index lists Computer and Communication Companies ("M" entries)

HTML representation Adjacent cells in the same column are far from each

other in the HTML text

Page 12: Web Noises Detection and Elimination PengBo Dec 3, 2010

IR Tool Problems

Generalization Search for “Frequency Division Multiplexing”

and get back general Electrical Engineering sites

Topic drift Search for “Finite Model Theory” and get SF

49’ers fan web sites Irrelevance

Get “Yahoo” as a result regardless of the query Bias

Search for “computing companies” and get Microspy highly ranked

Page 13: Web Noises Detection and Elimination PengBo Dec 3, 2010

Hypertext Improvement Problem

remove violations of the Hypertext IR principles process quickly millions of pages

Develop hypertext processing techniques that:

• automatically improve hypertext data

• are efficient and scalable

Main Goal

Page 14: Web Noises Detection and Elimination PengBo Dec 3, 2010

Hypertext Cleaning

Web

Crawler

Hypertext Cleaner

IR Tool

Page 15: Web Noises Detection and Elimination PengBo Dec 3, 2010

Template detection

Page 16: Web Noises Detection and Elimination PengBo Dec 3, 2010

DOM TreeDOM Tree

模版 Template模版 Template

Page 17: Web Noises Detection and Elimination PengBo Dec 3, 2010
Page 18: Web Noises Detection and Elimination PengBo Dec 3, 2010
Page 19: Web Noises Detection and Elimination PengBo Dec 3, 2010

Templates

Page 20: Web Noises Detection and Elimination PengBo Dec 3, 2010

Templates Detection

Semantic Definition: A template is a master HTML

shell page that is used as a basis for composing new pages

Content of new pages plugged into template shell

All pages share common look & feel Usually controlled by a central

authority Not necessarily confined to a

single site May include variety of data

Navigational bars Advertisements Company info and policies

Page 21: Web Noises Detection and Elimination PengBo Dec 3, 2010

Search pagelet

Navigation pagelet

Services pagelet

Company info pagelet

Ad pagelet

Page 22: Web Noises Detection and Elimination PengBo Dec 3, 2010

Pagelets

Semantic Definition: A pagelet is a maximal region of a page

that has a single topic or functionality Not too large

has only one topic / functionality Not too small

any larger region that contains it has other topics / functionalities

Page 23: Web Noises Detection and Elimination PengBo Dec 3, 2010

IR with Pagelets

Use pagelets rather than pages as atomic units for information retrieval

Main Idea 1

Main Idea 2

Eliminate pagelets belonging to templates

Page 24: Web Noises Detection and Elimination PengBo Dec 3, 2010

Pagelets: Syntactic Definition

A pagelet is a node in the HTML parse tree of a page satisfying the following: Its HTML tag is one of the

following: <TABLE>, <OL>, <UL>,

<AREA>, <P>, <DL>, … None of it’s children

contains more than k hyperlinks

None of its ancestor is a pagelet

Page 25: Web Noises Detection and Elimination PengBo Dec 3, 2010
Page 26: Web Noises Detection and Elimination PengBo Dec 3, 2010

Templates: Syntactic Definition

A template is a collection T = (p1,…,pk) of pagelets satisfying:

Similarity:p1,…,pk are identical or almost identical

Connectivity Every two pages owning pagelets in T are

reachable from each other (undirectedely) through other pages owning pagelets in T.p1

p3

p5

p2

p4

Template Recognition Problem: Given a set of pages S find all the templates in S.

Page 27: Web Noises Detection and Elimination PengBo Dec 3, 2010

Template Recognition in Large Sets

Cluster pagelets in S according to shingle

Calculate shingle(p) for each pagelet pS

Discard clusters of size 1

For each remaining cluster C:

Construct graph Gc of pages that own pagelets in C

Find undirected connected components of Gc

Output components of size > 1

Page 28: Web Noises Detection and Elimination PengBo Dec 3, 2010

Evaluation

Question:

How to evaluate the performance/effectiveness of this cleaning algorithm?

Page 29: Web Noises Detection and Elimination PengBo Dec 3, 2010

Benefits of template detection

Page 30: Web Noises Detection and Elimination PengBo Dec 3, 2010

Cleaning via feature weighting

Page 31: Web Noises Detection and Elimination PengBo Dec 3, 2010

Cleaning via feature weightingCleaning via feature weighting

In a given Web site Noisy blocks — Share

common contents or presentation styles

Meaningful (or main) blocks — diverse in contents and presentation style

Weighting features makes cleaning automatic (nothing is eliminated)“Eliminating noisy

information in Web pages for data mining”

Page 32: Web Noises Detection and Elimination PengBo Dec 3, 2010

DOM treesDOM trees

<BODY bgcolor=WHITE> <TABLE width=800 height=200 > … </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> … </TABLE></BODY>

bc=red

bc=white

IMG TABLE

BODY

root

width=800 height=20

0TABLE

width=800

Page 33: Web Noises Detection and Elimination PengBo Dec 3, 2010

Build Site style tree (SST)

commoncommon

Page 34: Web Noises Detection and Elimination PengBo Dec 3, 2010

SST

Style Node S = (ELEMENTs, n) ELEMENTs — a sequence of element nodes n — number of pages that has this style

Element Node E = (Tag, Attr, STYLEs) Tag — tag name. E.g., TABLE, IMG; Attr — display attributes of Tag. E.g., bgcolor=RED STYLEs — style nodes below E

Page 35: Web Noises Detection and Elimination PengBo Dec 3, 2010

Quantify the importance

Inner NodeInner Node

Leaf NodeLeaf Node

Page 36: Web Noises Detection and Elimination PengBo Dec 3, 2010

Weighting policy

Inner Node Importance

(1)

l = |E.STYLEs| m = number of pages containing E, |E.parent.n| pi — percentage of tag nodes (in E.parent.n)

using the i-th presentation style Inner NodeImp(E) — diversity of presentation

styles

1

1

1

log)(

1 mif

mifppENodeImp

l

iimi

Page 37: Web Noises Detection and Elimination PengBo Dec 3, 2010

NodeImp(Body) = -1log1001 = 0NodeImp(Table)

= -(0.35log1000.35 + 2*0.25log1000.25+ 0.15log1000.15) = 0.29 >0

1

1

1

log)(

1 mif

mifppENodeImp

l

iimi

1

1

1

log)(

1 mif

mifppENodeImp

l

iimi

Page 38: Web Noises Detection and Elimination PengBo Dec 3, 2010

Weighting policy

Features( terms) of Leaf Node Importance of Leaf Node’s Features

(3)

m = number of pages containing E, |E.parent.n| pij — probability of ai appears in E of page j HE(ai) — information entropy of ai

the higher HE(ai), the less important ai

1

1log

0)(

1mif

mifppaH

m

jijmij

iE

Page 39: Web Noises Detection and Elimination PengBo Dec 3, 2010

Weighting policy

Leaf Node Importance

(2)

N — number of features in E ai — a feature of content in E (1-HE(ai)) — information contained in ai Leaf NodeImp(E) —content diversity of E

N

aH

N

aHENodeImp

N

iiE

N

iiE

11

)(1

))(1()(

Page 40: Web Noises Detection and Elimination PengBo Dec 3, 2010

Et1:PCMag,samsung

t2:PCMag,epson

t3:PCMag,canon

TABLE

Ep

SST:

IMG

root

3

m = 3N = |{PCMag, samsung,

epson, canon}| = 4HE(PCMag) =

-3 * (1/3log31/3) = 1

HE(samsung)=HE(epson)

=HE(canon) =

-(0+0+1log31) = 0NodeImp(E) = ((1-1) + 3*(1-0))/4

= 0.75

1

1log

0)(

1mif

mifppaH

m

jijmij

iE

1

1log

0)(

1mif

mifppaH

m

jijmij

iE

Page 41: Web Noises Detection and Elimination PengBo Dec 3, 2010

Transitive Weighting policy

Composite ImportanceComposite Importance

0

0.290

0.75

Page 42: Web Noises Detection and Elimination PengBo Dec 3, 2010

Page nosie

noisy element node For an element node E in the SST, if all of its

descendents and itself have composite importance less than a specified threshold t, then we say element node E is noisy.

Maximal noisy element node meaningful element node :

If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful.

Maximal meaningful element node

Page 43: Web Noises Detection and Elimination PengBo Dec 3, 2010

Web page cleaning via block elimination

We can use SST (site style tree) to identify & eliminate noise content blocks in a page. Build SST by sample pages crawled from a site. Computing an importance value for each block,

using a specified threshold t to decide noisy or not noisy

Matching to noisy blocks and not noisy blocks in the tree, given a new page.

Page 44: Web Noises Detection and Elimination PengBo Dec 3, 2010

Noise Detection and Elimination

Table Img Table

Body

Table

TrTr

root

Text

Text

AP

P P P A P Img

AImg

A A A AA

Page 45: Web Noises Detection and Elimination PengBo Dec 3, 2010

After simplification

Table Img Table

Body

Table

TrTr

root

Text

Page 46: Web Noises Detection and Elimination PengBo Dec 3, 2010

Summary of the technique

Evaluate Common and Diversity of content and styles DOM trees SST Information Entropy Based Evaluation

Node Importance Composite Importance

Noise detection and automatic matching

Page 47: Web Noises Detection and Elimination PengBo Dec 3, 2010

Near duplicate detection

Page 48: Web Noises Detection and Elimination PengBo Dec 3, 2010

Syntactic clustering of the web contentsWWW6,1997

Syntactic clustering of the web contentsWWW6,1997

Page 49: Web Noises Detection and Elimination PengBo Dec 3, 2010

Document Representation

How to represent a document? Represent document content by a feature

set , preparing the computations of resemblance or similarity.

For document D, extract it’s feature set as S(D)

Page 50: Web Noises Detection and Elimination PengBo Dec 3, 2010

Defining similarity of documents

How to express the concept “roughly the same” precisely?

Quantity Definition: resemblance The resemblance fo two documents A and B is a

number between 0 and 1.

Page 51: Web Noises Detection and Elimination PengBo Dec 3, 2010

Defining similarity of documents(cont’d)

Resemblance

Symmetric, reflexive, not transitive, not a metric

Note r (A,A) = 1 But r (A,B)=1 does not mean A and B are identical!

Forgives any number of occurrences and any permutations of the terms.

Resemblance distance

)()(

)()(),(

BSAS

BSASBAr

),(1),( BArBAd

Jaccard coefficientJaccard coefficient

Page 52: Web Noises Detection and Elimination PengBo Dec 3, 2010

Feature Selection

Assume: we have converted page into a sequence of tokens Eliminate punctuation, HTML markup, lower

case, etc How to do feature selection, S(D)=?

Document level Character/word level Shingle level

Page 53: Web Noises Detection and Elimination PengBo Dec 3, 2010

Shingling

A contiguous subsequence contained in D is called a shingle.

Given a document D we define its w-shingling S(D, w) as the set of all unique shingles of size w contained in D. D = (a,rose,is,a,rose,is,a,rose) S(D,4) = {(a,rose,is,a),(rose,is,a,rose),

(is,a,rose,is)} “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is

Why shingling? S(D,4) .vs. S(D,1)What is a good

value for w?

Page 54: Web Noises Detection and Elimination PengBo Dec 3, 2010

Sketches

Set of all shingles is large Bigger than the original

document Can we create a document

sketch by sampling only a few shingles?

Requirement Sketch resemblance should be

a good estimate of document resemblance

Page 55: Web Noises Detection and Elimination PengBo Dec 3, 2010

Choosing a sketch

Random sampling E.g., suppose we have identical documents A &

B each with n shingles M(A) = set of s shingles from A, chosen

uniformly at random; similarly M(B) Does it work?

For s=1: E[|M(A) M(B)|] = 1/n But r(A,B) = 1 So the sketch overlap is an underestimate

Verify that this is true for any value of s

Page 56: Web Noises Detection and Elimination PengBo Dec 3, 2010

Choosing a sketch

Improvements: Random sampling + compare “special” item Random permutations + compare “smallest”

shingle Random permutation

Let be a set (1..N e.g.) Pick a permutation : uniformly at random

={3,7,1,4,6,2,5} A={2,3,6} MIN((A))=?

Page 57: Web Noises Detection and Elimination PengBo Dec 3, 2010

Estimating Jaccard Coefficient

Theorem : If permutations are picked uniformly at

random from the n! possible permutations,

Page 58: Web Noises Detection and Elimination PengBo Dec 3, 2010

Choosing a sketch

Create a “sketch vector” (e.g., of size 200) for each document Documents which share more than t (say 80%)

corresponding vector elements are similar For doc d, sketchd[i] is computed as follows:

Let f map all shingles in the universe to 0..2m

Let i be a specific random permutation on 0..2m

Pick MIN i (f(s)) over all shingles s in d

Page 59: Web Noises Detection and Elimination PengBo Dec 3, 2010

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64 bit shingles

Permute on the number line

with i

Pick the min value

Page 60: Web Noises Detection and Elimination PengBo Dec 3, 2010

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Test for 200 random permutations: , ,… 200

Are these equal?

Document 1

264

264

264

264A

Document 2

264

264

264

264A

Page 61: Web Noises Detection and Elimination PengBo Dec 3, 2010

However…

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection)This happens with probability: Size_of_intersection / Size_of_union

BA

Document 1

264

264

264

264A

Document 2

264

264

264

264A

Page 62: Web Noises Detection and Elimination PengBo Dec 3, 2010

Finding all near-duplicates

Naïve implementation makes O(N^2) sketch comparisons Suppose N=100 million

How can you do it faster?How can you do it faster?

Page 63: Web Noises Detection and Elimination PengBo Dec 3, 2010

本次课小结

Web Noises Hypertext IR Principles

Template Detection Semantic and Syntactic

Definition Information Entropy of

Features Weighting SST

Near duplicates detection

Jaccard similarity Shingling sketch

Document 1

264

264

264

26

4

A

Document 2

264

264

264

264A

Table

Img Table

Body

Table

TrTr

root

Text

Page 64: Web Noises Detection and Elimination PengBo Dec 3, 2010

References

[1] B.-Y. Ziv and R. Sridhar, "Template detection via data mining and its applications," in Proceedings of the 11th international conference on World Wide Web. Honolulu, Hawaii, USA: ACM Press, 2002.

[2] Y. Lan, L. Bing, and L. Xiaoli, "Eliminating noisy information in Web pages for data mining," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. Washington, D.C.: ACM Press, 2003.

[3] G. David, P. Kunal, and T. Andrew, "The volume and evolution of web page templates," in Special interest tracks and posters of the 14th international conference on World Wide Web. Chiba, Japan: ACM Press, 2005.

[4] Z. B. Andrei, C. G. Steven, S. M. Mark, and Z. Geoffrey, "Syntactic clustering of the Web," in Selected papers from the sixth international conference on World Wide Web. Santa Clara, California, United States: Elsevier Science Publishers Ltd., 1997.

[5] N. Shivakumar and H. Garca-Molina, "Finding near-replicas of documents on the web," presented at Proceedings of Workshop on Web Databases (WebDB'98), Mar, 1998.

Page 65: Web Noises Detection and Elimination PengBo Dec 3, 2010

Related Resources

Html-tidy Code http://code.google.com/p/html-tidy/

Shingle Codehttp://research.microsoft.com/research/downloads/Details/4e0d0535-ff4c-4259-99fa-ab34f3f57d67/Details.aspx?0sr=d

Page 66: Web Noises Detection and Elimination PengBo Dec 3, 2010

Thank You!

Q&A

Page 67: Web Noises Detection and Elimination PengBo Dec 3, 2010

阅读材料

[1] IIR Chapter 19.6 [2] G. Salton and C. Buckley, "Term-

weighting approaches in automatic text retrieval," Inf. Process. Manage., vol. 24, pp. 523, 1988.

Page 68: Web Noises Detection and Elimination PengBo Dec 3, 2010

DOM Tree

W3C Document Object Model allow programs and scripts to dynamically access

and update the content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the presented page.

Page 69: Web Noises Detection and Elimination PengBo Dec 3, 2010

Information Entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information contained in a message, usually in units such as bits.

Page 70: Web Noises Detection and Elimination PengBo Dec 3, 2010

Estimating algorithm

1. Generate a set of m random permutations

2. for each do3. compute and 4. check if5. end for6. if equality was observed in k cases,

estimate

m

kddr ),(' 21

))((min))((min 21 dSdS ))(( 1dS ))(( 2dS

Page 71: Web Noises Detection and Elimination PengBo Dec 3, 2010

Some other approaches

For set W of shingles, let MINs(W) = set of s smallest shingles in W

Assume documents have at least s shingles Define

M(A) = MINs(S(A)) M(AB) = MINs(M(A) M(B)) r’(A,B) = |M(AB) M(A) M(B)| / s

By increasing sample size (s) we can make it very unlikely r’(A,B) is significantly different from r(A,B)

100-200 shingles is sufficient in practice Compute a fingerprint f for each shingle (e.g.,

Rabin fingerprint) 40 bits is usually enough to keep estimates reasonably

accurate Fingerprint also eliminates need for random permutation

Page 72: Web Noises Detection and Elimination PengBo Dec 3, 2010

Finding all near-duplicates

Naïve implementation makes O(N^2) sketch comparisons Suppose N=100 million

Divide-Compute-Merge (DCM) Divide data into batches that fit in memory Operate on individual batch and write out

partial results in sorted order Merge partial results

Generalization of external sorting

Page 73: Web Noises Detection and Elimination PengBo Dec 3, 2010

doc1: s11,s12,…,s1kdoc2: s21,s22,…,s2k…

DCM Steps

s11,doc1s12,doc1…s1k,doc1s21,doc2…

Invertt1,doc1t1,docX…t2,doc1t2,docY…

sort onshingle_fp

doc1,docX,1doc1,docZ,1…doc1,docY,1…

Invert and pair

doc1,docX,1doc1,docX,1…doc1,docY,1…

sort on<docid1,docid2>

doc1,docX,2doc1,docY,10…

Merge

Page 74: Web Noises Detection and Elimination PengBo Dec 3, 2010

Finding all near-duplicates

1. Calculate a sketch for each document2. For each document, write out the pairs <shingle_fp,

docId>3. Sort by shingle_fp (DCM)4. In a sequential scan, generate triplets of the form

<docId1,docId2,1> for pairs of docs that share a shingle (DCM)

5. Sort on <docId1,docId2> (DCM)6. Merge the triplets with common docids to generate

triplets of the form <docId1,docId2,count> (DCM)7. Output document pairs whose resemblance exceeds

the threshold

Page 75: Web Noises Detection and Elimination PengBo Dec 3, 2010

DCM algorithm

1. for each random permutation do2. create a file3. for each document d do4. write out to 5. end for6. sort using key s -- this results in contiguous blocks with

fixed s containing all associated

7. create a file8. for each pair within a run of having a given s do9. write out a document-pair record to10. end for11. sort on key 12. end for13. merge for all in order, counting the number of

entries

),( 21 dd

sd

ddSs )),((min

f

f

f

g

f

),( 21 dd

g ),( 21 dd

g ),( 21 dd ),( 21 dd

g

Page 76: Web Noises Detection and Elimination PengBo Dec 3, 2010

Some optimizations

“Invert and Pair” is the most expensive step

We can speed it up eliminating very common shingles Common headers, footers, etc. Do it as a preprocessing step

Also, eliminate exact duplicates up front Probabilistic Counting [5]

Page 77: Web Noises Detection and Elimination PengBo Dec 3, 2010

Detecting duplicate pages

Page 78: Web Noises Detection and Elimination PengBo Dec 3, 2010

State of the art Technology

Page 79: Web Noises Detection and Elimination PengBo Dec 3, 2010

Volume and Evolution of Page Templates

Our results show that 40–50% of the content on the web is template content.

Over the last eight years, the fraction of template content has doubled, and the growth shows no sign of abating.

Text, links, and total HTML bytes within templates are all growing as a fraction of total content at a rate of between 6 and 8% per year. Question:

how to design the experiment to reach these conclusions?

Question: how to design the experiment to

reach these conclusions?

Page 80: Web Noises Detection and Elimination PengBo Dec 3, 2010