intelligent database systems lab n.y.u.s.t. i. m. spotsigs: robust and efficient near duplicate...
TRANSCRIPT
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections
Presenter: Tsai Tzung Ruei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke
SIGIR. 2008
國立雲林科技大學National Yunlin University of Science and Technology
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation Objective Methodology Experiments Conclusion Comments
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte-by-byte comparisons fail.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
To avoid exact duplicates during the collection of Web archives, near duplicates frequently slip into the corpus.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
SPOT SIGNATURE EXTRACTION
MATCHING
5
WebDatabase document
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
SPOT SIGNATURE EXTRACTION A = {aj(dj, cj)}
6
Example
a(1,2), an(1,2), the(1,2) and is(1,2)
“ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.”
ResultS = {a:rally:kick,a:weeklong:campain, the:south:carolina, the:record:straight,an:attack:circulating, the:internet:designed, is:designed:play}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
SPOT SIGNATURE MATCHING Jaccard Similarity for Sets
7
Generalization for Multi-Sets
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
SPOT SIGNATURE MATCHING
8
SPOT SIGNATURE
partition
partition
partition
Inverted Index Pruning
Jaccard Similarity for Sets
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
Optimal Partitioning
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
Inverted Index Pruning
10
Exampled1 = {s1:5, s2:4, s3:4}, with |d1| = 13d2 = {s1:8, s2:4}, |d2| = 12d3 = {s1:4, s2:5, s3:5} , |d3| = 14τ = 0.8δ1 = 0δ2 = |d1| − |d3| = −1
SPOT SIGNATURE
partition
partition
partition
Inverted Index Pruning
Jaccard Similarity for Sets
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
Gold Set of Near Duplicate News Articles SpotSigs vs. Shingling
Choice of Spot Signatures
SpotSigs vs. Hashing
TREC WT10g SpotSigs vs. Hashing
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
Gold Set of Near Duplicate News Articles
12
SpotSigs vs. Shingling
Choice of Spot SignaturesSpotSigs vs. Hashing
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
TREC WT10g SpotSigs vs. Hashing
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusion
MAJOR CINTRIBUTION SpotSigs proved to provide both increased robustness of signatures as
well as highly efficient deduplication compared to various state-of-the-art approaches.
FUTURE WORK Future work will focus on efficient access to disk-based index
structures, as well as generalizing the bounding approach toward other metrics such as Cosine.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Comments
Advantage The SpotSigs deduplication algorithm runs “right out of the box”
without the need for further tuning, while remaining exact and efficient.
Drawback …..
Application information retrieval
15