efficient duplicate detection over massive data sets
TRANSCRIPT
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu
INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 4.April 21, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21
Dedoop: Efficient Deduplication with Hadoop
Introduction
Blocking
Grouping of entities that are “somehow similar”.
Comparisons restricted to entities from the same block.
Entity Resolution (ER, Object matching, deduplication)
Costly.
Traditional Blocking Approaches not effective.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 2 / 21
Dedoop: Efficient Deduplication with Hadoop
Motivation
Advantages of leveraging parallel and cloud environments.
Manual tuning of ER parameters is facilitated as ER results can bequickly generated and evaluated.
⇓ Execution times for large data sets ⇒ Speed up common datamanagement processes.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 3 / 21
Dedoop: Efficient Deduplication with Hadoop
Dedoop
http://dbs.uni-leipzig.de/dedoop
MapReduce-based entity resolution of large datasets.
Pair-wise similarity computation [O(n2)] executed in parallel.
Automatic transformation:Workflow definition ⇒ Executable MapReduce workflow.
Avoid unnecessary entity pair comparisons
That result from the utilization of multiple blocking keys.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 4 / 21
Dedoop: Efficient Deduplication with Hadoop
Features
Several load balancing strategies
In combination with its blocking techniques.To achieve balanced workloads across all employed nodes of the cluster.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 5 / 21
Dedoop: Efficient Deduplication with Hadoop
User Interface
Users easily specify advanced ER workflows in a web browser.Choose from a rich toolset of common ER components.
Blocking techniques.Similarity functions.
Machine learning for automatically building match classifiers.Visualization of the ER results and the workload of all cluster nodes.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 6 / 21
Dedoop: Efficient Deduplication with Hadoop
Solution Architecture
Map determines blocking keys for each entity and outputs (blockkey,entity) pairs.
Reduce compares entities that belong to the same block.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 7 / 21
MapDupReducer: Detecting Near Duplicates ..
Near Duplicate Detection (NDD)
Multi-Processor Systems are more effective.
MapReduce Platform.
Ease of use.High Efficiency.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 8 / 21
MapDupReducer: Detecting Near Duplicates ..
System Architecture
Non-trivial generalization of the PPJoin algorithm into theMapReduce framework.
Redesigning the position and prefix filtering.Document signature filtering to further reduce the candidate size.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 9 / 21
MapDupReducer: Detecting Near Duplicates ..
Evaluation
Data sets.MEDLINE documents.
Finding plagiarized documents.18.5 million records.
BING.Web pages with an aggregated size of 2TB.
Hotspot.High update frequency.
Altering the arguments.
Different number of map() and reduce() params.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 10 / 21
Efficient Similarity Joins for Near Duplicate Detection
Similarity Definitions
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 11 / 21
Efficient Similarity Joins for Near Duplicate Detection
Efficient Similarity Join Algorithms
Efficient similarity join algorithms by exploiting the ordering of tokensin the records.
Positional filtering and suffix filtering are complementary to theexisting prefix filtering technique.
Commonly used strategy depends on the size of the document.Text documents: Edit distance and Jaccard similarity.
Edit distance: Minimum number of edits required to transform onestring to another.An insertion, deletion, or substitution of a single character.
Web documents: Jaccard or overlap similarity on small or fix sizedsketches.
Near duplicate object detection problem is a generalization of thewell-known nearest neighbor problem.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 12 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Introduction
Efficiently perform set-similarity joins in parallel using the popularMapReduce framework.
A 3-stage approach for end-to-end set-similarity joins.
Efficiently partition the data across nodes.
Balance the workload.The need for replication ⇓.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 13 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
MapReduce
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 14 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Parallel Set-Similarity Joins Stages
1 Token Ordering:Computes data statistics in order to generate good signatures.
The techniques in later stages utilize these statistics.
2 RID-Pair Generation:Extracts the record IDs (“RID”) and the join-attribute value fromeach record.
Distributes the RID and the join-attribute value pairs.The pairs sharing a signature go to at least one common reducer.Reducers compute the similarity of the join-attribute values and outputRID pairs of similar records.
3 Record Join:Generates actual pairs of joined records.
It uses the list of RID pairs from the second stage and the original datato build the pairs of similar records.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 15 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Token Ordering
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 16 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Handling Insufficient Memory
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 17 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Speedup
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 18 / 21
Efficient Parallel Set-Similarity Joins Using MapReduce
Scalability
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 19 / 21
Conclusion
Conclusion
MapReduce frameworks offer an effective platform for near duplicatedetection.
Distributed execution frameworks can be leveraged for a scalable datacleaning.
Efficient partitioning for data that cannot fit in the main memory.
Software-Defined Networking and later advances in networking canlead to better data solutions.
Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 20 / 21
Conclusion
References
Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplicationwith Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881.
Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallelset-similarity joins using MapReduce. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of data (pp. 495-506).ACM.
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R.(2010, June). MapDupReducer: detecting near duplicates over massivedatasets. In Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data (pp. 1119-1122). ACM.
Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficientsimilarity joins for near-duplicate detection. ACM Transactions on DatabaseSystems (TODS), 36(3), 15.
Thank you!Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 21 / 21