efficient duplicate detection over massive data sets

21
Efficient Duplicate Detection Over Massive Data Sets Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 4. April 21, 2015. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21

Upload: kathiravelu-pradeeban

Post on 18-Jul-2015

1.011 views

Category:

Technology


0 download

TRANSCRIPT

Efficient Duplicate Detection Over Massive Data Sets

Pradeeban Kathiravelu

INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa

Lisbon, Portugal

Data Quality – Presentation 4.April 21, 2015.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21

Dedoop: Efficient Deduplication with Hadoop

Introduction

Blocking

Grouping of entities that are “somehow similar”.

Comparisons restricted to entities from the same block.

Entity Resolution (ER, Object matching, deduplication)

Costly.

Traditional Blocking Approaches not effective.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 2 / 21

Dedoop: Efficient Deduplication with Hadoop

Motivation

Advantages of leveraging parallel and cloud environments.

Manual tuning of ER parameters is facilitated as ER results can bequickly generated and evaluated.

⇓ Execution times for large data sets ⇒ Speed up common datamanagement processes.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 3 / 21

Dedoop: Efficient Deduplication with Hadoop

Dedoop

http://dbs.uni-leipzig.de/dedoop

MapReduce-based entity resolution of large datasets.

Pair-wise similarity computation [O(n2)] executed in parallel.

Automatic transformation:Workflow definition ⇒ Executable MapReduce workflow.

Avoid unnecessary entity pair comparisons

That result from the utilization of multiple blocking keys.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 4 / 21

Dedoop: Efficient Deduplication with Hadoop

Features

Several load balancing strategies

In combination with its blocking techniques.To achieve balanced workloads across all employed nodes of the cluster.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 5 / 21

Dedoop: Efficient Deduplication with Hadoop

User Interface

Users easily specify advanced ER workflows in a web browser.Choose from a rich toolset of common ER components.

Blocking techniques.Similarity functions.

Machine learning for automatically building match classifiers.Visualization of the ER results and the workload of all cluster nodes.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 6 / 21

Dedoop: Efficient Deduplication with Hadoop

Solution Architecture

Map determines blocking keys for each entity and outputs (blockkey,entity) pairs.

Reduce compares entities that belong to the same block.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 7 / 21

MapDupReducer: Detecting Near Duplicates ..

Near Duplicate Detection (NDD)

Multi-Processor Systems are more effective.

MapReduce Platform.

Ease of use.High Efficiency.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 8 / 21

MapDupReducer: Detecting Near Duplicates ..

System Architecture

Non-trivial generalization of the PPJoin algorithm into theMapReduce framework.

Redesigning the position and prefix filtering.Document signature filtering to further reduce the candidate size.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 9 / 21

MapDupReducer: Detecting Near Duplicates ..

Evaluation

Data sets.MEDLINE documents.

Finding plagiarized documents.18.5 million records.

BING.Web pages with an aggregated size of 2TB.

Hotspot.High update frequency.

Altering the arguments.

Different number of map() and reduce() params.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 10 / 21

Efficient Similarity Joins for Near Duplicate Detection

Similarity Definitions

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 11 / 21

Efficient Similarity Joins for Near Duplicate Detection

Efficient Similarity Join Algorithms

Efficient similarity join algorithms by exploiting the ordering of tokensin the records.

Positional filtering and suffix filtering are complementary to theexisting prefix filtering technique.

Commonly used strategy depends on the size of the document.Text documents: Edit distance and Jaccard similarity.

Edit distance: Minimum number of edits required to transform onestring to another.An insertion, deletion, or substitution of a single character.

Web documents: Jaccard or overlap similarity on small or fix sizedsketches.

Near duplicate object detection problem is a generalization of thewell-known nearest neighbor problem.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 12 / 21

Efficient Parallel Set-Similarity Joins Using MapReduce

Introduction

Efficiently perform set-similarity joins in parallel using the popularMapReduce framework.

A 3-stage approach for end-to-end set-similarity joins.

Efficiently partition the data across nodes.

Balance the workload.The need for replication ⇓.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 13 / 21

Efficient Parallel Set-Similarity Joins Using MapReduce

MapReduce

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 14 / 21

Efficient Parallel Set-Similarity Joins Using MapReduce

Parallel Set-Similarity Joins Stages

1 Token Ordering:Computes data statistics in order to generate good signatures.

The techniques in later stages utilize these statistics.

2 RID-Pair Generation:Extracts the record IDs (“RID”) and the join-attribute value fromeach record.

Distributes the RID and the join-attribute value pairs.The pairs sharing a signature go to at least one common reducer.Reducers compute the similarity of the join-attribute values and outputRID pairs of similar records.

3 Record Join:Generates actual pairs of joined records.

It uses the list of RID pairs from the second stage and the original datato build the pairs of similar records.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 15 / 21

Efficient Parallel Set-Similarity Joins Using MapReduce

Token Ordering

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 16 / 21

Efficient Parallel Set-Similarity Joins Using MapReduce

Handling Insufficient Memory

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 17 / 21

Efficient Parallel Set-Similarity Joins Using MapReduce

Speedup

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 18 / 21

Efficient Parallel Set-Similarity Joins Using MapReduce

Scalability

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 19 / 21

Conclusion

Conclusion

MapReduce frameworks offer an effective platform for near duplicatedetection.

Distributed execution frameworks can be leveraged for a scalable datacleaning.

Efficient partitioning for data that cannot fit in the main memory.

Software-Defined Networking and later advances in networking canlead to better data solutions.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 20 / 21

Conclusion

References

Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplicationwith Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881.

Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallelset-similarity joins using MapReduce. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of data (pp. 495-506).ACM.

Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R.(2010, June). MapDupReducer: detecting near duplicates over massivedatasets. In Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data (pp. 1119-1122). ACM.

Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficientsimilarity joins for near-duplicate detection. ACM Transactions on DatabaseSystems (TODS), 36(3), 15.

Thank you!Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 21 / 21