pairwise document similarity in large collections with mapreduce tamer elsayed, jimmy lin, and...

22
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics, 2008 May 15, 2014 Kyung-Bin Lim

Upload: lorin-porter

Post on 31-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Pairwise Document Similarity in Large Collections with MapReduce

Tamer Elsayed, Jimmy Lin, and Douglas W. OardAssociation for Computational Linguistics, 2008

May 15, 2014Kyung-Bin Lim

Page 2: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

2 / 19

Outline

Introduction Methodology Discussion Conclusion

Page 3: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

3 / 19

Pairwise Similarity of Documents

PubMed – “More like this” Similar blog posts Google – Similar pages

Page 4: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

4 / 19

Abstract Problem

Applications:– Clustering– “more-like-that” queries

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.740.2

00.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

Page 5: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

5 / 19

Outline

Introduction Methodology Results Conclusion

Page 6: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

6 / 19

Trivial Solution

Load each vector O(N) times O(N2) dot products

scalable and efficient solu-tion

for large collections

Goal

Page 7: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

7 / 19

Better Solution

Load weights for each term once Each term contributes O(dft

2) partial scores

Each term contributes only if appears in

Page 8: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

8 / 19

Better Solution

A term contributes to each pair that contains it

For example, if a term t1 appears in documents x, y, z :

List of documents that contain a particular term: Inverted Index

t1 appears in x, y, z

t1 contributes for pairs:

(x, y) (x, z) (y, z)

Page 9: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

9 / 19

Algorithm

Page 10: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

10 / 19

MapReduce Programming

Framework that supports distributed computing on clusters of computers

Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications

Page 11: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

11 / 19

MapReduce Model

Page 12: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

12 / 19

Computation Decomposition

reduce

Load weights for each term once Each term contributes o(dft2) partial scores

Each term contributes only if appears in

map

Page 13: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

13 / 19

MapReduce Jobs

(1) Inverted Index Computation

(2) Pairwise Similarity

Page 14: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

14 / 19

Job1: Inverted Index

(A,(d1,2))

(B,(d1,1))

(C,(d1,1))

(B,(d2,1))

(D,(d2,2))

(A,(d3,1))

(B,(d3,2))

(E,(d3,1))

(A,[(d1,2),

(d3,1)])(B,[(d1,1),

(d2,1),

(d3,2)])(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

map

map

map

shuffle

reduce

reduce

reduce

reduce

reduce

(A,[(d1,2),

(d3,1)])(B,[(d1,1),

(d2,1),

(d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

A A B C

B D D

A B B E

d1

d2

d3

Page 15: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

15 / 19

Job2: Pairwise Similarity

map

map

map

map

map

(A,[(d1,2),

(d3,1)])(B,[(d1,1),

(d2,1),

(d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

((d1,d3),2)

((d1,d2),1)

((d1,d3),2)

((d2,d3),2)

shuffle

((d1,d2),[1])

((d1,d3),

[2,2])

((d2,d3),[2])

reduce

reduce

reduce

((d1,d2),1)

((d1,d3),4)

((d2,d3),2)

Page 16: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

16 / 19

Implementation Issues

df-cut– Drop common terms

Intermediate tuples dominated by very high df terms

Implemented 99% cut

efficiency Vs. effectiveness

Page 17: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

17 / 19

Outline

Introduction Methodology Results Conclusion

Page 18: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

18 / 19

Experimental Setup

Hadoop 0.16.0 Cluster of 19 machines– Each with two processors (single core)

Aquaint-2 collection– 2.5GB of text– 906k documents

Okapi BM25 Subsets of collection

Page 19: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

19 / 19

Running Time of Pairwise Similarity Comparisons

R2 = 0.997

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Co

mp

uta

tio

n T

ime

(m

inu

tes

)

Page 20: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

20 / 19

Number of Intermediate Pairs

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rmed

iate

Pai

rs (

bil

lio

ns)

df-cut at 99%df-cut at 99.9%df-cut at 99.99%df-cut at 99.999%no df-cut

Page 21: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

21 / 19

Outline

Introduction Methodology Results Conclusion

Page 22: Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

22 / 19

Conclusion

Simple and efficient MapReduce solution– 2H for ~million-doc collection

Effective linear-time-scaling approximation– 99.9% df-cut achieves 98% relative accuracy– df-cut controls efficiency vs. effectiveness tradeoff