processing theta-joins using mapreduce alper okcan, mirek riedewald northeastern university, boston,...

20
Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

Upload: chastity-singleton

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

Processing Theta-Joins using MapRe-duceAlper Okcan, Mirek RiedewaldNortheastern University, Boston, MASIGMOD 11

12 April 2013SNU IDB Lab.Hyesung Oh

Page 2: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<2/20>

Outline Introduction Optimization Goal Mapping Join Matrix Cells to Reducers 1-Bucket Theta Experiments Conclusion

Page 3: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<3/20>

Introduction - 1 Internet companies want to analyze terabytes of data

– parallel computation is essential Join

– equi-join : join exact same attribute value– theta-join : join range attribute values

Page 4: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<4/20>

Introduction - 2 MapReduce overview

Page 5: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<5/20>

Introduction - 3 MapReduce

– Key, value– map, reduce jobs– good for equi-joins– about another types of joins?

reducer-centered cost model and a join model– simplifies creation of and reasoning about theta-join

Page 6: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<6/20>

Optimization Goal How to minimize job completion time

– max-reducer-input– max-reducer-output– problems

input-size dominated output-size dominated input-output balanced

Page 7: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<7/20>

Mapping Join Matrix Cells to Reducers

Standard equi-join (left), random(center), and balanced (right)

Page 8: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<8/20>

Comparisons of Reduce Allocation Methods Simple allocation

– Minimize the maximum input size of reduce functions– Output size may be skewed

Random allocation– Minimize the maximum output size of reduce functions– Input size may be increased due to duplication

Balances allocation– Minimize both maximum input and output sizes

Page 9: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<9/20>

1-Bucket Theta MapReduce Algorithm “Computes” cross-product Goals:

– Tuples matched at exactly one reducer– Minimal input to a reducer– Minimal output from each reducer

“1-Bucket” refers to no statistics about data distribution

Page 10: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<10/20>

Algorithm Precompute regions of cross-product SxT

– Use size of S (|S|) and T (|T|)– Regions are disjoint– Union of regions covers cross-product– Each region assigned to single reducer

1 1 1 1 2 2 2 2

1 1 1 1 2 2 2 2

1 1 1 1 2 2 2 2

1 1 1 1 2 2 2 2

3 3 3 3 4 4 4 4

3 3 3 3 4 4 4 4

3 3 3 3 4 4 4 4

3 3 3 3 4 4 4 4

|S|=8; |T|=8; #reducers =4Rows are tuples in s; columns are tuples in tValue is region for the <s,t> pair

Page 11: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<11/20>

Algorithm : Mapper Each row in S

– Randomly assign value (x) from 1 to size(S) – Output <region, row + ‘S’> for each region containing x– Example: Assume x=3. Output <1,row+’S’> and <2,row+’S’>

Each row in T– Same, except output <region, row+’T’>– Example: Assume x=3. Output <1, row+’T’> and <3,row+’T’>

Page 12: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<12/20>

Algorithm: Reducer Joins all S rows with all T rows Can use any join algorithm appropriate for join value Output cross-product, theta join or equi-join

Page 13: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<13/20>

Algorithm: Correctness Random assignment of tuples

– Since actual row number unknown, any row number works– Some reducer will compare tuple to any tuple in other table

Therefore, every pair compared (as in nested block loop join) in only one reducer

Page 14: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<14/20>

Optimal Partitioning Basis for minimal input and minimal output Let |S| be size of table S; r number of reducers Optimal output |S||T|/r Optimal input sqrt(|S||T|/r) from each table

Page 15: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<15/20>

Example

|S|=8; |T|=8; r=4; sqrt(|S||T|/r) =4; s=t=2

Page 16: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<16/20>

Near Optimal Partitioning Optimal case is rare General case

– t=floor(|T|/ sqrt(|S||T|/r))– Side length: floor((1+1/min(s,t)) * sqrt(|S||T|/r))– Note floor function omitted from paper

Example: |S|=|T|=8; r=9– s=t=floor(8/sqrt(64/9))=3– Side length = floor((1+1/3)*sqrt(64/9))=3

Page 17: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<17/20>

Example: Near-Optimal Partitioning

Assumed partitioningNote: 64/9=7.111 . . .Eight partitions with 7 and one with 8 is better

Page 18: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<18/20>

Experiments Cloud data set

– Information about cloud cover– 382 million records – 28.8 GB– Cloud-5-i is 5 million record subset

SELECT S.date, S.longitude, S.latitude FROM Cloud S, Cloud T WHERE s.date = t.date and S.longitude = T. longitude and ABS(S.latitude-T.latitude) <= 10

SELECT S.latitude, T.latitude FROM Cloud-5-1 S, Cloud-5-2 T WHERE ABS(S.latitude-T.latitude) < 2

Page 19: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<19/20>

Experimental Results

Figure 6: Input imbalance for Figure 7: Max-reducer-input for 1- Figure 8: MapReduce time for 1- 1-Bucket-Theta (#buckets=1) and Bucket-Theta and M-Bucket-I on Bucket-Theta and M-Bucket-I on M-Bucket-I on Cloud Cloud Cloud

Figure 9: Output imbalance for Figure 10: Max-reducer-output for Figure 11: MapReduce time for 1- 1-Bucket-Theta (#buckets=1) and 1-Bucket-Theta and M-Bucket-O Bucket-Theta and M-Bucket-O on M-Bucket-O on Cloud-5 on Cloud-5 Cloud-5

Page 20: Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

<20/20>

Conclusion MapReduce algorithm for arbitrary joins Always applicable Effective for large-scale data analysis Additional statistics provide better performance