hku csis db seminar processing ad-hoc joins on mobile devices hku csis db seminar 10 oct 2003...

61
HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

Upload: moris-parrish

Post on 13-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Processing Ad-Hoc Joins on Mobile Devices

HKU CSIS DB Seminar

10 Oct 2003

Speaker: Eric Lo

Page 2: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Mobile Devices and Databases

Cellular phones and Personal Data Assistants (PDAs) are capable to ask information from remote database(s) anywhere and anytime

The connection channel is wireless E.g., WAP, IEEE 802.11 (also WiFi), GPRS, 3G

Page 3: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

HK Stock Exchange

Example

11:55am:What is the stockprice of “PCCW” now?

SELECT Stock_PriceFROM DBWHERE Stock_Code = ‘8’

8 - PCCW11:56am: HKD 2.5

Page 4: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

HK Stock Exchange

There are no free lunchOption 1: Charged by airtime

11:55am:What is the stockprice of “PCCW” now?

SELECT Stock_PriceFROM DBWHERE Stock_Code = ‘8’

8 - PCCW11:56am: HKD 2.5

$ 1$ 2.8$ 4.6$ 10.2

Page 5: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Option 2: Charged by amount of data transferred

Network traffic and QoS of wireless data networking are strongly dependent on factors like Network workloads Availability of network stations

Charged by amount of data transferred Minimizing the transfer cost dollar!

Page 6: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Query more than one data source

Mobile users may wish to combine information from more than one remote databases

E.g., A vegetarian visits Hong Kong and looks for some restaurants recommended by both HK tourist office and HK vegetarian community

Page 7: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Example relations

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

Join Query

Page 8: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Motivations

Evaluating join queries on mobile devices Considerations:

1. Mobile device has limited memory

2. Minimizing the transfer cost (dollar $$$$$$)

3. Databases are non-collaborative

4. Query results are small in sizes compare to input relations (ad-hoc)

Page 9: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Download all relations?

Download both relations (HK tourist office and HK vegetarian directory) onto the mobile device and evaluate the join on the device locally

Won’t be able to hold the large amount of data from the remote databases (for most mobile devices)

The transfer cost is very high though the result size is very small

Page 10: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Outline

Introduction and motivation A simple late-projection strategy Block-merge join Ship-data-as-queries join RAMJ: Recursive and Adaptive Mobile Join Experiment result Conclusions and future work

Page 11: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

A simple late-projection strategy Traditional distributed query processing techniques

like semi-join involves: Shipping of join columns and (whole) tuples Across the trusted distributed nodes directly

In high selective join, most tuples fails to contribute to the final result

Semi-join d/l the non-key attributes which may not be included in the result

Download and join the distinct values of join keys only (Do not download the non-key attributes)

Only tuples belong to join result entails downloading the rest of non-key attributes

Page 12: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

A late projection strategy

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

Page 13: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Step 1

Download Name R1

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

Page 14: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Step 2

Download Name R2

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

Page 15: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Step 3

Evaluate T = Name (R1) Name (R2) locally

=

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

T

Page 16: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Step 4

Evaluate Name,Address (Name=T (R1))

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

T

Page 17: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Step 5

Evaluate Name,Cost (Name=T (R2))

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

T

Page 18: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Step 6

Join the two resultsets

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.Name

Page 19: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Block-merge join (BMJ)

Late-projection still insufficient if the whole join column cannot fit into the memory of mobile devices

Applying sort-merge join, with the sorting part on the servers 1 block of ordered join keys are downloaded from each

server and join them locally, until one block is exhausted

Each block must Cover same data range Sorted in same order (e.g., both in ascending order) Small enough to be resided in memory

Each block can be downloaded by using ROWNUM or LIMIT SQL statements

Page 20: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Ship-data-as-queries join (SDAQ)

If R1<< R2, transfer cost can be reduced by:

1. Download the join column of R1 to the mobile device:SELECT Name FROM R1

2. Send the join keys to R2 in form of SQL selection queries (e.g., if two results returned in step 1): SELECT Name FROM R2

WHERE Name in (‘Beta Food’,‘Ceta Food’))3. The result from R2 are the joined keys

Very few results returned

Page 21: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Can we do even better?

Block-merge join (BMJ) can handle the limited memory problem, BUT download all join keys essentially

Ship-data-as-queries (SDAQ) can do better ONLY if the sizes of two relations differ much

Pay small overhead Build histograms that capture the data distribution

of target relations Join space pruning Bucket-base joining

Page 22: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Pruning the data space

Page 23: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Constructing Histogram

Problem Mobile devices are not able to receive those

histograms (they are some internal data structures in remote databases)

Solution Constructing some queries that build histogram

through SQL

Page 24: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

String Histogram

Using the SUBSTRING function SELECT SUBSTRING(Name,1,1) AS Bucket,COUNT(Name) As CountFROM R1GROUP BY SUBSTRING(Name,1,1)HAVING COUNT(Name) > 0

Page 25: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Numeric Histogram

Using ROUND function SELECT ROUND(Cost/(D/G)) AS Bucket,COUNT(ROUND(Cost/(D/G)) As CountFROM R2GROUP BY ROUND(Cost/(D/G)HAVING COUNT(ROUND(Cost/(D/G)) > 0

G is granularity that specifies the number of bucket

D is the numeric domain

Page 26: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

A Bucket-base Approach

So far we know that: SDAQ is good when input relations have large size difference But, BMJ is better when two input relations have similar size

(Why?) Histogram helpful to prune join space

“Which method is better?” The histogram can do more! The histogram already partitioned the data space in form of

buckets Depends on the data distribution of each bucket,

assign the best action to them adaptively

Page 27: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Direct Join (~BMJ)

Page 28: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Ship-join-keys (~SDAQ)

Page 29: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning

Page 30: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning

Further breaks a partition into more refined ones and further request histogram for it Hoping some sub-buckets are being pruned in

future Or hoping cheap ship-join-keys join can be

applied on some future sub-buckets

Page 31: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning

SELECT SUBSTRING(Name,1,2) AS Bucket,COUNT(Name) AS CountFROM R2WHERE SUBSTRING(Name,1,1)=‘A’GROUP BY SUBSTRING(Name,1,2)HAVING COUNT(Name) > 0

Bucket Count

AA 10

AB 54

AC 105

AX 85

AY 12

AZ 32

Bucket Count

AB 54

AC 5

AX 90

AZ 32

Page 32: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Which action is the best for each bucket? The cost model!

The largest amount of data that can be transferred in one packet is called MTU (Maximum Transfer Unit)

The largest segment of TCP data that can be transmitted is called MSS(Maximum Segment Size)

MTU = MSS + BH (BH is the size of headers) To transfer B bytes data, the actual number

of bytes to be transferred is:

Page 33: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Cost Model

Assume CR1 and CR2 be the cost of accessing R1 and R2, respectively

Send a selection query Q to a server needs T(BSQL + Bkey) bytes Bkey = 4 bytes for numeric attributes Bkey = 2L bytes for string attributes in length L

Page 34: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Cost Model

Under these settings, we have to determine the cost of: Direct Join C1

Ship-join-key Join C2

Recursive Partitioning C3

Execute the minimal cost action for each bucket adaptively

Page 35: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

C1 : Direct Join

αi,βi be the i-th histogram bucket summarizing the same data region

1. Send a selection query to R1: CR1 T(BSQL + Bkey)

2. Receiving CR1 T(|αi|Bkey) bytes.3. Send a selection query to R2: CR2 T(BSQL + Bkey)

4. Receiving CR2 T(|βi|Bkey) bytes. CC11((ααii,,ββii)= (C)= (CR1 + R1 + CCR2 R2 )T(B)T(BSQLSQL + B + Bkeykey) )

+ C + CR1R1 T(| T(|ααii|B|Bkeykey) + C) + CR2R2 T(| T(|ββii|B|Bkeykey))

Page 36: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

C2 : Ship-join-keys

|αi |<=| βi |1. Send a selection query to the smaller relation R2 that holds

αi: CR2 T(BSQL + Bkey)

2. Receiving CR2 T(|αi|Bkey) bytes from R2

3. Send a selection query to larger relation R1 to check existence of |αi| keys: CR1 T(BSQL + |αi|Bkey)

4. Receiving at most |αi| keys from R1: CR1 T(|αi|Bkey)

CC22((ααii,,ββii)= C)= CR1 R1 (T(B(T(BSQLSQL + B + Bkeykey)+ T(|)+ T(|ααii|B|Bkeykey)) ))

+ C + CR2R2 T(T(B T(T(BSQLSQL + 2| + 2|ααii|B|Bkeykey))))

Page 37: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

C3 : Recursive Partitioning

Have to estimate the cost of:1. Ask for finer histograms for that bucket from R12. Ask for finer histograms for that bucket from R23. For each pair of (future/virtual) sub-buckets, each of them

may execute direct-join, ship-join-key or recursive partitioning again

Bucket Count

AA 10

AB 54

AC 105

AX 85

AY 12

AZ 32

Bucket Count

AB 54

AC 5

AX 90

AZ 32

Page 38: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning C3

Ask for finer histograms from R1: Ch(G,R1) = CR1(T(G(Bkey+4))+T(BSQL+Bkey))

Ask for finer histograms from R2: Ch(G,R2) = CR2(T(G(Bkey+4))+T(BSQL+Bkey))

For each pair of (future) sub-buckets, each sub-bucket pair may recursively follow direct-join, ship-join-key or recursive partitioning again:

Page 39: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning C3

CC33((ααii,,ββii)= C)= Ch h (G,R1) + C(G,R1) + Ch h (G,R2) + C(G,R2) + CRP RP ((ααii,,ββii) )

Bucket Count

AA 10

AB 54

AC 105

AX 85

AY 12

AZ 32

Bucket Count

AB 54

AC 5

AX 90

AZ 32?

Page 40: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning – Optimistic Estimation

Optimistically assume that buckets in next level are all being pruned It will hold if the data distribution in the two

datasets is very different Since all future (next-level) sub-buckets are being

pruned, they would NOT have any actions Therefore:

CC33((ααii,,ββii)= C)= Ch h (G,R1) + C(G,R1) + Ch h (G,R2) + C(G,R2) + CRP RP ((ααii,,ββii))

Page 41: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning – Linear Interpolation Estimation More accurate. Higher computational cost Exploit the histogram in current level to

estimate the distribution of next level

We DON’T have histograms in this level

We have histograms in this level

Page 42: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Linear Interpolation Estimation

b1b2

b3

b4 b5 b1b2

b3

b4 b5

• Select adjacent buckets as interpolation points• Preserve the current trend• Resistance to fluctuated distribution

Page 43: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

One problem left

1. Level 1: The cost of RP on b2 is?2. Estimate Level 2 by Linear Interpolation

b2,1, b2,2, … , b2,5 are found

3. b2,1, b2,2, … , b2,5 are found, determine which action is the most cost-efficient (C1,C2 or C3) for each sub-bucket?

C1, C2 of b2,1, b2,2, … , b2,5 can be determine C3 of b2,1, b2,2, … , b2,5 ? Started from step 1 again

b1b2

b3

b4 b5 b1b2

b3

b4 b5

Level 1

Level 2

b2,1

b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5

b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5

b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5

b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5

3

2

Page 44: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

If you don’t understand…

Cost 3 of one level depends on next level

Fortunately the cost of CCRP RP ((ααii,,ββii) ) is bounded by the following inequality:

ααii,,ββii

C1

C2

CC33

C1

C2

CC33

C1

C2

CC33

Page 45: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Recursive Partitioning – Linear Interpolation Estimation

CC33((ααii,,ββii)= C)= Ch h (G,R1) + C(G,R1) + Ch h (G,R2) + C(G,R2) + CRP RP ((ααii,,ββii) ) In optimistic estimation, we omit the last item

optimistically CCRP RP ((ααii,,ββii) ) is bounded by the inequality:

Summing up everything:

Page 46: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

RAMJ Algorithm

Recursive and Adaptive Mobile Join

Page 47: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Real Data Experiment

Real Data Set 1 DBLP

Join relations “Conference” (235K tuples) and “Journal” (125K tuples) in order to find the set of publications that have the same conference and journal title

3836 publications have same title in both conference and journal paper

SELECT R1.TitleFROM Conference R1, Journal R2WHERE R1.Title = R2.Title

Page 48: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Real Data Experiment

Real Data Set 2 Restaurants Data Set

Crawled from www.restaurantrow.com Join relation “Steak” (4573 tuples) and “Vegetarian”

(2098 tuples) in order to find the set of restaurants that offer both steak and vegetarian dishes (163 joined)

SELECT R1.NameFROM Steak R1, Vegetarian R2WHERE R1.Name = R2.Name

Page 49: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Real Data Experiment Result

Algorithm DBLP Restaurant

BMJ 45.89M 266.22K

SDAQ 43.28M 180.15K

RAMJ-OPT 30.11M 116.67K

Page 50: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Synthetic Data Experiment

Generate 3 relations with different distributions: Gaussian Negative Exponential Zipf (skewness θ = 1) Default:

10,000 tuples Domain = 100,000 G = 20

Page 51: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Synthetic Data Experiment Result

Algorithm

NegExp-Gaussian Zipf-Gaussian Zipf-NegExp

Transferred (Bytes)

No. of joined keys

Transferred (Bytes)

No. of joined keys

Transferred (Bytes)

No. of joined keys

BMJ 80116

420

80124

184

80120

1148

SDAQ 181944 139728 143580

RAMJ-OPT 48654 35056 77114

RAMJ-LI 47864 35056 76200

Page 52: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

The impact of data skew

Page 53: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

The impact of memory size

Page 54: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Conclusions and Future Work

Identify the requirements and limitation on evaluating ad-hoc join on mobile devices

A recursive and adaptive algorithm – RAMJ Extension to multi-way joins and multi-

attributes Extension to Top-K-Join

Existing approaches on Top-K problem ONLY works on collaborative database

Page 55: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Q & A

?

Page 56: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Approach 2: Mediator

User queries are free-form … i.e., User KY may issue a join query that involves

DB x and DB y, whereas user BY may issue a join query that involves DB e and DB f Mediator cannot answer those queries without prior

preparation like data integration, schema matching …

Mediator services may charge the users as well

Page 57: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Approach 2:Distributed query processing?

Existing distributed database work on trusted environment only

DB1.Name DB2.CustomerName Semi-join

1. Site DB1: Evaluate J: Name DB1 [J = All Names]

2. Send J from DB1 to DB2

3. Site DB2: Evaluate K: CustomerName = J( DB2 ) [Find all CustomerName = Name in DB1]

4. Send K from DB2 to DB1

Page 58: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Approach 2:Distributed query processing?

Not work on our problem! DB1 and DB2 are non-collaborative Would not accept “data structures” as input (e.g.,

a “join column” in semi-join or a “hash-table” in bloom-join Accept SQL only

Semi-join is worked by some modifications: Send J and K through the mobile device

High transfer cost

Page 59: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

References

1. Processing Ad-hoc joins on mobile devices, submitted to EDBT 04

2. P.A. Bernstein and N. Goodman. Power of natural semijoin. SAIM Journal of Computing, 1981

3. P.A. Bernstein, N. Goodman, et. al. Query processing in a system for distributed databases (sdd-1). ACM TODS, 1981

4. N. Mamoulis, P. Kalnis, et. al. Optimization of spatial joins on mobile devices. SSTD, 2003

Page 60: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Other Join Types

Equi-join with selection constraints E.g., We are interested in restaurants appear in both

datasets and the expense is less than $20

SELECT R1.Name, R1.Address, R2.CostFROM R1, R2WHERE R1.Name = R2.NameAND R2.Cost < 20 Add this condition in histogram construction:SELECT SUBSTRING(Name,1,1) AS Bucket,COUNT(Name) As CountFROM R1WHERE R2.Cost < 20GROUP BY SUBSTRING(Name,1,1)HAVING COUNT(Name) > 0

Page 61: HKU CSIS DB Seminar Processing Ad-Hoc Joins on Mobile Devices HKU CSIS DB Seminar 10 Oct 2003 Speaker: Eric Lo

HKU CSIS DB Seminar

Iceberg Semi-join

Find all restaurants in R1 which are recommended by at least 10 users in a discussion group R2 Properties:

Equi-join between R1 and R2Results are comes from R1 onlyCondition is applied on R2 only (>10

users)SELECT SUBSTRING(Name,1,1) AS Bucket,COUNT(Name) As CountFROM R2GROUP BY SUBSTRING(Name,1,1)HAVING COUNT(Name) > t