fine-grained partitioning for aggressive data skipping calvin 2015-06-03 sigmod 2014 uc berkeley

Fine-grained Partitioning for

Aggressive Data Skipping

Calvin2015-06-03

SIGMOD 2014

UC Berkeley

Contents

• Background• Contribution• Overview• Algorithm• Data skipping• Experiment

Background

• How to get insights of enormous datasets interactively ?• How to shorten query response time on huge datasets ?

• Drawbacks1. Coarse-grained block(partition)s2. Not balance3. The remaining block(partition)s still contain many tuples4. Blocks do not match the workload skew5. Data and query filter correlation

• Block / Partition• Oracle / Hbase / Hive / LogBase

Prune data block(partition) according to metadata

Goals

• Workload-driven blocking techniqueFined-grainedBalance-sizedOfflineRe-executableCo-exists with original partitioning techniques

Example

Extract features

Vectorization

1.Split block2.Storage

How to choose

How to split

Condition Skip

F3 P1, P3

F1^F2 P2, P3

Contribution

• Feature selectionIdentity representative filtersModeled as Frequent itemset mining

• Optimal partitioningBalanced-Max-Skip partitioning problem – NP HardA bottom-up framework for approximate solution

Overview

(1)extract features from workload(2)scan table and transform tuple to (vector, tuple)-pair(3)count by vector to reduce partitioner input(4)generate blocking map(vector -> blockId)(5)route each tuple to its destination block(6)update union block feature to catalog

Workload Assumptions

• Filters in query of the workload have commonality and stability Scheduled or reporting queries Template query with different value range

Workload Modeling

• Q={Q1,Q2,…Qm}• Examples:

Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21

• F: All predicates in Q• Fi: Qi’s predicates• fij: Each item in Fi

product in (‘shoes’, ‘shirts’) vs product= ‘shoes’product in (‘shoes’, ‘shirts’) vs revenue > 21

Filter augmentation

• Examples:Q1: product=‘shoes’Q2: product in (‘shoes’, ‘shirts’), revenue > 32Q3: product=‘shirts’, revenue > 21

• Examples:Q1: product=‘shoes’, product in (‘shoes’, ‘shirts’)Q2: product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21Q3: product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’)

Frequent itemset mining with threshold T(=2) numFeat

Partitioning problem modeling

• ={F1,F2,…Fm} as features, weight wi

• V={v1,v2,…vn} as transformed tupleVij indicates whether vi satisfies Fj

• P={P1,P2,P3} as a partition𝑣 (𝑃 𝑖)=𝑂𝑅𝑣 𝑗∈𝑃 𝑖

𝑣 𝑗

• Cost function C(Pi) as sum of tuples that Pi can skip for all queries in workload :

Max(C(P)) NP-Hard

The bottom up framework

Ward’s method: Hierarchical grouping to optimize an objective function

n2log(n)

R: {vector -> blockId, …}

Data skipping1. Generate vector2. OR with each partition vector3. Block with at least one 0 bit can be skipped

Experiment

• EnvironmentAmazon Spark EC2 cluster with 25 instances8*2.66GHz CPU cores64 GB RAM2*840 GB disk storage

• Implement and experiment on Shark (SQL on spark)

Datasets• TPC-H

• 600 million rows, 700GB in size• Query templates (q3,q5,q6,q8,q10,q12,q14,q19)• 800 queries as training workload, 100 from each• 80 testing queries, 10 from each

• TPC-H Skewed• TPC-H query generator has a uniform distribution• 800 queries as training workload, 100 from each under Zipf distribution

• Conviva• User access log of video streams• 104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, …• 674 training queries and 61 testing queries• 680 million tuples, 1TB in size

TPC-H 相关说明： http://blog.csdn.net/fivedoumi/article/details/12356807

http://blog.csdn.net/fivedoumi/article/details/12356807

TPC-H results

• Query performance• Measure number of tuples scanned and response time for different blocking

and skipping schemas• Full scan: no data skipping, baseline• Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used• Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000

partitions. Shark’s data skipping used• Fineblock: numFeature=15 features from 800 training queries, minSize=50k,

Shark’s data skipping and feature-based data skipping are used

TPC-H results - efficiency

TPC-H results – effect of minSize

• The smaller the block size is, the more chance we can skip data• numFeature=15 and various minSize

Y-value : ratio of number scanned to number must be scanned

TPC-H results – effect of numFeat

TPC-H results – blocking time

• A month partition in TPC-H• 7.7 million tuples, 8GB in size• 1000 blocks• numFeat=15,minSize=50• One minute

Convia results

• Query performance• Fullscan: no data skipping• Range: partition on date and a frequently queried column, Shark’s skipping used• Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping

and feature-based skipping used

fine-grained partitioning for aggressive data skipping calvin 2015-06-03 sigmod 2014 uc berkeley

Documents