johns hopkins university xiaodan wang eric perlman randal burns tamas budavari charles meneveau...

26
Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations

Upload: helen-truss

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

Johns Hopkins University

Xiaodan WangEric PerlmanRandal BurnsTamas BudavariCharles MeneveauAlexander Szalay

Purdue UniversityTanu Malik

JAWS: Job-Aware Workload Scheduling for the Exploration of

Turbulence Simulations

Page 2: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Problem

Ensure high throughput for concurrent accesses to peta-scale Scientific datasets

Turbulence Database Cluster– A new approach to data exploration

Traditionally analyze dynamics on the fly Large simulations out of reach for many Scientists

– Stores complete space-time histories of DNS– Exploration by querying simulation result– 27TB (velocity and pressure data on 10243 grid)– Available to wide community over the Web

Page 3: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Pitfalls of Success Enable new class of applications

– Iterative exploration over large space-time– Correlate, mine, extract at petabyte scale

Heavily used and data intensive queries– 50,275,005,460 points queried– Hundreds of thousands of queries/month– I/O bound queries (79-88% time

on loading data)– Scan large portions of DB lasting

hours-days

Single user can occupy the

entire system for hours

Page 4: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Addressing I/O Challenges

I/O contention and congestion from concurrent use Significant data reuse between queries

– Many large queries access the same data– Lends to batch scheduling– I.e. particles may cluster in turbulence structures

Page 5: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

A Batch Scheduling Approach

Co-schedule queries accessing the same data– Eliminate redundant accesses to the disk– Amortize I/O cost over multiple queries

Job-aware schedule for queries w/ data dependencies

Trade-offs b/w arrival order

and throughput Scales with workload saturation

– Up to 4x improvement in throughput

Page 6: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Architecture

Universal addressing scheme for partitioning, addressing, and scheduling

Data organization– 643 atoms (8MB)– Morton order index– Spatial and temporal

partitioning

JAWS scheduling at

each node

Page 7: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

LifeRaft: Data-Driven Batch Scheduling

Decompose into sub-queries based on data access Co-schedule sub-queries to amortize I/O Evaluate data atoms based on utility metric

– Amount of contention (queries per data atom)– Age (queuing time) of oldest query (arrival order)– Balance contention with age via tunable parameter

Turbulence DBTurbulence DB

R1 R2 R3

R2 R3 R4

R1 R2

Q1

Q2

Q3

Dec

omp

osit

ion

Data Access by QueryData Access by Query

Q1 Q2 Q3

Q1 Q3

Q1 Q2

R2

R1

R3Q2R3

Co-schedule by Sub-queryCo-schedule by Sub-query

Bat

ch S

ched

.

QueryResultsQueryResults

Page 8: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

A Case for Job-Aware Scheduling Job-awareness yields additional I/O savings

– Greedy LifeRaft miss data sharing between jobs– Incorporate data-dependency to identify redundancy

Execution TimeExecution Time

JobJob11 R1R1R1R1 R3R3R3R3 R4R4R4R4L

ifeR

aft

Lif

eRaf

t

JobJob22

JobJob33

R2R2R2R2

R2R2R2R2

R3R3R3R3

R3R3R3R3

R4R4R4R4

R4R4R4R4

JobJob11 R1R1R1R1 R3R3R3R3 R4R4R4R4

JAW

SJA

WS

JobJob22

JobJob33

R2R2R2R2

R2R2R2R2

R3R3R3R3

R3R3R3R3

R4R4R4R4

R4R4R4R4

Page 9: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Poly-Time Greedy Algorithm

jj11

jj22

R2R2R2R2 R4R4R4R4 R5R5R5R5R1R1R1R1

R3R3R3R3 R4R4R4R4R2R2R2R2 R6R6R6R6

jj33 R4R4R4R4 R5R5R5R5R1R1R1R1 R6R6R6R6

Precedence Edge ( ): Subsequent queries in a job must wait for predecessors

Gating Edge ( ): Queries with data sharing and are evaluated at the same time

Scheduler evaluate queries in the graph from left to right

Page 10: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Poly-Time Greedy Algorithm Dynamic program phase: identify data sharing b/w job pairs

– DP based on Needleman-Wunsch algorithm for every pair of jobs– Maximize score (i.e. data sharing): 1 if two queries exhibit data

sharing and are co-scheduled, 0 otherwise– Complexity O(n2m2)

jj11

jj22

R2R2R2R2 R4R4R4R4 R5R5R5R5R1R1R1R1

R4R4R4R4 R5R5R5R5R1R1R1R1 R6R6R6R6

Gating EdgeGating EdgePrecedence EdgePrecedence Edge 32110

32110

22110

11110

00000

jj11 R1R1 R2R2 R4R4 R5R5

jj22

R1R1

R4R4

R5R5

R6R6

Page 11: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Poly-Time Greedy Algorithm Merge phase: merge pairwise DP solutions

– Sort job pairs based on # of gating edges– Merge gating edges b/w pairs of jobs greedily– Complexity O(n3m2) (typically sparse graphs up to ~3000 edges)

jj11 R2R2R2R2 R4R4R4R4 R5R5R5R5R1R1R1R1

jj33 R4R4R4R4 R5R5R5R5R1R1R1R1 R6R6R6R6

jj22 R3R3R3R3 R4R4R4R4R2R2R2R2 R6R6R6R6

jj11 R2R2R2R2 R4R4R4R4 R5R5R5R5R1R1R1R1

jj22 R3R3R3R3 R4R4R4R4R2R2R2R2 R6R6R6R6

jj33 R4R4R4R4 R5R5R5R5R1R1R1R1 R6R6R6R6

jj11

jj22

R2R2R2R2 R4R4R4R4 R5R5R5R5R1R1R1R1

R3R3R3R3 R4R4R4R4R2R2R2R2 R6R6R6R6

jj33 R4R4R4R4 R5R5R5R5R1R1R1R1 R6R6R6R6

jj11

jj22

R2R2R2R2 R4R4R4R4 R5R5R5R5R1R1R1R1

R3R3R3R3 R4R4R4R4R2R2R2R2 R6R6R6R6

jj33 R4R4R4R4 R5R5R5R5R1R1R1R1 R6R6R6R6

Page 12: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Scheduling Example

jj11

jj22

WAITWAITWAITWAIT WAITWAITWAITWAIT WAITWAITWAITWAITQUEUEQUEUEQUEUEQUEUE

WAITWAITWAITWAIT WAITWAITWAITWAITREADYREADYREADYREADY WAITWAITWAITWAIT

Gating EdgeGating EdgePrecedence EdgePrecedence Edge

jj33 WAITWAITWAITWAIT WAITWAITWAITWAITQUEUEQUEUEQUEUEQUEUE WAITWAITWAITWAIT

R1R1 R2R2 R4R4 R5R5

R6R6R4R4R3R3R2R2

R1R1 R4R4 R5R5 R6R6

ExampleExample Three jobs jThree jobs j11, j, j22, j, j33

No cachingNo caching Single region at a timeSingle region at a time

Page 13: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Scheduling Example

jj11

jj22

QUEUEQUEUEQUEUEQUEUE WAITWAITWAITWAIT WAITWAITWAITWAITDONEDONEDONEDONE

WAITWAITWAITWAIT WAITWAITWAITWAITQUEUEQUEUEQUEUEQUEUE WAITWAITWAITWAIT

Gating EdgeGating EdgePrecedence EdgePrecedence Edge

jj33 READYREADYREADYREADY WAITWAITWAITWAITDONEDONEDONEDONE WAITWAITWAITWAIT

R1R1 R2R2 R4R4 R5R5

R6R6R4R4R3R3R2R2

R1R1 R4R4 R5R5 R6R6

Time 1

jj11 R1R1R1R1

jj33 R1R1R1R1

Page 14: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Scheduling Example

jj11

jj22

DONEDONEDONEDONE READYREADYREADYREADY WAITWAITWAITWAITDONEDONEDONEDONE

QUEUEQUEUEQUEUEQUEUE WAITWAITWAITWAITDONEDONEDONEDONE WAITWAITWAITWAIT

Gating EdgeGating EdgePrecedence EdgePrecedence Edge

jj33 READYREADYREADYREADY WAITWAITWAITWAITDONEDONEDONEDONE WAITWAITWAITWAIT

R1R1 R2R2 R4R4 R5R5

R6R6R4R4R3R3R2R2

R1R1 R4R4 R5R5 R6R6

Time 2

jj11

jj22

R2R2R2R2

R2R2R2R2

Page 15: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Scheduling Example

jj11

jj22

DONEDONEDONEDONE QUEUEQUEUEQUEUEQUEUE WAITWAITWAITWAITDONEDONEDONEDONE

DONEDONEDONEDONE QUEUEQUEUEQUEUEQUEUEDONEDONEDONEDONE WAITWAITWAITWAIT

Gating EdgeGating EdgePrecedence EdgePrecedence Edge

jj33 QUEUEQUEUEQUEUEQUEUE WAITWAITWAITWAITDONEDONEDONEDONE WAITWAITWAITWAIT

R1R1 R2R2 R4R4 R5R5

R6R6R4R4R3R3R2R2

R1R1 R4R4 R5R5 R6R6

Time 3

jj22 R3R3R3R3

Page 16: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Scheduling Example

jj11

jj22

DONEDONEDONEDONE DONEDONEDONEDONE QUEUEQUEUEQUEUEQUEUEDONEDONEDONEDONE

DONEDONEDONEDONE DONEDONEDONEDONEDONEDONEDONEDONE READYREADYREADYREADY

Gating EdgeGating EdgePrecedence EdgePrecedence Edge

jj33 DONEDONEDONEDONE QUEUEQUEUEQUEUEQUEUEDONEDONEDONEDONE WAITWAITWAITWAIT

R1R1 R2R2 R4R4 R5R5

R6R6R4R4R3R3R2R2

R1R1 R4R4 R5R5 R6R6

Time 4

jj11

jj22

jj33

R4R4R4R4

R4R4R4R4

R4R4R4R4

Page 17: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Scheduling Example

jj11

jj22

DONEDONEDONEDONE DONEDONEDONEDONE DONEDONEDONEDONEDONEDONEDONEDONE

DONEDONEDONEDONE DONEDONEDONEDONEDONEDONEDONEDONE QUEUEQUEUEQUEUEQUEUE

Gating EdgeGating EdgePrecedence EdgePrecedence Edge

jj33 DONEDONEDONEDONE DONEDONEDONEDONEDONEDONEDONEDONE QUEUEQUEUEQUEUEQUEUE

R1R1 R2R2 R4R4 R5R5

R6R6R4R4R3R3R2R2

R1R1 R4R4 R5R5 R6R6

Time 5

jj11

jj33

R5R5R5R5

R5R5R5R5

Page 18: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

JAWS: Scheduling Example

jj11

jj22

DONEDONEDONEDONE DONEDONEDONEDONE DONEDONEDONEDONEDONEDONEDONEDONE

DONEDONEDONEDONE DONEDONEDONEDONEDONEDONEDONEDONE DONEDONEDONEDONE

Gating EdgeGating EdgePrecedence EdgePrecedence Edge

jj33 DONEDONEDONEDONE DONEDONEDONEDONEDONEDONEDONEDONE DONEDONEDONEDONE

R1R1 R2R2 R4R4 R5R5

R6R6R4R4R3R3R2R2

R1R1 R4R4 R5R5 R6R6

Time 6

jj22

jj33

R6R6R6R6

R6R6R6R6

In comparison, LifeRaft requirestime 8

Page 19: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Additional Optimizations

Two-level scheduling– Exploit locality of reference– Group and evaluate multiple data atoms

Adaptive Starvation Resistance– Trade-offs b/w query throughput and response time– Incremental changes by workload saturation (i.e. query

arrival rate)

Coord. Cache Replacement w/ Scheduling

Page 20: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Experimental Setup 800GB sample DB: 31 time steps (0.062 sec of simulation time) Workload

– 8 million queries (11/2007-09/2009), 83k unique jobs– 63% of jobs persist between 1 and 30 min– 88% of jobs access data from one time step, 3% iterate over 0.2 sec

of simulation time (10% of DB)– Use 50k query trace (1k jobs) from week of 07/20/2009

Algorithms compared– NoShare: queries in arrival order with no I/O sharing– LifeRaft1 (arrival order) and LifeRaft2 (contention order)

– JAWS1: JAWS without job awareness

– JAWS2: includes all optimizations

Page 21: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Query Throughput

3x improvement3x improvement

30%

fro

m j

ob-a

war

enes

s30

% f

rom

job

-aw

aren

ess

12%

fro

m 2

-lev

el s

ched

.12

% f

rom

2-l

evel

sch

ed.

22%

fro

m q

ry r

eord

erin

g22

% f

rom

qry

reo

rder

ing

Page 22: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Sensitivity to Workload Saturation

- JAWS2 scales with workload- NoShare and LifeRaft1 plateau @ 0.3

- Gap insensitive to saturation changes- JAWS2 keeps response time low and adapts to workload saturation

Page 23: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Future Directions Quality of service guarantees

– Supporting interactive queries– Bounded completion time in proportion to query size

Declarative style interfaces for job optimizations– Explicitly link related queries– Pre-declare time and space of interest– Pre-packaged op. that iterate over space/time inside DB

Job-awareness crucial for Scientific workloads– Alleviates I/O contention across jobs– Up to 4x increase in throughput– Scales with workload

Page 24: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Questions?

Page 25: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Sensitivity to Batch Size k

Small k fails to exploit localityof reference in the computationSmall k fails to exploit locality

of reference in the computationLarge k impacts cache reuse and

conforms less to workload throughputLarge k impacts cache reuse and

conforms less to workload throughput

Page 26: Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-

JAWS: Job-Aware Workload Scheduling

Sensitivity to Cache Replacement

Compare w/ SQL Server’s LRU-K based replacement– Workload knowledge improves cache hit modestly– URC and SLRU improves performance by 16% and 4%– Low overhead optimizations for data intensive queries