exploiting gpus in spark

37
Kazuaki Ishizaki IBM Research – Tokyo 日本アイ・ビー・エム(株)東京基礎研究所 Exploiting GPUs in Spark 1

Upload: kazuaki-ishizaki

Post on 18-Jan-2017

940 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Exploiting GPUs in Spark

Kazuaki IshizakiIBM Research – Tokyo日本アイ・ビー・エム(株)東京基礎研究所

Exploiting GPUs in Spark

1

Page 2: Exploiting GPUs in Spark

Who am I?

Kazuaki Ishizaki– live in Tokyo, Japan

Research staff member at IBM Research – Tokyo– http://ibm.co/kiszk

Research interests– compiler optimizations, language runtime, and parallel processing

Worked for Java virtual machine and just-in-time compiler over 20 years– From JDK 1.0 to Java SE 8

Twitter: @kiszk

Slideshare: http://www.slideshare.net/ishizaki

Github: https://github.com/kiszk2 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 3: Exploiting GPUs in Spark

My message is “Spark can meet GPUs”

Let us discuss use cases, opportunities, requirements in meetups,

conferences, and Spark dev or user mailing list

3 Exploiting GPUs in Spark - Kazuaki Ishizaki

While GPU is not the first-class citizen in Spark,4 GPU related talks will be in Spark Summit SF

Page 4: Exploiting GPUs in Spark

Agenda

Motivation & Goal

Activities to Exploit GPUs in Spark

Introduction of GPUs

Design & New Components– Binary columnar

– GPU enabler

Two approaches to Exploit GPUs in Spark– Spark Plug-in

– Enhancement of Catalyst in Spark runtime

Conclusion

4 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 5: Exploiting GPUs in Spark

Want to Accelerate Computation-heavy Application

Motivation– Want to shorten execution time of a long-running Spark application

Computation-heavy

Shuffle-heavy

I/O-heavy

Goal– Accelerate a Spark computation-heavy application

According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark

5 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 6: Exploiting GPUs in Spark

Accelerate a Spark Application by GPUs

Our Approach– Accelerate a Spark application by using GPUs effectively and transparently

Exploit high performance of GPUs

Do not ask users to change their Spark programs

New components for acceleration – Binary columnar (e.g. Apache Arrow)

Efficient data representations for GPUs and CPUs

– GPU enabler Automatically handle executions on GPUs

• GPU memory allocation, data copy between GPU and CPU, etc …

6 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 7: Exploiting GPUs in Spark

Motivation & Goal

Projects to Exploit GPUs in Spark

Introduction of GPUs

Design & New Components

Two approaches to Exploit GPUs in Spark

Conclusion

Page 8: Exploiting GPUs in Spark

Existing 10~ Projects to Exploit GPUs in Spark

There are several activities, but no one was enabled in master– Community will make GPU as a first-class citizen in Spark

8 Exploiting GPUs in Spark - Kazuaki Ishizaki

Spark system

programmer

Spark application

programmer

Generated from

Spark application

Spark

standard APIs

(RDD, Dataset,

DataFrame)

mllib (N/A on github)

Deeplearning4J on

Spark

Our GPU enabler

(spark-gpu)

Spark SWAT

Columnar

DataFrame (N/A on

github)

NUWA (product)

Our on-going work

Unique APIs Caffe on Spark

BidMach Spark

CSR in Spark

HeteroSpark (N/A on

github)

Who prepares GPU code

Ho

w G

PU

co

de is

called

Page 9: Exploiting GPUs in Spark

Existing Resource Managers to Support GPU for Spark

Spark on Mesos– https://spark-summit.org/2016/events/spark-on-mesos-the-state-of-the-art/

Yarn Node Labels– https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-

site/NodeLabel.html

9 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 10: Exploiting GPUs in Spark

Motivation & Goal

Projects to Exploit GPUs in Spark

Introduction of GPUs

Design & New Components

Two approaches to Exploit GPUs in Spark

Conclusion

Page 11: Exploiting GPUs in Spark

GPU Programming Model

Five steps 1. Allocate GPU device memory

2. Copy data on CPU main memory to GPU device memory

3. Launch a GPU kernel to be executed in parallel on cores

4. Copy back data on GPU device memory to CPU main memory

5. Free GPU device memory

Usually, a programmer has to write these steps in CUDA or OpenCL

11 Exploiting GPUs in Spark - Kazuaki Ishizaki

device memory(up to 12GB)

main memory(up to 1TB/socket)

CPU GPU

Data copyover PCIe

dozen cores/socket thousands cores

Page 12: Exploiting GPUs in Spark

How We Can Run Program Faster on GPU

Assign a lot of parallel computations into cores

Make memory accesses coalesced– An example

– Column-oriented layout achieves better performance This paper reports about 3x performance improvement of GPU kernel execution of

kmeans over row-oriented layout

12 Exploiting GPUs in Spark - Kazuaki Ishizaki

1 52 61 5 3 7

Assumption: 4 consecutive data elementscan be coalesced by GPU hardware

2 v.s. 4memory accesses toGPU device memory

Row-oriented layoutColumn-oriented layout

Pt(x: Int, y: Int)Load four Pt.xLoad four Pt.y

2 6 4 843 87

coresx1 x2 x3 x4cores

Load Pt.x Load Pt.y Load Pt.x Load Pt.y

1 2 31 2 4

y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4

Page 13: Exploiting GPUs in Spark

Motivation & Goal

Projects to Exploit GPUs in Spark

Introduction of GPUs

Design & New Components

Two approaches to Exploit GPUs in Spark

Conclusion

Page 14: Exploiting GPUs in Spark

High Level View of GPU Exploitation

Efficient– Reduce data copy overhead between CPU and GPU

– Make memory accesses efficient on GPU

Transparent– Map parallelism in a program

into GPU native code

User’s Spark Program (scala)

14

case class Pt(x: Int, y: Int)rdd1 = sc.parallelize(Array(

Pt(1, 4), Pt(2, 5),Pt(3, 6), Pt(4, 7),Pt(5, 8), Pt(6, 9)), 3)

rdd2 = rdd1.map(p => Pt(p.x*2, p.y-1))cnt = rdd2.reduce(

(p1, p2) => p1.x + p2.x)

Translate to

GPU native

code

Nativ

e c

ode

1

GPU

4

2 5

3 6

4 7

5 8

6 9

1 4

2 5

3 6

4 7

5 8

6 9

2 3

4 4

6 5

8 6

10 7

12 8

2 3

4 4

6 5

8 6

10 7

12 8

* 2 =

-1 =

rdd1

Datatransfer

x y

Exploiting GPUs in Spark - Kazuaki Ishizaki

GPU enabler

binary columnar Off-heap

x y

GPU can exploit parallelism bothamong blocks in RDD andwithin a block of RDD

rdd2

blockGPU

kernel

CPU

Page 15: Exploiting GPUs in Spark

What Binary Columnar does?

Keep data as binary representation (not Java object representation)

Keep data as column-oriented layout

Keep data on off-heap or GPU device memory

15 Exploiting GPUs in Spark - Kazuaki Ishizaki

2 51 4

Off-heap

case class Pt(x: Int, y: Int)Array(Pt(1, 4),

Pt(2, 5))

Example

2 51 4

Off-heap

Columnar (column-oriented) Row-oriented

Page 16: Exploiting GPUs in Spark

Current RDD as Java objects on Java heap

16 Exploiting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap

Current RDDRow-oriented layoutJava object representationOn Java heap

Pt Pt

Page 17: Exploiting GPUs in Spark

Binary Columnar on off-heap

17 Exploiting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap Off-heap

2 51 4

Current RDDRow-oriented layoutJava object representationOn Java heap

Binary columnarColumn-oriented layoutBinary representationOn off-heap

Page 18: Exploiting GPUs in Spark

2.1.

Long Path from Current RDD to GPU

Three steps to send data from RDD to GPU1. Java objects to column-oriented binary representation on Java heap

From a Java object to binary representation

From a row-oriented format to columnar

2. Binary representation on Java heap to binary columnar on off-heap Garbage collection may move objects on Java heap during GPU related operations

3. Off-heap to GPU device memory

18 Exploiting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

1 4 2 5 2 51 4 2 51 4 2 51 4

Off-heap GPU device memoryJava heap Java heap

This thread in dev ML also discusses overhead of copying data between RDD and GPU

3.

Pt Pt ByteBuffer ByteBuffer

Page 19: Exploiting GPUs in Spark

Long Path from Current Dataset to GPU

Two steps to send data from RDD to GPU1. Binary representation on Java heap to binary columnar on off-heap

From a row-oriented format to columnar

2. Off-heap to GPU device memory

19 Exploiting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)ds = Array(Pt(1, 4),Pt(2, 5)).toDS()ds.map(…).reduce(…) // execute on GPU

2 51 4 2 51 4

Off-heap GPU device memory

2 51 4

Java heap

1. 2.

Page 20: Exploiting GPUs in Spark

Shorter Path from Binary Columnar RDD to GPU

RDD with binary columnar can be simply copied to GPU device memory

20 Exploiting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

Off-heap GPU device memoryEliminated

2 51 4 2 51 4

1 4 2 5 2 51 4 2 51 4

Off-heap GPU device memoryJava heap

2 51 4

Java heap

Page 21: Exploiting GPUs in Spark

Can Execute map() in Parallel Using Binary Columnar

Adjacent elements in binary columnar can be accessed in parallel

The same type of operations ( * or -) can be executed in parallel for data

to be loaded in parallel

21 Exploiting GPUs in Spark - Kazuaki Ishizaki

case class Pt(x: Int, y: Int)...res= rdd or ds.map(p => Pt(p.x*2, p.y-1))

1 4 2 5

Java heap Off-heap

2 51 4

Current RDD Binary columnar

Memory accessorder 1 2 3 4 1 1 2 2

1 4 2 5

Java heap

Current Dataset

1 2 3 4

Page 22: Exploiting GPUs in Spark

Advantages of Binary Columnar

Can exploit high performance of GPUs

Can reduce overhead of data copy between CPU and GPU

Consume less memory footprint than RDD

Can directly compute data, which are stored in columnar, from Apache

Parquet, Apache Arrow

Can exploit SIMD instructions on CPU, too

22 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 23: Exploiting GPUs in Spark

What GPU Enabler Does?

Copy data in binary columnar RDD between CPU main memory and GPU

device memory

Launch GPU kernels

Cache GPU native code for kernels

Generate GPU native code from transformations and actions in a program– We already productized the IBM Java just-in-time compiler that generate GPU

native code from a lambda expression in Java 8

23 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 24: Exploiting GPUs in Spark

Motivation & Goal

Projects to Exploit GPUs in Spark

Introduction of GPUs

Design & New Components

Two approaches to Exploit GPUs in Spark

Conclusion

Page 25: Exploiting GPUs in Spark

How to Exploit GPUs in Spark

Bottom line is to enable columnar storage and GPU enabler in Spark– Any approaches can use both them to effectively and transparently exploit

GPUs in Spark

25 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 26: Exploiting GPUs in Spark

Java heap

Comparisons among DataFrame, Dataset, and RDD

DataFrame (with relational operations) and Dataset (with lambda

functions) use Catalyst and row-oriented data representation on off-heap

26 Exploiting GPUs in Spark - Kazuaki Ishizaki

ds = d.toDS()ds.filter(p => p.x>1)

.count()

1 4 2 5

Java heap

rdd = sc.parallelize(d)rdd.filter(p => p.x>1)

.count()

df = d.toDF(…)df.filter(”x>1”)

.count()

case class Pt(x: Int, y: Int)d = Array(Pt(1, 4), Pt(2, 5))

FrontendAPI

2 51 4 Data

DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

Catalyst

Backendcomputation

GeneratedJava bytecode

Java bytecode inSpark program and runtime

Row-orientedRow-oriented

Page 27: Exploiting GPUs in Spark

Two Approaches to Exploit GPUs

Devising Spark Package for RDD– Library developers can use this to enable their GPU code in Spark libraries

– Application programmers can use this to run their code in their Spark

application

Enhance Catalyst for DataFrame/Dataset– Spark programs with DataFrame/Dataset will be translated to GPU code

transparently

– As the first step, we are generating code for specific columnar storages for

CPUs• https://github.com/apache/spark/pull/11636 for ColumnarBatch

• https://github.com/apache/spark/pull/11956 for CachedBatch

2. Introduce generic columnar storage (UnsafeColumn?) for CPU

3. Generate code for generic columnar storage for CPU

4. Generate code for generic columnar storage for GPU

27 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 28: Exploiting GPUs in Spark

Software Stack for RDD in Spark 2.0

RDD keeps data on Java heap

28 Exploiting GPUs in Spark - Kazuaki Ishizaki

RDD API

Java heap

RDD data

User’s/library’s Spark program

Page 29: Exploiting GPUs in Spark

Off-heap

GPU Exploitation for RDD

Current RDD and binary columnar can co-exist

User/library-provided GPU code is managed by GPU enabler

29 Exploiting GPUs in Spark - Kazuaki Ishizaki

RDD API

Java heap

RDD data

User’s/library’s Spark program

ColumnarGPU

enabler

GPU device memory

Columnar

Page 30: Exploiting GPUs in Spark

Software Stack for Dataset/DataFrame in Spark 2.0

Dataset become a primary data structure for computation

Dataset keeps data in UnsafeRow on Java heap

30 Exploiting GPUs in Spark - Kazuaki Ishizaki

DataFrame

Dataset

TungstenCatalyst

Java heap

UnsafeRow

User’s/library’s Spark program

Logical optimizer

CPU code generator

Page 31: Exploiting GPUs in Spark

GPU Exploitation for DataFrame/Dataset

UnsafeRow and Columnar can co-exist

Catalyst will generate GPU code from a Spark program

31 Exploiting GPUs in Spark - Kazuaki Ishizaki

User’s/library’s Spark program

DataFrame

Dataset

TungstenCatalyst

Off-heap

GPU device memory

ColumnarLogical optimizer

CPU code generator

Columnar

Java heap

UnsafeRow

GPU enabler

Columnar

Page 32: Exploiting GPUs in Spark

Exploit GPUs for RDD

Execute user-provided GPU kernels from map()/reduce() functions– GPU memory managements and data copy are automatically handled

Generate GPU native code for simple map()/reduce() methods– “spark.gpu.codegen=true” in spark-defaults.conf

32 Exploiting GPUs in Spark - Kazuaki Ishizaki

rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDDsum = rdd1.map(i => i * 2)

.reduce((x, y) => (x + y))

// CUDA__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {

long ix = threadIdx.x + blockIdx.x * blockDim.x;if (size <= ix) return;outX[ix] = inX[ix] * 2;outY[ix] = inY[ix] – 1;

}

// SparkmapFunction = new CUDAFunction(“sample_map", // CUDA method name

Array("this.x", "this.y"), // input object has two fieldsArray("this.x“, “this.y”), // output object has two fieldsthis.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier

rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDDrdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y-1), mapFunction)

Page 33: Exploiting GPUs in Spark

How to Use Exploitation of GPUs for RDD

Easy to install by one-liner and to run by one-liner– on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM

JDK or OpenJDK

Run script for AWS EC2 is available, which support spot instances 33 Exploiting GPUs in Spark - Kazuaki Ishizaki

$ wget https://s3.amazonaws.com/spark-gpu-public/spark-gpu-latest-bin-hadoop2.4.tgz &&tar xf spark-gpu-latest-bin-hadoop2.4.tgz && cd spark-gpu

$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run-example SparkGPULR 8 3200 32 5…numSlices=8, N=3200, D=32, ITERATIONS=5 On iteration 1On iteration 2On iteration 3On iteration 4On iteration 5Elapsed time: 431 ms$

Available at http://kiszk.github.io/spark-gpu/

• 3 contributors• Private communications

with other developers

Page 34: Exploiting GPUs in Spark

Achieved 3.15x Performance Improvement by GPU

Ran naïve implementation of logistic regression

Achieved 3.15x performance improvement of logistic regression over

without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card– We have rooms to improve performance

34 Exploiting GPUs in Spark - Kazuaki Ishizaki

Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark

Program parametersN=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5Slices=128 (without GPU), 16 (with GPU)MASTER=local[8] (without and with GPU)

Hardware and softwareMachine: nx360 M4, 2 sockets 8-core Intel Xeon E5-2667 3.3GHz, 256GB memory, one NVIDIA K40m cardOS: RedHat 6.6, CUDA: 7.0

Page 35: Exploiting GPUs in Spark

We are planning to release Spark Package version

You can use any Spark runtime– Spark 1.6, 1.6.1, 2.0.0-SNAPSHOP, your own Spark, …

Live demo

35 Exploiting GPUs in Spark - Kazuaki Ishizaki

Page 36: Exploiting GPUs in Spark

Motivation & Goal

Projects to Exploit GPUs in Spark

Introduction of GPUs

Design & New Components

Two approaches to Exploit GPUs in Spark

Conclusion

Page 37: Exploiting GPUs in Spark

Takeaway

Accelerate a Spark application by using GPUs effectively and transparently

More than 10 approaches exist for GPU exploitation

Two fundamental components– Binary columnar to alleviate overhead for GPU exploitation

– GPU enabler to manage GPU kernel execution from a Spark program Call pre-compiled libraries for GPU

Generate GPU native code at runtime

Two approaches– Spark plugin For RDD

– Enhancement of Catalyst for DataFrame/Dataset

Looking for anything in the community– Use case, discussions, requests, …

37 Exploiting GPUs in Spark - Kazuaki Ishizaki

Appreciate any your feedback and contributions