exploiting gpus in spark

Kazuaki IshizakiIBM Research – Tokyo日本アイ・ビー・エム（株）東京基礎研究所

Exploiting GPUs in Spark

1

Who am I?

Kazuaki Ishizaki– live in Tokyo, Japan

Research staff member at IBM Research – Tokyo– http://ibm.co/kiszk

Research interests– compiler optimizations, language runtime, and parallel processing

Worked for Java virtual machine and just-in-time compiler over 20 years– From JDK 1.0 to Java SE 8

Twitter: @kiszk

Slideshare: http://www.slideshare.net/ishizaki

Github: https://github.com/kiszk2 Exploiting GPUs in Spark - Kazuaki Ishizaki

http://ibm.co/kiszk

http://www.slideshare.net/ishizaki

https://github.com/kiszk

My message is “Spark can meet GPUs”

Let us discuss use cases, opportunities, requirements in meetups,

conferences, and Spark dev or user mailing list

3 Exploiting GPUs in Spark - Kazuaki Ishizaki

While GPU is not the first-class citizen in Spark,4 GPU related talks will be in Spark Summit SF

Agenda

Motivation & Goal

Activities to Exploit GPUs in Spark

Introduction of GPUs

Design & New Components– Binary columnar

– GPU enabler

Two approaches to Exploit GPUs in Spark– Spark Plug-in

– Enhancement of Catalyst in Spark runtime

Conclusion


Want to Accelerate Computation-heavy Application

Motivation– Want to shorten execution time of a long-running Spark application

Computation-heavy

Shuffle-heavy

I/O-heavy

Goal– Accelerate a Spark computation-heavy application

According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark


http://www.slideshare.net/SparkSummit/reynold-xin

Accelerate a Spark Application by GPUs

Our Approach– Accelerate a Spark application by using GPUs effectively and transparently

Exploit high performance of GPUs

Do not ask users to change their Spark programs

New components for acceleration – Binary columnar (e.g. Apache Arrow)

Efficient data representations for GPUs and CPUs

– GPU enabler Automatically handle executions on GPUs

• GPU memory allocation, data copy between GPU and CPU, etc …


Motivation & Goal

Projects to Exploit GPUs in Spark


Design & New Components

Two approaches to Exploit GPUs in Spark

Conclusion

Existing 10~ Projects to Exploit GPUs in Spark

There are several activities, but no one was enabled in master– Community will make GPU as a first-class citizen in Spark


Spark system

programmer

Spark application

programmer

Generated from

Spark application

Spark

standard APIs

(RDD, Dataset,

DataFrame)

mllib (N/A on github)

Deeplearning4J on

Spark

Our GPU enabler

(spark-gpu)

Spark SWAT

Columnar

DataFrame (N/A on

github)

NUWA (product)

Our on-going work

Unique APIs Caffe on Spark

BidMach Spark

CSR in Spark

HeteroSpark (N/A on

github)

Who prepares GPU code

Ho

w G

PU

co

de is

called

http://registration.gputechconf.com/quicklink/fVDfEZz

http://deeplearning4j.org/spark

http://kiszk.github.io/spark-gpu/

https://github.com/agrippa/spark-swat

http://tinyurl.com/sparkgpu

http://nuwabox.com/

https://github.com/yahoo/CaffeOnSpark/wiki

https://github.com/BIDData/BIDMach_Spark

https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-28-Peilong-Li-Yan-Luo.pdf

Existing Resource Managers to Support GPU for Spark

Spark on Mesos– https://spark-summit.org/2016/events/spark-on-mesos-the-state-of-the-art/

Yarn Node Labels– https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-

site/NodeLabel.html


Motivation & Goal





Conclusion

GPU Programming Model

Five steps 1. Allocate GPU device memory

2. Copy data on CPU main memory to GPU device memory

3. Launch a GPU kernel to be executed in parallel on cores

4. Copy back data on GPU device memory to CPU main memory

5. Free GPU device memory

Usually, a programmer has to write these steps in CUDA or OpenCL


device memory(up to 12GB)

main memory(up to 1TB/socket)

CPU GPU

Data copyover PCIe

dozen cores/socket thousands cores

How We Can Run Program Faster on GPU

Assign a lot of parallel computations into cores

Make memory accesses coalesced– An example

– Column-oriented layout achieves better performance This paper reports about 3x performance improvement of GPU kernel execution of

kmeans over row-oriented layout


1 52 61 5 3 7

Assumption: 4 consecutive data elementscan be coalesced by GPU hardware

2 v.s. 4memory accesses toGPU device memory

Row-oriented layoutColumn-oriented layout

Pt(x: Int, y: Int)Load four Pt.xLoad four Pt.y

2 6 4 843 87

coresx1 x2 x3 x4cores

Load Pt.x Load Pt.y Load Pt.x Load Pt.y

1 2 31 2 4

y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4

http://www.cs.virginia.edu/~skadron/Papers/sc11_dymaxion_dist.pdf

Motivation & Goal





Conclusion

High Level View of GPU Exploitation

Efficient– Reduce data copy overhead between CPU and GPU

– Make memory accesses efficient on GPU

Transparent– Map parallelism in a program

into GPU native code

User’s Spark Program (scala)

14

case class Pt(x: Int, y: Int)rdd1 = sc.parallelize(Array(

Pt(1, 4), Pt(2, 5),Pt(3, 6), Pt(4, 7),Pt(5, 8), Pt(6, 9)), 3)

rdd2 = rdd1.map(p => Pt(p.x*2, p.y-1))cnt = rdd2.reduce(

(p1, p2) => p1.x + p2.x)

Translate to

GPU native

code

Nativ

e c

ode

1

GPU

4

2 5

3 6

4 7

5 8

6 9

1 4

2 5

3 6

4 7

5 8

6 9

2 3

4 4

6 5

8 6

10 7

12 8

2 3

4 4

6 5

8 6

10 7

12 8

* 2 =

-1 =

rdd1

Datatransfer

x y

Exploiting GPUs in Spark - Kazuaki Ishizaki

GPU enabler

binary columnar Off-heap

x y

GPU can exploit parallelism bothamong blocks in RDD andwithin a block of RDD

rdd2

blockGPU

kernel

CPU

What Binary Columnar does?

Keep data as binary representation (not Java object representation)

Keep data as column-oriented layout

Keep data on off-heap or GPU device memory


2 51 4

Off-heap

case class Pt(x: Int, y: Int)Array(Pt(1, 4),

Pt(2, 5))

Example

2 51 4

Off-heap

Columnar (column-oriented) Row-oriented

Current RDD as Java objects on Java heap


case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap

Current RDDRow-oriented layoutJava object representationOn Java heap

Pt Pt

Binary Columnar on off-heap


case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),

Pt(2, 5)))

Object header for Java virtual machine

1 4 2 5

Java heap Off-heap

2 51 4

Current RDDRow-oriented layoutJava object representationOn Java heap

Binary columnarColumn-oriented layoutBinary representationOn off-heap

2.1.

Long Path from Current RDD to GPU

Three steps to send data from RDD to GPU1. Java objects to column-oriented binary representation on Java heap

From a Java object to binary representation

From a row-oriented format to columnar

2. Binary representation on Java heap to binary columnar on off-heap Garbage collection may move objects on Java heap during GPU related operations

3. Off-heap to GPU device memory


case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

1 4 2 5 2 51 4 2 51 4 2 51 4

Off-heap GPU device memoryJava heap Java heap

This thread in dev ML also discusses overhead of copying data between RDD and GPU

3.

Pt Pt ByteBuffer ByteBuffer

http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tt10481.html#a16061

Long Path from Current Dataset to GPU

Two steps to send data from RDD to GPU1. Binary representation on Java heap to binary columnar on off-heap

From a row-oriented format to columnar

2. Off-heap to GPU device memory


case class Pt(x: Int, y: Int)ds = Array(Pt(1, 4),Pt(2, 5)).toDS()ds.map(…).reduce(…) // execute on GPU

2 51 4 2 51 4

Off-heap GPU device memory

2 51 4

Java heap

1. 2.

Shorter Path from Binary Columnar RDD to GPU

RDD with binary columnar can be simply copied to GPU device memory


case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU

Off-heap GPU device memoryEliminated

2 51 4 2 51 4

1 4 2 5 2 51 4 2 51 4

Off-heap GPU device memoryJava heap

2 51 4

Java heap

Can Execute map() in Parallel Using Binary Columnar

Adjacent elements in binary columnar can be accessed in parallel

The same type of operations ( * or -) can be executed in parallel for data

to be loaded in parallel


case class Pt(x: Int, y: Int)...res= rdd or ds.map(p => Pt(p.x*2, p.y-1))

1 4 2 5

Java heap Off-heap

2 51 4

Current RDD Binary columnar

Memory accessorder 1 2 3 4 1 1 2 2

1 4 2 5

Java heap

Current Dataset

1 2 3 4

Advantages of Binary Columnar

Can exploit high performance of GPUs

Can reduce overhead of data copy between CPU and GPU

Consume less memory footprint than RDD

Can directly compute data, which are stored in columnar, from Apache

Parquet, Apache Arrow

Can exploit SIMD instructions on CPU, too


What GPU Enabler Does?

Copy data in binary columnar RDD between CPU main memory and GPU

device memory

Launch GPU kernels

Cache GPU native code for kernels

Generate GPU native code from transformations and actions in a program– We already productized the IBM Java just-in-time compiler that generate GPU

native code from a lambda expression in Java 8


http://www-01.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/diag/understanding/gpu_jit.html

Motivation & Goal





Conclusion

How to Exploit GPUs in Spark

Bottom line is to enable columnar storage and GPU enabler in Spark– Any approaches can use both them to effectively and transparently exploit

GPUs in Spark


Java heap

Comparisons among DataFrame, Dataset, and RDD

DataFrame (with relational operations) and Dataset (with lambda

functions) use Catalyst and row-oriented data representation on off-heap


ds = d.toDS()ds.filter(p => p.x>1)

.count()

1 4 2 5

Java heap

rdd = sc.parallelize(d)rdd.filter(p => p.x>1)

.count()

df = d.toDF(…)df.filter(”x>1”)

.count()

case class Pt(x: Int, y: Int)d = Array(Pt(1, 4), Pt(2, 5))

FrontendAPI

2 51 4 Data

DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)

Catalyst

Backendcomputation

GeneratedJava bytecode

Java bytecode inSpark program and runtime

Row-orientedRow-oriented

Two Approaches to Exploit GPUs

Devising Spark Package for RDD– Library developers can use this to enable their GPU code in Spark libraries

– Application programmers can use this to run their code in their Spark

application

Enhance Catalyst for DataFrame/Dataset– Spark programs with DataFrame/Dataset will be translated to GPU code

transparently

– As the first step, we are generating code for specific columnar storages for

CPUs• https://github.com/apache/spark/pull/11636 for ColumnarBatch

• https://github.com/apache/spark/pull/11956 for CachedBatch

2. Introduce generic columnar storage (UnsafeColumn?) for CPU

3. Generate code for generic columnar storage for CPU

4. Generate code for generic columnar storage for GPU


https://github.com/apache/spark/pull/11636

https://github.com/apache/spark/pull/11956

Software Stack for RDD in Spark 2.0

RDD keeps data on Java heap


RDD API

Java heap

RDD data

User’s/library’s Spark program

Off-heap

GPU Exploitation for RDD

Current RDD and binary columnar can co-exist

User/library-provided GPU code is managed by GPU enabler


RDD API

Java heap

RDD data


ColumnarGPU

enabler

GPU device memory

Columnar

Software Stack for Dataset/DataFrame in Spark 2.0

Dataset become a primary data structure for computation

Dataset keeps data in UnsafeRow on Java heap


DataFrame

Dataset

TungstenCatalyst

Java heap

UnsafeRow


Logical optimizer

CPU code generator

GPU Exploitation for DataFrame/Dataset

UnsafeRow and Columnar can co-exist

Catalyst will generate GPU code from a Spark program



DataFrame

Dataset

TungstenCatalyst

Off-heap

GPU device memory

ColumnarLogical optimizer

CPU code generator

Columnar

Java heap

UnsafeRow

GPU enabler

Columnar

Exploit GPUs for RDD

Execute user-provided GPU kernels from map()/reduce() functions– GPU memory managements and data copy are automatically handled

Generate GPU native code for simple map()/reduce() methods– “spark.gpu.codegen=true” in spark-defaults.conf


rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDDsum = rdd1.map(i => i * 2)

.reduce((x, y) => (x + y))

// CUDA__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {

long ix = threadIdx.x + blockIdx.x * blockDim.x;if (size <= ix) return;outX[ix] = inX[ix] * 2;outY[ix] = inY[ix] – 1;

}

// SparkmapFunction = new CUDAFunction(“sample_map", // CUDA method name

Array("this.x", "this.y"), // input object has two fieldsArray("this.x“, “this.y”), // output object has two fieldsthis.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier

rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDDrdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y-1), mapFunction)

How to Use Exploitation of GPUs for RDD

Easy to install by one-liner and to run by one-liner– on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM

JDK or OpenJDK

Run script for AWS EC2 is available, which support spot instances 33 Exploiting GPUs in Spark - Kazuaki Ishizaki

$ wget https://s3.amazonaws.com/spark-gpu-public/spark-gpu-latest-bin-hadoop2.4.tgz &&tar xf spark-gpu-latest-bin-hadoop2.4.tgz && cd spark-gpu

$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run-example SparkGPULR 8 3200 32 5…numSlices=8, N=3200, D=32, ITERATIONS=5 On iteration 1On iteration 2On iteration 3On iteration 4On iteration 5Elapsed time: 431 ms$

Available at http://kiszk.github.io/spark-gpu/

• 3 contributors• Private communications

with other developers

http://www.ibm.com/java/jdk/

http://openjdk.java.net/install/index.html

https://github.com/kiszk/spark-gpu/wiki/How-to-run-(local-or-AWS-EC2)

http://kiszk.github.io/spark-gpu/

Achieved 3.15x Performance Improvement by GPU

Ran naïve implementation of logistic regression

Achieved 3.15x performance improvement of logistic regression over

without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card– We have rooms to improve performance


Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark

Program parametersN=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5Slices=128 (without GPU), 16 (with GPU)MASTER=local[8] (without and with GPU)

Hardware and softwareMachine: nx360 M4, 2 sockets 8-core Intel Xeon E5-2667 3.3GHz, 256GB memory, one NVIDIA K40m cardOS: RedHat 6.6, CUDA: 7.0

https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala

https://github.com/kiszk/spark-gpu/wiki/Benchmark

We are planning to release Spark Package version

You can use any Spark runtime– Spark 1.6, 1.6.1, 2.0.0-SNAPSHOP, your own Spark, …

Live demo


Motivation & Goal





Conclusion

Takeaway

Accelerate a Spark application by using GPUs effectively and transparently

More than 10 approaches exist for GPU exploitation

Two fundamental components– Binary columnar to alleviate overhead for GPU exploitation

– GPU enabler to manage GPU kernel execution from a Spark program Call pre-compiled libraries for GPU

Generate GPU native code at runtime

Two approaches– Spark plugin For RDD

– Enhancement of Catalyst for DataFrame/Dataset

Looking for anything in the community– Use case, discussions, requests, …


Appreciate any your feedback and contributions