exploiting gpus in spark
TRANSCRIPT
Kazuaki IshizakiIBM Research – Tokyo日本アイ・ビー・エム(株)東京基礎研究所
Exploiting GPUs in Spark
1
Who am I?
Kazuaki Ishizaki– live in Tokyo, Japan
Research staff member at IBM Research – Tokyo– http://ibm.co/kiszk
Research interests– compiler optimizations, language runtime, and parallel processing
Worked for Java virtual machine and just-in-time compiler over 20 years– From JDK 1.0 to Java SE 8
Twitter: @kiszk
Slideshare: http://www.slideshare.net/ishizaki
Github: https://github.com/kiszk2 Exploiting GPUs in Spark - Kazuaki Ishizaki
My message is “Spark can meet GPUs”
Let us discuss use cases, opportunities, requirements in meetups,
conferences, and Spark dev or user mailing list
3 Exploiting GPUs in Spark - Kazuaki Ishizaki
While GPU is not the first-class citizen in Spark,4 GPU related talks will be in Spark Summit SF
Agenda
Motivation & Goal
Activities to Exploit GPUs in Spark
Introduction of GPUs
Design & New Components– Binary columnar
– GPU enabler
Two approaches to Exploit GPUs in Spark– Spark Plug-in
– Enhancement of Catalyst in Spark runtime
Conclusion
4 Exploiting GPUs in Spark - Kazuaki Ishizaki
Want to Accelerate Computation-heavy Application
Motivation– Want to shorten execution time of a long-running Spark application
Computation-heavy
Shuffle-heavy
I/O-heavy
Goal– Accelerate a Spark computation-heavy application
According to Reynold’s talk (p. 21), CPU will become bottleneck on Spark
5 Exploiting GPUs in Spark - Kazuaki Ishizaki
Accelerate a Spark Application by GPUs
Our Approach– Accelerate a Spark application by using GPUs effectively and transparently
Exploit high performance of GPUs
Do not ask users to change their Spark programs
New components for acceleration – Binary columnar (e.g. Apache Arrow)
Efficient data representations for GPUs and CPUs
– GPU enabler Automatically handle executions on GPUs
• GPU memory allocation, data copy between GPU and CPU, etc …
6 Exploiting GPUs in Spark - Kazuaki Ishizaki
Motivation & Goal
Projects to Exploit GPUs in Spark
Introduction of GPUs
Design & New Components
Two approaches to Exploit GPUs in Spark
Conclusion
Existing 10~ Projects to Exploit GPUs in Spark
There are several activities, but no one was enabled in master– Community will make GPU as a first-class citizen in Spark
8 Exploiting GPUs in Spark - Kazuaki Ishizaki
Spark system
programmer
Spark application
programmer
Generated from
Spark application
Spark
standard APIs
(RDD, Dataset,
DataFrame)
mllib (N/A on github)
Deeplearning4J on
Spark
Our GPU enabler
(spark-gpu)
Spark SWAT
Columnar
DataFrame (N/A on
github)
NUWA (product)
Our on-going work
Unique APIs Caffe on Spark
BidMach Spark
CSR in Spark
HeteroSpark (N/A on
github)
Who prepares GPU code
Ho
w G
PU
co
de is
called
Existing Resource Managers to Support GPU for Spark
Spark on Mesos– https://spark-summit.org/2016/events/spark-on-mesos-the-state-of-the-art/
Yarn Node Labels– https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-
site/NodeLabel.html
9 Exploiting GPUs in Spark - Kazuaki Ishizaki
Motivation & Goal
Projects to Exploit GPUs in Spark
Introduction of GPUs
Design & New Components
Two approaches to Exploit GPUs in Spark
Conclusion
GPU Programming Model
Five steps 1. Allocate GPU device memory
2. Copy data on CPU main memory to GPU device memory
3. Launch a GPU kernel to be executed in parallel on cores
4. Copy back data on GPU device memory to CPU main memory
5. Free GPU device memory
Usually, a programmer has to write these steps in CUDA or OpenCL
11 Exploiting GPUs in Spark - Kazuaki Ishizaki
device memory(up to 12GB)
main memory(up to 1TB/socket)
CPU GPU
Data copyover PCIe
dozen cores/socket thousands cores
How We Can Run Program Faster on GPU
Assign a lot of parallel computations into cores
Make memory accesses coalesced– An example
– Column-oriented layout achieves better performance This paper reports about 3x performance improvement of GPU kernel execution of
kmeans over row-oriented layout
12 Exploiting GPUs in Spark - Kazuaki Ishizaki
1 52 61 5 3 7
Assumption: 4 consecutive data elementscan be coalesced by GPU hardware
2 v.s. 4memory accesses toGPU device memory
Row-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)Load four Pt.xLoad four Pt.y
2 6 4 843 87
coresx1 x2 x3 x4cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
Motivation & Goal
Projects to Exploit GPUs in Spark
Introduction of GPUs
Design & New Components
Two approaches to Exploit GPUs in Spark
Conclusion
High Level View of GPU Exploitation
Efficient– Reduce data copy overhead between CPU and GPU
– Make memory accesses efficient on GPU
Transparent– Map parallelism in a program
into GPU native code
User’s Spark Program (scala)
14
case class Pt(x: Int, y: Int)rdd1 = sc.parallelize(Array(
Pt(1, 4), Pt(2, 5),Pt(3, 6), Pt(4, 7),Pt(5, 8), Pt(6, 9)), 3)
rdd2 = rdd1.map(p => Pt(p.x*2, p.y-1))cnt = rdd2.reduce(
(p1, p2) => p1.x + p2.x)
Translate to
GPU native
code
Nativ
e c
ode
1
GPU
4
2 5
3 6
4 7
5 8
6 9
1 4
2 5
3 6
4 7
5 8
6 9
2 3
4 4
6 5
8 6
10 7
12 8
2 3
4 4
6 5
8 6
10 7
12 8
* 2 =
-1 =
rdd1
Datatransfer
x y
Exploiting GPUs in Spark - Kazuaki Ishizaki
GPU enabler
binary columnar Off-heap
x y
GPU can exploit parallelism bothamong blocks in RDD andwithin a block of RDD
rdd2
blockGPU
kernel
CPU
What Binary Columnar does?
Keep data as binary representation (not Java object representation)
Keep data as column-oriented layout
Keep data on off-heap or GPU device memory
15 Exploiting GPUs in Spark - Kazuaki Ishizaki
2 51 4
Off-heap
case class Pt(x: Int, y: Int)Array(Pt(1, 4),
Pt(2, 5))
Example
2 51 4
Off-heap
Columnar (column-oriented) Row-oriented
Current RDD as Java objects on Java heap
16 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
Object header for Java virtual machine
1 4 2 5
Java heap
Current RDDRow-oriented layoutJava object representationOn Java heap
Pt Pt
Binary Columnar on off-heap
17 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
Object header for Java virtual machine
1 4 2 5
Java heap Off-heap
2 51 4
Current RDDRow-oriented layoutJava object representationOn Java heap
Binary columnarColumn-oriented layoutBinary representationOn off-heap
2.1.
Long Path from Current RDD to GPU
Three steps to send data from RDD to GPU1. Java objects to column-oriented binary representation on Java heap
From a Java object to binary representation
From a row-oriented format to columnar
2. Binary representation on Java heap to binary columnar on off-heap Garbage collection may move objects on Java heap during GPU related operations
3. Off-heap to GPU device memory
18 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU
1 4 2 5 2 51 4 2 51 4 2 51 4
Off-heap GPU device memoryJava heap Java heap
This thread in dev ML also discusses overhead of copying data between RDD and GPU
3.
Pt Pt ByteBuffer ByteBuffer
Long Path from Current Dataset to GPU
Two steps to send data from RDD to GPU1. Binary representation on Java heap to binary columnar on off-heap
From a row-oriented format to columnar
2. Off-heap to GPU device memory
19 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)ds = Array(Pt(1, 4),Pt(2, 5)).toDS()ds.map(…).reduce(…) // execute on GPU
2 51 4 2 51 4
Off-heap GPU device memory
2 51 4
Java heap
1. 2.
Shorter Path from Binary Columnar RDD to GPU
RDD with binary columnar can be simply copied to GPU device memory
20 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))rdd.map(…).reduce(…) // execute on GPU
Off-heap GPU device memoryEliminated
2 51 4 2 51 4
1 4 2 5 2 51 4 2 51 4
Off-heap GPU device memoryJava heap
2 51 4
Java heap
Can Execute map() in Parallel Using Binary Columnar
Adjacent elements in binary columnar can be accessed in parallel
The same type of operations ( * or -) can be executed in parallel for data
to be loaded in parallel
21 Exploiting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)...res= rdd or ds.map(p => Pt(p.x*2, p.y-1))
1 4 2 5
Java heap Off-heap
2 51 4
Current RDD Binary columnar
Memory accessorder 1 2 3 4 1 1 2 2
1 4 2 5
Java heap
Current Dataset
1 2 3 4
Advantages of Binary Columnar
Can exploit high performance of GPUs
Can reduce overhead of data copy between CPU and GPU
Consume less memory footprint than RDD
Can directly compute data, which are stored in columnar, from Apache
Parquet, Apache Arrow
Can exploit SIMD instructions on CPU, too
22 Exploiting GPUs in Spark - Kazuaki Ishizaki
What GPU Enabler Does?
Copy data in binary columnar RDD between CPU main memory and GPU
device memory
Launch GPU kernels
Cache GPU native code for kernels
Generate GPU native code from transformations and actions in a program– We already productized the IBM Java just-in-time compiler that generate GPU
native code from a lambda expression in Java 8
23 Exploiting GPUs in Spark - Kazuaki Ishizaki
Motivation & Goal
Projects to Exploit GPUs in Spark
Introduction of GPUs
Design & New Components
Two approaches to Exploit GPUs in Spark
Conclusion
How to Exploit GPUs in Spark
Bottom line is to enable columnar storage and GPU enabler in Spark– Any approaches can use both them to effectively and transparently exploit
GPUs in Spark
25 Exploiting GPUs in Spark - Kazuaki Ishizaki
Java heap
Comparisons among DataFrame, Dataset, and RDD
DataFrame (with relational operations) and Dataset (with lambda
functions) use Catalyst and row-oriented data representation on off-heap
26 Exploiting GPUs in Spark - Kazuaki Ishizaki
ds = d.toDS()ds.filter(p => p.x>1)
.count()
1 4 2 5
Java heap
rdd = sc.parallelize(d)rdd.filter(p => p.x>1)
.count()
df = d.toDF(…)df.filter(”x>1”)
.count()
case class Pt(x: Int, y: Int)d = Array(Pt(1, 4), Pt(2, 5))
FrontendAPI
2 51 4 Data
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Catalyst
Backendcomputation
GeneratedJava bytecode
Java bytecode inSpark program and runtime
Row-orientedRow-oriented
Two Approaches to Exploit GPUs
Devising Spark Package for RDD– Library developers can use this to enable their GPU code in Spark libraries
– Application programmers can use this to run their code in their Spark
application
Enhance Catalyst for DataFrame/Dataset– Spark programs with DataFrame/Dataset will be translated to GPU code
transparently
– As the first step, we are generating code for specific columnar storages for
CPUs• https://github.com/apache/spark/pull/11636 for ColumnarBatch
• https://github.com/apache/spark/pull/11956 for CachedBatch
2. Introduce generic columnar storage (UnsafeColumn?) for CPU
3. Generate code for generic columnar storage for CPU
4. Generate code for generic columnar storage for GPU
27 Exploiting GPUs in Spark - Kazuaki Ishizaki
Software Stack for RDD in Spark 2.0
RDD keeps data on Java heap
28 Exploiting GPUs in Spark - Kazuaki Ishizaki
RDD API
Java heap
RDD data
User’s/library’s Spark program
Off-heap
GPU Exploitation for RDD
Current RDD and binary columnar can co-exist
User/library-provided GPU code is managed by GPU enabler
29 Exploiting GPUs in Spark - Kazuaki Ishizaki
RDD API
Java heap
RDD data
User’s/library’s Spark program
ColumnarGPU
enabler
GPU device memory
Columnar
Software Stack for Dataset/DataFrame in Spark 2.0
Dataset become a primary data structure for computation
Dataset keeps data in UnsafeRow on Java heap
30 Exploiting GPUs in Spark - Kazuaki Ishizaki
DataFrame
Dataset
TungstenCatalyst
Java heap
UnsafeRow
User’s/library’s Spark program
Logical optimizer
CPU code generator
GPU Exploitation for DataFrame/Dataset
UnsafeRow and Columnar can co-exist
Catalyst will generate GPU code from a Spark program
31 Exploiting GPUs in Spark - Kazuaki Ishizaki
User’s/library’s Spark program
DataFrame
Dataset
TungstenCatalyst
Off-heap
GPU device memory
ColumnarLogical optimizer
CPU code generator
Columnar
Java heap
UnsafeRow
GPU enabler
Columnar
Exploit GPUs for RDD
Execute user-provided GPU kernels from map()/reduce() functions– GPU memory managements and data copy are automatically handled
Generate GPU native code for simple map()/reduce() methods– “spark.gpu.codegen=true” in spark-defaults.conf
32 Exploiting GPUs in Spark - Kazuaki Ishizaki
rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDDsum = rdd1.map(i => i * 2)
.reduce((x, y) => (x + y))
// CUDA__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {
long ix = threadIdx.x + blockIdx.x * blockDim.x;if (size <= ix) return;outX[ix] = inX[ix] * 2;outY[ix] = inY[ix] – 1;
}
// SparkmapFunction = new CUDAFunction(“sample_map", // CUDA method name
Array("this.x", "this.y"), // input object has two fieldsArray("this.x“, “this.y”), // output object has two fieldsthis.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier
rdd1 = sc.parallelize(…).convert(ColumnFormat) // rdd1 uses binary columnar RDDrdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.y-1), mapFunction)
How to Use Exploitation of GPUs for RDD
Easy to install by one-liner and to run by one-liner– on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM
JDK or OpenJDK
Run script for AWS EC2 is available, which support spot instances 33 Exploiting GPUs in Spark - Kazuaki Ishizaki
$ wget https://s3.amazonaws.com/spark-gpu-public/spark-gpu-latest-bin-hadoop2.4.tgz &&tar xf spark-gpu-latest-bin-hadoop2.4.tgz && cd spark-gpu
$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/run-example SparkGPULR 8 3200 32 5…numSlices=8, N=3200, D=32, ITERATIONS=5 On iteration 1On iteration 2On iteration 3On iteration 4On iteration 5Elapsed time: 431 ms$
Available at http://kiszk.github.io/spark-gpu/
• 3 contributors• Private communications
with other developers
Achieved 3.15x Performance Improvement by GPU
Ran naïve implementation of logistic regression
Achieved 3.15x performance improvement of logistic regression over
without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card– We have rooms to improve performance
34 Exploiting GPUs in Spark - Kazuaki Ishizaki
Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark
Program parametersN=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5Slices=128 (without GPU), 16 (with GPU)MASTER=local[8] (without and with GPU)
Hardware and softwareMachine: nx360 M4, 2 sockets 8-core Intel Xeon E5-2667 3.3GHz, 256GB memory, one NVIDIA K40m cardOS: RedHat 6.6, CUDA: 7.0
We are planning to release Spark Package version
You can use any Spark runtime– Spark 1.6, 1.6.1, 2.0.0-SNAPSHOP, your own Spark, …
Live demo
35 Exploiting GPUs in Spark - Kazuaki Ishizaki
Motivation & Goal
Projects to Exploit GPUs in Spark
Introduction of GPUs
Design & New Components
Two approaches to Exploit GPUs in Spark
Conclusion
Takeaway
Accelerate a Spark application by using GPUs effectively and transparently
More than 10 approaches exist for GPU exploitation
Two fundamental components– Binary columnar to alleviate overhead for GPU exploitation
– GPU enabler to manage GPU kernel execution from a Spark program Call pre-compiled libraries for GPU
Generate GPU native code at runtime
Two approaches– Spark plugin For RDD
– Enhancement of Catalyst for DataFrame/Dataset
Looking for anything in the community– Use case, discussions, requests, …
37 Exploiting GPUs in Spark - Kazuaki Ishizaki
Appreciate any your feedback and contributions