apache spark performance observations

66
© 2016 IBM Corporation 1 Sharing observaons from IBM Runmes High level techniques and tools Wring efficient code Hardware accelerators RDMA for networking GPUs for computaon Apache Spark performance Adam Roberts IBM Runtimes, Hursley, UK

Upload: adam-roberts

Post on 21-Jan-2018

209 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Apache Spark Performance Observations

© 2016 IBM Corporation1

● Sharing observations from IBM Runtimes● High level techniques and tools● Writing efficient code● Hardware accelerators

RDMA for networking GPUs for computation

Apache Spark performance

Adam RobertsIBM Runtimes, Hursley, UK

Page 2: Apache Spark Performance Observations

© 2016 IBM Corporation2

Workloads we're especially interested in

● HIBench● SparkSqlPerf, all 100 TPC-DS queries● Real customer applications● PoCs and Spark demos

Page 3: Apache Spark Performance Observations

© 2016 IBM Corporation3

What I will be covering✔ Best practices for Java/Scala code✔ Writing code that works well with a JIT compiler✔ Profiling techniques you can use✔ How to use RDMA for fast networking✔ How to use GPUs for fast data processing✔ How we can use the above to dramatically increase our Spark

performance: get results faster✔ Package for anyone to try

Page 4: Apache Spark Performance Observations

© 2016 IBM Corporation4

What I won't be covering● High-level application design decisions● Avoiding the shuffle: knowing which Spark methods to use● File systems, operating systems, and file types to use● Conventional Spark options e.g. spark.shuffle.*, compression

codecs, spark.memory.*, spark.rpc.*, spark.streaming.*, spark.dynamicAllocation.*

● Java options in depth: though a matching -Xms and -Xmx shows good results in Spark 2 (omitted by default in a PR) and we use the Kryo serializer

Page 5: Apache Spark Performance Observations

© 2016 IBM Corporation5

Tooling we use, all freely available● Health Center● TPROF with Visual Performance Analyzer● GCMV: garbage collection and memory visualizer● MAT: diagnose and resolve memory leaks● Linux perf tools● Jenkins, Slack, Maven, ScalaTest, Eclipse, Intellij Community

Edition

Page 6: Apache Spark Performance Observations

© 2016 IBM Corporation6

Profiling Spark with Healthcenter

-Xhealthcenter:level=headless

Page 7: Apache Spark Performance Observations

© 2016 IBM Corporation7

Profiling Java with TPROF

-agentlib:jprof=tprof

Page 8: Apache Spark Performance Observations

© 2016 IBM Corporation8

Tips for performance in Java and Scala● Locals are faster than globals

Can prove closed set of storage readers / modifersFields and statics slow; parameters and locals fast

● Constants are faster than variablesCan copy constants inline or across memory cachesJava’s final and Scala’s val are your friends

● private is faster than publicprivate methods can't be dynamically redefinedprotected and “package private” just as slow as public

● Small methods (≤100 bytecodes) are goodMore opportunities to in-line them

● Simple is faster than complexEasier for the JIT to reason about the effects

● Limit extension points and architectural complexity when practicalMakes call sites concrete.

Page 9: Apache Spark Performance Observations

© 2016 IBM Corporation9

Scala has lots of features, not all of them are fast● Understand the implementation of Scala language features – use them judiciously

● Reduce uncertainty for the compiler in your coding style: use type ascription, avoid ambiguous polymorphism

● Stick to common coding patterns - the JIT is tuned for them, as new workloads emerge the latest JITs will change too

● Focus on performance hotspots using the profiling tools I mentioned● Too much emphasis on performance can compromise

maintainability!● Too much emphasis on maintainability can compromise

performance!

Page 10: Apache Spark Performance Observations

© 2016 IBM Corporation10

Idiomatic vs imperative Scala

Page 11: Apache Spark Performance Observations

© 2016 IBM Corporation11

for (x <- 1 to 10) { println(“Value of x: “ + x)}

val values = List(1,2,3,4,5,6)for (x <- values) { println(“Value of x: “ + x)}

val x = 1while (x <= 10) { println(“Value of x: “ + x) x = x + 1}

val values = List(1,2,3,4,5,6)var x = 0while (x < values.length) { println(“Value of x: “ + values(x)) x = x + 1}

Scala for loops

Page 12: Apache Spark Performance Observations

© 2016 IBM Corporation12

Takeaway is to avoid boxing/unboxing (involves object allocation) – avoid collections of type AnyRef! Know your types

Convert to AnyRef with care

Page 13: Apache Spark Performance Observations

© 2016 IBM Corporation13

● Max heap size, initial heap size, quickstart can make a big difference - for Spark 2 we've noticed that a matching -Xms and -Xmx improves performance on HiBench and SparkSqlPerf

● O* JDK has a method size bytecode limit for the JIT, ours does not, if you do use O*JDK try -XX:DontCompileHugeMethods if you find certain queries become very slow

● Experiment then profile – spend your time in what's actually used the most, not nitpicking over barely used code paths!

● spark-env.sh for environment variables● spark-defaults.conf for Spark settings

Observations with Java options

Page 14: Apache Spark Performance Observations

© 2016 IBM Corporation14

● The VM searches the JAR, loads and verifies bytecodes to internal representation, runs bytecode form directly

● After many invocations (or via sampling) code gets compiled at ‘cold’ or ‘warm’ level

● An internal, low overhead sampling thread is used to identify frequently used methods

● Methods may get recompiled at ‘hot’ or ‘scorching’ levels (for more optimizations)

● Transition to ‘scorching’ goes through a temporary profiling step

cold

hot

scorching

profiling

interpreter

warm

Java's intermediate bytecodes are compiled as required and based on runtime profiling- code is compiled 'just in time' as required- dynamic compilation can determine the target machine capabilities and app demands

The JIT takes a holistic view of the application, looking for global optimizations based on actual usage patterns, and speculative

assumptions

Page 15: Apache Spark Performance Observations

© 2016 IBM Corporation15

export IBM_JAVA_OPTIONS=”-Xint“ to run without it, see the difference for yourself

● For more JIT goodness check out this talk

What a difference a JIT makes...

Page 16: Apache Spark Performance Observations

© 2016 IBM Corporation16

● Using type ascription● Avoiding ambiguities● Preferring val/final and private● Reducing non-obvious polymorphism● Avoiding collections of AnyRef● Avoiding JNI

Writing JIT friendly code guidelines

Page 17: Apache Spark Performance Observations

© 2016 IBM Corporation17

IBM JDK8 SR3 (tuned)

IBM JDK8 SR3 (out of the box)

PageRank 160% 148%

Sleep 187% 113%

Sort 103% 147%

WordCount 130% 146%

Bayes 100% 91%

Terasort 160% 131%1/Geometric mean of HiBench time on zLinux 32 cores, 25G heap

Improvements in successive IBM Java 8 releases Performance compared with OpenJDK 8

HiBench huge, Spark 2.0.1, Linux Power8 12 core * 8-way SMT

1.35x

Can we tune a JDK to work well with Spark?

Page 18: Apache Spark Performance Observations

© 2016 IBM Corporation18

Contributing back changes to Spark core

[SPARK-18231]: optimising the SizeEstimator

Hot methods in these classes with PageRank:● [SPARK-18196]: optimising CompactBuffer● [SPARK-18197]: optimising AppendOnlyMap● [SPARK-18224]: optimising PartitionedPairBuffer

Blog post for more details here

Page 19: Apache Spark Performance Observations

© 2016 IBM Corporation19

Takeaways● Profile Lots In Pre-production (PLIP)

Our tools will help● Not all Java implementations are the same● Remember to focus on what's hot in the profiles! Make a change, rebuild and reprofile, repeat

● Many ways to achieve the same goal in Scala, use convenient code in most places and simple imperative code for what's critical

Page 20: Apache Spark Performance Observations

© 2016 IBM Corporation20

We can only get so far writing fast code, I'll talk about RDMA for fast networking and how we can use GPUs for fast processing

Beyond optimum code...

Page 21: Apache Spark Performance Observations

© 2016 IBM Corporation21

● Feature available in our SDK for Java: Java Sockets over RDMA

● Requires RDMA capable network adapter● Investigating other RDMA implementations so we can

avoid marshalling and data (de)serialization costs● Breaking Sorting World Records with RDMA● Getting started with RDMA

Remote Direct Memory Access (RDMA)

Page 22: Apache Spark Performance Observations

© 2016 IBM Corporation22

Spark VM

Buffer

OffHeap

Buffer

Spark VM

Buffer

OffHeap

Buffer

Ether/IB SwitchRDMA NIC/HCA RDMA NIC/HCA

OS OSDMA DMA(Z-Copy) (Z-Copy)

(B-Copy)(B-Copy)

Acronyms:Z-Copy – Zero Copy

B-Copy – Buffer CopyIB – InfiniBand

Ether - EthernetNIC – Network Interface CardHCA – Host Control Adapter

● Low-latency, high-throughput networking● Direct 'application to application' memory pointer exchange between remote hosts● Off-load network processing to RDMA NIC/HCA – OS/Kernel Bypass (zero-copy)● Introduces new IO characteristics that can influence the Apache Spark transfer plan

Spark node #1 Spark node #2

Page 23: Apache Spark Performance Observations

© 2016 IBM Corporation23

TCP/IP

RDMA

RDMA exhibits improved throughput and reduced latency

Our JVM makes RDMA available transparently via Java.net.Socket APIs (JsoR) or explicitly via com.ibm jVerbs calls

Page 24: Apache Spark Performance Observations

© 2016 IBM Corporation24

32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node), IBM Java 8

0 100 200 300 400 500 600

Spark HiBench TeraSort [30GB]

Execution Time (sec)

556s

159s

TCP/IP

JSoR

Elapsed time with 30 GB of data, 32 GB executor

Page 25: Apache Spark Performance Observations

© 2016 IBM Corporation25

TPC-H benchmark 100Gb30% improvement in database

operations

Shuffle-intensive benchmarks show 30% - 40% better

performance with RDMA

HiBench PageRank 3Gb40% faster, lower CPU usage

32 cores(1 master, 4 nodes x 8 cores/node,

32GB Mem/node)

Page 26: Apache Spark Performance Observations

© 2016 IBM Corporation26

Why? ● Faster computation of results or the ability to process more

data in the same amount of time – we want to improve accuracy of systems and free up CPUs for boring work

● GPUs becoming available in servers and many modern computers for us to use

● Drivers and SDKs freely available

Fast computation: Graphics Processing Units

Page 27: Apache Spark Performance Observations

© 2016 IBM Corporation27

z13

BigInsights

How popular is Java?

Page 28: Apache Spark Performance Observations

© 2016 IBM Corporation28

AlphaGo: 1,202 CPUs, 176 GPUsTitan: 18,688 GPUs, 18,688 CPUsCERN and Geant: reported to be using GPUsOak Ridge, IBM “the world's fastest supercomputers by 2017”: two, $325mDatabricks: recent blog post mentions deep learning with GPUs and Spark

Who's interested in GPUs?

Page 29: Apache Spark Performance Observations

© 2016 IBM Corporation29

GPUs excel at executing many of the same operations at once (Single Instruction Multiple Data programming)

We'll program using CUDA or OpenCL – like C and C++ and we'll write JNI code to access data in our Java world using the GPU

We'll run code on computers that are shipped with graphics cards, there are free CUDA drivers for x86-64 Windows, Linux, and IBM's Power LE, OpenCL drivers, SDK and source also widely available

CPUGPU

Page 30: Apache Spark Performance Observations

© 2016 IBM Corporation30

Assume we have an integer array in CUDA C called myData

Allocate space on the GPU (device side) using cudaMalloc, this returns a pointer we'll use later. Let's say we call this variable myDataOnGPU

Copy myData from the host to your allocated space (myDataOnGPU) using cudaMemcpyHostToDevice

Process your data on the GPU in a kernel (we use <<< and >>>)

Copy the result back (what's at myDataOnGPU replaces myData on the host) using cudaMemcpyDeviceToHost

How do we use a GPU?

Page 31: Apache Spark Performance Observations

© 2016 IBM Corporation31

__global__ void addingKernel(int* array1, int* array2){ array1[threadIdx.x] += array2[threadIdx.x]; }

__global__ : it's a function we can call on the host (CPU), it's available to be called from everywhere

How is the data arranged and how can I access it?Sequentially, a kernel runs on a grid (blocks x threads) and it's how we can run many threads that work on different parts of the data

int*? A pointer to integers we've copied to the GPU

threadIdx.x?We use this as an index to our array, remember lots of threads run on the GPU. Access each item for our example using this

Page 32: Apache Spark Performance Observations

© 2016 IBM Corporation32

● Assume we have an integer array on the Java heap: myData● Create a native method in Java or Scala ● Write .cpp or .c code with a matching signature for your native method● In your native code, use JNI to get a pointer to your data● With this pointer, we can figure out how much memory we need● Allocate space on the GPU (device side): cudaMalloc, returns myDataOnTheGPU● Copy myData to your allocated space (myDataOnTheGPU) using cudaMemcpyHostToDevice● Process your data on the GPU in a kernel (look for <<< and >>>)● Copy the result back (what's now at myDataOnTheGPU replaces myData on the host) using cudaMemcpyDeviceToHost● Release the elements (updating your JNI pointer so the data in our JVM heap is now the result)

How would we use a GPU with Java or Scala?

Easier ways?

Page 33: Apache Spark Performance Observations

© 2016 IBM Corporation33

Our option: -Dcom.ibm.gpu.enable/enforce/disable

40,000,000

400,000,000

Ints sorted per second

Array length

400m per sec

40m per secSorting throughput for ints

30,000 300,000 3,000,000 30,000,000 300,000,000Details online here

Making it simple: Java class library modification

Page 34: Apache Spark Performance Observations

© 2016 IBM Corporation34

Our option: -Xjit:enableGPUMaking it simple: Java JIT compiler modification

Use an IntStream and specify our JIT option

Primitive types can be used (byte, char, short, int, float, double, long)

Page 35: Apache Spark Performance Observations

© 2016 IBM Corporation35

Measured performance improvement with a GPU using four programs using1-CPU-thread sequential execution160-CPU-thread parallel execution

Experimental environment usedIBM Java 8 Service Release 2 for PowerPC Little Endian

Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory (160 hardware threads in total) with one NVIDIA Kepler K40m GPU (2880 CUDA cores in total) at 876 MHz with 12GB global memory (ECC off)

Performance of GPU enabled lambdas

Page 36: Apache Spark Performance Observations

© 2016 IBM Corporation36

Name Summary Data size Data typeMM A dense matrix

multiplication: C = A.B

1024 x 1024 (1m) items

double

SpMM As above, sparse matrix

500k x 500k (250m) items

double

Jacobi2D Solve an equation using the Jacobi method

8192 x 8192 (67m) items

double

LifeGame Conway's Game of Life with 10k iterations

512 x 512 (262k) items

byte

Page 37: Apache Spark Performance Observations

© 2016 IBM Corporation37

This shows GPU execution time speedup amounts compared to what's in blue (1 CPU thread) and yellow (160 CPU threads)

The higher the bar, the bigger the speedup!

Page 38: Apache Spark Performance Observations

© 2016 IBM Corporation38

Similar to JCuda but provides a higher level abstraction, production ready and supported by us

● No arbitrary and unrestricted use of Pointer(long)● Still feels like Java instead of C

Write your kernel and compile it into a fat binary

nvcc --fatbin AdamKernel.cu

Add your Java code

import com.ibm.cuda.*;

import com.ibm.cuda.CudaKernel.*;

Load your fat binary

module = new Loader().loadModule("AdamDoubler.fatbin", device);

Build and run as you would any other Java application

Making it simple: CUDA4J API

Page 39: Apache Spark Performance Observations

© 2016 IBM Corporation39

Only doubling integers; could be any use case where we're doing the same operation to lots of elements at once

Full code listing at the end, Javadocs: search IBM Java 8 API com.ibm.cuda* Tip: the offsets are byte offsets, so you'll want your index in Java * the size of the object!

module = new Loader().loadModule("AdamDoubler.fatbin", device); kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider"); stream = new CudaStream(device);

numElements = 100; myData = new int[numElements]; Util.fillWithInts(myData); CudaGrid grid = Util.makeGrid(numElements, stream); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData);

Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams);

buffer1.copyTo(myData);

If our dynamically created grid dimensions are too big we need to

break down the problem and use the slice* API: doChunkingProblem() Our kernel, compiles into AdamDoubler.fatbin

Page 40: Apache Spark Performance Observations

© 2016 IBM Corporation40

● Recommendation algorithms such as● Alternating Least Squares

● Movie recommendations on Netflix● Recommended purchases on Amazon● Similar songs with Spotify

● Clustering algorithms such as● K-means (unsupervised learning)

● Produce clusters from data to determine which cluster a new item can be categorised as

● Identify anomalies: transaction fraud or erroneous data

● Classification algorithms such as● Logistic regression

● Create a model that we can use to predict where to plot the next item in a sequence

● Healthcare: predict adverse drug reactions based on known interactions between similar drugs

Improving MLlib

Page 41: Apache Spark Performance Observations

© 2016 IBM Corporation41

● Under the covers optimisation, set the spark.mllib.ALS.useGPU property● Full paper: http://arxiv.org/abs/1603.03820● Full implementation: https://github.com/IBMSparkGPU

Netflix 1.5 GB 12 threads, CPU 64 threads, CPU GPU

Intel, IBM Java 8 676 seconds N/A 140 seconds

Currently always sends work to a GPU regardless of size, remember we have limited device memory!

2x Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz, 16 cores in the machine (SMT-2), 256 GB RAM vs 2x Nvidia Tesla K80Ms

Also available for Power LE.

Improving Alternating Least Squares

Page 42: Apache Spark Performance Observations

© 2016 IBM Corporation42

We modified the existing ALS (.scala) implementation's computeFactors method● Added code to check if spark.mllib.ALS.useGPU is set● If set we'll then call our native method written to ue JNI (.cpp)● Our JNI method calls native CUDA (.cu) method● CUDA used to send our data to the GPU, calls our kernel, returns the results over JNI

back to the Java heap

● Built with our Spark distribution and the shared library is included: libGPUALS.so● Remember this will require the CUDA runtime (libcudart) and a capable GPU

ALS.scala computeFactors

CuMFJNIInterface.cpp

ALS.cu libGPUALS.so

Page 43: Apache Spark Performance Observations

© 2016 IBM Corporation43

We can send code to a GPU with APIs or if we make substantial changes to existing implementations, but we can also make our changes at a higher level to be more pervasive

Input: user application using DataFrame or Datasets, data stored in Parquet format for now

✔ Spark with Tungsten. Uses UnsafeRow and, sun.misc.unsafe, idea is to bring Spark closer to the hardware than previously, exploit CPUA caches, improved memory and CPU efficiency, reduce GC times, avoid Java object overheads – good deep dive here

✔ Spark with Catalyst. Optimiser for Spark SQL APIs, good deep dive here, transforms a query plan (abstraction of a user's program) into an optimised version, generates optimised code with Janino compiler

✔ Spark with our changes: Java and core Spark class optimisations, optimised JIT

Pervasive GPU opportunities for Spark

Page 44: Apache Spark Performance Observations

© 2016 IBM Corporation44

Output: generated code able to leverage auto-SIMD and GPUs

We want generated code that:✔ has a counted loop, e.g. one controlled by an automatic induction

variable that increases from a lower to an upper bound✔ accesses data in a linear fashion✔ has as few branches as possible (simple for the GPU's kernel)✔ does not have external method calls or contains only calls that can

be easily inlined

These help a JIT to either use auto-SIMD capabilities or GPUs

Page 45: Apache Spark Performance Observations

© 2016 IBM Corporation45

Problems1) Data representation of columnar storage (CachedBatch with Array[Byte]) isn't commonly used

2) Compression schemes are specific to CachedBatch, limited to just several data types

3) Building in-memory cache involves a long code path -> virtual method calls, conditional branches

4) Generated whole-stage code -> unnecessary conversion from CachedBatch or ColumnarBatch to UnsafeRow

Solutions1) Use ColumnarBatch format instead of CachedBatch for the in-memory cache generated by the

cache() method. ColumnarBatch and ColumnVector are commonly used data representations for columnar storage

2) Use a common compression scheme (e.g. lz4) for all of the data types in a ColumnVector

3) Generate code at runtime that is simple and specialized for building a concrete instance of the in-memory cache

4) Generate whole-stage code that directly reads data from columnar storage

(1) and (2) increase code reuse, (3) improves runtime performance of executing the cache() method and (4) improves performance of user defined DataFrame and Dataset operations

Page 46: Apache Spark Performance Observations

© 2016 IBM Corporation46

We propose a new columnar format: CachedColumnarBatch, that has a pointer to ColumnarBatch (used by Parquet reader) that keeps each column as OnHeapUnsafeColumnVector instead of OnHeapColumnVector. Not yet using GPUS!

● [SPARK-13805], merged into 2.0, performance improvement: 1.2xGet data from ColumnVector directly by avoiding a copy from ColumnVector to UnsafeRow when a program reads data in parquet format

● [SPARK-14098] will be merged into 2.2, performance improvement: 3.4xGenerate optimized code to build CachedColumnarBatch, get data from a ColumnVector directly by avoiding a copy from the ColumnVector to UnsafeRow, and use lz4 to compress ColumnVector when df.cache() or ds.cache is executed

● [SPARK-15962], merged into 2.1, performance improvement: 1.7xRemove indirection at offsets field when accessing each element in UnsafeArrayData, reduce memory footprint of UnsafeArrayData

Page 47: Apache Spark Performance Observations

© 2016 IBM Corporation47

● [SPARK-16043], performance improvement: 1.2xUse a Scala primitive array (e.g. Array[Int]) instead of Array[Any] for avoiding boxing operations when putting a primitive array into GenericArrayData

● [SPARK-15985], merged into 2.1, performance improvement: 1.3xEliminate boxing operations to put a primitive array into GenericArrayData when a Dataset program with a primitive array is ran

● [SPARK-16213], to be merged into 2.2, performance improvement: 16.6xEliminate boxing operations to put a primitive array into GenericArrayData when a DataFrame program with a primitive array is ran

● [SPARK-17490], merged into 2.1, performance improvement: 2.0x Eliminate boxing operations to put a primitive array into GenericArrayData when a DataFrame program with a primitive array is used

Page 48: Apache Spark Performance Observations

© 2016 IBM Corporation48

● improving a commonly used API and contributing the code● Ensuring generated code is in the right format for exploitation● Making it simple for any Spark user to exploit hardware accelerators, be it

GPU or auto-SIMD code for the latest processors● We know how to build GPU based applications● We can figure out if a GPU is available● We can figure out what code to generate● We can figure out which GPU to send that code to● All while retaining Java safety features such as exceptions, bounds

checking, serviceability, tracing and profiling hooks● Assuming you have the hardware, add an option and watch

performance improve: this is our goal

What's in it for me?

Page 49: Apache Spark Performance Observations

© 2016 IBM Corporation49

● We provide an optimised JDK with Spark bundle that includes hardware offloading, profiling, a tuned JIT and is under constant development

● We can talk more about performance aspects, not covered FPGAs, CAPI flash, an improved serializer, GC optimisations, object layout, monitoring...

● Upcoming blog post at spark.tc outlining the Catalyst related work● Look out for more pull requests and involvement from IBM, we want to improve

performance for everybody and maintain Spark's status● Open to ideas and wanting to work in communities for everyone's benefit

http://ibm.biz/spark-kit

Feedback and suggestions welcome: [email protected]

Wrapping it all up...

Page 50: Apache Spark Performance Observations

© 2016 IBM Corporation

Backup slides, code listing, legal information and disclaimers beyond this point

Page 51: Apache Spark Performance Observations

© 2016 IBM Corporation51

CUDA core: part of the GPU, they execute groups of threads

Kernel: a function we'll run on the GPU

Grid: think of it as a CUBE of BLOCKS which lay out THREADS; our GPU functions (KERNELS) run on one of these, we need to know the grid dimensions for each kernel

Threads: these do our computation, much more available than with CPUs

Blocks: groups of threads

Recommended reading: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy

The nvidia-smi command tells you about your GPU's limits

One GPU can have MANY CUDA cores, each CUDA core executes many threads

Page 52: Apache Spark Performance Observations

© 2016 IBM Corporation52

CUDA grid: why is this important?To achieve parallelism: a layout of threads we can use to solve ourbig data problems

Block dimensions? How many threads can run on a block

Grid dimensions?How many blocks we can have

threadIdx.x? (BLOCKS contain THREADS)Built in variable to get the current x coordinate of a given THREAD (can have an x, y, z coordinate too)

blockIdx.x? (GRIDS contain BLOCKS)Built in variable to get the current x coordinate of a given BLOCK (can have an x, y, z coordinate too)

Grid image is fully credited to http://www.karimson.com/posts/introduction-to-cuda/

Page 53: Apache Spark Performance Observations

© 2016 IBM Corporation53

For figuring out the dimensions we can use the following Java code, we want 512 threads and as many blocks as possible for the problem size

int log2BlockDim = 9;

int numBlocks = (numElements + 511) >> log2BlockDim;

int numThreads = 1 << log2BlockDim;

Size Blocks Threads

500 1 512

1,024 2 512

32,000 63 512

64,000 125 512

100,000 196 512

512,000 1,000 512

1,024,000 2,000 512

Page 54: Apache Spark Performance Observations

CUDA4J sample, part 1 of 3import com.ibm.cuda.*;import com.ibm.cuda.CudaKernel.*;

public class Sample { private static final boolean PRINT_DATA = false; private static int numElements; private static int[] myData; private static CudaBuffer buffer1; private static CudaDevice device = new CudaDevice(0); private static CudaModule module; private static CudaKernel kernel; private static CudaStream stream;

public static void main(String[] args) { try { module = new Loader().loadModule("AdamDoubler.fatbin", device); kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider"); stream = new CudaStream(device); doSmallProblem(); doMediumProblem(); doChunkingProblem(); } catch (CudaException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } private static void doSmallProblem() throws Exception { System.out.println("Doing the small sized problem"); numElements = 100; myData = new int[numElements]; Util.fillWithInts(myData); CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams);

int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); }

Page 55: Apache Spark Performance Observations

private static void doMediumProblem() throws Exception { System.out.println("Doing the medium sized problem");

numElements = 5_000_000; myData = new int[numElements]; Util.fillWithInts(myData); // This is only when handling more than max blocks * max threads per kernel // Grid dim is the number of blocks in the grid // Block dim is the number of threads in a block // buffer1 is how we'll use our data on the GPU buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); // myData is on CPU, transfer it buffer1.copyFrom(myData); // Our stream executes the kernel, can launch many streams at once CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams);

int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); }

CUDA4J sample, part 2 of 3

Page 56: Apache Spark Performance Observations

private static void doChunkingProblem() throws Exception { // I know 5m doesn't require chunking on the GPU but this does System.out.println("Doing the too big to handle in one kernel problem");

numElements = 70_000_000; myData = new int[numElements]; Util.fillWithInts(myData); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>");

// Check we can actually launch a kernel with this grid size try { Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); int[] originalArrayCopy = new int[numElements]; System.arraycopy(myData, 0, originalArrayCopy, 0, numElements); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } catch (CudaException ce) {

if (ce.getMessage().equals("invalid argument")) { System.out.println("it was invalid argument, too big!");

int maxThreadsPerBlockX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_BLOCK_DIM_X); int maxBlocksPerGridX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_GRID_DIM_Y);

long maxThreadsPerGrid = maxThreadsPerBlockX * maxBlocksPerGridX;

// 67,107,840 on my Windows box System.out.println("Max threads per grid: " + maxThreadsPerGrid);

long numElementsAtOnce = maxThreadsPerGrid; long elementsDone = 0;

grid = new CudaGrid(maxBlocksPerGridX, maxThreadsPerBlockX, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); while (elementsDone < numElements) { if ( (elementsDone + numElementsAtOnce) > numElements) { numElementsAtOnce = numElements - elementsDone; // Just do the remainder } long toOffset = numElementsAtOnce + elementsDone; // It's the byte offset not the element index offset CudaBuffer slicedSection = buffer1.slice(elementsDone * Integer.BYTES, toOffset * Integer.BYTES); Parameters kernelParams = new Parameters(2).set(0, slicedSection).set(1, numElementsAtOnce); kernel.launch(grid, kernelParams); elementsDone += numElementsAtOnce; }

int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } else { System.out.println(ce.getMessage()); } } }

CUDA4J sample, part 3 of 3

Page 57: Apache Spark Performance Observations

CUDA4J kernel

#include <stdint.h>#include <stdio.h>

/** * 2D grid so we can have 1024 threads and many blocks * Remember 1 grid -> has blocks/threads and one kernel runs on one grid * In CUDA 6.5 we have cudaOccupancyMaxPotentialBlockSize which helps * * Let's say we have 100 ints to double, keeping it simple * Assume we want to run with 256 threads at once * For this size our kernel will be set up as follows * 1 grid, 1 block, 512 threads * blockDim.x is going to be 1 * threadIdx.x will remain at 0 * threadIdx.y will range from 0 to 512 * So we'll go from 1 to 512 and we'll limit access to how many elements we have */ extern "C" __global__ void Cuda_cuda4j_AdamDoubler(int* toDouble, int numElements){ int index = blockDim.x * threadIdx.x + threadIdx.y; if (index < numElements) { // Don't go out of bounds toDouble[index] *= 2; // Just double it }}

extern "C" __global__ void Cuda_cuda4j_AdamDoubler_Strider(int* toDouble, int numElements){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < numElements) { // don't go overboard toDouble[i] *= 2; }}

Page 58: Apache Spark Performance Observations

Lambda example, part 1 of 2import java.util.stream.IntStream;

public class Lambda {

private static long startTime = 0; // -Xjit:enableGPU is our JVM option public static void main(String[] args) { boolean timeIt = true; int numElements = 500_000_000; int[] toDouble = new int[numElements];

Util.fillWithInts(toDouble); myDoublerWithALambda(toDouble, timeIt); double[] toHalf = new double[numElements]; Util.fillWithDoubles(toHalf); myHalverWithALambda(toHalf, timeIt);

double[] toRandomFunc = new double[numElements]; Util.fillWithDoubles(toRandomFunc); myRandomFuncWithALambda(toRandomFunc, timeIt); } private static void myDoublerWithALambda(int[] myArray, boolean timeIt) { if (timeIt) startTime = System.currentTimeMillis(); IntStream.range(0, myArray.length).parallel().forEach(i -> { myArray[i] = myArray[i] * 2; // Done on GPU for us }); if (timeIt) { System.out.println("Done doubling with a lambda, time taken: " + (System.currentTimeMillis() - startTime) + " milliseconds"); } }

Page 59: Apache Spark Performance Observations

private static void myHalverWithALambda(double[] myArray, boolean timeIt) { if (timeIt) startTime = System.currentTimeMillis(); IntStream.range(0, myArray.length).parallel().forEach(i -> { myArray[i] = myArray[i] / 2; // Again on GPU }); if (timeIt) { System.out.println("Done halving with a lambda, time taken: " + (System.currentTimeMillis() - startTime) + " milliseconds"); } }

private static void myRandomFuncWithALambda(double[] myArray, boolean timeIt) { if (timeIt) startTime = System.currentTimeMillis(); IntStream.range(0, myArray.length).parallel().forEach(i -> { myArray[i] = myArray[i] * 3.142; // Double so we don't lose precision });

if (timeIt) { System.out.println("Done with the random func with a lambda, time taken: " + (System.currentTimeMillis() - startTime) + " milliseconds"); } }}

Lambda example, part 2 of 2

Page 60: Apache Spark Performance Observations

Utility methods, part 1 of 2import com.ibm.cuda.*;

public class Util {

protected static void fillWithInts(int[] toFill) { for (int i = 0; i < toFill.length; i++) { toFill[i] = i; } }

protected static void fillWithDoubles(double[] toFill) { for (int i = 0; i < toFill.length; i++) { toFill[i] = i; } } protected static void printArray(int[] toPrint) { System.out.println(); for (int i = 0; i < toPrint.length; i++) { if (i == toPrint.length - 1) { System.out.print(toPrint[i] + "."); } else { System.out.print(toPrint[i] + ", "); } } System.out.println(); }

protected static CudaGrid makeGrid(int numElements, CudaStream stream) { int numThreads = 512; int numBlocks = (numElements + (numThreads - 1)) / numThreads; return new CudaGrid(numBlocks, numThreads, stream); }

Page 61: Apache Spark Performance Observations

/* * Array will have been doubled at this point */ protected static void checkArrayResultsDoubler(int[] toCheck, int[] originalArray) { long errorCount = 0; // Check result, data has been copied back here

if (toCheck.length != originalArray.length) { System.err.println("Something's gone horribly wrong, different array length"); } for (int i = 0; i < originalArray.length; i++) { if (toCheck[i] != (originalArray[i] * 2) ) { errorCount++; /* System.err.println("Got an error, " + originalArray[i] + " is incorrect: wasn't doubled correctly!" + " Got " + toCheck[i] + " but should be " + originalArray[i] * 2); */ } else { //System.out.println("Correct, doubled " + originalArray[i] + " and it became " + toCheck[i]); } } System.err.println("Incorrect results: " + errorCount); }}

Utility methods, part 2 of 2

Page 62: Apache Spark Performance Observations

CUDA4J module loaderimport java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;

import com.ibm.cuda.CudaDevice;import com.ibm.cuda.CudaException;import com.ibm.cuda.CudaModule;

public class Loader { private final CudaModule.Cache moduleCache = new CudaModule.Cache(); CudaModule loadModule(String moduleName, CudaDevice device) throws CudaException, IOException { CudaModule module = moduleCache.get(device, moduleName); if (module == null) { try (InputStream stream = getClass().getResourceAsStream(moduleName)) { if (stream == null) { throw new FileNotFoundException(moduleName); } module = new CudaModule(device, stream); moduleCache.put(device, moduleName, module); } } return module; }}

Page 63: Apache Spark Performance Observations

CUDA4J build script on Windows

nvcc -fatbin AdamDoubler.cu"C:\ibm8sr3ga\sdk\bin\java" -version"C:\ibm8sr3ga\sdk\bin\javac" *.java"C:\ibm8sr3ga\sdk\bin\java" -Xmx2g Sample"C:\ibm8sr3ga\sdk\bin\java" -Xmx4g Lambda"C:\ibm8sr3ga\sdk\bin\java" -Xjit:enableGPU={verbose} -Xmx4g Lambda

Page 64: Apache Spark Performance Observations

Set the PATH to include the CUDA library. For example, set PATH=<CUDA_LIBRARY_PATH>;%PATH%, where the <CUDA_LIBRARY_PATH> variable is the full path to the CUDA library. The <CUDA_LIBRARY_PATH> variable is C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin, which assumes CUDA is installed to the default directory.

Note: If you are using Just-In-Time Compiler (JIT) based GPU support, you must also include paths to the NVIDIA Virtual Machine (NVVM) library, and to the NVDIA Management Library (NVML). For example, the <CUDA_LIBRARY_PATH> variable is C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin;<NVVM_LIBRARY_PATH>;<NVML_LIBRARY_PATH>.

If the NVVM library is installed to the default directory, the <NVVM_LIBRARY_PATH> variable is C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\nvvm\bin. You can find the NVML library in your NVIDIA drivers directory. The default location of this directory is C:\Program Files\NVIDIA Corporation\NVSMI.

From IBM's Java 8 docs

Environment example, see the docs for details

Page 65: Apache Spark Performance Observations

Notices and DisclaimersCopyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.

Page 66: Apache Spark Performance Observations

Notices and Disclaimers (con’t)Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a particular purpose.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand, ILOG, LinuxONE™, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Databricks is a registered trademark of Databricks, Inc.

Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Spark, Apache, any other Apache project mentioned here and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundation