CHARACTERIZING APU PERFORMANCE IN HADOOPCL ON HETEROGENEOUS DISTRIBUTED PLATFORMS
MAX GROSSMAN, MAURICIO BRETERNITZ, AND VIVEK SARKAR RICE UNIVERSITY & AMD
2 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
! Cloud offers elasHcity, lowered startup costs, unified plaQorm for all ! Generally see worse and less predictable performance
‒ Noisy neighbor ! Economics of scale => cloud is here to stay “I don’t care where my code runs, as long as it finishes… someday” – Bob the Cloud User
MOTIVATION
3 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
! Hadoop ‒ Java programming language ‒ JDK libraries ‒ Arbitrary data types ‒ Reliability ‒ Simple MapReduce distributed programming model
! AbstracHons built on Hadoop ‒ H2O from 0xdata ‒ Mahout machine learning framework
STATE-‐OF-‐THE-‐ART
4 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
1. Poor computaHonal performance ‒ JVM execuHon, short-‐lived tasks implies poor JIT,
high startup cost for creaHng child processes
2. Poor I/O performance ‒ SerializaHon, deserializaHon of arbitrary data types
3. Manual tweaking of intertwined tunables ‒ In an unstable cloud environment, you never have
it right
4. Scheduling execuHon & communicaHon with a holisHc view of the plaQorm
" A small sampling of Hadoop tunables…
PROBLEMS
5 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
! OpenCL ‒ SIMD programming model ‒ MulH-‐architecture and mulH-‐vendor support ‒ APIs for launching compute and copy tasks
! An expert programmer could: 1. Translate all applicaHon code to OpenCL kernels 2. Compile OpenCL kernels, API calls into naHve library 3. Call naHve library from Java via JNI 4. Spend a lot of Hme debugging performance and
correctness
! SHll not good enough!
Host
Device
clEnqueueNDRange()
Host ApplicaHon
A POTENTIAL SOLUTION
6 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
Hadoop
Reliability Distributed PlaQorm
OpenCL
MulH-‐architecture execuHon in naHve threads
APARAPI
bytecode to OpenCL kernels ! Hardware aware plaQorm manager
! Machine-‐learning, mulH-‐device scheduler based on device occupancy and past kernel performance
! Architecture aware opHmizing compiler ! Hadoop-‐like API
7 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
HADOOPCL ARCHITECTURE
class PiMapper extends DoubleDoubleBoolIntHadoopCLMapper { public void map(double x, double y) { if(x * x + y * y > 0.25) { write(false, 1); } else { write(true, 1); } } } job.waitForCompletion(true);
javac .class
! HadoopCL programming model supports ‒ Java syntax ‒ MapReduce abstracHons ‒ Dynamic memory allocaHon ‒ Variety of data types (primiHves, sparse vectors, tuples, etc) and can be extended to more
‒ Constant globals accessible from anywhere
! HadoopCL does not support ‒ Arbitrary inputs, outputs ‒ Massive data elements (i.e. sparse vectors larger than device memory)
‒ Object references
8 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
HADOOPCL ARCHITECTURE
$ hadoop jar Pi.jar input output
NameNode + JobTracker
DataNode
Hadoop DataNode
Task Map or Reduce
DataNode
TaskTracker
HadoopCL Child
HadoopCL Child
HadoopCL Child
HadoopCL Child
HadoopCL ML Device Scheduler
9 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
HadoopCL Child
HADOOPCL ARCHITECTURE
Task Map or Reduce
Input Collector
Input Buffer Queue
Kernel Executor
Output Buffer Queue
Launch
Retry
Input Buffer
Output
Input Buffer
Manager
Output Buffer
Manager
Release
! Each Child JVM encloses a data-‐driven pipeline of communicaHon and computaHon tasks ‒ Data is buffered in chunks for processing on the OpenCL device
! HadoopCL explicitly manages buffers to prevent large GC overheads
! Kernel Executor handles ‒ Auto-‐generaHon and opHmizaHon of OpenCL kernels from JVM bytecode
‒ Transfer of inputs, outputs to device ‒ Asynchronous launch of OpenCL kernels
OpenCL Device
10 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
TOPICS IN HADOOPCL
! Extending APARAPI with architecture-‐ and data-‐aware compiler opHmizaHons 1. A number of HadoopCL-‐specific funcHons are auto-‐generated from APARAPI at runHme 2. When GPU execuHon is detected and a vector data-‐type is in use, the HadoopCL runHme
auto-‐strides input vectors before copying to the device ‒ APARAPI must emit strided code to match data layout, fails in certain cases
double MahoutKMeansMapper__dot(...){ double agg = 0.0; for (int i = 0; i < length1; i++){ int currentIndex = index1[(i) * this-‐>nPairs]; int j = 0; for (; j<length2 && currentIndex!=index2[j]; j++) ; if (j != length2) agg = agg + (val1[(i) * this-‐>nPairs] * val2[j]); } return(agg); }
double MahoutKMeansMapper__dot(...){ double agg = 0.0; for (int i = 0; i < length1; i++){ int currentIndex = index1[i]; int j = 0; for (; j<length2 && currentIndex!=index2[j]; j++) ; if (j!=length2) agg = agg + (val1[i] * val2[j]); } return(agg); }
11 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
TOPICS IN HADOOPCL
! Enabling OpenCL dynamic memory allocaHon through restart-‐able kernels ‒ Note: there are no side effects of mappers or reducers unHl they commit (i.e. write())
OpenCL Device
Heap
free
nWrites
nInputs
public void map(int key, double val) { int[] outputVec = new int[10]; ... write(key, outputVec); } Mapper.java
writeOffsetLookup
__kernel void map(int key, double val) { int oldOffset = atomic_add(free, 10); if (oldOffset + 10 >= heapSize) { nWrites[inputIndex] = -‐1; return; } ... writeOffsetLookup[inputIndex] = oldOffset; nWrites[inputIndex] = nWrites[inputIndex] + 1; } Mapper.cl
12 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
TOPICS IN HADOOPCL
! Auto-‐scheduling OpenCL kernels across execuHon plaQorms through machine learning ‒ HadoopCL TaskTracker is responsible for
1. Assigning each Task an execuHon plaQorm (GPU, CPU, or JVM) 2. Recording execuHon Hme for each task along with the kernel executed and average device
occupancy during that task’s execuHon
! Device assignment is based on programmer hints and/or recorded data from previous runs
‒ Data is recorded in files to be used across Jobs
13 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
! Mahout Kmeans ‒ Mahout provides Hadoop MapReduce implementaHons of a variety of ML algorithms
‒ KMeans iteraHvely searches for K clusters
! HadoopCL KMeans port ‒ Mapper is trivial, for each point iterates through all clusters and outputs the closest
‒ Reducer is more complex ‒ Both OpenCL and Java versions implemented, as HadoopCL allows the programmer to force JVM execuHon
EVALUATION
14 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
! Evaluated on a 10-‐node AMD APU cluster ! Two datasets with varying parameters tested
‒ Wiki data set ‒ ASF e-‐mail archives data set ‒ Varied K, the number of clusters ‒ Varied the type of pruning done on the input data (prune all but the N most frequent tokens vs. prune each vector to be at most length M)
‒ Varied the amount of pruning done (i.e. the values of N and M)
‒ Enable and disable HadoopCL features to observe impact on performance
EVALUATION
15 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
! Graphs here
EVALUATION
16 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
! HadoopCL offers the flexibility, reliability, and programmability of Hadoop accelerated by naHve, heterogeneous OpenCL threads
! Using HadoopCL is a tradeoff: lose parts of the Java language but gain improved performance
! EvaluaHon of KMeans with real-‐world data sets shows that HadoopCL is flexible and efficient enough to improve performance of real-‐world applicaHons
! Future work to target HSA instead of OpenCL
Max Grossman, [email protected]
CONCLUSION
17 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
DISCLAIMER & ATTRIBUTION
The informaHon presented in this document is for informaHonal purposes only and may contain technical inaccuracies, omissions and typographical errors.
The informaHon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, souware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaHon to update or otherwise correct or revise this informaHon. However, AMD reserves the right to revise this informaHon and to make changes from Hme to Hme to the content hereof without obligaHon of AMD to noHfy any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaHons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicHons. SPEC is a registered trademark of the Standard Performance EvaluaHon CorporaHon (SPEC). Other names are for informaHonal purposes only and may be trademarks of their respecHve owners.
18 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL
SAMPLE SHAPES