cc-4000, characterizing apu performance in hadoopcl on heterogeneous distributed platforms, by max...

CHARACTERIZING APU PERFORMANCE IN HADOOPCL ON HETEROGENEOUS DISTRIBUTED PLATFORMS

MAX GROSSMAN, MAURICIO BRETERNITZ, AND VIVEK SARKAR RICE UNIVERSITY & AMD

2 | PRESENTATION TITLE | NOVEMBER 21, 2013 | CONFIDENTIAL

! Cloud offers elasHcity, lowered startup costs, unified plaQorm for all ! Generally see worse and less predictable performance

‒ Noisy neighbor ! Economics of scale => cloud is here to stay “I don’t care where my code runs, as long as it finishes… someday” – Bob the Cloud User

MOTIVATION


! Hadoop ‒ Java programming language ‒ JDK libraries ‒ Arbitrary data types ‒ Reliability ‒ Simple MapReduce distributed programming model

! AbstracHons built on Hadoop ‒ H2O from 0xdata ‒ Mahout machine learning framework

STATE-‐OF-‐THE-‐ART


1.  Poor computaHonal performance ‒  JVM execuHon, short-‐lived tasks implies poor JIT,

high startup cost for creaHng child processes

2.  Poor I/O performance ‒  SerializaHon, deserializaHon of arbitrary data types

3.  Manual tweaking of intertwined tunables ‒  In an unstable cloud environment, you never have

it right

4.  Scheduling execuHon & communicaHon with a holisHc view of the plaQorm

" A small sampling of Hadoop tunables…

PROBLEMS


! OpenCL ‒ SIMD programming model ‒ MulH-‐architecture and mulH-‐vendor support ‒ APIs for launching compute and copy tasks

! An expert programmer could: 1.  Translate all applicaHon code to OpenCL kernels 2.  Compile OpenCL kernels, API calls into naHve library 3.  Call naHve library from Java via JNI 4.  Spend a lot of Hme debugging performance and

correctness

! SHll not good enough!

Host

Device

clEnqueueNDRange()

Host ApplicaHon

A POTENTIAL SOLUTION


Hadoop

Reliability Distributed PlaQorm

OpenCL

MulH-‐architecture execuHon in naHve threads

APARAPI

bytecode to OpenCL kernels ! Hardware aware plaQorm manager

! Machine-‐learning, mulH-‐device scheduler based on device occupancy and past kernel performance

! Architecture aware opHmizing compiler ! Hadoop-‐like API


HADOOPCL ARCHITECTURE

class PiMapper extends DoubleDoubleBoolIntHadoopCLMapper { public void map(double x, double y) { if(x * x + y * y > 0.25) { write(false, 1); } else { write(true, 1); } } } job.waitForCompletion(true);

javac .class

!  HadoopCL programming model supports ‒  Java syntax ‒ MapReduce abstracHons ‒ Dynamic memory allocaHon ‒ Variety of data types (primiHves, sparse vectors, tuples, etc) and can be extended to more

‒ Constant globals accessible from anywhere

!  HadoopCL does not support ‒ Arbitrary inputs, outputs ‒ Massive data elements (i.e. sparse vectors larger than device memory)

‒ Object references



$ hadoop jar Pi.jar input output

NameNode + JobTracker

DataNode

Hadoop DataNode

Task Map or Reduce

DataNode

TaskTracker

HadoopCL Child

HadoopCL Child

HadoopCL Child

HadoopCL Child

HadoopCL ML Device Scheduler


HadoopCL Child


Task Map or Reduce

Input Collector

Input Buffer Queue

Kernel Executor

Output Buffer Queue

Launch

Retry

Input Buffer

Output

Input Buffer

Manager

Output Buffer

Manager

Release

!  Each Child JVM encloses a data-‐driven pipeline of communicaHon and computaHon tasks ‒ Data is buffered in chunks for processing on the OpenCL device

!  HadoopCL explicitly manages buffers to prevent large GC overheads

!  Kernel Executor handles ‒ Auto-‐generaHon and opHmizaHon of OpenCL kernels from JVM bytecode

‒ Transfer of inputs, outputs to device ‒ Asynchronous launch of OpenCL kernels

OpenCL Device


TOPICS IN HADOOPCL

! Extending APARAPI with architecture-‐ and data-‐aware compiler opHmizaHons 1.  A number of HadoopCL-‐specific funcHons are auto-‐generated from APARAPI at runHme 2.  When GPU execuHon is detected and a vector data-‐type is in use, the HadoopCL runHme

auto-‐strides input vectors before copying to the device ‒  APARAPI must emit strided code to match data layout, fails in certain cases

double MahoutKMeansMapper__dot(...){ double agg = 0.0; for (int i = 0; i < length1; i++){ int currentIndex = index1[(i) * this-‐>nPairs]; int j = 0; for (; j<length2 && currentIndex!=index2[j]; j++) ; if (j != length2) agg = agg + (val1[(i) * this-‐>nPairs] * val2[j]); } return(agg); }

double MahoutKMeansMapper__dot(...){ double agg = 0.0; for (int i = 0; i < length1; i++){ int currentIndex = index1[i]; int j = 0; for (; j<length2 && currentIndex!=index2[j]; j++) ; if (j!=length2) agg = agg + (val1[i] * val2[j]); } return(agg); }


TOPICS IN HADOOPCL

! Enabling OpenCL dynamic memory allocaHon through restart-‐able kernels ‒ Note: there are no side effects of mappers or reducers unHl they commit (i.e. write())

OpenCL Device

Heap

free

nWrites

nInputs

public void map(int key, double val) { int[] outputVec = new int[10]; ... write(key, outputVec); } Mapper.java

writeOffsetLookup

__kernel void map(int key, double val) { int oldOffset = atomic_add(free, 10); if (oldOffset + 10 >= heapSize) { nWrites[inputIndex] = -‐1; return; } ... writeOffsetLookup[inputIndex] = oldOffset; nWrites[inputIndex] = nWrites[inputIndex] + 1; } Mapper.cl


TOPICS IN HADOOPCL

! Auto-‐scheduling OpenCL kernels across execuHon plaQorms through machine learning ‒ HadoopCL TaskTracker is responsible for

1.  Assigning each Task an execuHon plaQorm (GPU, CPU, or JVM) 2.  Recording execuHon Hme for each task along with the kernel executed and average device

occupancy during that task’s execuHon

!  Device assignment is based on programmer hints and/or recorded data from previous runs

‒  Data is recorded in files to be used across Jobs


! Mahout Kmeans ‒ Mahout provides Hadoop MapReduce implementaHons of a variety of ML algorithms

‒ KMeans iteraHvely searches for K clusters

! HadoopCL KMeans port ‒ Mapper is trivial, for each point iterates through all clusters and outputs the closest

‒ Reducer is more complex ‒ Both OpenCL and Java versions implemented, as HadoopCL allows the programmer to force JVM execuHon

EVALUATION


! Evaluated on a 10-‐node AMD APU cluster !  Two datasets with varying parameters tested

‒ Wiki data set ‒ ASF e-‐mail archives data set ‒ Varied K, the number of clusters ‒ Varied the type of pruning done on the input data (prune all but the N most frequent tokens vs. prune each vector to be at most length M)

‒ Varied the amount of pruning done (i.e. the values of N and M)

‒ Enable and disable HadoopCL features to observe impact on performance

EVALUATION


! Graphs here

EVALUATION


! HadoopCL offers the flexibility, reliability, and programmability of Hadoop accelerated by naHve, heterogeneous OpenCL threads

! Using HadoopCL is a tradeoff: lose parts of the Java language but gain improved performance

! EvaluaHon of KMeans with real-‐world data sets shows that HadoopCL is flexible and efficient enough to improve performance of real-‐world applicaHons

! Future work to target HSA instead of OpenCL

Max Grossman, [email protected]

CONCLUSION


DISCLAIMER & ATTRIBUTION

The informaHon presented in this document is for informaHonal purposes only and may contain technical inaccuracies, omissions and typographical errors.

The informaHon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, souware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaHon to update or otherwise correct or revise this informaHon. However, AMD reserves the right to revise this informaHon and to make changes from Hme to Hme to the content hereof without obligaHon of AMD to noHfy any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaHons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicHons. SPEC is a registered trademark of the Standard Performance EvaluaHon CorporaHon (SPEC). Other names are for informaHonal purposes only and may be trademarks of their respecHve owners.


SAMPLE SHAPES

cc-4000, characterizing apu performance in hadoopcl on heterogeneous distributed platforms, by max...

Technology