cc-4000, characterizing apu performance in hadoopcl on heterogeneous distributed platforms, by max...

21
CHARACTERIZING APU PERFORMANCE IN HADOOPCL ON HETEROGENEOUS DISTRIBUTED PLATFORMS MAX GROSSMAN, MAURICIO BRETERNITZ, AND VIVEK SARKAR RICE UNIVERSITY & AMD

Upload: amd-developer-central

Post on 27-Jan-2015

106 views

Category:

Technology


3 download

DESCRIPTION

Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.

TRANSCRIPT

Page 1: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

CHARACTERIZING  APU  PERFORMANCE  IN  HADOOPCL  ON  HETEROGENEOUS  DISTRIBUTED  PLATFORMS  

MAX  GROSSMAN,  MAURICIO  BRETERNITZ,  AND  VIVEK  SARKAR  RICE  UNIVERSITY  &  AMD  

Page 2: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

2   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

! Cloud  offers  elasHcity,  lowered  startup  costs,  unified  plaQorm  for  all  ! Generally  see  worse  and  less  predictable  performance  

‒ Noisy  neighbor  ! Economics  of  scale  =>  cloud  is  here  to  stay    “I  don’t  care  where  my  code  runs,  as  long  as  it  finishes…  someday”  –  Bob  the  Cloud  User  

MOTIVATION  

Page 3: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

3   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

! Hadoop  ‒ Java  programming  language  ‒ JDK  libraries  ‒ Arbitrary  data  types  ‒ Reliability  ‒ Simple  MapReduce  distributed  programming  model  

! AbstracHons  built  on  Hadoop  ‒ H2O  from  0xdata  ‒ Mahout  machine  learning  framework  

STATE-­‐OF-­‐THE-­‐ART  

Page 4: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

4   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

1.  Poor  computaHonal  performance  ‒  JVM  execuHon,  short-­‐lived  tasks  implies  poor  JIT,  

high  startup  cost  for  creaHng  child  processes  

2.  Poor  I/O  performance  ‒  SerializaHon,  deserializaHon  of  arbitrary  data  types  

3.  Manual  tweaking  of  intertwined  tunables  ‒  In  an  unstable  cloud  environment,  you  never  have  

it  right  

4.  Scheduling  execuHon  &  communicaHon  with  a  holisHc  view  of  the  plaQorm  

" A  small  sampling  of  Hadoop  tunables…  

PROBLEMS  

Page 5: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

5   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

! OpenCL  ‒ SIMD  programming  model  ‒ MulH-­‐architecture  and  mulH-­‐vendor  support  ‒ APIs  for  launching  compute  and  copy  tasks  

! An  expert  programmer  could:  1.  Translate  all  applicaHon  code  to  OpenCL  kernels  2.  Compile  OpenCL  kernels,  API  calls  into  naHve  library  3.  Call  naHve  library  from  Java  via  JNI  4.  Spend  a  lot  of  Hme  debugging  performance  and  

correctness  

! SHll  not  good  enough!  

Host  

Device  

clEnqueueNDRange()  

Host  ApplicaHon  

A  POTENTIAL  SOLUTION  

Page 6: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

6   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

Hadoop    

Reliability  Distributed  PlaQorm  

OpenCL    

MulH-­‐architecture  execuHon  in  naHve  threads  

   

APARAPI    

bytecode  to  OpenCL  kernels   ! Hardware  aware  plaQorm  manager  

! Machine-­‐learning,  mulH-­‐device  scheduler  based  on  device  occupancy  and  past  kernel  performance  

! Architecture  aware  opHmizing  compiler  ! Hadoop-­‐like  API  

Page 7: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

7   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

HADOOPCL  ARCHITECTURE  

 class  PiMapper  extends          DoubleDoubleBoolIntHadoopCLMapper  {        public  void  map(double  x,                double  y)  {          if(x  *  x  +  y  *  y  >  0.25)  {              write(false,  1);          }  else  {              write(true,  1);          }      }  }    job.waitForCompletion(true);  

javac   .class  

!  HadoopCL  programming  model  supports  ‒  Java  syntax  ‒ MapReduce  abstracHons  ‒ Dynamic  memory  allocaHon  ‒ Variety  of  data  types  (primiHves,  sparse  vectors,  tuples,  etc)  and  can  be  extended  to  more  

‒ Constant  globals  accessible  from  anywhere  

!  HadoopCL  does  not  support  ‒ Arbitrary  inputs,  outputs  ‒ Massive  data  elements  (i.e.  sparse  vectors  larger  than  device  memory)  

‒ Object  references  

Page 8: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

8   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

HADOOPCL  ARCHITECTURE  

$  hadoop  jar  Pi.jar  input  output  

NameNode  +  JobTracker  

DataNode  

Hadoop  DataNode  

Task  Map  or  Reduce  

DataNode  

TaskTracker  

HadoopCL  Child  

HadoopCL  Child  

HadoopCL  Child  

HadoopCL  Child  

HadoopCL  ML  Device  Scheduler  

Page 9: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

9   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

HadoopCL  Child  

HADOOPCL  ARCHITECTURE  

Task  Map  or  Reduce  

Input  Collector  

Input  Buffer  Queue  

Kernel  Executor  

Output  Buffer  Queue    

Launch  

Retry  

Input  Buffer  

Output  

Input  Buffer  

Manager  

Output  Buffer  

Manager  

Release  

!  Each  Child  JVM  encloses  a  data-­‐driven  pipeline  of  communicaHon  and  computaHon  tasks  ‒ Data  is  buffered  in  chunks  for  processing  on  the  OpenCL  device  

!  HadoopCL  explicitly  manages  buffers  to  prevent  large  GC  overheads  

!  Kernel  Executor  handles  ‒ Auto-­‐generaHon  and  opHmizaHon  of  OpenCL  kernels  from  JVM  bytecode  

‒ Transfer  of  inputs,  outputs  to  device  ‒ Asynchronous  launch  of  OpenCL  kernels  

OpenCL  Device  

Page 10: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

10   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

TOPICS  IN  HADOOPCL  

! Extending  APARAPI  with  architecture-­‐  and  data-­‐aware  compiler  opHmizaHons  1.  A  number  of  HadoopCL-­‐specific  funcHons  are  auto-­‐generated  from  APARAPI  at  runHme  2.  When  GPU  execuHon  is  detected  and  a  vector  data-­‐type  is  in  use,  the  HadoopCL  runHme  

auto-­‐strides  input  vectors  before  copying  to  the  device  ‒  APARAPI  must  emit  strided  code  to  match  data  layout,  fails  in  certain  cases  

double  MahoutKMeansMapper__dot(...){      double  agg  =  0.0;      for  (int  i  =  0;  i  <  length1;  i++){          int  currentIndex  =  index1[(i)  *  this-­‐>nPairs];          int  j  =  0;          for  (;  j<length2  &&  currentIndex!=index2[j];  j++)  ;          if  (j  !=  length2)              agg  =  agg  +  (val1[(i)  *  this-­‐>nPairs]  *  val2[j]);      }      return(agg);  }  

double  MahoutKMeansMapper__dot(...){      double  agg  =  0.0;      for  (int  i  =  0;  i  <  length1;  i++){          int  currentIndex  =  index1[i];          int  j  =  0;          for  (;  j<length2  &&  currentIndex!=index2[j];  j++)  ;          if  (j!=length2)              agg  =  agg  +  (val1[i]  *  val2[j]);      }      return(agg);  }  

Page 11: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

11   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

TOPICS  IN  HADOOPCL  

! Enabling  OpenCL  dynamic  memory  allocaHon  through  restart-­‐able  kernels  ‒ Note:  there  are  no  side  effects  of  mappers  or  reducers  unHl  they  commit  (i.e.  write())  

OpenCL  Device  

Heap  

free  

nWrites  

nInputs  

public  void  map(int  key,  double  val)  {      int[]  outputVec  =  new  int[10];      ...      write(key,  outputVec);  }                  Mapper.java  

writeOffsetLookup  

__kernel  void  map(int  key,  double  val)  {      int  oldOffset  =  atomic_add(free,  10);      if  (oldOffset  +  10  >=  heapSize)  {          nWrites[inputIndex]  =  -­‐1;          return;      }      ...      writeOffsetLookup[inputIndex]  =  oldOffset;      nWrites[inputIndex]  =  nWrites[inputIndex]  +  1;  }                      Mapper.cl  

Page 12: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

12   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

TOPICS  IN  HADOOPCL  

! Auto-­‐scheduling  OpenCL  kernels  across  execuHon  plaQorms  through  machine  learning  ‒ HadoopCL  TaskTracker  is  responsible  for  

1.  Assigning  each  Task  an  execuHon  plaQorm  (GPU,  CPU,  or  JVM)  2.  Recording  execuHon  Hme  for  each  task  along  with  the  kernel  executed  and  average  device  

occupancy  during  that  task’s  execuHon  

!  Device  assignment  is  based  on  programmer  hints  and/or  recorded  data  from  previous  runs  

‒  Data  is  recorded  in  files  to  be  used  across  Jobs  

Page 13: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

13   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

! Mahout  Kmeans  ‒ Mahout  provides  Hadoop  MapReduce  implementaHons  of  a  variety  of  ML  algorithms  

‒ KMeans  iteraHvely  searches  for  K  clusters  

! HadoopCL  KMeans  port  ‒ Mapper  is  trivial,  for  each  point  iterates  through  all  clusters  and  outputs  the  closest  

‒ Reducer  is  more  complex  ‒ Both  OpenCL  and  Java  versions  implemented,  as  HadoopCL  allows  the  programmer  to  force  JVM  execuHon  

EVALUATION  

Page 14: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

14   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

! Evaluated  on  a  10-­‐node  AMD  APU  cluster  !  Two  datasets  with  varying  parameters  tested  

‒ Wiki  data  set  ‒ ASF  e-­‐mail  archives  data  set  ‒ Varied  K,  the  number  of  clusters  ‒ Varied  the  type  of  pruning  done  on  the  input  data  (prune  all  but  the  N  most  frequent  tokens  vs.  prune  each  vector  to  be  at  most  length  M)  

‒ Varied  the  amount  of  pruning  done  (i.e.  the  values  of  N  and  M)  

‒ Enable  and  disable  HadoopCL  features  to  observe  impact  on  performance  

EVALUATION  

Page 15: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

15   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

! Graphs  here  

EVALUATION  

Page 16: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

16   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

! HadoopCL  offers  the  flexibility,  reliability,  and  programmability  of  Hadoop  accelerated  by  naHve,  heterogeneous  OpenCL  threads  

! Using  HadoopCL  is  a  tradeoff:  lose  parts  of  the  Java  language  but  gain  improved  performance  

! EvaluaHon  of  KMeans  with  real-­‐world  data  sets  shows  that  HadoopCL  is  flexible  and  efficient  enough  to  improve  performance  of  real-­‐world  applicaHons  

! Future  work  to  target  HSA  instead  of  OpenCL      

Max  Grossman,  [email protected]  

CONCLUSION  

Page 17: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

17   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

DISCLAIMER  &  ATTRIBUTION  

The  informaHon  presented  in  this  document  is  for  informaHonal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.    

The  informaHon  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap  changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  souware  changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obligaHon  to  update  or  otherwise  correct  or  revise  this  informaHon.  However,  AMD  reserves  the  right  to  revise  this  informaHon  and  to  make  changes  from  Hme  to  Hme  to  the  content  hereof  without  obligaHon  of  AMD  to  noHfy  any  person  of  such  revisions  or  changes.    

AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY  INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.    

AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE  LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION  CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.  

 

ATTRIBUTION  

©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combinaHons  thereof  are  trademarks  of  Advanced  Micro  Devices,  Inc.  in  the  United  States  and/or  other  jurisdicHons.    SPEC    is  a  registered  trademark  of  the  Standard  Performance  EvaluaHon  CorporaHon  (SPEC).  Other  names  are  for  informaHonal  purposes  only  and  may  be  trademarks  of  their  respecHve  owners.  

Page 18: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

18   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  

SAMPLE  SHAPES  

Page 19: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman
Page 20: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman
Page 21: CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman