performance optimization of deep learning frameworks … · performance optimization of deep...

33
Performance Optimization of Deep Learning Frameworks on Modern Intel Architectures ElMoustapha Ould-Ahmed-Vall, AG Ramesh, Vamsi Sripathi and Karthik Raman Representing the work of many at Intel

Upload: vubao

Post on 04-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Performance Optimization of Deep Learning Frameworks on Modern Intel

Architectures ElMoustapha Ould-Ahmed-Vall, AG Ramesh,

Vamsi Sripathi and Karthik Raman Representing the work of many at Intel

Page 2: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Agenda  

• Op#miza#on  ma+ers  on  modern  architectures  

•  Intel’s  recent  Xeon  and  Xeon  Phi  products  

•  Introduc#on  to  Deep  Learning  

• Op#mizing  DL  frameworks  on  IA  •  Key  challenges  •  Op#miza#on  techniques  •  Performance  data    •  DL  scaling  

Page 3: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Moore’s  Law  Goes  on!  

Increasing  clock  speeds  -­‐>    more  cores  +  wider  SIMD  (Hierarchical  parallelism)  

Page 4: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Combined  Amdahl’s  Law  for  Vector  Mul<cores*  𝑺𝒑𝒆𝒆𝒅𝒖𝒑=(1/𝑆𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 + 1− 𝑆𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 /𝑵𝒖𝒎𝑪𝒐𝒓𝒆𝒔  )∗(1/𝑆𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 + 1− 𝑆𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 /𝑽𝒆𝒄𝒕𝒐𝒓𝑳𝒆𝒏𝒈𝒕𝒉  )

Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code

Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work)

Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter

Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte)

Peak  “Compute”  Gflops/s  

Peak  “Compute”  Gflops/s  without  SIMD  

Compute  intensity  (flops/byte)  

A+ainable  Gfl

ops/s  

Page 5: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Overview of Current Generation of Intel Xeon and Xeon Phi Products

Page 6: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Current  Intel®  Xeon  PlaBorms  

Westmere   Sandy  Bridge  

Intel  Microarchitecture  (Nehalem)  

Intel  Microarchitecture  (Sandy  Bridge)  

NEW    Intel  Microarchitecture  (Sandy  Bridge)  

Nehalem   Ivy  Bridge  

45nm  Process  Technology   32nm  Process  Technology   22nm  Process  Technology  

TOCK   TICK   TOCK   TICK   TOCK  

NEW  Intel®  Microarchitecture  (Nehalem)  

Haswell  

NEW    Intel  Microarchitecture  (Haswell)  

TICK  

14nm  Process  Technology  

Latest released – Broadwell (14nm process) •  Intel’s foundation of HPC and ML performance

•  Suited for full scope of workloads

•  Industry leading performance/watt for serial & highly parallel workloads.

•  Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology) Software optimization helps maximize benefit and

adoption of new features  

Broadwell  

 Intel  Microarchitecture  (Haswell)  

 

Page 7: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

2nd  Genera<on  Intel®  Xeon  Phi™  PlaBorm  

Page 8: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Intel®  AVX  Technology  

HSW/BDW  

512b AVX512 Flops/Cycle: 64SP / 32 DP (FMA)

SKX  &  KNL  SNB/IVB  

AVX512 512-bit FP/Integer 32 registers 8 mask registers Embedded rounding Embedded broadcast Scalar/SSE/AVX “promotions” Native media additions HPC additions Transcendental support Gather/Scatter

AVX AVX2 256-bit basic FP 16 registers NDS (and AVX128) Improved blend MASKMOV Implicit unaligned

Float16 (IVB 2012) 256-bit FP FMA 256-bit integer PERMD Gather

256b AVX2 Flops/Cycle: 32SP / 16 DP (FMA)

256b AVX1 Flops/Cycle: 16 SP / 8 DP

Page 9: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Overview of Deep Learning and DL Frameworks

Page 10: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Deep  Learning  –  Convolu<onal  Neural  Network  

Filter = 3 x 3

Stride = 2

Pad_size = 1

Convolution Parameters: Number of outputs/feature-maps: < 4 > Filter size: < 3 x 3 > Stride: < 2 > Pad_size (for corner case): <1>

Feature maps

Page 11: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

•  Step  1:  Training    (Over  Hours/Days/Weeks)  

Deep  Learning:  Train  Once  Use  Many  Times  

Person  

90%  person  8%  traffic  light  

Input  data  

Output  Classifica#on  

Create  Deep  network  

•  Step  2:  Inference  (Real  Time)  

New  input  from  camera  and  sensors  

Output  Classifica#on  

Trained  neural  network  model  

97%  person  

Trained  Model  

Page 12: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Bigger  Data   Be[er  Hardware   Smarter  Algorithms  

Deep  Learning:  Why  Now?  

Image:  1000  KB  /  picture  Audio:  5000  KB  /  song  Video:  5,000,000  KB  /  movie  

Transistor  density  doubles  every  18  months  Cost  /  GB  in  1995:  $1000.00  Cost  /  GB  in  2015:  $0.03  

Advances  in  algorithm  innova#on,  including  neural  networks,  leading  to  be+er  accuracy  in  training  models  

Page 13: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Intel  Caffe  –  ML  Framework    Op<mized  for  Xeon  and  Xeon  Phi  Products  

q  Fork of BVLC Caffe by Intel to optimize for IA

q  Leverages Intel MKL Deep Neural Network (DNN) API’s

q  Optimized for BDW (AVX2) and KNL (MIC_AVX512)

q  https://github.com/intel/caffe

Intel  Caffe  

Compute  Layer  

Convolu#on   ReLU   Pooling  

Data  Layer  

LMDB,  LevelDB,  HDF5  

Intel  MKL  DNN    

BDW   KNL  

Page 14: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Tensorflow  ™  :  Open  Source  ML  Framework  (Google)   •  Computa<on  is  a  Dataflow  Graph  with  Tensors  

•  General  compu#ng  mathema#cal  framework  –  widely  used  for    •  Deep  Neural  Networks  •  Other  machine  learning  algorithms  •  HPC  applica#ons  

•  Key  computa#onal  kernels,  extendable  user  opera#ons  •  Core  in  C++,  front  end  wrapper  in  python  •  Mul#  node  support  using  GRPC  

•  Google  Remote  Procedural  Calls  

Example    from  Jeff  Dean’s  presenta#on    

Page 15: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Optimizing Deep Learning Frameworks

Page 16: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Performance  Op<miza<on  on  Modern  PlaBorms  

U#lize  all  the  cores  

OpenMP,  MPI,  TBB…    Reduce  synchroniza#on  events,  serial  code    Improve  load  balancing  

Vectorize/SIMD    

Unit  strided  access  per  SIMD  lane    High  vector  efficiency    Data  alignment  

Efficient  memory/cache  

use      Blocking      Data  reuse        Prefetching      Memory  alloca#on    

Hierarchical  Parallelism  

Fine-­‐Grained  Parallelism  /  within  node    Sub-­‐domain:  1)  Mul<-­‐level  domain  decomposi<on  (ex.  across  layers)            

   2)  Data  decomposi<on  (layer  parallelism)    

Coarse-­‐Grained  /    mul#-­‐node  

Domain  decomposi<on  

Scaling    Improve  load  balancing    Reduce  synchroniza#on  events,  all-­‐to-­‐all  comms  

Page 17: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Intel  Strategy:  Op<mized  Deep  Learning  Environment  

Fuel  the  development  of  ver#cal  solu#ons  

Deliver  best  single  node  and  mul#-­‐node  performance  

 Accelerate  design,  training,  and  deployment  

Drive  op#miza#ons  across  open  source  deep  learning  frameworks    

Intel®  Deep  Learning  SDK  

Intel®  Omni-­‐Path  Architecture  (Intel®  OPA)  

Maximum  performance  on  Intel  architecture  Intel®  Math  Kernel  Library  (Intel®  MKL)  

+

Training   Inference  

Intel®  MKL-­‐DNN  

+

Page 18: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Example  Challenge  1:  Data  Layout  Has  Big  Impact  on  Performance  

•  Data  Layouts  impacts  performance  •  Sequen#al  access  to  avoid  gather/sca+er  •  Have  itera#ons  in  inner  most  loop  to  ensure  high  vector  u#liza#on  •  Maximize  data  reuse;  e.g.  weights  in  a  convolu#on  layer  

•  Conver#ng  to/from  op#mized  Layout  is  some  #mes  less  expensive  than  opera#ng  on  unop#mized  Layout  

21   18   32   6   3  

1   8   0   3   26  

40   9   22   76   81  

23   44   81   32   11  

5   38   10   11   1  

8   92   37   29   44  

11   9   22   3   26  

3   47   29   88   1  

15   16   22   46   12  

29   9   13   11   1  21   8   18   92   ..   1   11   ..  

21   18   …   1   ..   8   92   ..  

Be+er  op#mized  for  some  opera#ons  

       vs  

Page 19: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Example  Challenge  2:  Minimize  Conversions  Overhead  

•  End  to  end  op#miza#on  can  reduce  conversions  •  Staying  in  op#mized  layout  as  long  as  possible  becomes  one  of  the  tuning  goals    

•  Minimize  the  number  of  back  and  forth  conversions  •  Use  of  graph  op#miza#on  techniques  

Convolu#on   Convolu#on  Max  Pool      Na#ve  to  MKL  layout  

MKL  layout  to    Na#ve  

MKL  layout  to    Na#ve  

   Na#ve  to  MKL  layout  

Page 20: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

•  Maximize  parallelism  to  use  all  cores  efficiently  

•  Intra  opera#on/layer  parallelism  within  operators  (OpenMP)  

Inter  opera#on  parallelism  across  operators  

8   92   37   29  

11   9   22   3  

3   47   29   88  

15   16   22   46  

concat  

3x3  Conv  

5x5  Conv  

1x1  Conv  

10   20  

15   18  Convolu#on  of  #les  in  parallel  

Parallel  execu#on  

Example  Challenge  3:  Ensuring  Enough  Parallelism  to  Leverage  all  Cores  

Page 21: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Example  Challenge  4:  Op<mizing  the  Data  Layer  

•  Data  Layer  comprises  3  major  ops  o  Read  data  o  Decode  data:  e.g.  JPEG  decode,  decompression  o  Transform  data  

•  Result  of  read,  decode  &  transform  is  input  to  DNN  layers  •  Reduce  number  of  cores  dedicated  to  feed  DNN  

o  IO  op#miza#on:  consider  compression  o  Decode:  consider  LMDB  instead  of  JPEG  o  Resizing/data  processing:  consider  pre-­‐processing  o  Then  vectorize,  parallelize    

C0  Boost  thread  

C1  Boost  thread  

C2  OpenMP  

C3  OpenMP    ….    ….   Cn-­‐1  

OpenMP  

Page 22: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Op<mizing  Deep  Learning  Frameworks  for  Intel®  Architecture    •  Leverage  high  performant  compute  libraries  and  tools  

•  e.g.  Intel®  Math  Kernel  Library,  Intel®  Python,  Intel®  Compiler  etc.  •  Data  Format/Shape:  

•  Right  format/shape  for  max  performance:  blocking,  gather/sca+er  •  Data  Layout:    

•  Minimize  cost  of  data  layout  conversions    •  Parallelism:  

•  Use  all  cores,  eliminate  serial  sec#ons,  load  imbalance  •  Other  Func#ons/Primi#ves  (un-­‐op#mized  in  libraries):  

•  Op#mize  via  compiler  knobs,  improve  exis#ng  implementa#ons  •  Memory  alloca#on  

•  unique  characteris#cs  and  ability  to  reuse  buffers  •  Data  layer  op#miza#ons:    

•  paralleliza#on,  vectoriza#on,  IO  •  Op#mize  hyper  parameters:  

•  e.g.  batch  size  for  more  parallelism  •  learning  rate  and  op#mizer  to  ensure  accuracy/convergence  

Page 23: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

AlexNet  Op<miza<on  Progression  

1.00x  2.20x   4.18x  

6.96x   7.72x   9.27x  13.36x   13.72x  

2.16x  

12.48x  18.28x  

25.49x   27.17x  

40.71x  

49.07x  

0  

10  

20  

30  

40  

50  

60  

Cumula#

ve  sp

eedu

p  

Broadwell   Knights  Landing  

Page 24: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

VGG  Op<miza<on  Progression  

1.00x   3.15x   5.40x   10.18x   13.29x   14.65x   19.27x  1.00x  

15.80x   23.60x  

122.50x  

164.95x   171.20x  

273.50x  

0  

50  

100  

150  

200  

250  

300  

Baseline   MKL  Integra#on   Thread  Op#miza#on  

Compiler  Knobs  Tuning  

Matrix  Transpose/Data  Transforma#ons  

Memory  Alloca#ons  

Conversions  Op#miza#on  

Broadwell   Knights  Landing  

Cumula#

ve  Spe

edup

 

Page 25: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

25  

Configura<on  details  

Intel®  Xeon™  processor  E5-­‐2699v4  (22  Cores,  2.2  GHz),  128GB  DDR  memory,    Centos  7.2  based  on  Red  Hat*  Enterprise  Linux  7.2    Intel®  Xeon  Phi™  processor  7250  (68  Cores,  1.4  GHz,  16GB  MCDRAM:  Flat  mode),  96GB  DDR  memory,    Centos  7.2  based  on  Red  Hat*  Enterprise  Linux  7.2      AlexNet  and  VGG  benchmarks:    h+ps://github.com/soumith/convnet-­‐benchmarks    

Page 26: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Mul<-­‐Node  Distributed  Training  •  Model  Parallelism  

•  Break  the  model  into  N  nodes  •  The  same  data  is  in  all  the  nodes  

•  Data  Parallelism  •  Break  the  dataset  into  N    nodes  •  The  same  model  is  in  all  the  nodes  •  Good  for  networks  with  few  weights,  e.g.  GoogLeNet  

•  You  can  use  either  model  or  data  parallelism  or  a  hybrid  of  both  

Page 27: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Data  Parallelism  

Page 28: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Scaling  Efficiency:  Intel®  Xeon  Phi™  Processor    Deep  Learning  Image  Classifica#on  Training  Performance  :  MULTI-­‐NODE  Scaling  

Soxware  and  workloads  used  in  performance  tests  may  have  been  op#mized  for  performance  only  on  Intel  microprocessors.    Performance  tests,  such  as  SYSmark  and  MobileMark,  are  measured  using  specific  computer  systems,  components,  soxware,  opera#ons  and  func#ons.    Any  change  to  any  of  those  factors  may  cause  the  results  to  vary.    You  should  consult  other  informa#on  and  performance  tests  to  assist  you  in  fully  evalua#ng  your  contemplated  purchases,  including  the  performance  of  that  product  when  combined  with  other  products.    For  more  informa#on  go  to  h+p://www.intel.com/performance  .  *Other  names  and  brands  may  be  property  of  others  Configura#ons:    •  Intel®  Xeon  Phi™    Processor  7250  (68  Cores,  1.4  GHz,  16GB  MCDRAM),  128  GB  memory,  Red  Hat*  Enterprise  Linux  6.7,  Intel®  Op#mized  Framework  

0  10  20  30  40  50  60  70  80  90  100  

1   2   4   8   16   32   64   128  

SCALING  EFFICIEN

CY  %  

#  OF  INTEL®  XEON  PHI™  PROCESSOR  7250  (68-­‐CORES,  1.4  GHZ,  16  GB)  NODES  

OverFeat     AlexNet   VGG-­‐A   GoogLeNet  

62  

87  

   

Page 29: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

𝑇𝑖𝑚𝑒  𝑇𝑜  𝑇𝑟𝑎𝑖𝑛  (𝑇𝑇𝑇)  

𝑏𝑎𝑡𝑐ℎ  𝑠𝑖𝑧𝑒  

sweet  spot  

Mul<-­‐node  Challenges  •  Need  to  op#mize  both  compute  (itera#on)  and  

communica#on  (weight  updates)  •  More  nodes  mean  higher  batch  per  itera#on  

•  Enough  work  for  each  node  

•  Op#mized  hyper  parameters  (e.g.  Batch  Size)  •  Time  to  Train:  increases  with  batch  size    •  Accuracy:  batch  size  impacts  convergence  and  accuracy  

•  Communication overheads if small per node batch •  e.g.  Total  batch  size  =  1024  

•  1024  nodes  :  Batch  size  =  1  per  node  –  communica<on  dominates  

•  64  nodes  each  :  Batch  size  =  16  per  node  –  computa<on  dominates  

Page 30: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Summary  •  Don’t  be  fooled  by  performance  of  DL  workloads  when  using  unop#mized  

frameworks  

•  Significant  performance  headroom  from  op#miza#on  on  Xeon  and  Xeon  Phi  •  Close  to  300x  speedup  in  certain  topologies  

•  Tradi#onal  vectoriza#on  and  paralleliza#on  strategies  apply  

•  Other  unique  performance  challenges:  hyper  parameters,  data  layer,  inter/intra  layer  paralleliza#on,  etc.  

•  Call  to  ac#on:  •  Try  Intel  op#mized  frameworks  available  today,  more  to  come  soon  

Page 31: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Legal Disclaimers •  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across

different processor families: Go to: Learn About Intel® Processor Numbers http://www.intel.com/products/processor_number •  Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system

hardware or software design or configuration may affect actual performance.•  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such

as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 

•  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

•  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

•  SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.

•  TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.•  No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Xeon® processor E7-8800/4800/2800 v2 product

families or Intel® Itanium® 9500 series-based system (or follow-on generations of either.) Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient System Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel processor and enabled technology(ies). Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient Memory Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel® processor and enabled technology(ies). built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details.

Page 32: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to

Op/miza/on  No/ce

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Page 33: Performance Optimization of Deep Learning Frameworks … · Performance Optimization of Deep Learning Frameworks on ... (Sandy*Bridge)* Nehalem* Ivy ... a baseline value of 1.0 to