performance optimization of deep learning frameworks … · performance optimization of deep...

Performance Optimization of Deep Learning Frameworks on Modern Intel

Architectures ElMoustapha Ould-Ahmed-Vall, AG Ramesh,

Vamsi Sripathi and Karthik Raman Representing the work of many at Intel

Agenda

• Op#miza#on ma+ers on modern architectures

•  Intel’s recent Xeon and Xeon Phi products

•  Introduc#on to Deep Learning

• Op#mizing DL frameworks on IA •  Key challenges •  Op#miza#on techniques •  Performance data •  DL scaling

Moore’s Law Goes on!

Increasing clock speeds -‐> more cores + wider SIMD (Hierarchical parallelism)

Combined Amdahl’s Law for Vector Mul<cores* 𝑺𝒑𝒆𝒆𝒅𝒖𝒑=(1/𝑆𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 + 1− 𝑆𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 /𝑵𝒖𝒎𝑪𝒐𝒓𝒆𝒔  )∗(1/𝑆𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 + 1− 𝑆𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 /𝑽𝒆𝒄𝒕𝒐𝒓𝑳𝒆𝒏𝒈𝒕𝒉  )

Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code

Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work)

Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter

Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte)

Peak “Compute” Gflops/s

Peak “Compute” Gflops/s without SIMD

Compute intensity (flops/byte)

A+ainable Gfl

ops/s

Overview of Current Generation of Intel Xeon and Xeon Phi Products

Current Intel® Xeon PlaBorms

Westmere Sandy Bridge

Intel Microarchitecture (Nehalem)

Intel Microarchitecture (Sandy Bridge)

NEW Intel Microarchitecture (Sandy Bridge)

Nehalem Ivy Bridge

45nm Process Technology 32nm Process Technology 22nm Process Technology

TOCK TICK TOCK TICK TOCK

NEW Intel® Microarchitecture (Nehalem)

Haswell

NEW Intel Microarchitecture (Haswell)

TICK

14nm Process Technology

Latest released – Broadwell (14nm process) •  Intel’s foundation of HPC and ML performance

•  Suited for full scope of workloads

•  Industry leading performance/watt for serial & highly parallel workloads.

•  Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology) Software optimization helps maximize benefit and

adoption of new features

Broadwell

Intel Microarchitecture (Haswell)

2nd Genera<on Intel® Xeon Phi™ PlaBorm

Intel® AVX Technology

HSW/BDW

512b AVX512 Flops/Cycle: 64SP / 32 DP (FMA)

SKX & KNL SNB/IVB

AVX512 512-bit FP/Integer 32 registers 8 mask registers Embedded rounding Embedded broadcast Scalar/SSE/AVX “promotions” Native media additions HPC additions Transcendental support Gather/Scatter

AVX AVX2 256-bit basic FP 16 registers NDS (and AVX128) Improved blend MASKMOV Implicit unaligned

Float16 (IVB 2012) 256-bit FP FMA 256-bit integer PERMD Gather

256b AVX2 Flops/Cycle: 32SP / 16 DP (FMA)

256b AVX1 Flops/Cycle: 16 SP / 8 DP

Overview of Deep Learning and DL Frameworks

Deep Learning – Convolu<onal Neural Network

Filter = 3 x 3

Stride = 2

Pad_size = 1

Convolution Parameters: Number of outputs/feature-maps: < 4 > Filter size: < 3 x 3 > Stride: < 2 > Pad_size (for corner case): <1>

Feature maps

•  Step 1: Training (Over Hours/Days/Weeks)

Deep Learning: Train Once Use Many Times

Person

90% person 8% traffic light

Input data

Output Classifica#on

Create Deep network

•  Step 2: Inference (Real Time)

New input from camera and sensors

Output Classifica#on

Trained neural network model

97% person

Trained Model

Bigger Data Be[er Hardware Smarter Algorithms

Deep Learning: Why Now?

Image: 1000 KB / picture Audio: 5000 KB / song Video: 5,000,000 KB / movie

Transistor density doubles every 18 months Cost / GB in 1995: $1000.00 Cost / GB in 2015: $0.03

Advances in algorithm innova#on, including neural networks, leading to be+er accuracy in training models

Intel Caffe – ML Framework Op<mized for Xeon and Xeon Phi Products

q  Fork of BVLC Caffe by Intel to optimize for IA

q  Leverages Intel MKL Deep Neural Network (DNN) API’s

q  Optimized for BDW (AVX2) and KNL (MIC_AVX512)

q  https://github.com/intel/caffe

Intel Caffe

Compute Layer

Convolu#on ReLU Pooling

Data Layer

LMDB, LevelDB, HDF5

Intel MKL DNN

BDW KNL

Tensorflow ™ : Open Source ML Framework (Google) •  Computa<on is a Dataflow Graph with Tensors

•  General compu#ng mathema#cal framework – widely used for •  Deep Neural Networks •  Other machine learning algorithms •  HPC applica#ons

•  Key computa#onal kernels, extendable user opera#ons •  Core in C++, front end wrapper in python •  Mul# node support using GRPC

•  Google Remote Procedural Calls

Example from Jeff Dean’s presenta#on

Optimizing Deep Learning Frameworks

Performance Op<miza<on on Modern PlaBorms

U#lize all the cores

OpenMP, MPI, TBB… Reduce synchroniza#on events, serial code Improve load balancing

Vectorize/SIMD

Unit strided access per SIMD lane High vector efficiency Data alignment

Efficient memory/cache

use Blocking Data reuse Prefetching Memory alloca#on

Hierarchical Parallelism

Fine-‐Grained Parallelism / within node Sub-‐domain: 1) Mul<-‐level domain decomposi<on (ex. across layers)

2) Data decomposi<on (layer parallelism)

Coarse-‐Grained / mul#-‐node

Domain decomposi<on

Scaling Improve load balancing Reduce synchroniza#on events, all-‐to-‐all comms

Intel Strategy: Op<mized Deep Learning Environment

Fuel the development of ver#cal solu#ons

Deliver best single node and mul#-‐node performance

Accelerate design, training, and deployment

Drive op#miza#ons across open source deep learning frameworks

Intel® Deep Learning SDK

Intel® Omni-‐Path Architecture (Intel® OPA)

Maximum performance on Intel architecture Intel® Math Kernel Library (Intel® MKL)

+

Training Inference

Intel® MKL-‐DNN

+

Example Challenge 1: Data Layout Has Big Impact on Performance

•  Data Layouts impacts performance •  Sequen#al access to avoid gather/sca+er •  Have itera#ons in inner most loop to ensure high vector u#liza#on •  Maximize data reuse; e.g. weights in a convolu#on layer

•  Conver#ng to/from op#mized Layout is some #mes less expensive than opera#ng on unop#mized Layout

21 18 32 6 3

1 8 0 3 26

40 9 22 76 81

23 44 81 32 11

5 38 10 11 1

8 92 37 29 44

11 9 22 3 26

3 47 29 88 1

15 16 22 46 12

29 9 13 11 1 21 8 18 92 .. 1 11 ..

21 18 … 1 .. 8 92 ..

Be+er op#mized for some opera#ons

vs

Example Challenge 2: Minimize Conversions Overhead

•  End to end op#miza#on can reduce conversions •  Staying in op#mized layout as long as possible becomes one of the tuning goals

•  Minimize the number of back and forth conversions •  Use of graph op#miza#on techniques

Convolu#on Convolu#on Max Pool Na#ve to MKL layout

MKL layout to Na#ve

MKL layout to Na#ve

Na#ve to MKL layout

•  Maximize parallelism to use all cores efficiently

•  Intra opera#on/layer parallelism within operators (OpenMP)

Inter opera#on parallelism across operators

8 92 37 29

11 9 22 3

3 47 29 88

15 16 22 46

concat

3x3 Conv

5x5 Conv

1x1 Conv

10 20

15 18 Convolu#on of #les in parallel

Parallel execu#on

Example Challenge 3: Ensuring Enough Parallelism to Leverage all Cores

Example Challenge 4: Op<mizing the Data Layer

•  Data Layer comprises 3 major ops o  Read data o  Decode data: e.g. JPEG decode, decompression o  Transform data

•  Result of read, decode & transform is input to DNN layers •  Reduce number of cores dedicated to feed DNN

o  IO op#miza#on: consider compression o  Decode: consider LMDB instead of JPEG o  Resizing/data processing: consider pre-‐processing o  Then vectorize, parallelize

C0 Boost thread

C1 Boost thread

C2 OpenMP

C3 OpenMP …. …. Cn-‐1

OpenMP

Op<mizing Deep Learning Frameworks for Intel® Architecture •  Leverage high performant compute libraries and tools

•  e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc. •  Data Format/Shape:

•  Right format/shape for max performance: blocking, gather/sca+er •  Data Layout:

•  Minimize cost of data layout conversions •  Parallelism:

•  Use all cores, eliminate serial sec#ons, load imbalance •  Other Func#ons/Primi#ves (un-‐op#mized in libraries):

•  Op#mize via compiler knobs, improve exis#ng implementa#ons •  Memory alloca#on

•  unique characteris#cs and ability to reuse buffers •  Data layer op#miza#ons:

•  paralleliza#on, vectoriza#on, IO •  Op#mize hyper parameters:

•  e.g. batch size for more parallelism •  learning rate and op#mizer to ensure accuracy/convergence

AlexNet Op<miza<on Progression

1.00x 2.20x 4.18x

6.96x 7.72x 9.27x 13.36x 13.72x

2.16x

12.48x 18.28x

25.49x 27.17x

40.71x

49.07x

0

10

20

30

40

50

60

Cumula#

ve sp

eedu

p

Broadwell Knights Landing

VGG Op<miza<on Progression

1.00x 3.15x 5.40x 10.18x 13.29x 14.65x 19.27x 1.00x

15.80x 23.60x

122.50x

164.95x 171.20x

273.50x

0

50

100

150

200

250

300

Baseline MKL Integra#on Thread Op#miza#on

Compiler Knobs Tuning

Matrix Transpose/Data Transforma#ons

Memory Alloca#ons

Conversions Op#miza#on

Broadwell Knights Landing

Cumula#

ve Spe

edup

25

Configura<on details

Intel® Xeon™ processor E5-‐2699v4 (22 Cores, 2.2 GHz), 128GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: Flat mode), 96GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 AlexNet and VGG benchmarks: h+ps://github.com/soumith/convnet-‐benchmarks

Mul<-‐Node Distributed Training •  Model Parallelism

•  Break the model into N nodes •  The same data is in all the nodes

•  Data Parallelism •  Break the dataset into N nodes •  The same model is in all the nodes •  Good for networks with few weights, e.g. GoogLeNet

•  You can use either model or data parallelism or a hybrid of both

Data Parallelism

Scaling Efficiency: Intel® Xeon Phi™ Processor Deep Learning Image Classifica#on Training Performance : MULTI-‐NODE Scaling

Soxware and workloads used in performance tests may have been op#mized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, soxware, opera#ons and func#ons. Any change to any of those factors may cause the results to vary. You should consult other informa#on and performance tests to assist you in fully evalua#ng your contemplated purchases, including the performance of that product when combined with other products. For more informa#on go to h+p://www.intel.com/performance . *Other names and brands may be property of others Configura#ons: •  Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat* Enterprise Linux 6.7, Intel® Op#mized Framework

0 10 20 30 40 50 60 70 80 90 100

1 2 4 8 16 32 64 128

SCALING EFFICIEN

CY %

# OF INTEL® XEON PHI™ PROCESSOR 7250 (68-‐CORES, 1.4 GHZ, 16 GB) NODES

OverFeat AlexNet VGG-‐A GoogLeNet

62

87

𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 (𝑇𝑇𝑇)

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

sweet spot

Mul<-‐node Challenges •  Need to op#mize both compute (itera#on) and

communica#on (weight updates) •  More nodes mean higher batch per itera#on

•  Enough work for each node

•  Op#mized hyper parameters (e.g. Batch Size) •  Time to Train: increases with batch size •  Accuracy: batch size impacts convergence and accuracy

•  Communication overheads if small per node batch •  e.g. Total batch size = 1024

•  1024 nodes : Batch size = 1 per node – communica<on dominates

•  64 nodes each : Batch size = 16 per node – computa<on dominates

Summary •  Don’t be fooled by performance of DL workloads when using unop#mized

frameworks

•  Significant performance headroom from op#miza#on on Xeon and Xeon Phi •  Close to 300x speedup in certain topologies

•  Tradi#onal vectoriza#on and paralleliza#on strategies apply

•  Other unique performance challenges: hyper parameters, data layer, inter/intra layer paralleliza#on, etc.

•  Call to ac#on: •  Try Intel op#mized frameworks available today, more to come soon

Legal Disclaimers •  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across

different processor families: Go to: Learn About Intel® Processor Numbers http://www.intel.com/products/processor_number •  Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system

hardware or software design or configuration may affect actual performance.•  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such

as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

•  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

•  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

•  SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.

•  TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.•  No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Xeon® processor E7-8800/4800/2800 v2 product

families or Intel® Itanium® 9500 series-based system (or follow-on generations of either.) Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient System Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel processor and enabled technology(ies). Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient Memory Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel® processor and enabled technology(ies). built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details.

Op/miza/on No/ce

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

performance optimization of deep learning frameworks … · performance optimization of deep...

Documents