performance optimization of deep learning frameworks … · performance optimization of deep...
TRANSCRIPT
Performance Optimization of Deep Learning Frameworks on Modern Intel
Architectures ElMoustapha Ould-Ahmed-Vall, AG Ramesh,
Vamsi Sripathi and Karthik Raman Representing the work of many at Intel
Agenda
• Op#miza#on ma+ers on modern architectures
• Intel’s recent Xeon and Xeon Phi products
• Introduc#on to Deep Learning
• Op#mizing DL frameworks on IA • Key challenges • Op#miza#on techniques • Performance data • DL scaling
Moore’s Law Goes on!
Increasing clock speeds -‐> more cores + wider SIMD (Hierarchical parallelism)
Combined Amdahl’s Law for Vector Mul<cores* 𝑺𝒑𝒆𝒆𝒅𝒖𝒑=(1/𝑆𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 + 1− 𝑆𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 /𝑵𝒖𝒎𝑪𝒐𝒓𝒆𝒔 )∗(1/𝑆𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 + 1− 𝑆𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 /𝑽𝒆𝒄𝒕𝒐𝒓𝑳𝒆𝒏𝒈𝒕𝒉 )
Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code
Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work)
Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter
Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte)
Peak “Compute” Gflops/s
Peak “Compute” Gflops/s without SIMD
Compute intensity (flops/byte)
A+ainable Gfl
ops/s
Overview of Current Generation of Intel Xeon and Xeon Phi Products
Current Intel® Xeon PlaBorms
Westmere Sandy Bridge
Intel Microarchitecture (Nehalem)
Intel Microarchitecture (Sandy Bridge)
NEW Intel Microarchitecture (Sandy Bridge)
Nehalem Ivy Bridge
45nm Process Technology 32nm Process Technology 22nm Process Technology
TOCK TICK TOCK TICK TOCK
NEW Intel® Microarchitecture (Nehalem)
Haswell
NEW Intel Microarchitecture (Haswell)
TICK
14nm Process Technology
Latest released – Broadwell (14nm process) • Intel’s foundation of HPC and ML performance
• Suited for full scope of workloads
• Industry leading performance/watt for serial & highly parallel workloads.
• Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology) Software optimization helps maximize benefit and
adoption of new features
Broadwell
Intel Microarchitecture (Haswell)
2nd Genera<on Intel® Xeon Phi™ PlaBorm
Intel® AVX Technology
HSW/BDW
512b AVX512 Flops/Cycle: 64SP / 32 DP (FMA)
SKX & KNL SNB/IVB
AVX512 512-bit FP/Integer 32 registers 8 mask registers Embedded rounding Embedded broadcast Scalar/SSE/AVX “promotions” Native media additions HPC additions Transcendental support Gather/Scatter
AVX AVX2 256-bit basic FP 16 registers NDS (and AVX128) Improved blend MASKMOV Implicit unaligned
Float16 (IVB 2012) 256-bit FP FMA 256-bit integer PERMD Gather
256b AVX2 Flops/Cycle: 32SP / 16 DP (FMA)
256b AVX1 Flops/Cycle: 16 SP / 8 DP
Overview of Deep Learning and DL Frameworks
Deep Learning – Convolu<onal Neural Network
Filter = 3 x 3
Stride = 2
Pad_size = 1
Convolution Parameters: Number of outputs/feature-maps: < 4 > Filter size: < 3 x 3 > Stride: < 2 > Pad_size (for corner case): <1>
Feature maps
• Step 1: Training (Over Hours/Days/Weeks)
Deep Learning: Train Once Use Many Times
Person
90% person 8% traffic light
Input data
Output Classifica#on
Create Deep network
• Step 2: Inference (Real Time)
New input from camera and sensors
Output Classifica#on
Trained neural network model
97% person
Trained Model
Bigger Data Be[er Hardware Smarter Algorithms
Deep Learning: Why Now?
Image: 1000 KB / picture Audio: 5000 KB / song Video: 5,000,000 KB / movie
Transistor density doubles every 18 months Cost / GB in 1995: $1000.00 Cost / GB in 2015: $0.03
Advances in algorithm innova#on, including neural networks, leading to be+er accuracy in training models
Intel Caffe – ML Framework Op<mized for Xeon and Xeon Phi Products
q Fork of BVLC Caffe by Intel to optimize for IA
q Leverages Intel MKL Deep Neural Network (DNN) API’s
q Optimized for BDW (AVX2) and KNL (MIC_AVX512)
q https://github.com/intel/caffe
Intel Caffe
Compute Layer
Convolu#on ReLU Pooling
Data Layer
LMDB, LevelDB, HDF5
Intel MKL DNN
BDW KNL
Tensorflow ™ : Open Source ML Framework (Google) • Computa<on is a Dataflow Graph with Tensors
• General compu#ng mathema#cal framework – widely used for • Deep Neural Networks • Other machine learning algorithms • HPC applica#ons
• Key computa#onal kernels, extendable user opera#ons • Core in C++, front end wrapper in python • Mul# node support using GRPC
• Google Remote Procedural Calls
Example from Jeff Dean’s presenta#on
Optimizing Deep Learning Frameworks
Performance Op<miza<on on Modern PlaBorms
U#lize all the cores
OpenMP, MPI, TBB… Reduce synchroniza#on events, serial code Improve load balancing
Vectorize/SIMD
Unit strided access per SIMD lane High vector efficiency Data alignment
Efficient memory/cache
use Blocking Data reuse Prefetching Memory alloca#on
Hierarchical Parallelism
Fine-‐Grained Parallelism / within node Sub-‐domain: 1) Mul<-‐level domain decomposi<on (ex. across layers)
2) Data decomposi<on (layer parallelism)
Coarse-‐Grained / mul#-‐node
Domain decomposi<on
Scaling Improve load balancing Reduce synchroniza#on events, all-‐to-‐all comms
Intel Strategy: Op<mized Deep Learning Environment
Fuel the development of ver#cal solu#ons
Deliver best single node and mul#-‐node performance
Accelerate design, training, and deployment
Drive op#miza#ons across open source deep learning frameworks
Intel® Deep Learning SDK
Intel® Omni-‐Path Architecture (Intel® OPA)
Maximum performance on Intel architecture Intel® Math Kernel Library (Intel® MKL)
+
Training Inference
Intel® MKL-‐DNN
+
Example Challenge 1: Data Layout Has Big Impact on Performance
• Data Layouts impacts performance • Sequen#al access to avoid gather/sca+er • Have itera#ons in inner most loop to ensure high vector u#liza#on • Maximize data reuse; e.g. weights in a convolu#on layer
• Conver#ng to/from op#mized Layout is some #mes less expensive than opera#ng on unop#mized Layout
21 18 32 6 3
1 8 0 3 26
40 9 22 76 81
23 44 81 32 11
5 38 10 11 1
8 92 37 29 44
11 9 22 3 26
3 47 29 88 1
15 16 22 46 12
29 9 13 11 1 21 8 18 92 .. 1 11 ..
21 18 … 1 .. 8 92 ..
Be+er op#mized for some opera#ons
vs
Example Challenge 2: Minimize Conversions Overhead
• End to end op#miza#on can reduce conversions • Staying in op#mized layout as long as possible becomes one of the tuning goals
• Minimize the number of back and forth conversions • Use of graph op#miza#on techniques
Convolu#on Convolu#on Max Pool Na#ve to MKL layout
MKL layout to Na#ve
MKL layout to Na#ve
Na#ve to MKL layout
• Maximize parallelism to use all cores efficiently
• Intra opera#on/layer parallelism within operators (OpenMP)
Inter opera#on parallelism across operators
8 92 37 29
11 9 22 3
3 47 29 88
15 16 22 46
concat
3x3 Conv
5x5 Conv
1x1 Conv
10 20
15 18 Convolu#on of #les in parallel
Parallel execu#on
Example Challenge 3: Ensuring Enough Parallelism to Leverage all Cores
Example Challenge 4: Op<mizing the Data Layer
• Data Layer comprises 3 major ops o Read data o Decode data: e.g. JPEG decode, decompression o Transform data
• Result of read, decode & transform is input to DNN layers • Reduce number of cores dedicated to feed DNN
o IO op#miza#on: consider compression o Decode: consider LMDB instead of JPEG o Resizing/data processing: consider pre-‐processing o Then vectorize, parallelize
C0 Boost thread
C1 Boost thread
C2 OpenMP
C3 OpenMP …. …. Cn-‐1
OpenMP
Op<mizing Deep Learning Frameworks for Intel® Architecture • Leverage high performant compute libraries and tools
• e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc. • Data Format/Shape:
• Right format/shape for max performance: blocking, gather/sca+er • Data Layout:
• Minimize cost of data layout conversions • Parallelism:
• Use all cores, eliminate serial sec#ons, load imbalance • Other Func#ons/Primi#ves (un-‐op#mized in libraries):
• Op#mize via compiler knobs, improve exis#ng implementa#ons • Memory alloca#on
• unique characteris#cs and ability to reuse buffers • Data layer op#miza#ons:
• paralleliza#on, vectoriza#on, IO • Op#mize hyper parameters:
• e.g. batch size for more parallelism • learning rate and op#mizer to ensure accuracy/convergence
AlexNet Op<miza<on Progression
1.00x 2.20x 4.18x
6.96x 7.72x 9.27x 13.36x 13.72x
2.16x
12.48x 18.28x
25.49x 27.17x
40.71x
49.07x
0
10
20
30
40
50
60
Cumula#
ve sp
eedu
p
Broadwell Knights Landing
VGG Op<miza<on Progression
1.00x 3.15x 5.40x 10.18x 13.29x 14.65x 19.27x 1.00x
15.80x 23.60x
122.50x
164.95x 171.20x
273.50x
0
50
100
150
200
250
300
Baseline MKL Integra#on Thread Op#miza#on
Compiler Knobs Tuning
Matrix Transpose/Data Transforma#ons
Memory Alloca#ons
Conversions Op#miza#on
Broadwell Knights Landing
Cumula#
ve Spe
edup
25
Configura<on details
Intel® Xeon™ processor E5-‐2699v4 (22 Cores, 2.2 GHz), 128GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: Flat mode), 96GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 AlexNet and VGG benchmarks: h+ps://github.com/soumith/convnet-‐benchmarks
Mul<-‐Node Distributed Training • Model Parallelism
• Break the model into N nodes • The same data is in all the nodes
• Data Parallelism • Break the dataset into N nodes • The same model is in all the nodes • Good for networks with few weights, e.g. GoogLeNet
• You can use either model or data parallelism or a hybrid of both
Data Parallelism
Scaling Efficiency: Intel® Xeon Phi™ Processor Deep Learning Image Classifica#on Training Performance : MULTI-‐NODE Scaling
Soxware and workloads used in performance tests may have been op#mized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, soxware, opera#ons and func#ons. Any change to any of those factors may cause the results to vary. You should consult other informa#on and performance tests to assist you in fully evalua#ng your contemplated purchases, including the performance of that product when combined with other products. For more informa#on go to h+p://www.intel.com/performance . *Other names and brands may be property of others Configura#ons: • Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat* Enterprise Linux 6.7, Intel® Op#mized Framework
0 10 20 30 40 50 60 70 80 90 100
1 2 4 8 16 32 64 128
SCALING EFFICIEN
CY %
# OF INTEL® XEON PHI™ PROCESSOR 7250 (68-‐CORES, 1.4 GHZ, 16 GB) NODES
OverFeat AlexNet VGG-‐A GoogLeNet
62
87
𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 (𝑇𝑇𝑇)
𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒
sweet spot
Mul<-‐node Challenges • Need to op#mize both compute (itera#on) and
communica#on (weight updates) • More nodes mean higher batch per itera#on
• Enough work for each node
• Op#mized hyper parameters (e.g. Batch Size) • Time to Train: increases with batch size • Accuracy: batch size impacts convergence and accuracy
• Communication overheads if small per node batch • e.g. Total batch size = 1024
• 1024 nodes : Batch size = 1 per node – communica<on dominates
• 64 nodes each : Batch size = 16 per node – computa<on dominates
Summary • Don’t be fooled by performance of DL workloads when using unop#mized
frameworks
• Significant performance headroom from op#miza#on on Xeon and Xeon Phi • Close to 300x speedup in certain topologies
• Tradi#onal vectoriza#on and paralleliza#on strategies apply
• Other unique performance challenges: hyper parameters, data layer, inter/intra layer paralleliza#on, etc.
• Call to ac#on: • Try Intel op#mized frameworks available today, more to come soon
Legal Disclaimers • Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across
different processor families: Go to: Learn About Intel® Processor Numbers http://www.intel.com/products/processor_number • Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system
hardware or software design or configuration may affect actual performance.• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such
as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
• Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
• Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.
• SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.
• TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.• No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Xeon® processor E7-8800/4800/2800 v2 product
families or Intel® Itanium® 9500 series-based system (or follow-on generations of either.) Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient System Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel processor and enabled technology(ies). Built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details. For systems also featuring Resilient Memory Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Run Sure Technology-enabled system, including an enabled Intel® processor and enabled technology(ies). built-in reliability features available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your system manufacturer for more details.
Op/miza/on No/ce
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804