scaling deep learning - nvidiaimages.nvidia.com/.../sc5108-scaling-deep-learning... · scaling deep...
TRANSCRIPT
Bryan Catanzaro
What do we want AI to do?
Drive us to work Serve drinks?
Help us communicate
帮助我们沟通
Keep us organized
Help us find things
Guide us to content
Bryan Catanzaro
Image Q&A Baidu IDL
Sample questions and answers
Bryan Catanzaro
Medical Diagnostics App Baidu BDL
AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems.
Bryan Catanzaro
Progress in AI
Idea
Code Test
Latency from Idea to Idea is the limiting factor
Bryan Catanzaro
Why Deep Learning?
1. Scale Matters – Bigger models usually win
2. Data Matters – More data means less
cleverness necessary
3. Productivity Matters – Teams with better tools can try out more ideas
Data & Compute
Accuracy Deep Learning
Many previous methods
Bryan Catanzaro
Scaling up
• Make progress on AI by focusing on systems – Make models bigger
– Tackle more data
– Reduce research cycle time
• Accelerate large-scale experiments
Bryan Catanzaro
Training Deep Neural Networks
• Computation dominated by dot products
• GEMM (Compute bound!)
• Convolutional layers even more compute bound
20 Exaflops to train one model
Bryan Catanzaro
Natural User Interfaces
• Goal: Make interacting with computers as natural as interacting with humans
• AI problems: – Speech recognition
– Emotional recognition
– Semantic understanding
– Dialog systems
– Speech synthesis
Bryan Catanzaro
End-to-end speech with Deep Learning
• Deep neural network predicts characters directly from audio
. . .
. . .
T H _ E … D O G
Note: Language model is separate.
Bryan Catanzaro
Bidirectional Recurrent Network
• RNNs model temporal dependence
• Various flavors used in many applications – Especially time series data
• Sequential dependence complicates parallelism
Bryan Catanzaro
Connectionist Temporal Classification
• How to connect speech data with transcription? – Use CTC loss function, from [Graves 06]
• Efficient dynamic programming of all possible alignments to compute error of {audio, transcription}
• GPU implementation uses ModernGPU + custom kernels to get 10-30X speedup over simple OpenMP implementation
T H _ E … D O G
? ?
Bryan Catanzaro
SVAIL Infrastructure
1 http://www.tyan.com
FT77CB7079
Service Engineer’s Manual
NVIDIA GeForce GTX Titan X
8 * Titan X
Mellanox FDR Infiniband
• Software: CUDA, MPI, Majel (SVAIL library)
• Hardware:
~5 Petaflops, SP
Bryan Catanzaro
Parallelism
Model Parallel
Data Parallel
MPI_Allreduce()
Training Data Training Data
For these models, Data Parallelism works best
Bryan Catanzaro
Performance for RNN training
• 55% of GPU FMA peak using a single GPU
• ~48% of peak using 8 GPUs in one node
• Weak scaling very efficient, albeit algorithmically challenged
1
2
4
8
16
32
64
128
256
512
1 2 4 8 16 32 64 128
TF
LO
P/s
Number of GPUs
Typical training run
one node multi node
Bryan Catanzaro
Scalability
• Batch size is hard to increase – algorithm, memory limits
• Performance at small batch sizes (32, 64) leads to scalability limits
Bryan Catanzaro
Determinism
• Determinism very important
• So much randomness, hard to tell if you have a bug
• Networks train despite bugs, although accuracy impaired
• Reproducibility is important – For the usual scientific reasons
– Progress not possible without reproducibility
Bryan Catanzaro
Precision
• FP32 works – No need for FP64
• FP16 also works – Use FP32 for softmax and weight updates
1
10
100
1000
10000
100000
1000000
10000000
100000000
-31
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Count
Magnitude
Weight Distribution
Bryan Catanzaro
FP16 HGEMM for deployment
• We batch, but n is still small small
• Custom kernels for HGEMM help – 2-2.5X more performance at small batches
1 2 3 4 5 6 7 8 9 10
Outer dimension n of x in Ax = b
0.0
0.1
0.2
0.3
0.4
0.5
Terafl
op
s/s
nervana
baidu
Performance on Quadro K1200 (1.1 Tflop peak, 45W)
Bryan Catanzaro
Conclusion
• Deep Learning is extreme HPC
• Systems matter a lot for deep learning
• We favor dense clusters of GPUs for training
• Custom software makes it efficient – 50 Tflops sustained
• GPUs work for deployment as well
• Thanks to Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley … and all of SVAIL
Bryan Catanzaro
@ctnzr