scaling deep learning - nvidiaimages.nvidia.com/.../sc5108-scaling-deep-learning... · scaling deep...

Scaling Deep Learning

Bryan Catanzaro

@ctnzr

http://www.twitter.com/ctnzr

Bryan Catanzaro

What do we want AI to do?

Drive us to work Serve drinks?

Help us communicate

帮助我们沟通

Keep us organized

Help us find things

Guide us to content

Bryan Catanzaro

Image Q&A Baidu IDL

Sample questions and answers

Bryan Catanzaro

Medical Diagnostics App Baidu BDL

AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems.

Bryan Catanzaro

Progress in AI

Idea

Code Test

Latency from Idea to Idea is the limiting factor

Bryan Catanzaro

Why Deep Learning?

1. Scale Matters – Bigger models usually win

2. Data Matters – More data means less

cleverness necessary

3. Productivity Matters – Teams with better tools can try out more ideas

Data & Compute

Accuracy Deep Learning

Many previous methods

Bryan Catanzaro

Scaling up

• Make progress on AI by focusing on systems – Make models bigger

– Tackle more data

– Reduce research cycle time

• Accelerate large-scale experiments

Bryan Catanzaro

Training Deep Neural Networks

• Computation dominated by dot products

• GEMM (Compute bound!)

• Convolutional layers even more compute bound

20 Exaflops to train one model

Bryan Catanzaro

Natural User Interfaces

• Goal: Make interacting with computers as natural as interacting with humans

• AI problems: – Speech recognition

– Emotional recognition

– Semantic understanding

– Dialog systems

– Speech synthesis

Bryan Catanzaro

End-to-end speech with Deep Learning

• Deep neural network predicts characters directly from audio

. . .

. . .

T H _ E … D O G

Note: Language model is separate.

Bryan Catanzaro

Bidirectional Recurrent Network

• RNNs model temporal dependence

• Various flavors used in many applications – Especially time series data

• Sequential dependence complicates parallelism

Bryan Catanzaro

Connectionist Temporal Classification

• How to connect speech data with transcription? – Use CTC loss function, from [Graves 06]

• Efficient dynamic programming of all possible alignments to compute error of {audio, transcription}

• GPU implementation uses ModernGPU + custom kernels to get 10-30X speedup over simple OpenMP implementation

T H _ E … D O G

? ?

Bryan Catanzaro

SVAIL Infrastructure

1 http://www.tyan.com

FT77CB7079

Service Engineer’s Manual

NVIDIA GeForce GTX Titan X

8 * Titan X

Mellanox FDR Infiniband

• Software: CUDA, MPI, Majel (SVAIL library)

• Hardware:

~5 Petaflops, SP

Bryan Catanzaro

Parallelism

Model Parallel

Data Parallel

MPI_Allreduce()

Training Data Training Data

For these models, Data Parallelism works best

Bryan Catanzaro

Performance for RNN training

• 55% of GPU FMA peak using a single GPU

• ~48% of peak using 8 GPUs in one node

• Weak scaling very efficient, albeit algorithmically challenged

1

2

4

8

16

32

64

128

256

512

1 2 4 8 16 32 64 128

TF

LO

P/s

Number of GPUs

Typical training run

one node multi node

Bryan Catanzaro

Scalability

• Batch size is hard to increase – algorithm, memory limits

• Performance at small batch sizes (32, 64) leads to scalability limits

Bryan Catanzaro

Determinism

• Determinism very important

• So much randomness, hard to tell if you have a bug

• Networks train despite bugs, although accuracy impaired

• Reproducibility is important – For the usual scientific reasons

– Progress not possible without reproducibility

Bryan Catanzaro

Precision

• FP32 works – No need for FP64

• FP16 also works – Use FP32 for softmax and weight updates

1

10

100

1000

10000

100000

1000000

10000000

100000000

-31

-30

-29

-28

-27

-26

-25

-24

-23

-22

-21

-20

-19

-18

-17

-16

-15

-14

-13

-12

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

Count

Magnitude

Weight Distribution

Bryan Catanzaro

FP16 HGEMM for deployment

• We batch, but n is still small small

• Custom kernels for HGEMM help – 2-2.5X more performance at small batches

1 2 3 4 5 6 7 8 9 10

Outer dimension n of x in Ax = b

0.0

0.1

0.2

0.3

0.4

0.5

Terafl

op

s/s

nervana

baidu

Performance on Quadro K1200 (1.1 Tflop peak, 45W)

Bryan Catanzaro

Conclusion

• Deep Learning is extreme HPC

• Systems matter a lot for deep learning

• We favor dense clusters of GPUs for training

• Custom software makes it efficient – 50 Tflops sustained

• GPUs work for deployment as well

• Thanks to Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley … and all of SVAIL

Bryan Catanzaro

@ctnzr



scaling deep learning - nvidiaimages.nvidia.com/.../sc5108-scaling-deep-learning... · scaling deep...

Documents