scaling deep learning - nvidiaimages.nvidia.com/.../sc5108-scaling-deep-learning... · scaling deep...

20
Scaling Deep Learning Bryan Catanzaro @ctnzr

Upload: others

Post on 08-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Scaling Deep Learning

Bryan Catanzaro

@ctnzr

Page 2: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

What do we want AI to do?

Drive us to work Serve drinks?

Help us communicate

帮助我们沟通

Keep us organized

Help us find things

Guide us to content

Page 3: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Image Q&A Baidu IDL

Sample questions and answers

Page 4: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Medical Diagnostics App Baidu BDL

AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems.

Page 5: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Progress in AI

Idea

Code Test

Latency from Idea to Idea is the limiting factor

Page 6: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Why Deep Learning?

1. Scale Matters – Bigger models usually win

2. Data Matters – More data means less

cleverness necessary

3. Productivity Matters – Teams with better tools can try out more ideas

Data & Compute

Accuracy Deep Learning

Many previous methods

Page 7: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Scaling up

• Make progress on AI by focusing on systems – Make models bigger

– Tackle more data

– Reduce research cycle time

• Accelerate large-scale experiments

Page 8: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Training Deep Neural Networks

• Computation dominated by dot products

• GEMM (Compute bound!)

• Convolutional layers even more compute bound

20 Exaflops to train one model

Page 9: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Natural User Interfaces

• Goal: Make interacting with computers as natural as interacting with humans

• AI problems: – Speech recognition

– Emotional recognition

– Semantic understanding

– Dialog systems

– Speech synthesis

Page 10: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

End-to-end speech with Deep Learning

• Deep neural network predicts characters directly from audio

. . .

. . .

T H _ E … D O G

Note: Language model is separate.

Page 11: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Bidirectional Recurrent Network

• RNNs model temporal dependence

• Various flavors used in many applications – Especially time series data

• Sequential dependence complicates parallelism

Page 12: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Connectionist Temporal Classification

• How to connect speech data with transcription? – Use CTC loss function, from [Graves 06]

• Efficient dynamic programming of all possible alignments to compute error of {audio, transcription}

• GPU implementation uses ModernGPU + custom kernels to get 10-30X speedup over simple OpenMP implementation

T H _ E … D O G

? ?

Page 13: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

SVAIL Infrastructure

1 http://www.tyan.com

FT77CB7079

Service Engineer’s Manual

NVIDIA GeForce GTX Titan X

8 * Titan X

Mellanox FDR Infiniband

• Software: CUDA, MPI, Majel (SVAIL library)

• Hardware:

~5 Petaflops, SP

Page 14: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Parallelism

Model Parallel

Data Parallel

MPI_Allreduce()

Training Data Training Data

For these models, Data Parallelism works best

Page 15: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Performance for RNN training

• 55% of GPU FMA peak using a single GPU

• ~48% of peak using 8 GPUs in one node

• Weak scaling very efficient, albeit algorithmically challenged

1

2

4

8

16

32

64

128

256

512

1 2 4 8 16 32 64 128

TF

LO

P/s

Number of GPUs

Typical training run

one node multi node

Page 16: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Scalability

• Batch size is hard to increase – algorithm, memory limits

• Performance at small batch sizes (32, 64) leads to scalability limits

Page 17: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Determinism

• Determinism very important

• So much randomness, hard to tell if you have a bug

• Networks train despite bugs, although accuracy impaired

• Reproducibility is important – For the usual scientific reasons

– Progress not possible without reproducibility

Page 18: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Precision

• FP32 works – No need for FP64

• FP16 also works – Use FP32 for softmax and weight updates

1

10

100

1000

10000

100000

1000000

10000000

100000000

-31

-30

-29

-28

-27

-26

-25

-24

-23

-22

-21

-20

-19

-18

-17

-16

-15

-14

-13

-12

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

Count

Magnitude

Weight Distribution

Page 19: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

FP16 HGEMM for deployment

• We batch, but n is still small small

• Custom kernels for HGEMM help – 2-2.5X more performance at small batches

1 2 3 4 5 6 7 8 9 10

Outer dimension n of x in Ax = b

0.0

0.1

0.2

0.3

0.4

0.5

Terafl

op

s/s

nervana

baidu

Performance on Quadro K1200 (1.1 Tflop peak, 45W)

Page 20: Scaling Deep Learning - Nvidiaimages.nvidia.com/.../SC5108-scaling-deep-learning... · Scaling Deep Learning Bryan Catanzaro @ctnzr . Bryan Catanzaro What do we want AI to do?

Bryan Catanzaro

Conclusion

• Deep Learning is extreme HPC

• Systems matter a lot for deep learning

• We favor dense clusters of GPUs for training

• Custom software makes it efficient – 50 Tflops sustained

• GPUs work for deployment as well

• Thanks to Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley … and all of SVAIL

Bryan Catanzaro

@ctnzr