new technology of nvidia (dgx-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5...

33
Hyungon Ryu (유현곤 부장) Sr. Solutions Architect at NVIDIA New Technology of NVIDIA (DGX-2 활용)

Upload: others

Post on 02-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

Hyungon Ryu (유현곤부장)

Sr. Solutions Architect at NVIDIA

New Technology of NVIDIA

(DGX-2 활용)

Page 2: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

Perf. in DGX-2

DGX-2 have 2 PetaFLOPS

..as One Giant GPU81,920 CUDA Cores

512 GB HBM2

Tesla V1005,120 CUDA Cores

32GB HBM2

x 16

Page 3: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

$ nvidia-smi topo -m

G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 CPU Affinity

GPU0 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU1 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU2 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU3 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU4 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU5 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU7 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 0-23,48-71

GPU8 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 24-47,72-95

GPU9 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 24-47,72-95

GPU10 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 24-47,72-95

GPU11 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 24-47,72-95

GPU12 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 24-47,72-95

GPU13 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 24-47,72-95

GPU14 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 24-47,72-95

GPU15 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X 24-47,72-95

Comm. In DGX-2

NVSWITCH

6x NVLINK lanes

16 GPUs Fully connected

2.4TB/s of bi-section bandwidth

$ nvidia-smi nvlink -sGPU 0: Tesla V100-SXM3-32GB

Link 0: 25.781 GB/s

Link 1: 25.781 GB/s

Link 2: 25.781 GB/s

Link 3: 25.781 GB/s

Link 4: 25.781 GB/s

Link 5: 25.781 GB/s

All 96 lanes are Non Blocking Fat Tree

Page 4: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

0. NV Switch BenefitALL Reduce Benchmark

Deep Learning의파라미터전송에사용됨

대량전송시 성능극대화Model Parallel

Page 5: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

0. NV Switch Benefit3D FFT

Page 6: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

0. NV Switch Benefit

8GPU 사용 대비 16 GPU 사용시 NVSwitch를통해 I/O 병목이해소되어4배 이상 성능향상

http://on-demand.gputechconf.com/gtc/2018/presentation/s8688-extending-the-connectivity-and-reach-of-the-gpu.pdf

Huge MLP NMT, LM, RNN-T

Page 7: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

I/O 병목이 심한 모델일 수록 DGX-2의 성능 극대화

0. NV Switch BenefitFairSeq Model Parallel

http://on-demand.gputechconf.com/gtc/2018/presentation/s8688-extending-the-connectivity-and-reach-of-the-gpu.pdf

Page 8: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

DGX-2 실 성능 테스트

Page 9: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

테스트 방안

1 2 3

Image

ClassificationImage GAN TTS

ResNet-50 Pix2PixHDTacotron2 +

WaveNet

WaveGlow

0

Model Parallel

Page 10: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

DGX Data Center Reference Design Whitepaper

All DL training run on NVIDIA Saturn V

Allocated Resources are

- DGX-2

- DGX-1 (16GB )

Using NGC Docker Images with custom build for

utility

All benchmark is done in base clock.

More Room to optimize

These number is not official NVIDIA Perf. Data

Saturn V - DGX POD

Page 11: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

1. ResNet-50은?

DOG CAT

Residual Block

이미지 분류문제

이미지 분류문제를 풀 수 있는 딥러닝 모델Data Parallel로 쉽게 Multi-GPU 병렬화 가능

*Images from ImageNet

Page 12: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

1. ResNet-50 Perf.

1GPU(DGX-1 V100 16GB)

Step Epoch Img/sec Loss LR

390 390.0 686.5 0.001 0.997 0.00000

400 400.0 686.1 0.001 0.997 0.00000 *

8GPU (DGX-1 V100 16GB)

390 390.0 4619.3 0.000 0.980 0.00000

400 400.0 4648.1 0.000 0.980 0.00000

16 GPU (DGX-2 V100 32GB)

390 12.9 11886.2 4.409 5.090 1.47065

400 13.2 11896.4 4.163 4.841 1.45808

11896.4

4648.1

DGX-1(1GPU) DGX-1(8GPU) DGX-2(16GPU)

686.5

4648.1

11896.4

2.5x

➢ Perfect Scaling with Data Parallelism.

➢ With More Memory(large Batch), 17.3X

* In Tesla V100 32GB , upto 780 Img/s in Single GPU

Benefit of

16GPUs &

32GB memory

* I run whole training in Base Clock

Page 13: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

1. GPU Util. on ResNet-50+--------------------------------------------------------------------------------------------------+

| NVIDIA-SMI 410.33 Driver Version: 410.33 |

| N/A 54C P0 271W / 350W | 23596MiB / 32510MiB | 92% Default |

| N/A 53C P0 325W / 350W | 23596MiB / 32510MiB | 96% Default |

| N/A 69C P0 309W / 350W | 23596MiB / 32510MiB | 93% Default |

| N/A 67C P0 301W / 350W | 23596MiB / 32510MiB | 90% Default |

| N/A 56C P0 159W / 350W | 23596MiB / 32510MiB | 96% Default |

| N/A 70C P0 360W / 350W | 23591MiB / 32510MiB | 92% Default |

| N/A 55C P0 284W / 350W | 23596MiB / 32510MiB | 98% Default |

| N/A 68C P0 311W / 350W | 23596MiB / 32510MiB | 93% Default |

| N/A 53C P0 311W / 350W | 23596MiB / 32510MiB | 92% Default |

| N/A 53C P0 140W / 350W | 23596MiB / 32510MiB | 98% Default |

| N/A 68C P0 328W / 350W | 23596MiB / 32510MiB | 94% Default |

| N/A 68C P0 319W / 350W | 23596MiB / 32510MiB | 90% Default |

| N/A 53C P0 249W / 350W | 23596MiB / 32510MiB | 97% Default |

| N/A 53C P0 313W / 350W | 23596MiB / 32510MiB | 92% Default |

| N/A 67C P0 291W / 350W | 23596MiB / 32510MiB | 93% Default |

| N/A 67C P0 296W / 350W | 23596MiB / 32510MiB | 94% Default |

+----------------------------------------+------------------------------+-------------------------+

GPU Utilization during ResNet-50 Training

All GPU utilize up

to 90~98% ( well

distributed)

Page 14: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

2. GAN : Pix2PixHD고화질이미지합성을 위해서GAN 모델을학습하는데오랜 시간이필요함 ( 약 1주 정도소요)

Model High Quality Model Time to train

Pix2PixHD 2 stages for 1080p ~ few days

Vid2Vid 2 stages + FlowNet ~ few days

OpenAI Glow Single Stage 40 GPUs for 18K epochs

PGGAN multiple Stages ~ 2 weeks

GAN 모델로 합성한 이미지합성을 위한 입력 데이터(Segmentation)

GAN 모델Pix2PixHD

Page 15: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

2. Perf. of Pix2PixHDDGX-2 이용시,

56시간, 2일 이상걸리는 GAN 학습이8시간 내학습완료

1319*

781

460

283

184

Spe

ed

Up

*sec/epochs

0

1

2

3

4

5

6

7

8

1 2 4 8 16

saving the latest model (epoch 139, total_steps 411000)

(epoch: 139, iters: 2268, time: 0.142) G_GAN: 1.447 G_GAN_Feat: 2.043 G_VGG: 0.798 D_real: 0.479 D_fake: 0.087

(epoch: 139, iters: 2568, time: 0.148) G_GAN: 0.994 G_GAN_Feat: 1.399 G_VGG: 0.773 D_real: 0.293 D_fake: 0.307

(epoch: 139, iters: 2868, time: 0.153) G_GAN: 1.094 G_GAN_Feat: 1.542 G_VGG: 0.873 D_real: 0.323 D_fake: 0.305

End of epoch 139 / 200 Time Taken: 461 sec

(epoch: 200, iters: 1360, time: 0.098) G_GAN: 1.325 G_GAN_Feat: 1.638 G_VGG: 0.801 D_real: 0.230 D_fake: 0.224

(epoch: 200, iters: 1760, time: 0.091) G_GAN: 1.336 G_GAN_Feat: 1.669 G_VGG: 0.728 D_real: 0.175 D_fake: 0.220

(epoch: 200, iters: 2160, time: 0.089) G_GAN: 1.371 G_GAN_Feat: 1.961 G_VGG: 0.906 D_real: 0.223 D_fake: 0.208

(epoch: 200, iters: 2560, time: 0.084) G_GAN: 1.298 G_GAN_Feat: 1.458 G_VGG: 0.724 D_real: 0.235 D_fake: 0.211

(epoch: 200, iters: 2960, time: 0.089) G_GAN: 1.325 G_GAN_Feat: 1.905 G_VGG: 0.867 D_real: 0.189 D_fake: 0.227

batchSize: 12

beta1: 0.5

checkpoints_dir: ./checkpoints

continue_train: False

data_type: 32

dataroot: ./datasets/cityscapes/

debug: False

display_freq: 100

display_winsize: 512

feat_num: 3

fineSize: 512

gpu_ids: [0, 1, 2, 3]

Training Logs7.1x?

Page 16: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

2. GPU Util. On Pix2Pix

+-----------------------------------------------------------------------------+Tue Oct 30 06:21:05 2018 +-----------------------------------------------------------------------------+| NVIDIA-SMI 410.33 Driver Version: 410.33 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 Tesla V100-SXM3... On | 00000000:34:00.0 Off | 0 || N/A 51C P0 273W / 350W | 29886MiB / 32510MiB | 100% Default || N/A 52C P0 344W / 350W | 27774MiB / 32510MiB | 99% Default || N/A 61C P0 97W / 350W | 27806MiB / 32510MiB | 100% Default || N/A 62C P0 371W / 350W | 27748MiB / 32510MiB | 87% Default || N/A 47C P0 86W / 350W | 30222MiB / 32510MiB | 85% Default || N/A 58C P0 91W / 350W | 27782MiB / 32510MiB | 63% Default || N/A 47C P0 84W / 350W | 27896MiB / 32510MiB | 1% Default || N/A 59C P0 207W / 350W | 27748MiB / 32510MiB | 19% Default || N/A 51C P0 349W / 350W | 27844MiB / 32510MiB | 40% Default || N/A 52C P0 344W / 350W | 27762MiB / 32510MiB | 89% Default || N/A 62C P0 336W / 350W | 27742MiB / 32510MiB | 99% Default || N/A 62C P0 272W / 350W | 30062MiB / 32510MiB | 100% Default || N/A 54C P0 293W / 350W | 27864MiB / 32510MiB | 100% Default || N/A 49C P0 83W / 350W | 27806MiB / 32510MiB | 89% Default || N/A 60C P0 92W / 350W | 27742MiB / 32510MiB | 72% Default || N/A 58C P0 87W / 350W | 27768MiB / 32510MiB | 8% Default |

➢ 512p 문제에서는 16 GPU의 성능을 충분히활용 못함➢ Cityscapes(fine annotation) with 5,000 images(414MB) 중

학습 데이터는 (91MB + 76MB). 나머지는 Test, Val 데이터셋임.

➢ 128 Batch X 16 GPUs = 2048 Images

Page 17: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

8 GPU(DGX-1) DGX-2(16 GPU)

2. Result Comparision

동일한시간동한학습 결과

차량의 측면의 정확한 합성을 위해서는 더 많은 학습 필요 작은 차량의 측면 사진을 정확하게 합성함

Page 18: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3. TTS(음성합성)

Mel Spectrum

Tacotron2 WaveNet

“This will was a

deliberate forgery”

Text(Sentence) Waveform(Voice)

Txt2Mel Mel2Wav

학습에14일 소요*학습에 10일 소요

* Multi Speaker(남/여,개인) 고려시배수로학습시간증가

고음질음성합성을 위해서는2단계(T2M, M2W), 약 3주간의학습이필요

Page 19: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.1 TACOTRON2

https://github.com/NVIDIA/tacotron2

https://github.com/yhgon/tacotron2 Patch for NGC in DGX-1, DGX-2

Mel Spectrum

Tacotron2

“This will was a

deliberate forgery”

Text(Sentence)

Txt2MelStage1

T2M

MultiGPU 데이터 병렬화 지원FP16(Mixed Precisin) 학습 지원

T2M 단계는 Mel Generation단계로,

Tacotron2는 Google이제안한 Tacotron을개선한모델로, 소리의쌍을 학습 데이터로넣어주면 발음, 발성을 표현할수 있는 간소화(Quantazation)된스펙트럼(Mel Spectrum)을생성. 유사 모델로 바이두의 Deep Voice 모델이 있음

Page 20: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.1 T2 성능측정 결과DGX-2에서 약 5시간이면음성합성용 MEL 생성 가능하도록학습가능10시간이면 고음질의음성 합성가능

tr_wavs gpu batch s/iter Target hours Day

12500 1 48 3.0 1000 217.0 9.04

12500 4 48 3.3 1000 58.8 2.45

12500 8 48 4.0 1000 36.2 1.51

12500 16 48 3.1 1000 14.0 0.58

12500 16 96 3.4 1000 7.6 0.32

12500 16 192 4.4 1000 4.9 0.21

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 4 8 16

*FP16, distributed Run

*

21758

7.6

36

Sp

ee

d u

p

Train loss 82 2.123065 Grad Norm 1.548152 4.59s/itTrain loss 83 2.179360 Grad Norm 3.380926 4.46s/it

Log

Benefit from

32GB & 16GPUs

Page 21: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

Iteration 1K

original Generated Attention

Iteration 32K

Iteration 300KDGX-2

3.1 T2 학습에 따른 음질

Page 22: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.2 WaveNet

Mel Spectrum

WaveNet

Waveform(Voice)

Mel2Wav

M2W단계를 보코더(Vocoder)라고부르며WaveNet은 Google Deepmind에서제안한 고음질의 딥러닝음성 합성 모델로,

dilated Convolution을 통한 autoregressive 방법으로음성을 합성함Griffin-Lim, World Vocoder가 존재하고, WaveGAN, WaveRNN등의 모델도 있음.

Google Deepmind Model : 논문과 결과만 제공Google Nsynth Wavenet : 중음질이지만 성능이 느림 , 1초 음원 생성에 18분 걸림 ( 음성에 비적합)Ryuich’s wavenet : 고음질이지만 성능이 느림 학습 시간(2주), 1초 음원 생성(20분)nv-wavenet : 고속의 음성 합성 모델 (MultiGPU 학습 지원), 1초 음원 생성(1초 이내). 속도를 위해 음질 포기

Stage2

M2W

Page 23: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.2 WN 성능측정 결과DGX-2에서 약 8시간이면 학습가능고음질음성은더욱 16시간 이상학습필요

* 32 Batch per GPU ( using 30GB memory )

0

2

4

6

8

10

12

14

16

18

1GPU 16GPU 16GPU

63.4

9.6

3.72

Sp

ee

d u

p

LogIter 138 : reduced loss : 3.005531311 0.43s/it Iter 139 : reduced loss : 2.985509396 0.44s/it 63.35s/epoch

Iter 110 : reduced loss : 3.051481247 1.74s/it Iter 111 : reduced loss : 3.024585962 1.71s/it 3.72s/epoch

Iter 6 : reduced loss : 5.374511242 0.45s/it Iter 7 : reduced loss : 5.339864254 0.45s/it 9.60s/epoch

samples batch

# of

GPU

iter/

epoc

sec/

epoc taraget hours day

12500 8 1 1563 63.4 100000 140.8 5.9

12500 8 16 98 9.6 100000 21.3 0.9

12500 32 16 24 3.72 100000 8.3 0.3 *

Page 24: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.2 Util. for WaveNet

Sun Nov 4 18:49:23 2018

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.33 Driver Version: 410.33 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| N/A 46C P0 320W / 350W | 8432MiB / 32510MiB | 95% Default |

| N/A 42C P0 253W / 350W | 8432MiB / 32510MiB | 100% Default |

| N/A 45C P0 282W / 350W | 8432MiB / 32510MiB | 95% Default |

| N/A 48C P0 240W / 350W | 8432MiB / 32510MiB | 93% Default |

| N/A 41C P0 224W / 350W | 8432MiB / 32510MiB | 94% Default |

| N/A 44C P0 222W / 350W | 8432MiB / 32510MiB | 99% Default |

| N/A 42C P0 272W / 350W | 8432MiB / 32510MiB | 100% Default |

| N/A 47C P0 264W / 350W | 8432MiB / 32510MiB | 100% Default |

| N/A 43C P0 295W / 350W | 8432MiB / 32510MiB | 100% Default |

| N/A 44C P0 327W / 350W | 8434MiB / 32510MiB | 100% Default |

| N/A 46C P0 245W / 350W | 8432MiB / 32510MiB | 94% Default |

| N/A 46C P0 223W / 350W | 8432MiB / 32510MiB | 95% Default |

| N/A 44C P0 272W / 350W | 8432MiB / 32510MiB | 94% Default |

| N/A 41C P0 236W / 350W | 8432MiB / 32510MiB | 99% Default |

| N/A 46C P0 247W / 350W | 8432MiB / 32510MiB | 100% Default |

| N/A 45C P0 235W / 350W | 8432MiB / 32510MiB | 100% Default |

+-------------------------------+----------------------+----------------------+

Sun Nov 4 19:06:57 2018

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.33 Driver Version: 410.33 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| N/A 46C P0 218W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 44C P0 190W / 350W | 30498MiB / 32510MiB | 100% Default |

| N/A 49C P0 259W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 48C P0 279W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 45C P0 208W / 350W | 30498MiB / 32510MiB | 100% Default |

| N/A 48C P0 308W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 46C P0 199W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 51C P0 267W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 46C P0 267W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 47C P0 234W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 50C P0 300W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 50C P0 201W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 47C P0 295W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 44C P0 268W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 51C P0 310W / 350W | 30496MiB / 32510MiB | 100% Default |

| N/A 50C P0 295W / 350W | 30496MiB / 32510MiB | 100% Default |

+-------------------------------+----------------------+----------------------+

GPUs Fully Utilized for WaveNet

8 Batch 32 Batch

Page 25: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.2 WaveNet10K Iter

100K Iter

200K Iter

250K Iter

Original Voice

arctic_a0010.wav

Ryuich WaveNet

Page 26: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

https://arxiv.org/abs/1811.00002

Our PyTorch implementation produces audio samples at a

rate of more than 500 kHz on an NVIDIA V100 GPU. Mean

Opinion Scores show that it delivers audio quality as good

as the best publicly available WaveNet implementation.

3.3 WaveGlowInference :

WaveNet : 1초음성생성하는데 10분소요WaveGlow : 1초음성생성하는데 0.2초소요

Paper and Sample was published

OpenAI/Glow 모델을 DeepMind WaveNet 모델 학습에 이용Inference는 INV 1x1 Conv로 대체

Page 27: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.2 WG 성능측정 결과DGX-2, 약 4시간이면MEL에서고음질음성 합성가능

System Mem wavs batch gpu s/iter iter

sec/

ep Hr day

DGX-1 16GB 13050 3 1 1.46 4350 6351 68 2.81

DGX-1 16GB 13050 3 8 1.59 544 865 10 0.40

DGX-2 32GB 13050 3 1 1.46 4350 6351 68 2.81

DGX-2 32GB 13050 6 1 2.16 2175 4698 50 2.09

DGX-2 32GB 13050 8 1 2.82 1631 4600 49 2.04

DGX-2 32GB 13050 3 8 1.87 544 1017 11 0.46

DGX-2 32GB 13050 6 8 2.64 272 718 8 0.33

DGX-2 32GB 13050 8 8 3.32 204 677 7 0.31

DGX-2 32GB 13050 3 16 1.92 272 522 6 0.25

DGX-2 32GB 13050 6 16 2.85 136 387 5 0.19

DGX-2 32GB 13050 8 16 3.65 102 372 4 0.18

Measure EstimateConfigure

Sp

ee

d u

p

1GPU* 1GPU** 8GPU** 16GPU** 16GPU***

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

* DGX-1, 3 batch

** DGX-2, 3 batch

*** DGX-2, 8 batch

4.3 Hr

6.0 Hr

11.1 Hr

67 Hr 67 Hr

Page 28: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

3.2 Util. for WaveGlow

Tue Nov 6 12:01:28 2018

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.33 Driver Version: 410.33 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| N/A 54C P0 348W / 350W | 27296MiB / 32510MiB | 81% Default |

| N/A 48C P0 90W / 350W | 27194MiB / 32510MiB | 73% Default |

| N/A 63C P0 160W / 350W | 27194MiB / 32510MiB | 96% Default |

| N/A 65C P0 343W / 350W | 27194MiB / 32510MiB | 85% Default |

| N/A 52C P0 348W / 350W | 27194MiB / 32510MiB | 99% Default |

| N/A 70C P0 351W / 350W | 27194MiB / 32510MiB | 99% Default |

| N/A 51C P0 242W / 350W | 27194MiB / 32510MiB | 93% Default |

| N/A 67C P0 294W / 350W | 27194MiB / 32510MiB | 55% Default |

| N/A 51C P0 337W / 350W | 27194MiB / 32510MiB | 68% Default |

| N/A 51C P0 89W / 350W | 27194MiB / 32510MiB | 69% Default |

| N/A 68C P0 352W / 350W | 27194MiB / 32510MiB | 78% Default |

| N/A 65C P0 102W / 350W | 27194MiB / 32510MiB | 63% Default |

| N/A 49C P0 124W / 350W | 27194MiB / 32510MiB | 100% Default |

| N/A 46C P0 198W / 350W | 27194MiB / 32510MiB | 82% Default |

| N/A 67C P0 342W / 350W | 27194MiB / 32510MiB | 98% Default |

| N/A 67C P0 355W / 350W | 27194MiB / 32510MiB | 99% Default |

+-------------------------------+----------------------+----------------------+

GPU Utilization for WaveGlow

8 Batch

old NGC images for Pytorch 0.4.0

Using Pytorch 0.4.1+,

More Perf. improvement as NV-WAVENET

8GPU Test 16 GPU Test8GPU Test

Page 29: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

Real Voice

WaveGlowLJ001-0015

WaveNet

Griffin-Lim

https://nv-adlr.github.io/WaveGlow

3.3 Quality Comparison

9.2sec

Page 30: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

LJ018-0227

“This Will was a

deliberate forgery”

3. TTS(음성합성)

M T2 + GL M T2 + WN

Real Voice

NV T2 + WG

실시간(V100)실시간Inference 속도

음질

25분(K80)

CPU 사용시 2시간

음성합성시Mel(256/4096 압축)에서잃어버린정보를복원하지못함

320K 학습시Mel에서정보를복원하여22Khz 고음질합성

Reproduction https://github.com/yhgon/KR_AICONF_2018/blob/master/TTS_ryuich_taco2%2BWaveNnet.ipynb

Tacotron2 https://github.com/Rayhane-mamah/Tacotron-2

WaveNet https://github.com/r9y9/wavenet_vocoder

540K 이상학습시Mel에서정보를복원하여22Khz 고음질합성

학습 속도 학습필요없음1GPU로약 2주소요MultiGPU 병렬효과없음

16 GPU로 1일이내학습가능16 GPU scale 보장

Page 31: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

Summary

1 2 3

Image Classification Image GAN TTS

ResNet-50 Pix2PixHDTacotron2 + WaveNet

WaveGlow

Data Parallel Data Parallel Data Parallel

재연을 위한 자세한 설명이 추가된 자료는설문조사에서 요청하시거나 [email protected] 로 메일 주시면Appendix 첨부파일 형태로 배포할 예정입니다.

https://github.com/yhgon/KR_AICONF_2018

Model Parallel &

Page 32: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

NVIDIA Booth for Demo(Video)

Developer Meet up at Dec 2018

Page 33: New Technology of NVIDIA (DGX-2 활용 · 2018. 11. 26. · $ nvidia-smi topo -m g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 cpu affinity gpu0 x nv6 nv6 nv6 nv6 nv6 nv6

SEOUL | NOVEMBER 7 - 8,2018

www.nvidia.com/ko-kr/ai-conference/