nnabla on minsky動作・性能検証のご紹介

© Copyright IBM Corporation 2017

NNabla on Minsky 稼働・性能検証

- ベンチマーク結果のご報告 -

⽇本アイ・ビー・エム株式会社システムズ・ハードウェア事業本部OSSソリューション中島康裕

© Copyright IBM Corporation 2017 2

NNablaをMinskyへ導⼊MinskyへのNNablaの導⼊はSource fileから実施可能です。

Github : sony/nnabla

Github : sony/nnabla-ext-cuda

1. Githubからソースをダウンロード=> 2つのソースファイルをCloneします。

2. sony/nnablaからwheelをビルド・インストール

3. sony/nnabla-ext-cudaからwheelをビルド・インストール=> マルチGPU対応には、ビルド時に” WITH_NCCL=ON”のオプションが必要です。指定しない場合にはシングルGPUでの稼働

• sony/nnabla : NNablaのメインモジュール

• sony/nnabla-ext-cuda : GPU連携のためのモジュール

# pip install -U nnabla-<package version>-<package-arch>.whl

ソースからビルドしたwhlファイル

ビルド時のポイント !!PowerAIでは、Minskyに最適化されたNCCLが提供されています。nnabla-ext-cudaのCMakeList.txtを編集し、明⽰的にそのNCCLを使⽤するように設定しています。

nnabla-ext-cuda/src/nbla/cuda/CMakeList.txt

#25 $ENV{NCCL_HOME}/build/include → /opt/DL/nccl/include#29 $ENV{NCCL_HOME}/build/lib/libnccl.so → /opt/DL/nccl/lib/libnccl.so

# pip install -U nnabla_ext_cuda-<package version>-<package-arch>.whl

ソースからビルドしたwhlファイル


Cifar10 MinskyにおけるNNabla v0.9.4 稼働検証に関して

3

• 稼働検証環境- Server: S822LC for HPC (Minsky)- CPU : 10-core 2.860 GHz (3.492 GHz turbo) x 2- SMT : 1

(Hyper-Threading相当機能。最⼤で1core 8 threadまで設定可能)- GPU : P100 x 4 - Memory : 256GB- Local Disk : SSD 960GB - OS : Ubuntu16.04.2 S822LC for High Performance Computing

• 導⼊しているNNabla情報- Version : v0.9.4

- 導⼊済みモジュール

- 実⾏テストスクリプト

- 実⾏コマンド (Case : 1GPU/Batch 32/300epoch)

nnabla (0.9.4.post32+g7c5c75d)nnabla-ext-cuda (0.9.4.post18+g859268c)

1. multi_device_multi_process_classification.py

1. mpirun --allow-run-as-root -n 1 python multi_device_multi_process_classification.py --context "cuda.cudnn" -b 32 --max-iter 468600 –val-iter 312

CPU-GPU間：POWER8 NVLinkのデータ転送

GPU

P8

GPU GPU

P8

GPU

NVLink80 GB/s

“Minsky”

CPU-GPU間：PCIe のデータ転送

GPU

x86

GPU GPU

x86

GPU

PCIe32 GB/s

x86 Servers

2.5倍の帯域


<参考>Cifar10

4


Cifar10 _ Error - Iteration

5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1562

15620

29678

43736

57794

71852

85910

99968

114026

128084

142142

156200

170258

184316

198374

212432

226490

240548

254606

268664

282722

296780

310838

324896

338954

353012

367070

381128

395186

409244

423302

437360

451418

465476

Error

Iteration

Batch_size32

GPU1 GPU2 GPU4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

781

7810

14839

21868

28897

35926

42955

49984

57013

64042

71071

78100

85129

92158

99187

106216

113245

120274

127303

134332

141361

148390

155419

162448

169477

176506

183535

190564

197593

204622

211651

218680

225709

232738

Error

Iteration

Batch_size64

GPU1 GPU2 GPU4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1562

3124

4686

6248

7810

9372

10934

12496

14058

15620

17182

18744

20306

21868

23430

24992

26554

28116

29678

31240

32802

34364

35926

37488

39050

40612

42174

43736

45298

Error

Iteration

Batch_size 32

GPU1 GPU2 GPU4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

781

1562

2343

3124

3905

4686

5467

6248

7029

7810

8591

9372

10153

10934

11715

12496

13277

14058

14839

15620

16401

17182

17963

18744

19525

20306

21087

21868

22649

Error

Iteration

Batch_size 64

GPU1 GPU2 GPU4

拡⼤

拡⼤


Cifar10 _ Error - Iteration

6

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

390

4290

8190

12090

15990

19890

23790

27690

31590

35490

39390

43290

47190

51090

54990

58890

62790

66690

70590

74490

78390

82290

86190

90090

93990

97890

101790

105690

109590

113490

Error

Iteration

Batch_size128

GPU1 GPU2 GPU4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

390

780

1170

1560

1950

2340

2730

3120

3510

3900

4290

4680

5070

5460

5850

6240

6630

7020

7410

7800

8190

8580

8970

9360

9750

10140

10530

10920

11310

Error

Iteration

Batch_size128

GPU1 GPU2 GPU4

拡⼤


CIfar10 _ Throughput

7

1

1.835114303

3.081221549

0

1

2

3

4

5

1 2 3 4

Throughp

ut

GPU

Throughput

Ideal Batch_size=32 Batch_size=64 Batch_size=128

Batch_size=32/worker Batch_size=64/worker Batch_size=128/workerTime(s) Throughput Time(s) Throughput Time(s) Throughput

GPU1 6212.25 1 4357.83 1 3438.19 1GPU2 3403.44 1.83 2374.69 1.84 1866.90 1.84GPU4 2010.94 3.09 1414.32 3.08 1114.79 3.08


Imagenet(Resnet34/50) MinskyにおけるNNabla v0.9.4 稼働検証に関して

8

• 稼働検証環境- Server: S822LC for HPC (Minsky)- CPU : 10-core(3.990 GHz turbo) x 2- SMT : 1

(Hyper-Threading相当機能。最⼤で1core 8 threadまで設定可能)- GPU : P100 x 4 - Memory : 512GB- Local Disk : SSD 960GB / Data領域 : NVMe 3.2TB- OS : Ubuntu16.04.2

• 導⼊しているNNabla情報- Version : v0.9.4

- 導⼊済みモジュール

- 実⾏テストスクリプト

- 実⾏コマンド (Case : 1GPU/Batch 32/1epoch / Resnet50)(特にオプションで指定しない場合には、Resnet34で動作)

nnabla (0.9.4.post94+g4e6e2d1)nnabla-ext-cuda (0.9.4.post42+g0659db6)

1. multi_device_multi_process_classification.py

1. mpirun --allow-run-as-root -n 1 python multi_device_multi_process_classification.py -b 32 -a 4 –L 50 -c cuda.cudnn -T train_cache -V val_cache --val-interval 10016 --max-iter 10016 --val-iter 1562

S822LC for High Performance Computing

CPU-GPU間：POWER8 NVLinkのデータ転送

GPU

P8

GPU GPU

P8

GPU

NVLink80 GB/s

“Minsky”

CPU-GPU間：PCIe のデータ転送

GPU

x86

GPU GPU

x86

GPU

PCIe32 GB/s

x86 Servers

2.5倍の帯域


MinskyにおけるNNabla v0.9.4 稼働検証に関してImagenet(Resnet34) Throughput

9

0

1

2

3

4

5

1 2 3 4

Throughp

ut

GPU数

Throughput

Ideal Bacth=32 Bacth=64 Batch=128


GPU1 3439.57 1 3210.57 1 3056.79 1GPU2 1787.70 1.92 1656.55 1.94 1569.75 1.95GPU4 940.03 3.66 861.46 3.73 808.96 3.78


MinskyにおけるNNabla v0.9.4 稼働検証に関してImagenet(Resnet50) Throughput

10


GPU1 6515.62 1 6226.98 1 - -GPU2 3356.31 1.941 3186.38 1.95 - -GPU4 1746.193 3.73 1652.93 3.77 - -

0

1

2

3

4

5

1 2 3 4

Throughp

ut

GPU数

Throughput

Ideal Bacth=32 Bacth=64

nnabla on minsky動作・性能検証のご紹介

Devices & Hardware