openacc cudaによる gpuコンピューティングgpu computing 5 gpu コンピューティング...

Akira Naruse, 19th Jul. 2018

OpenACC・CUDAによるGPUコンピューティング

自己紹介

▪ 成瀬彰 (Naruse, Akira)

▪ 2013年～: NVIDIA、シニア・デベローパーテクノロジー・エンジニア

▪ 1996～2013年: 富士通研究所、研究員など

▪ 専門・興味: 並列処理、性能最適化、スパコン・HPC、 GPUコンピューティング、 DeepLearning、…

▪ 詳しくは…

github.com/anaruse

AGENDA

▪ GPU Computing

▪ Volta GPUs

▪ OpenACC

▪ CUDA

GPU Computing

GPUコンピューティングLow latency + High throughput

CPU GPU

アプリケーション実行

アプリケーション・コード

並列部分はGPUで実行

計算の重い部分

逐次部分はCPU上で実行

Do i=1,N

End do

GPUの構造 (TESLA P100)

64 CUDA core/SM

大量のCUDAコア並列性が鍵

56 SM/chip

3584 CUDA core/chip

GPU APPLICATIONS

数百のアプリケーションがGPUに対応www.nvidia.com/object/gpu-applications.html

LEADING APPLICATIONS

DL FRAMEWORKS

アプリをGPU対応する方法

Application

CUDAOpenACCLibrary

GPU対応ライブラリにチェンジ

簡単に開始主要処理をCUDAで記述

高い自由度既存コードにディレクティブを挿入

簡単に加速

NVIDIA LIBRARIES

cuRAND Thrust

AmgXcuBLAS

cuSPARSE cuSOLVER

Performance Primitives NCCL

nvGRAPH

TensorRT

PARTNER LIBRARIES

Computer Vision

Sparse direct solvers

Sparse Iterative MethodsLinear Algebra

Linear Algebra

Audio and Video Matrix, Signal and Image

Math, Signal and Image Linear Algebra

Computational Geometry

Real-time visual simulation

Application

Library

簡単に開始

CUDAOpenACC

主要処理をCUDAで記述

簡単に加速

void saxpy(int n,

float a,

float *x,

float *restrict y)

#pragma acc parallel copy(y[:n]) copyin(x[:n])

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

saxpy(N, 3.0, x, y);

SAXPY (Y=A*X+Y)

OpenMP OpenACC

void saxpy(int n,

float a,

float *x,

float *restrict y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

void saxpy(int n,

float a,

float *x,

float *restrict y)

#pragma omp parallel for

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

Application

Library OpenACC CUDA

簡単に開始主要処理をCUDAで記述

簡単に加速

__global__ void saxpy(int n, float a,

float *x, float *y)

int i = threadIdx.x + blodkDim.x * blockIdx;

if (i < n)

y[i] += a*x[i];

size_t size = sizeof(float) * N;

cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

cudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);

SAXPY (Y=A*X+Y)

CPU CUDA

void saxpy(int n, float a,

float *x, float *y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

Volta GPUs

ance /

2012 20142008 2010 2016 2018

GPUロードマップ

TeslaFermi

Kepler

Maxwell

Pascal

TESLA V100の概要

Deep LearningとHPC、両方に最適なGPU

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

Efficient Bandwidth

VOLTAHPC性能を大きく向上

P100に対する相対性能

HPCアプリケーション性能

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.

Summit

Supercomputer

200+ PetaFlops

~3,400 Nodes

10 Megawatts

3+EFLOPSTensor Ops

AI Exascale Today

DIRAC FLASH GTC

HACC LSDALTON NAMD

NUCCOR NWCHEM QMCPACK

RAPTOR SPECFEM XGC

AcceleratedScience

10XPerf Over Titan

200 PF

Performance Leadership

VOLTA 米国トップスパコンのエンジンSUMMIT

5-10XApplication Perf Over Titan

トランジスタ数:21B 815 mm2

80 SM5120 CUDAコア640 Tensorコア

HBM232 GB, 900 GB/s

NVLink 300 GB/s

TESLA V100

*full GV100 chip contains 84 SMs

P100 V100 性能UP

トレーニング性能 10 TOPS 125 TOPS 12x

インファレンス性能 21 TFLOPS 125 TOPS 6x

FP64/FP32 5/10 TFLOPS 7.8/15.6 TFLOPS 1.5x

HBM2 バンド幅 720 GB/s 900 GB/s 1.2x

NVLink バンド幅 160 GB/s 300 GB/s 1.9x

L2 キャッシュ 4 MB 6 MB 1.5x

L1 キャッシュ 1.3 MB 10 MB 7.7x

GPUピーク性能比較: P100 vs v100

HBM2メモリ、使用効率UP

STREAM

Delivere

P100 V100

76% 95%

実効バンド幅1.5倍

V100 measured on pre-production hardware.

HBM2 stack

使用効率

VOLTA NVLINK

P100 V100

リンク数 4 6

バンド幅 / リンク

40 GB/s 50 GB/s

トータルバンド幅

160 GB/s 300 GB/s(*) バンド幅は双方向

VOLTA GV100 SM

FP32ユニット 64

FP64ユニット 32

INT32ユニット 64

Tensorコア 8

レジスタファイル 256 KB

統合L1・共有メモリ

128 KB

Activeスレッド 2048(*) SMあたり

VOLTA GV100 SM

命令セットを一新

スケジューラを2倍

命令発行機構をシンプルに

L1キャッシュの大容量・高速化

SIMTモデルの改善

テンソル計算の加速

最もプログラミングの簡単なSM

生産性の向上

OpenACC

OPENACCプログラミング

▪ 概要紹介

▪ プログラムのOpenACC化

▪ OpenACC化事例

OPENACC

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

!$acc end kernels

... serial code …

End Program myscience

既存のC/Fortranコード

簡単: 既存のコードにコンパイラへのヒントを追加

強力: 相応の労力で、コンパイラがコードを自動で並列化

オープン: 複数コンパイラベンダが、様々なプロセッサをサポート

NVIDIA GPU, AMD GPU,

x86 CPU, …

ヒントの追加

GPUコンピューティング

アプリケーション・コード

並列部分はGPUで実行

計算の重い部分

逐次部分はCPU上で実行

Do i=1,N

End do

OpenACC

float *x, float *y)

if (i < n)

y[i] += a*x[i];

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

SAXPY (Y=A*X+Y)

CPU CUDA

float *x, float *y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

void saxpy(int n,

float a,

float *x,

float *restrict y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

SAXPY (y=a*x+y)

▪ omp → acc ▪ データの移動

OpenMP OpenACC

void saxpy(int n,

float a,

float *x,

float *restrict y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

subroutine saxpy(n, a, X, Y)

real :: a, Y(:), Y(:)

integer :: n, i

!$acc parallel copy(Y(:)) copyin(X(:))

do i=1,n

Y(i) = a*X(i)+Y(i)

!$acc end parallel

end subroutine saxpy

call saxpy(N, 3.0, x, y)

SAXPY (y=a*x+y, FORTRAN)

▪ FORTRANも同様

OpenMP OpenACC

subroutine saxpy(n, a, X, Y)

real :: a, X(:), Y(:)

integer :: n, i

!$omp parallel do

do i=1,n

Y(i) = a*X(i)+Y(i)

!$omp end parallel do

end subroutine saxpy

call saxpy(N, 3.0, x, y)

簡単にコンパイル

OpenMP / OpenACC

float *x,

float *restrict y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

$ pgcc -acc –Minfo=acc saxpy.csaxpy:

16, Generating present_or_copy(y[:n])

Generating present_or_copyin(x[:n])

Generating Tesla code

19, Loop is parallelizable

Accelerator kernel generated

19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

簡単に実行

OpenMP / OpenACC

float *x,

float *restrict y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

$ pgcc -Minfo -acc saxpy.csaxpy:

16, Generating present_or_copy(y[:n])

Generating present_or_copyin(x[:n])

$ nvprof ./a.out ==10302== NVPROF is profiling process 10302, command: ./a.out

==10302== Profiling application: ./a.out

==10302== Profiling result:

Time(%) Time Calls Avg Min Max Name

62.95% 3.0358ms 2 1.5179ms 1.5172ms 1.5186ms [CUDA memcpy HtoD]

31.48% 1.5181ms 1 1.5181ms 1.5181ms 1.5181ms [CUDA memcpy DtoH]

5.56% 268.31us 1 268.31us 268.31us 268.31us saxpy_19_gpu

▪ 概要紹介

事例: ヤコビ反復法

while ( error > tol ) {

error = 0.0;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]));

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

A(i,j) A(i+1,j)A(i-1,j)

A(i,j-1)

A(i,j+1)

並列領域の指定 (parallel/kernelsディレクティブ)

▪ Parallel と Kernels

▪ Parallel

▪ OpenMPと親和性

▪ 開発者主体

▪ Kernels

▪ 複数kernelの生成

▪ コンパイラ主体

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

[PGI tips] コンパイラメッセージ

$ pgcc –acc –Minfo=accel jacobi.cjacobi:

44, Generating copyout(Anew[1:4094][1:4094])

Generating copyin(A[:][:])

45, #pragma acc loop gang /* blockIdx.y */

49, Max reduction generated for error

並列領域の指定 (kernels)

▪ 並列領域の指定▪ Parallels と

Kernels

▪ Parallels

▪ 開発者主体

▪ Kernels

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

$ pgcc -Minfo=acc -acc jacobi.cjacobi:

59, Generating present_or_copyout(Anew[1:4094][1:4094])

Generating present_or_copyin(A[:][:])

Max reduction generated for error

データの転送 (data clause)

▪ 並列領域の指定▪ Parallels と

Kernels

▪ Parallels

▪ 開発者主体

▪ Kernels

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

$ pgcc -Minfo=acc -acc jacobi.cjacobi:

▪ copyin (Host→GPU)

▪ copyout (HostGPU)

▪ copy

▪ create

▪ present

error = 0.0;

#pragma acc kernels \

pcopyout(Anew[1:4094][1:4094]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

pcopyout(A[1:4094][1:4094]) pcopyin(Anew[1:4094][1:4094])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

} N=M=4096

配列形状の指定

▪ 配列は、全要素でなく、一部だけ指定して転送することも可能▪ 注意: C/C++ と Fortranでは指定方法が異なる

▪ C/C++: array[ start : size ]float Anew[4096][4096]

pcopyout( Anew[1:4094][1:4094]) pcopyin( A[:][:]) )

▪ Fortran: array( start : end )real Anew(4096,4096)

pcopyout( Anew(2:4095, 2:4095) ) pcopyin( A(:,:) )

▪ copyin (Host→GPU)

▪ copyout (HostGPU)

▪ copy

▪ create

▪ present

error = 0.0;

pcopy(Anew[:][:]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

} N=M=4096

[PGI tips] PGI_ACC_TIME

$ PGI_ACC_TIME=1 ./a.outAccelerator Kernel Timing data

/home/anaruse/src/OpenACC/jacobi/C/task1-solution/jacobi.c

jacobi NVIDIA devicenum=0

time(us): 649,886

44: data region reached 200 times

44: data copyin transfers: 800

device time(us): total=14,048 max=41 min=15 avg=17

53: data copyout transfers: 800

44: compute region reached 200 times

46: kernel launched 200 times

grid: [32x4094] block: [128]

device time(us): total=382,798 max=1,918 min=1,911 avg=1,913

elapsed time(us): total=391,408 max=1,972 min=1,953 avg=1,957

46: reduction kernel launched 200 times

grid: [1] block: [256]

elapsed time(us): total=53,510 max=280 min=266 avg=267

53: data region reached 200 times

統計データ

[PGI tips] PGI_ACC_NOTIFY

$ PGI_ACC_NOTIFY=3 ./a.out...

upload CUDA data file=/home/anaruse/src/OpenACC/jacobi/C/task1-

solution/jacobi.c function=jacobi line=44 device=0 variable=A bytes=16777216

launch CUDA kernel file=/home/anaruse/src/OpenACC/jacobi/C/task1-

solution/jacobi.c function=jacobi line=46 device=0 num_gangs=131008

num_workers=1 vector_length=128 grid=32x4094 block=128 shared memory=1024

download CUDA data file=/home/anaruse/src/OpenACC/jacobi/C/task1-

solution/jacobi.c function=jacobi line=53 device=0 variable=Anew bytes=16736272

トレースデータ

NVIDIA VISUAL PROFILER (NVVP)

NVVPによる解析: データ転送がボトルネック

1 cycle

kernel

利用率:低い

過剰なデータ転送

error = 0.0;

pcopy(Anew[:][:]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

過剰なデータ転送

error = 0.0;

pcopy(Anew[:][:]) \

pcopyin(A[:][:])

pcopy(A[:][:]) \

pcopyin(Anew[:][:])

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Host GPU

copyin

copyout

データ領域の指定 (dataディレクティブ)

▪ copyin (CPU→GPU)

▪ copyout (CPUGPU)

▪ copy

▪ create

▪ present

#pragma acc data pcopy(A, Anew)

error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

#pragma acc kernels pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

データ領域の指定 (dataディレクティブ)

▪ copyin (CPU→GPU)

▪ copyout (CPUGPU)

▪ copy

▪ create

▪ present

#pragma acc data pcopy(A) create(Anew)

error = 0.0;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

#pragma acc kernels pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

適正なデータ転送

#pragma acc data \

pcopy(A) create(Anew)

error = 0.0;

pcopy(Anew[:][:]) \

pcopyin(A[:][:])

pcopy(A[:][:]) \

pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

copyin

copyout

Host GPU

データ転送が減少 (NVVP)

利用率:高い1 cycle

2つの処理

データ転送

計算オフロード

計算オフロード、データ転送、両方を考慮する必要がある

GPU MemoryCPU Memory

float *array;

Init( ) {

array = (float*)malloc( … );

input_array( array );

#pragma enter data copyin(array)

Fin( ) {

#pragma exit data copyout(array)

output_array( array );

free( array );

その他のデータ管理方法

▪ Enter data▪ Copyin

▪ Create

▪ Exit data▪ Copyout

▪ Delete

#pragma acc data pcopy(A,B)

for (k=0; k<LOOP; k++) {

#pragma acc kernels present(A,B)

for (i=0; i<N; i++) {

A[i] = subA(i,A,B);

#pragma acc update self(A[0:1])

output[k] = A[0];

A[N-1] = input[k];

#pragma acc update device(A[N-1:1])

#pragma acc kernels present(A,B)

for (i=0; i<N; i++) {

B[i] = subB(i,A,B);

▪ Update selfCPU GPU

▪ Update deviceCPU → GPU

▪ Unified Memoryの利用

▪ PGIコンパイラオプション: -acc –ta=…,managed,…

▪ プログラム実行中に動的に確保される配列は（C: malloc, Fortran: allocate）、Unified Memoryで管理される

▪ OpenACCのデータディレクティブで移動を指示する必要がない

リダクション(縮約計算)

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

リダクション(縮約計算)

▪ 演算の種類+ 和

Max 最大

Min 最小

| ビット和

& ビット積

|| 論理和

&& 論理積

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:

リダクション (REDUCTION CLAUSE)

▪ 演算種類(C/C++)

max 最大

min 最小

| ビット和

& ビット積

|| 論理和

&& 論理積

error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

並列方法の指示

error = 0.0;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

並列方法の指示

error = 0.0;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:

並列方法の指示 (loopディレクティブ)

▪ Gang

▪ Worker

▪ Vector … SIMD幅

▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

error = 0.0;

#pragma acc loop gang vector(1) reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop gang vector(128)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

実行条件設定 (gang, vector)

for (j = 0; j < 16; j++) {

for (i = 0; i < 16; i++) {

4 x 16

for (j = 1; j < 16; j++) {

for (i = 0; i < 16; i++) {

8 x 8 8 x 8

ループを融合 (collapse)

▪ Gang

▪ Worker

▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...

error = 0.0;

#pragma acc loop reduction(max:error) \

collapse(2) gang vector(128)

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

ループを融合 (collapse)

▪ Gang

▪ Worker

▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...

error = 0.0;

#pragma acc loop reduction(max:error) gang vector(128)

for (int ji = 0; ji < (N-2)*(M-2); ji++) {

j = (ji / (M-2)) + 1;

i = (ji % (M-2)) + 1;

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

並列実行可能(independent)

▪ Gang

▪ Worker

▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...

error = 0.0;

#pragma acc loop reduction(max:error) independent

for (int jj = 1; jj < NN-1; jj++) {

int j = list_j[jj];

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

逐次に実行 (seq)

▪ Gang

▪ Worker

▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...

#pragma acc loop seq

for (int k = 3; k < NK-3; k++) {

#pragma acc loop

for (int j = 0; j < NJ; j++) {

#pragma acc loop

for (int i = 0; i < NI; i++) {

Anew[k][j][i] = func(

A[k-1][j][i], A[k-2][j][i], A[k-3][j][i],

A[k+1][j][i], A[k+2][j][i], A[k+3][j][i], ...

▪ 概要紹介

LSDalton

Quantum ChemistryAarhus University

12X speedup 1 week

PowerGrid

Medical ImagingUniversity of Illinois

40 days to2 hours

INCOMP3D

CFDNC State University

4X speedup

NekCEM

Comp ElectromagneticsArgonne National Lab

2.5X speedup60% less energy

CloverLeaf

Comp HydrodynamicsAWE

4X speedupSingle CPU/GPU code

MAESTROCASTRO

AstrophysicsStony Brook University

4.4X speedup4 weeks effort

FINE/Turbo

CFDNUMECA International

10X faster routines2X faster app

OPENACC ACCELERATES COMPUTATIONAL SCIENCE

Janus Juul Eriksen, PhD Fellow

qLEAP Center for Theoretical Chemistry, Aarhus University

OpenACC makes GPU computing approachable for

domain scientists. Initial OpenACC implementation

required only minor effort, and more importantly,

no modifications of our existing CPU implementation.

LSDALTON

Large-scale application for calculating high-accuracy molecular energies

Lines of Code

Modified

# of Weeks

Required

# of Codes to

Maintain

<100 Lines 1 Week 1 Source

Big Performance

7.9x8.9x

ALANINE-113 ATOMS

ALANINE-223 ATOMS

ALANINE-333 ATOMS

Speedup v

Minimal Effort

LS-DALTON CCSD(T) ModuleBenchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

https://developer.nvidia.com/openacc/success-stories

OpenACC enabled us to

target routines for GPU

acceleration without

rewriting code, allowing us

to maintain portability on a

code that is 20-years old

David Gutzwiller, Head of HPC

NUMECA International

CHALLENGE• Accelerate 20 year old highly optimized code

on GPUs

SOLUTION• Accelerated computationally intensive routines

with OpenACC

RESULTS• Achieved 10x or higher speed-up on key

routines

• Full app speedup of up to 2x on the Oak Ridge

Titan supercomputer

• Total time spent on optimizing various routines

with OpenACC was just five person-months

NUMECA FINE/TURBOCommercial CFD Application

Brad Sutton, Associate Professor of Bioengineering and

Technical Director of the Biomedical Imaging Center

University of Illinois at Urbana-Champaign

“Now that we’ve seen how easy it is to program the

GPU using OpenACC and the PGI compiler, we’re

looking forward to translating more of our projects

CHALLENGE• Produce detailed and accurate brain images by applying

computationally intensive algorithms to MRI data

• Reduce reconstruction time to make diagnostic use possible

SOLUTION• Accelerated MRI reconstruction application with OpenACC

using NVIDIA GPUs

RESULTS• Reduced reconstruction time for a single high-resolution MRI

scan from 40 days to a couple of hours

• Scaled on Blue Waters at NCSA to reconstruct 3000 images in

under 24 hours

POWERGRIDAdvanced MRI Reconstruction Model

Lixiang Luo, Researcher

Aerospace Engineering Computational Fluid Dynamics Laboratory

North Carolina State University (NCSU)

OpenACC is a highly effective tool for programming fully

implicit CFD solvers on GPU to achieve true 4X speedup“

* CPU Speedup on 6 cores of Xeon E5645, with additional cores performance reduces due to partitioning and MPI overheads

CHALLENGE• Accelerate a complex implicit CFD solver on GPU

SOLUTION• Used OpenACC to run solver on GPU with

minimal code changes.

RESULTS• Achieved up to 4X speedups on Tesla GPU over

parallel MPI implementation on x86 multicore

processor

• Structured approach to parallelism provided by

OpenACC allowed better algorithm design

without worrying about GPU architecture

INCOMP3D3D Fully Implicit CFD Solver

The most significant result from

our performance studies is faster

computation with less energy

consumption compared with our

CPU-only runs.

Dr. Misun Min, Computation Scientist

Argonne National Laboratory

CHALLENGE• Enable NekCEM to deliver strong scaling on

next-generation architectures while

maintaining portability

SOLUTION• Use OpenACC to port the entire program

to GPUs

RESULTS• 2.5x speedup over a highly tuned CPU-only

version

• GPU used only 39 percent of the energy

needed by 16 CPUs to do the same

computation in the same amount of time

NEKCEMComputational Electromagnetics Application

Adam Jacobs, PhD candidate in the Department of Physics and

Astronomy at Stony Brook University

On the reactions side, accelerated calculations allow us to model larger

networks of nuclear reactions for similar computational costs as the

simple networks we model now

MAESTRO & CASTROAstrophysics 3D Simulation

““

CHALLENGE• Pursuing strong scaling and portability to run

code on GPU-powered supercomputers

SOLUTION• OpenACC compiler on OLCF’s Titan

supercomputer

RESULTS• 4.4x faster reactions than on a multi-core

computer with 16 cores

• Accelerated calculations allow modeling larger

networks of nuclear reactions for similar

computational costs as simple networks

2 weeks to learn OpenACC

2 weeks to modify code

Wayne Gaudin and Oliver Perks

Atomic Weapons Establishment, UK

We were extremely impressed that we can run OpenACC on a CPU with

no code change and get equivalent performance to our OpenMP/MPI

implementation.

CLOVERLEAFPerformance Portability for a Hydrodynamics Application

Benchmarked Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Accelerator: Tesla K80

““

CHALLENGE• Application code that runs across architectures

without compromising on performance

SOLUTION• Use OpenACC to port the CloverLeaf mini app

to GPUs and then recompile and run the same

source code on multi-core CPUs

RESULTS• Same performance as optimized OpenMP

version on x86 CPU

• 4x faster performance using the same code on

Earthquake Simulation (理研、東大地震研)

▪ WACCPD 2016(SC16併催WS): Best Paper Award

http://waccpd.org/wp-content/uploads/2016/04/SC16_WACCPD_fujita.pdfより引用

▪ 気象・気候モデル by 理研AICS/東大▪ 膨大なコード (数十万行)

▪ ホットスポットがない (パレートの法則)

▪ 特性の異なる2種類の処理▪ 力学系 … メモリバンド幅ネック

▪ 物理系 … 演算ネック

NICAM: 力学系(NICAM-DC)

▪ OpenACCによるGPU化▪ 主要サブルーチンは、全てGPU上で動作(50以上)

▪ MPI対応済み

▪ 2週間

▪ 良好なスケーラビリティ▪ Tsubame 2.5, 最大2560

▪ Scaling factor: 0.8

Weak scaling

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04Perf

ance (

Number of CPUs or GPUs

Tsubame 2.5 (GPU:K20X)

K computer

Tsubame 2.5 (CPU:WSM)

(*) weak scaling

Courtesy of Dr. Yashiro from RIKEN AICS

NICAM: 力学系(NICAM-DC)

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Aggregate Peak Memory Bandwidth (GB/s)

Tsubame 2.5 (GPU:K20X)

K computer

Tsubame 2.5 (CPU:WSM)

Courtesy of Dr. Yashiro from RIKEN AICS

NICAM: 物理系(SCALE-LES)

▪ Atmospheric radiation transfer

▪ 物理系の中で、最も重い計算

▪ OpenACCによるGPU対応済

1.00 1.99 3.88 8.51

1 core 2 core 4 core 10 core 1 GPU 2 GPUs 4 GPUs

Xeon E5-2690v2(3.0GHz,10-core) Tesla K40

Speedup

(*) PCIデータ転送時間込み, グリッドサイズ:1256x32x32

Better

SEISM3D

▪ 地震シミュレーション by 古村教授(東大地震研)

▪ 主要サブルーチンのGPU対応が完了▪ メモリバンド幅ネック、 3次元モデル(2次元分割)、隣接プロセス間通信

K: 8x SPARC64VIIIfx

CPU: 8x XeonE5-2690v2

GPU: 8x TeslaK40

SEISM3D (480x480x1024, 1K steps)

3.4x speedup

(アプリ全体)

GPU: 8x Tesla K40

Others (CPU, MPI and so on)

[CUDA memcpy DtoH]

[CUDA memcpy HtoD]

(other subroutines)

update_vel_pml

update_vel

update_stress_pml

update_stress

diff3d_*

GPUの実行時間内訳

Better

FFR/BCM (仮称)

▪ 次世代CFDコード by 坪倉准教授(理研AICS/北大)

▪ MUSCL_bench:

▪ MUSCLスキームに基づくFlux計算 (とても複雑な計算)

▪ CFD計算の主要部分 (60-70%)

▪ OpenACCによるGPU対応、完了

1.00 1.934.55

101520253035

1 core 2 core 5 core 10 core 1 GPU

Xeon E5-2690v2(3.0GHz,10-core) Tesla K40

Speedup

(*) PCIデータ転送時間込み、サイズ:80x32x32x32

Better

CM-RCM IC-CG

▪ IC-CG法のベンチマークコード by 中島教授(東大)▪ CM-RCM法(Cyclic Multi-coloring Reverse Cuthill-Mckee)を使用

▪ メインループ内のサブルーチンを全てOpenACCでGPU化

1.58 1.53 1.39

OpenMP OpenMP OpenMP OpenMP OpenACC

Opteron 6386SE(2.8GHz,16core)

SPARC64 Ixfx(1.85GHz,16core)

Xeon E5-2680v2(2.8GHz,10core)

Xeon-Phi 5110P Tesla K40

second)

CM-RCM ICCG (100x100x100)

Better

Courtesy of Dr. Ohshima from U-Tokyo

CCS-QCD

▪ QCDコード by 石川准教授(広島大)

▪ BiCGStab計算を全てOpenACCでGPU化▪ データレイアウトを変更: AoS → SoA

32.324.3 22.1 20.9 19.7

53.357.9 60.8 63.4

8x8x8x32 8x8x8x64 8x8x16x64 8x16x16x64 16x16x16x64

Problem Size

CCS-QCD: BiCGStab Total FLOPS

Xeon E5-2690v2(3.0GHz,10core,OpenMP) Tesla K40(OpenACC)Better

CUDAプログラミング

▪ プログラミングモデル

▪ アーキテクチャ

▪ 性能Tips

GPUコンピューティング

▪ 高スループット指向のプロセッサ

▪ 分離されたメモリ空間

GPU MemoryCPU Memory

CPU GPU

GPUプログラム

float *x, float *y)

if (i < n)

y[i] += a*x[i];

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

cudaDeviceSynchronize();

float *x, float *y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

CPU GPU

GPU実行の基本的な流れ

▪ GPUは、CPUからの制御で動作

▪ 入力データ: CPUからGPUに転送 (H2D)

▪ GPUカーネル: CPUから投入

▪ 出力データ: GPUからCPUに転送 (D2H)

CPU GPU

入力データ転送

GPUカーネル投入

同期

出力データ転送

GPU上で演算

GPUプログラム

float *x, float *y)

if (i < n)

y[i] += a*x[i];

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

float *x, float *y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

CPU GPU

入力データ転送

カーネル起動

同期

出力データ転送

GPUプログラム (Unified Memory)

float *x, float *y)

if (i < n)

y[i] += a*x[i];

saxpy<<< N/128, 128 >>>(N, 3.0, x, y);

float *x, float *y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

CPU GPU

カーネル起動

同期

GPUカーネル

▪ GPUカーネル: 1つのGPUスレッドの処理内容を記述▪ 基本: 1つのGPUスレッドが、1つの配列要素を担当

float *x, float *y)

int i = threadIdx.x + blodkDim.x * blockIdx.x;

if (i < n)

y[i] += a*x[i];

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

float *x, float *y)

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

CPU GPU

Global スレッドID

Execution Configuration(ブロック数とブロックサイズ)

float *x, float *y)

if (i < n)

y[i] += a*x[i];

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

ブロック数ブロックサイズ

ブロック数 x ブロックサイズ = 配列要素数

スレッドIDブロックID

ブロックサイズ

スレッド階層(スレッド、ブロック、グリッド)

グリッド

x[] 0 127 128 255

y[] 0 127 128 255

y[i] = a*x[i] + y[i]

256 383 384

ブロック2ブロック1ブロック0 ブロック

スレッド(global)

0 127 128 255 256 383 384

▪ ブロックサイズ(スレッド数/ブロック)は、カーネル毎に設定可能▪ 推奨: 128 or 256 スレッド

スレッド(local)

0 127 0 127 0 127 0

Execution Configuration(ブロック数とブロックサイズ)

float *x, float *y)

if (i < n)

y[i] += a*x[i];

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

ブロック数ブロックサイズ

ブロック数 x ブロックサイズ = 配列要素数

N/ 64, 64N/256, 256

2D配列のGPUカーネル例

▪ ブロックサイズ(ブロック形状)は、1D～3Dで表現可能

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])

int j = threadIdx.y + blodkDim.y * blockIdy.y;

if ( i < N && j < N )

C[i][i] = A[i][j] + B[i][j];

dim3 sizeBlock( 16, 16 );

dim3 numBlocks( N/sizeBlock.x, N/sizeBlock.y );

MatAdd<<< numBlocks, sizeBlock >>>(A, B, C);

32, 864, 4

Globalスレッド ID (x)

GlobalスレッドID (y)

ブロックサイズ (x,y)

ブロック数 (x,y)

ブロック・マッピング、スレッド・マッピング

dim3 sizeBlock(16,16)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(0,2) (1,2) (2,2)

ブロックID(blockIdx)

dim3 sizeBlock(32,8)

(0,0) (1,0)

(0,1) (1,1)

(0,2) (1,2)

(0,3) (1,3)

(0,4) (1,4)

(15,15)

(31,7)

スレッドID(threadIdx)

▪ 性能Tips

GPUアーキテクチャ概要

▪ PCI I/F▪ ホスト接続インタフェース

▪ Giga Thread Engine▪ SMに処理を割り振るスケジューラ

▪ DRAM I/F (HBM2)▪ 全SM、PCI I/Fからアクセス可能なメモリ (デバイスメモリ, フレームバッファ)

▪ L2 cache (4MB)▪ 全SMからアクセス可能なR/Wキャッシュ

▪ SM (Streaming Multiprocessor)▪ 「並列」プロセッサ、GP100:最多60Pascal GP100

SM (Stream Multi-Processor)

▪ CUDAコア▪ GPUスレッドはこの上で動作

▪ GP100: 64個 /SM

▪ Other units▪ LD/ST, SFU, etc

▪ レジスタ(32bit): 64K個

▪ 共有メモリ: 64KB

▪ Tex/L1キャッシュPascal GP100

GPUカーネル実行の流れ

▪ CPUが、GPUに、グリッドを投入▪ 具体的な投入先は、Giga Thread Engine

グリッド

ブロック

スレッド

▪ Giga Thread Engine(GTE)が、SMに、ブロックを投入▪ GTEは、ブロックスケジューラ

▪ グリッドをブロックに分解して、ブロックを、空いているSMに割当てる

グリッド

ブロック

スレッド

ブロックをSMに割り当て

▪ 各ブロックは、互いに独立に実行▪ ブロック間では同期しない、実行順序の保証なし

▪ 1つのブロックは複数SMにまたがらない▪ 1つのSMに、複数ブロックが割当てられることはある

グリッド

ブロック

▪ SM内のスケジューラが、スレッドをCUDAコアに投入

グリッド

ブロック

スレッド

▪ SM内のスケジューラが、ワープをCUDAコアに投入▪ ワープ: 32スレッドの塊

▪ ブロックをワープに分割、実行可能なワープを、空CUDAコアに割当てる

グリッド

ブロック

ワープ

ワープのCUDAコアへの割り当て

▪ ワープ内の32スレッドは、同じ命令を同期して実行

▪ 各ワープは、互いに独立して実行▪ 同じブロック内のワープは、明示的に同期可能(__syncthreads())

グリッド

ブロック

ワープ

SIMT(Single Instruction Multiple Threads)

ワープ

GPUアーキの変化を問題としないプログラミングモデル

Kepler, CC3.5

192 cores /SM

Pascal, CC6.0

64 cores /SM

Maxwell, CC5.0

128 cores /SM

▪ 性能Tips

リソース使用率 (Occupancy)

SMの利用効率を上げる≒ SMに割当て可能なスレッド数を、

上限に近づける

▪ レジスタ使用量(/スレッド)▪ できる限り減らす

▪ DP(64bit)は、2レジスタ消費

▪ レジスタ割り当て単位は8個

▪ レジスタ使用量と、割当て可能なスレッド数の関係▪ 32レジスタ: 2048(100%), 64レジスタ: 1024(50%)

▪ 128レジスタ:512(25%), 256レジスタ: 256(12.5%)

CUDAコア数: 64最大スレッド数: 2048最大ブロック数: 32共有メモリ: 64KB

レジスタ数(32-bit): 64K個

リソース量/SM (GP100)

SMの利用効率を上げる≒ SMに割当て可能なスレッド数を、上限に近づける

▪ スレッド数(/ブロック)▪ 64以上にする▪ 64未満だと最大ブロック数がネックになる

▪ 共有メモリ使用量(/ブロック)▪ できる限り減らす▪ 共有メモリ使用量と、割当て可能なブロック数の関係

▪ 32KB:2ブロック, 16KB:4ブロック, 8KB:8ブロック

CUDAコア数: 64最大スレッド数: 2048最大ブロック数: 32共有メモリ: 64KB

レジスタ数(32-bit): 64K個

リソース量/SM (GP100)

空き時間を埋める

▪ CUDAストリーム (≒キュー)▪ 同じCUDAストリームに投入した操作: 投入順に実行

▪ 別のCUDAストリームに投入した操作: 非同期に実行 (オーバラップ実行)

(*) 操作: GPUカーネル, データ転送

[CUDAストリームの効果例]GPUカーネルとデータ転送が

オーバラップして同時に実行されている

デバイスメモリへのアクセスは、まとめて

▪ コアレス・アクセス▪ 32スレッド(ワープ)のロード/ストアをまとめて、メモリトランザクションを発行

▪ トランザクションサイズ: 32B, 64B or 128B

▪ トランザクション数は、少ないほど良い

配列 128B境界 128B境界

0 1 2 29 30 31283

0 1 14 17 30 311615

128B x1

64B x2

32B x32

分岐を減らす

▪ ワープ内のスレッドが別パスを選択すると遅くなる▪ ワープ内のスレッドは、命令を共有 (SIMT)

▪ ワープ内のスレッドが選んだ全パスの命令を実行

▪ あるパスの命令を実行中、そのパスにいないスレッドはinactive状態

▪ Path divergenceを減らす▪ できる限り、同ワープ内のスレッドは

同じパスを選択させる

A[id] > 0

処理 X

処理 Y

true false

まとめ

▪ GPU Computing

▪ Volta GPUs

▪ OpenACC

▪ CUDA

openacc cudaによる gpuコンピューティングgpu computing 5 gpu コンピューティング...

Documents

nvidia gpu...

検査装置でのgpuの活用と -...

bsa グローバルクラウドコンピューティング...

cloud...

ai/ディープラーニングセミナー - fujitsu ·...

Вычисления на gpu с...

gpuコンピューティング...

cuda openaccによる gpuコンピューティング...4...

nvidia deep leaning 説明資料 - networld ·...

【旧版】2009/12/10...

gpuコンピューティング（cuda)...

ディープラーニングとpascal gpu...

gpu portfolio

atlas コンピューティングとグリッド

cudaやopenaccによる gpuコンピューティング ·...

cmsi計算科学技術特論b(14)...

gpuコンピューティング（cuda)...

gpu programming

1071: gpuコンピューティング最新情報～ cuda...

モデルベース開発ソリューション - nsw ·...