cuda & openacc - gist. gist__cuda... · cuda & openacc hyungon ryu 유현곤 (nvidia korea)...

65
NVIDIA Tutorial CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) [email protected]

Upload: lethuy

Post on 03-May-2018

234 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA & OpenACC

Hyungon Ryu 유현곤 (NVIDIA Korea)

[email protected]

Page 2: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Agenda

• Tutorial1 : GPGPU & CUDA

– CUDA H/W architecture

• Tutorial2 : CUDA C/C++ fortran

• Tutorial3 : CUDA Libraries

• Tutorial3 : OpenACC

• Tutorial4 : CUDA 5.0 Kepler Architecture

Page 3: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Tutorial 1 CUDA H/W architecture

Page 4: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Moore’s law is OK….

Page 5: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial PSC CUDA training 5

Conceptual GPU vs. CPU

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Cache

IB fetch

LSU Cache

FPU ALU

reg reg reg reg

FPU

128bit reg

Page 6: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial PSC CUDA training 6

Conceptual GPU vs. CPU

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Cache

IB fetch

LSU Cache

FPU ALU

reg reg reg reg IB fetch

LSU Cache

FPU ALU

reg reg reg reg

Page 7: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial PSC CUDA training 7

Conceptual GPU vs. CPU

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Cache

IB fetch

LSU

Cache

FPU ALU

reg reg reg reg

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

IB fetch

LSU

Cache

FPU ALU

reg reg reg reg

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

IB fetch

LSU

Cache

FPU ALU

reg reg reg reg

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

FPU ALU

Page 8: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial PSC CUDA training

Conceptual GPU vs. CPU

CPU core

GPU core

Even load OS kernel Focus on computing

Page 9: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial PSC CUDA training 9

CPU Program

Kernel Launch

GPU threads

CUDA parallel Model모델

Page 10: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial PSC CUDA training 10

CPU Program

Kernel Launch

Block

CUDA parallel Model (real)

thread

SMC

Geometry Controller

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

Texture Unit

Tex L1

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

SM

TPC

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

<<< Block, Threads>>>

Page 11: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Saxpy Example : Serial and OpenMP

saxpy(int n, double a, double*x , double* y) {

for (int i = 0; i < n; ++i) {

y[i] = a*x[i] + y[i];

}

}

saxpy(int n, double a, double*x , double* y) {

# pragma omp parallel shared (n,a,x,y) private (i)

# pragma omp for

for (int i = 0; i < n; ++i) {

y[i] = a*x[i] + y[i];

}

}

Page 12: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Saxpy Parallel : CUDA parallel

__global__ void

saxpy(int n, double a, double*x , double* y) {

{

int i = blockIdx.x;

x[i] = a * x[i] + t * y[i];

}

Saxpy <<< n ,1 >>> (n, 2.0, x, y);

Loop configuration index control CUDA : for __global__ & <<< A,B>>> blockIdx.x OpenMP for #pragma omp for tid=omp_get_thread_num()

Page 13: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Templates for CUDA for CUDA #include <stdio.h>

main(){

Initialize the GPU

Memory allocation

Memory copy

FunctionG <<< N , M >>> (parameters);

Memory copy

Function_host( );

}

void __global__ functionG(parameters){

functionA( );

functionB( );

}

__device__ functionA(){

. . . }

Execute kernel

Define global function

Define subroutines

Memory Control

Memory Copy

CPU algorithm

use device function

Page 14: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA C extension Launch the kernel

Function <<< Grid, Block >>> ( parameter);

Additional C standard API for mem control

cudaXXX : cudaMalloc, cudaMemcpy,

cuXXX : cuMalloc, cuMemcpy

cutXXX : cutXXX

For Function

__global__, __device__, __host__, __device__ __host__

For memory

__shared__, __device__, reg/loc

pre-defined variables

blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice

Pre-defined function

__syncthreads(), __mul24(); etc

14

Page 15: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Overview of developing

15

Algorithm

serial Programming

Compile

Debugging

Release

Serial MPI parallel CUDA parallel

Algorithm

serial Programming

Compile

Release

parallel Programming

Debugging [totalview] Optimize

Compile

Profile

Parallelize

MPIrun

Algorithm

serial Programming

Compile

Release

CUDA convert

Debugging

Compile

Parallelize

Optimize/profile

Profile

Page 16: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA Parallel programming Steps

Page 17: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

demo system

OS :

Linux CentOS 6.3

Compiler :

C/C++ : GCC + NVCC

fortran : PGI compiler

H/W :

NVIDIA GPU ( Tesla C2075)

Page 18: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA C/C++ Model

CPU GPU cudaMemcpy

GPU_FUNCTION<<<A,B>>>( );

cudaMemcpy CPU GPU

upLoad

downLoad

Page 19: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA FORTRAN Model

CPU GPU d_data = h_data

call GPU_FUNCTION<<<A,B>>>

h_result = d_result CPU GPU

upLoad

downLoad

Page 20: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA C/C++ Makefile Example #NVCC Makefile Example include ./Makefile.inc ########################################################### PROG = saxpy #TARGET = exe.$(PROG) TARGET = saxpy_gpu GPU_OBJ = $(PROG).gpu.o CPU_OBJ = $(PROG).cpu.o MAIN_OBJ = $(PROG).main.o OBJS = $(GPU_OBJ) $(CPU_OBJ) $(MAIN_OBJ) all: build build: $(TARGET) $(GPU_OBJ): $(PROG)_gpu.cu @echo "Compile GPU part..." @$(NVCC) $(NVCCFLAGS) $(EXTRA_NVCCFLAGS) $(INCLUDES) $(GPU_ARCH) -o $@ -c $< $(CPU_OBJ): $(PROG)_cpu.c @echo "Compile CPU part..." @$(CC) $(CCFLAGS) $(EXTRA_CCFLAGS) $(INCLUDES) -o $@ -c $< $(MAIN_OBJ):$(PROG)_main.c @echo "Compile main part..." @$(CC) $(CCFLAGS) $(EXTRA_CCFLAGS) $(INCLUDES) -o $@ -c $< $(TARGET): $(OBJS) @echo "Link..." @$(CC) $(CCFLAGS) $(EXTRA_CCFLAGS) -o $@ $+ $(LDFLAGS) $(EXTRA_LDFLAGS) run: build ./$(TARGET) clean: @rm -rf *.o *.ptx $(TARGET) @echo "finish remove all tmp files"

Page 21: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA C/C++ example

__global__ void

saxpy_kernel(float* x, float*y,float alpha, int N){

int i = blockIdx.x;

y[i]=alpha*x[i]+ y[i];

return ;

}

extern "C"

void saxpyGPU(float* x, float* y, float alpha, int N){

float *x_d, *y_d;

cudaMalloc( (void**)&x_d , sizeof(float)*N );

cudaMalloc( (void**)&y_d , sizeof(float)*N );

cudaMemcpy( x_d , x, sizeof(float)*N ,cudaMemcpyHostToDevice);

cudaMemcpy( y_d , y, sizeof(float)*N ,cudaMemcpyHostToDevice);

saxpy_kernel <<< N,1>>> (x_d, y_d, alpha, N );

cudaMemcpy( y , y_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

return;

}

Page 22: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA fortran Makefile example

cat Makefile #F90FLAGS = -fast CUDAFLAGS = -Mcuda=cc20 OBJS = increment op_m all: $(OBJS) # increment increment: increment.cuf op_m.o pgfortran $(CUDAFLAGS) $(F90FLAGS) -o $@ $^ op_m.o: op_m.f pgfortran $(CUDAFLAGS) $(F90FLAGS) -c $< clean: rm -rf *.o *.mod $(OBJS) *~

Page 23: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

check GPU device with cudainfo program cufinfo use cudafor integer istat, num, numdevices type(cudadeviceprop) :: prop istat = cudaGetDeviceCount(numdevices) do num = 0, numdevices-1 istat = cudaGetDeviceProperties(prop, num) call printDeviceProperties(prop, num) end do end subroutine printDeviceProperties(prop, num) use cudafor type(cudadeviceprop) :: prop integer num ilen = verify(prop%name, ' ', .true.) write (*,900) "Device Number: " ,num write (*,901) "Device Name: " ,prop%name(1:ilen) write (*,905) "SM: ",prop%major,prop%minor 900 format (a,i0) 901 format (a,a) 905 format (a,i0,'.',i0) return end

check /opt/pgi/linux86-64/2012/src/cudafor.f90

Page 24: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

simple CUDA FORTRAN EXAMPLE program incTest use cudafor use simpleOps_m implicit none integer, parameter :: n = 256 integer :: a(n), b integer, device :: a_d(n) a = 1 b = 3 a_d = a call inc<<<1,n>>>(a_d, b) a = a_d if (all(a == 4)) then write(*,*) 'Success' endif end program incTest

Page 25: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

cuBLAS / cuFFT

Page 26: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CPU version DGEMM routine

static void simple_dgemm(int n, double alpha, const double *A, const double *B, double beta, double *C) { int i, j, k; for (i = 0; i < n; ++i) { for ( j = 0; j < n; ++j) { double prod = 0; for (k = 0; k < n; ++k) { prod += A[k * n + i] * B[j * n + k]; } C[j * n + i] = alpha * prod + beta * C[j * n + i]; } } }

Page 27: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

cuBLAS Model

Malloc Memcpy

Compute

GPU CPU

Memcpy

free

Page 28: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

cuBLAS call step

• GPU Memory Alloc cublasAlloc( n_size, memSize, GPU_ptr);

• GPU Memory upload(set) cublasSetVector( n_size, memSize, CPU_ptr, 1, GPU_ptr, 1 );

• GPU Compute cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

• GPU Memory download(get) cublasGetVector(n2, memSize , GPU_ptr, 1, CPU_ptr, 1);

nvcc mysrc.c –lcuda –lcublas

Page 29: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Main code in cuBLAS

h_A = (float*) malloc(n2 * sizeof(h_A[0])); h_B = (float *) malloc(n2 * sizeof(h_B[0])); h_C = (float *) malloc(n2 * sizeof(h_C[0])); cublasAlloc (n2, sizeof(d_A[0]), (void**)&d_A ); cublasAlloc (n2, sizeof(d_B[0]), (void **)&d_B ); cublasAlloc (n2, sizeof(d_C[0]), (void **)&d_C ); cublasSetVector (n2, sizeof(h_A[0]), h_A, 1, d_A, 1 ); cublasSetVector (n2, sizeof(h_B[0]), h_B, 1, d_B, 1 ); cublasSetVector (n2, sizeof(h_C[0]), h_C, 1, d_C, 1 ); cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N ); cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1 );

Page 30: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Tutorial 3 OpenACC

Page 31: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

OpenACC Specification and Website

• Full OpenACC 1.0 Specification available online

www.openacc.org

• Quick reference card also available

• Beta implementations available now from PGI, Cray, and CAPS

Page 32: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CASE Study

With Directives, tuning work focuses on exposing parallelism, which makes codes inherently bett

er

Example: Application tuning work using directives for new Titan system at ORNL

S3D Research more efficient combustion with next-generation fuels

CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios

• Tuning top 3 kernels (90% of runtime) • 3 to 6x faster on CPU+GPU vs. CPU+CPU • But also improved all-CPU version by 50%

• Tuning top key kernel (50% of runtime) • 6.5x faster on CPU+GPU vs. CPU+CPU • Improved performance of CPU version by 100%

Page 33: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Familiar to OpenMP Programmers

main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU

OpenMP

main() { double pi = 0.0; long i; #pragma acc kernels for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); }

CPU GPU

OpenACC

Page 34: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Directive Syntax

• Fortran !$acc directive [clause [,] clause]

…] ...often paired with a matching end directive surrounding a structured code block:

!$acc end directive

• C #pragma acc directive [clause [,] cl

ause] …] …often followed by a structured code block

Page 35: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

kernels: Your first OpenACC Directive

Each loop executed as a separate kernel on the GPU.

!$acc kernels

do i=1,n

a(i) = 0.0

b(i) = 1.0

c(i) = 2.0

end do

do i=1,n

a(i) = b(i) + c(i)

end do

!$acc end kernels

kernel 1

kernel 2

Kernel: A parallel functio

n that runs on the

GPU

Page 36: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Kernels Construct

Fortran !$acc kernels [clause …] structured block !$acc end kernels

Clauses if( condition )

async( expression )

Also, any data clause (more later)

C #pragma acc kernels [clause …] { structured block }

Page 37: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

OpenACC (C/C++) example MatrixMul

Page 38: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Matrix Mul CPU code

for ( int row=0; row<n; row++ ) {

for ( int col=0; col<n; col++ ) {

double val = 0;

for ( int k=0; k<n; k++ ) {

val += a[row*n+k] * b[k*n+col];

}

c[row*n+col] = val;

}

}

Page 39: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CPU

[cudaguest@gpu03 Ex3]$ ./ex1-gcc

Time : 12.1871 seconds Perf. : 0.1762 Gflops/s

Page 40: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CUDA converting

for ( int row=0; row<n; row++ ) {

for ( int col=0; col<n; col++ ) {

double val = 0;

for ( int k=0; k<n; k++ ) {

val += a[row*n+k] * b[k*n+col];

}

c[row*n+col] = val;

}

CUDAMalloc CUDA Memcpy Kernel Launch CUDA Memcpy

CUDA 코드로 변환 필요 block/thread Index

Global Memory

# of block

# of thread

CUDA cores Multiprocessors

__global__ shared memory texture memory

Page 41: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

OpenACC Converting

#pragma acc data region copyin(a[0:n*n-1],b[0:n*n-1]) copyout(c[0:n*n-1])

{

#pragma acc region

{

#pragma acc for independent

for ( int row=0; row<n; row++ ) {

#pragma acc for independent

for ( int col=0; col<n; col++ ) {

double val = 0;

for ( int k=0; k<n; k++ ) {

val += a[row*n+k] * b[k*n+col];

}

c[row*n+col] = val;

}

}

}

}

for ( int row=0; row<n; row++ ) {

for ( int col=0; col<n; col++ ) {

double val = 0;

for ( int k=0; k<n; k++ ) {

val += a[row*n+k] * b[k*n+col];

}

c[row*n+col] = val;

}

Page 42: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

OpenACC Compile

[cudaguest@gpu03 Ex3]$ make acc

pgcc -lgomp -fast -ta=nvidia:cc20 -ta=nvidia:fastmath -ta=nvidia:cuda4.0 -Minfo -Minline exercise3.c -o ex3 –lgomp

NOTE: your trial license will expire in 13 days, 8.55 hours.

NOTE: your trial license will expire in 13 days, 8.55 hours.

main:

28, Loop not vectorized: data dependency

38, Generating copyout(c[0:n*n-1])

Generating copyin(b[0:n*n-1])

Generating copyin(a[0:n*n-1])

43, Generating compute capability 2.0 binary

47, Loop is parallelizable

51, Loop is parallelizable

Accelerator kernel generated

47, #pragma acc for parallel, vector(16) /* blockIdx.y threadIdx.y */

51, #pragma acc for parallel, vector(16) /* blockIdx.x threadIdx.x */

CC 2.0 : 28 registers; 8 shared, 80 constant, 0 local memory bytes

57, Loop is parallelizable

[cudaguest@gpu03 Ex3]$

Page 43: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

OpenACC result

[cudaguest@gpu03 Ex3]$ ./ex3-cpu

[cudaguest@gpu03 Ex3]$ ./ex3-acc

Time : 12.1871 seconds Perf. : 0.1762 Gflops/s

Time : 0.8621 seconds Perf : 2.4911 Gflops/s

Page 44: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

OpenACC(fortran) examples HEAT

Page 45: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

simple Heat Equation

wtime = omp_get_wtime ( ) diff = eps ! Iteration start do while ( eps <= diff ) diff = 0.0 do j = 1, n do i = 1, m u(i,j) = w(i,j) end do end do do j = 2, n - 1 do i = 2, m - 1 w(i,j) = 0.25 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) ) end do end do do j = 1, n do i = 1, m diff = max ( diff, abs ( u(i,j) - w(i,j) ) ) end do end do iterations = iterations + 1 end do

sample code form ! Licensing: ! This code is distributed under the GNU LGPL license. ! Modified: ! 18 October 2011 ! Author: ! Original FORTRAN90 version by Michael Quinn. ! This version by John Burkardt.

Page 46: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

simple Heat Equation

wtime = omp_get_wtime ( ) diff = eps ! Iteration start do while ( eps <= diff ) diff = 0.0 do j = 1, n do i = 1, m u(i,j) = w(i,j) end do end do do j = 2, n - 1 do i = 2, m - 1 w(i,j) = 0.25 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) ) end do end do do j = 1, n do i = 1, m diff = max ( diff, abs ( u(i,j) - w(i,j) ) ) end do end do iterations = iterations + 1 end do

Page 47: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

simple Heat Equation with OpenMP do while ( eps <= diff ) diff = 0.0 !$omp parallel shared ( u, w ) private ( i, j ) !$omp do do j = 1, n do i = 1, m u(i,j) = w(i,j) end do end do !$omp end do !$omp do do j = 2, n - 1 do i = 2, m - 1 w(i,j) = 0.25 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) ) end do end do !$omp end do !$omp do reduction ( max : diff ) do j = 1, n do i = 1, m diff = max ( diff, abs ( u(i,j) - w(i,j) ) ) end do end do !$omp end do !$omp end parallel iterations = iterations + 1 end do

Page 48: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

simple Heat Equation with OpenACC do while ( eps <= diff ) diff = 0.0 !$omp parallel shared ( u, w ) private ( i, j ) !$acc kernels !$omp do do j = 1, n do i = 1, m u(i,j) = w(i,j) end do end do !$omp end do !$omp do do j = 2, n - 1 do i = 2, m - 1 w(i,j) = 0.25 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) ) end do end do !$omp end do !$acc end kernels !$omp do reduction ( max : diff ) !$acc kernels do j = 1, n do i = 1, m diff = max ( diff, abs ( u(i,j) - w(i,j) ) ) end do end do !$omp end do !$acc end kernels !$omp end parallel

Page 49: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Basic Concepts

PCI Bus

Transfer data

Offload computation

For efficiency, decouple data movement and compute off-load

GPU

GPU Memory

CPU

CPU Memory

Page 50: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

simple Heat Equation with OpenACC !$acc data copy(u,w) do while ( eps <= diff ) diff = 0.0 !$acc kernels do j = 1, n do i = 1, m u(i,j) = w(i,j) end do end do do j = 2, n - 1 do i = 2, m - 1 w(i,j) = 0.25 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) ) end do end do !$acc end kernels !$acc kernels do j = 1, n do i = 1, m diff = max ( diff, abs ( u(i,j) - w(i,j) ) ) end do end do !$omp end do !$acc end kernels iterations = iterations + 1 end do !$acc end data

Page 51: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Kepler Architecture and CUDA5.0

Page 52: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Tesla CUDA Architecture Roadmap

16

2

4

6

8

10

12

14

DP G

FLO

PS p

er

Watt

2008 2010 2012 2014

Tesla Fermi

Kepler

Maxwell

Page 53: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

0

50

100

150

200

250

300

2007 2008 2009 2010 2011 2012

Peak Memory Bandwidth

GBytes/sec

GT200

Nehalem

3 GHz

Westmere 3 GHz

8-core Sandy Bridge 3 GHz

GF100

GF110

Kepler II

0

200

400

600

800

1000

1200

1400

1600

2007 2008 2009 2010 2011 2012

Peak Double Precision FP

GFlops/sec

Nehalem

3 GHz

Westmere

3 GHz

GF100

GF110

GT200

8-core Sandy Bridge

3 GHz

Kepler II

NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU

Page 54: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Kepler GK110 Block Diagram

Architecture • 7.1B Transistors

• 15 SMX units

• > 1 TFLOP FP64

• 1.5 MB L2 Cache

• 384-bit GDDR5

• PCI Express Gen3

Page 55: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Power vs Clock Speed Example

Fermi

2x clock A B

A

A

B

B

Kepler

1x clock

Logic

Area Power

1.0x 1.0x

1.8x 0.9x

Clocking

Area Power

1.0x 1.0x

1.0x 0.5x

Page 56: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

SMX Balance of Resources

Resource Kepler GK110 vs Fermi

Floating point throughput 2-3x

Max Blocks per SMX 2x

Max Threads per SMX 1.3x

Register File Bandwidth 2x

Register File Capacity 2x

Shared Memory Bandwidth 2x

Shared Memory Capacity 1x

Page 57: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

New Instruction: SHFL

Data exchange between threads within a warp

• Avoids use of shared memory

• One 32-bit value per exchange

• 4 variants:

h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f

a b c d e f g h

Indexed any-to-any

Shift right to nth neighbour

Shift left to nth neighbour

Butterfly (XOR) exchange

__shfl() __shfl_up() __shfl_down() __shfl_xor()

Page 58: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Dynamic Parallelism

The ability to launch new grids from the GPU

– Dynamically

– Simultaneously

– Independently

CPU GPU CPU GPU

Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

Page 59: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

CPU GPU CPU GPU

Dynamic Parallelism

Autonomous, Dynamic Parallelism GPU as Co-Processor

Page 60: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Better Programming Model

LU decomposition (Fermi) LU decomposition (Kepler)

dgetrf(N, N) {

for j=1 to N

for i=1 to 64

idamax<<<>>>

memcpy

dswap<<<>>>

memcpy

dscal<<<>>>

dger<<<>>>

next i

memcpy

dlaswap<<<>>>

dtrsm<<<>>>

dgemm<<<>>>

next j

}

idamax();

dswap();

dscal();

dger();

dlaswap();

dtrsm();

dgemm();

dgetrf(N, N) {

dgetrf<<<>>>

synchronize();

}

dgetrf(N, N) {

for j=1 to N

for i=1 to 64

idamax<<<>>>

dswap<<<>>>

dscal<<<>>>

dger<<<>>>

next i

dlaswap<<<>>>

dtrsm<<<>>>

dgemm<<<>>>

next j

}

GPU Code CPU Code GPU Code CPU Code

Page 61: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Concurrency : Fermi

Fermi allows 16-way concurrency – Up to 16 grids can run at once

– But CUDA streams multiplex into a single queue

– Overlap only at stream edges

P -- Q -- R

A -- B -- C

X -- Y -- Z

Stream 1

Stream 2

Stream 3

Hardware Work Queue

A--B--C P--Q--R X--Y--Z

Page 62: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Concurrency : Kepler

P -- Q -- R

A -- B -- C

X -- Y -- Z

Stream 1

Stream 2

Stream 3

Multiple Hardware Work Queues

A--B--C

P--Q--R

X--Y--Z

Kepler allows 32-way concurrency One work queue per stream

Concurrency at full-stream level

No inter-stream dependencies

Page 63: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Hyper-Q : Fermi without Hyper-Q

Time

100

50

0

GPU

Uti

lizati

on %

A B C D E F

Page 64: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

Hyper-Q : Kepler With Hyper-Q

Time

100

50

0

GPU

Uti

lizati

on %

A

A A

B

B B

C

C

C

D

D

D

E

E

E

F

F

F

Page 65: CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

NVIDIA Tutorial

감사합니다.