cuda_odp

7/29/2019 CUDA_odp

1/29

GPGPU Programming with CUDA

Leandro Avila - University of Northern Iowa

Mentor:

Dr. Paul Gray

Computer Science Department

University of Northern Iowa

7/29/2019 CUDA_odp

2/29

Outline

Introduction

Architecture Description

Introduction to CUDA API

7/29/2019 CUDA_odp

3/29

Introduction

Shift in the traditional paradigm of sequential programming,towards parallel processing.

Scientific computing needs to change in order to deal with vast

amounts of data.

Hardware changes contributed to move towards parallelprocessing.

7/29/2019 CUDA_odp

4/29

Three Walls of Serial Performance

Manferdelli, J. (2007) - The Many-Core Inflection Point for Mass Market Computer Systems

Memory Wall

Discrepancy between memory and CPUperformance

Instruction Level Parallelism Wall

Effort put into ILP increases with not enough returns

Power Wall

Clock frequency vs. Heat dissipation efforts.

7/29/2019 CUDA_odp

5/29

Accelerators

In HPC, an accelerator is a hardware component whose role isto speed up some aspect of the computing workload.

In the old days (1980s), supercomputers we had arrayprocessors, for vector operations on arrays, and floating point

accelerators. More recently, Field Programmable Gate Arrays (FPGAs) allow

reprogramming deep into the hardware.

Courtesy of Henry Neeman - http://www.oscer.ou.edu/

7/29/2019 CUDA_odp

6/29

Accelerators

Advantages

They make your code run faster

Disadvantages

More expensive Harder to program

Code is not portable from one accelerator toanother. (OpenCL attempts to change this)


7/29/2019 CUDA_odp

7/29

Introducing GPGPU

General Purpose Computing on Graphics Processing Units

Great example of the trend of moving away from the traditionalmodel.

7/29/2019 CUDA_odp

8/29

Why GPUs?

Graphics Processing Units (GPUs) were originally designed toaccelerate graphics tasks like image rendering.

They became very popular with videogamers, because theyve

produced better and better images, and lightning fast. And, prices have been extremely good, ranging from three

figures at the low end to four figures at the high end.

GPUs mostly do stuff like rendering images.

This is done through mostly floating point arithmetic the samestuff people use supercomputing for!


7/29/2019 CUDA_odp

9/29

GPU vs. CPU Flop Rate

From Nvidia CUDA Programing Guide

7/29/2019 CUDA_odp

10/29

Architecture

7/29/2019 CUDA_odp

11/29

Architecture Comparison

Nvidia Tesla C1060 Intel i7 975 Extreme

Processing Cores 240 4

Memory 4GB L1 Cache 32KB/coreL2 Cache 256KB/core

L3 Cache 8MB (shared)

Clock Speed1.3 GHz 3.33.GHz

Memory Bandwidth 102 GB/Sec 25 GB/sec

Floating Point Operations /Sec

933 Single Precision78 Double Precision

70 Double Precision

7/29/2019 CUDA_odp

12/29

CPU vs. GPU


7/29/2019 CUDA_odp

13/29

Components

Texture Processors Clusters

Streaming Multiprocessors

Streaming Processor

From http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.html

7/29/2019 CUDA_odp

14/29


Blocks of threads areassigned to SMs

A SM contains 8 ScalarProcessors

Tesla C1060

Number of SM = 30

Number of Cores = 240

The more SM you have thebetter

7/29/2019 CUDA_odp

15/29

Hardware Hierarchy

Stream Processor Array

Contains 10 Texture Processor Clusters

Texture Processor Clusters

Contains 3 Streaming Multiprocessors


Contains 8 Scalar Processors

Scalar Processors

They do the work :)

7/29/2019 CUDA_odp

16/29

Connecting some dots...

Great! We see the GPU architecture is different from what wesee in the traditional CPU.

So... Now what?

What this all means?

How do we use it?

7/29/2019 CUDA_odp

17/29

Glossary

The HOST Is the machine executing main program

The DEVICE Is the card with the GPU

The KERNEL Is the routine that runs on the GPU

A THREAD Is the basic execution unit in the GPU

A BLOCK Is a group of threads

A GRID Is a group of blocks

A WARP Is a group of 32 threads

7/29/2019 CUDA_odp

18/29

CUDA Kernel Execution

Recall that threads are organized in BLOCKS and at the sametime BLOCKS are organized in a GRID.

The GRID can have 2 dimensions. X and Y

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

The BLOCK(S) can have 3 dimensions X,Y,Z

Maximum sizes of each dimension of a block: 512 x 512 x 64

Prior to kernel execution we need to set it up by setting thedimensions of the GRID and the dimensions of the BLOCKS

7/29/2019 CUDA_odp

19/29

Scheduling in Hardware

Grid is launched

Blocks are distributed to thenecessary SMs

SM initiates processing ofwarps

SM schedules warps that areready

As warps finish and resourcesare liberated, then new warpsare scheduled.

SM can take 1024 threads

Ex: 256 x 4 OR 128 x 8

Host

Kernel1

Kernel2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Kirk & Hwu University of Illinois Urbana- Champaign

7/29/2019 CUDA_odp

20/29

Memory Layout

Registers and sharedmemory are the fastest

Local Memory is virtualmemory

Global Memory is theslowest.


7/29/2019 CUDA_odp

21/29

Thread Memory Access

Threads access memory as follows

Registers Read & Write

Local Memory Read & Write

Shared Memory Read & Write (block level)

Global Memory Read & Write (grid level)

Constant Memory Read (grid level)

Remember that Local Memory is implemented as virtualmemory from a region that resides in Global Memory.

7/29/2019 CUDA_odp

22/29

CUDA API

7/29/2019 CUDA_odp

23/29

Programming Pattern

Host reads input and allocates memory in the device

Host copies data to the device

Host invokes a kernel that gets executed in parallel, using thedata and hardware in the device, to do some useful work.

Host copies back the results from the device for postprocessing.

S

7/29/2019 CUDA_odp

24/29

Kernel Setup

_global_ void myKernel(); //declaration

dim3 dimGrid(2,2,1);

dim3 dimBlock(4,8,8);

myKernel>( d_b, d_a );

D i M All ti

7/29/2019 CUDA_odp

25/29

Device Memory Allocation

cudaMalloc(&myDataAddress,sizeOfData)

Address of a pointer to the allocated data and the size of suchdata.

cudaFree(myDataPointer)

Used to free the allocated memory on the device.

Also check cudaMallocHost() and cudaFreeHost() in the CUDA

Refrence Manual.

D i D t T f

7/29/2019 CUDA_odp

26/29

Device Data Transfer

cudaMemcpy()

Requires: pointer to destination, pointer to source, size, type oftransfer

Examples:

cudaMemcpy(elements_d, elements_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(elements_h,elements_d,size,cudaMemcpyDeviceToHost);

F ti D l ti

7/29/2019 CUDA_odp

27/29

Function Declaration

_ global _ is used to declare a kernel. It must be void.

Executes On Callable From

_device_ float myDeviceFunc() Device Device

_host_ float myHostFunc() Host Host

_global_ void myKernel() Device Host

U f l V i bl

7/29/2019 CUDA_odp

28/29

Useful Variables

gridDim.(x|y) = grid dimension on x and y

blockDim = number of threads in a block

blockIdx = block index whithin the grid

blockIdx.(x|y)

threadIdx = Thread index within a block

threadIdx.(x|y|z)

V i bl T Q lifi

7/29/2019 CUDA_odp

29/29

Variable Type Qualifiers

Variable type qualifiers specify the memory location of avariable on the devices memory

__device__

Declares a variable in the device

__constant__ Declares a constant in the device

__shared__

Declares a variable in thread shared memory

Note: All shared memory variables start at the same address. Youmust use offsets if multiple variables are declared in shared memory.

cuda_odp

Documents