cuda_odp
TRANSCRIPT
-
7/29/2019 CUDA_odp
1/29
GPGPU Programming with CUDA
Leandro Avila - University of Northern Iowa
Mentor:
Dr. Paul Gray
Computer Science Department
University of Northern Iowa
-
7/29/2019 CUDA_odp
2/29
Outline
Introduction
Architecture Description
Introduction to CUDA API
-
7/29/2019 CUDA_odp
3/29
Introduction
Shift in the traditional paradigm of sequential programming,towards parallel processing.
Scientific computing needs to change in order to deal with vast
amounts of data.
Hardware changes contributed to move towards parallelprocessing.
-
7/29/2019 CUDA_odp
4/29
Three Walls of Serial Performance
Manferdelli, J. (2007) - The Many-Core Inflection Point for Mass Market Computer Systems
Memory Wall
Discrepancy between memory and CPUperformance
Instruction Level Parallelism Wall
Effort put into ILP increases with not enough returns
Power Wall
Clock frequency vs. Heat dissipation efforts.
-
7/29/2019 CUDA_odp
5/29
Accelerators
In HPC, an accelerator is a hardware component whose role isto speed up some aspect of the computing workload.
In the old days (1980s), supercomputers we had arrayprocessors, for vector operations on arrays, and floating point
accelerators. More recently, Field Programmable Gate Arrays (FPGAs) allow
reprogramming deep into the hardware.
Courtesy of Henry Neeman - http://www.oscer.ou.edu/
-
7/29/2019 CUDA_odp
6/29
Accelerators
Advantages
They make your code run faster
Disadvantages
More expensive Harder to program
Code is not portable from one accelerator toanother. (OpenCL attempts to change this)
Courtesy of Henry Neeman - http://www.oscer.ou.edu/
-
7/29/2019 CUDA_odp
7/29
Introducing GPGPU
General Purpose Computing on Graphics Processing Units
Great example of the trend of moving away from the traditionalmodel.
-
7/29/2019 CUDA_odp
8/29
Why GPUs?
Graphics Processing Units (GPUs) were originally designed toaccelerate graphics tasks like image rendering.
They became very popular with videogamers, because theyve
produced better and better images, and lightning fast. And, prices have been extremely good, ranging from three
figures at the low end to four figures at the high end.
GPUs mostly do stuff like rendering images.
This is done through mostly floating point arithmetic the samestuff people use supercomputing for!
Courtesy of Henry Neeman - http://www.oscer.ou.edu/
-
7/29/2019 CUDA_odp
9/29
GPU vs. CPU Flop Rate
From Nvidia CUDA Programing Guide
-
7/29/2019 CUDA_odp
10/29
Architecture
-
7/29/2019 CUDA_odp
11/29
Architecture Comparison
Nvidia Tesla C1060 Intel i7 975 Extreme
Processing Cores 240 4
Memory 4GB L1 Cache 32KB/coreL2 Cache 256KB/core
L3 Cache 8MB (shared)
Clock Speed1.3 GHz 3.33.GHz
Memory Bandwidth 102 GB/Sec 25 GB/sec
Floating Point Operations /Sec
933 Single Precision78 Double Precision
70 Double Precision
-
7/29/2019 CUDA_odp
12/29
CPU vs. GPU
From Nvidia CUDA Programing Guide
-
7/29/2019 CUDA_odp
13/29
Components
Texture Processors Clusters
Streaming Multiprocessors
Streaming Processor
From http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.html
-
7/29/2019 CUDA_odp
14/29
Streaming Multiprocessors
Blocks of threads areassigned to SMs
A SM contains 8 ScalarProcessors
Tesla C1060
Number of SM = 30
Number of Cores = 240
The more SM you have thebetter
-
7/29/2019 CUDA_odp
15/29
Hardware Hierarchy
Stream Processor Array
Contains 10 Texture Processor Clusters
Texture Processor Clusters
Contains 3 Streaming Multiprocessors
Streaming Multiprocessors
Contains 8 Scalar Processors
Scalar Processors
They do the work :)
-
7/29/2019 CUDA_odp
16/29
Connecting some dots...
Great! We see the GPU architecture is different from what wesee in the traditional CPU.
So... Now what?
What this all means?
How do we use it?
-
7/29/2019 CUDA_odp
17/29
Glossary
The HOST Is the machine executing main program
The DEVICE Is the card with the GPU
The KERNEL Is the routine that runs on the GPU
A THREAD Is the basic execution unit in the GPU
A BLOCK Is a group of threads
A GRID Is a group of blocks
A WARP Is a group of 32 threads
-
7/29/2019 CUDA_odp
18/29
CUDA Kernel Execution
Recall that threads are organized in BLOCKS and at the sametime BLOCKS are organized in a GRID.
The GRID can have 2 dimensions. X and Y
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
The BLOCK(S) can have 3 dimensions X,Y,Z
Maximum sizes of each dimension of a block: 512 x 512 x 64
Prior to kernel execution we need to set it up by setting thedimensions of the GRID and the dimensions of the BLOCKS
-
7/29/2019 CUDA_odp
19/29
Scheduling in Hardware
Grid is launched
Blocks are distributed to thenecessary SMs
SM initiates processing ofwarps
SM schedules warps that areready
As warps finish and resourcesare liberated, then new warpsare scheduled.
SM can take 1024 threads
Ex: 256 x 4 OR 128 x 8
Host
Kernel1
Kernel2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Kirk & Hwu University of Illinois Urbana- Champaign
-
7/29/2019 CUDA_odp
20/29
Memory Layout
Registers and sharedmemory are the fastest
Local Memory is virtualmemory
Global Memory is theslowest.
From Nvidia CUDA Programing Guide
-
7/29/2019 CUDA_odp
21/29
Thread Memory Access
Threads access memory as follows
Registers Read & Write
Local Memory Read & Write
Shared Memory Read & Write (block level)
Global Memory Read & Write (grid level)
Constant Memory Read (grid level)
Remember that Local Memory is implemented as virtualmemory from a region that resides in Global Memory.
-
7/29/2019 CUDA_odp
22/29
CUDA API
-
7/29/2019 CUDA_odp
23/29
Programming Pattern
Host reads input and allocates memory in the device
Host copies data to the device
Host invokes a kernel that gets executed in parallel, using thedata and hardware in the device, to do some useful work.
Host copies back the results from the device for postprocessing.
S
-
7/29/2019 CUDA_odp
24/29
Kernel Setup
_global_ void myKernel(); //declaration
dim3 dimGrid(2,2,1);
dim3 dimBlock(4,8,8);
myKernel>( d_b, d_a );
D i M All ti
-
7/29/2019 CUDA_odp
25/29
Device Memory Allocation
cudaMalloc(&myDataAddress,sizeOfData)
Address of a pointer to the allocated data and the size of suchdata.
cudaFree(myDataPointer)
Used to free the allocated memory on the device.
Also check cudaMallocHost() and cudaFreeHost() in the CUDA
Refrence Manual.
D i D t T f
-
7/29/2019 CUDA_odp
26/29
Device Data Transfer
cudaMemcpy()
Requires: pointer to destination, pointer to source, size, type oftransfer
Examples:
cudaMemcpy(elements_d, elements_h,size,cudaMemcpyHostToDevice);
cudaMemcpy(elements_h,elements_d,size,cudaMemcpyDeviceToHost);
F ti D l ti
-
7/29/2019 CUDA_odp
27/29
Function Declaration
_ global _ is used to declare a kernel. It must be void.
Executes On Callable From
_device_ float myDeviceFunc() Device Device
_host_ float myHostFunc() Host Host
_global_ void myKernel() Device Host
U f l V i bl
-
7/29/2019 CUDA_odp
28/29
Useful Variables
gridDim.(x|y) = grid dimension on x and y
blockDim = number of threads in a block
blockIdx = block index whithin the grid
blockIdx.(x|y)
threadIdx = Thread index within a block
threadIdx.(x|y|z)
V i bl T Q lifi
-
7/29/2019 CUDA_odp
29/29
Variable Type Qualifiers
Variable type qualifiers specify the memory location of avariable on the devices memory
__device__
Declares a variable in the device
__constant__ Declares a constant in the device
__shared__
Declares a variable in thread shared memory
Note: All shared memory variables start at the same address. Youmust use offsets if multiple variables are declared in shared memory.