cuda_odp

Upload: xafran-khan

Post on 03-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 CUDA_odp

    1/29

    GPGPU Programming with CUDA

    Leandro Avila - University of Northern Iowa

    Mentor:

    Dr. Paul Gray

    Computer Science Department

    University of Northern Iowa

  • 7/29/2019 CUDA_odp

    2/29

    Outline

    Introduction

    Architecture Description

    Introduction to CUDA API

  • 7/29/2019 CUDA_odp

    3/29

    Introduction

    Shift in the traditional paradigm of sequential programming,towards parallel processing.

    Scientific computing needs to change in order to deal with vast

    amounts of data.

    Hardware changes contributed to move towards parallelprocessing.

  • 7/29/2019 CUDA_odp

    4/29

    Three Walls of Serial Performance

    Manferdelli, J. (2007) - The Many-Core Inflection Point for Mass Market Computer Systems

    Memory Wall

    Discrepancy between memory and CPUperformance

    Instruction Level Parallelism Wall

    Effort put into ILP increases with not enough returns

    Power Wall

    Clock frequency vs. Heat dissipation efforts.

  • 7/29/2019 CUDA_odp

    5/29

    Accelerators

    In HPC, an accelerator is a hardware component whose role isto speed up some aspect of the computing workload.

    In the old days (1980s), supercomputers we had arrayprocessors, for vector operations on arrays, and floating point

    accelerators. More recently, Field Programmable Gate Arrays (FPGAs) allow

    reprogramming deep into the hardware.

    Courtesy of Henry Neeman - http://www.oscer.ou.edu/

  • 7/29/2019 CUDA_odp

    6/29

    Accelerators

    Advantages

    They make your code run faster

    Disadvantages

    More expensive Harder to program

    Code is not portable from one accelerator toanother. (OpenCL attempts to change this)

    Courtesy of Henry Neeman - http://www.oscer.ou.edu/

  • 7/29/2019 CUDA_odp

    7/29

    Introducing GPGPU

    General Purpose Computing on Graphics Processing Units

    Great example of the trend of moving away from the traditionalmodel.

  • 7/29/2019 CUDA_odp

    8/29

    Why GPUs?

    Graphics Processing Units (GPUs) were originally designed toaccelerate graphics tasks like image rendering.

    They became very popular with videogamers, because theyve

    produced better and better images, and lightning fast. And, prices have been extremely good, ranging from three

    figures at the low end to four figures at the high end.

    GPUs mostly do stuff like rendering images.

    This is done through mostly floating point arithmetic the samestuff people use supercomputing for!

    Courtesy of Henry Neeman - http://www.oscer.ou.edu/

  • 7/29/2019 CUDA_odp

    9/29

    GPU vs. CPU Flop Rate

    From Nvidia CUDA Programing Guide

  • 7/29/2019 CUDA_odp

    10/29

    Architecture

  • 7/29/2019 CUDA_odp

    11/29

    Architecture Comparison

    Nvidia Tesla C1060 Intel i7 975 Extreme

    Processing Cores 240 4

    Memory 4GB L1 Cache 32KB/coreL2 Cache 256KB/core

    L3 Cache 8MB (shared)

    Clock Speed1.3 GHz 3.33.GHz

    Memory Bandwidth 102 GB/Sec 25 GB/sec

    Floating Point Operations /Sec

    933 Single Precision78 Double Precision

    70 Double Precision

  • 7/29/2019 CUDA_odp

    12/29

    CPU vs. GPU

    From Nvidia CUDA Programing Guide

  • 7/29/2019 CUDA_odp

    13/29

    Components

    Texture Processors Clusters

    Streaming Multiprocessors

    Streaming Processor

    From http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.html

  • 7/29/2019 CUDA_odp

    14/29

    Streaming Multiprocessors

    Blocks of threads areassigned to SMs

    A SM contains 8 ScalarProcessors

    Tesla C1060

    Number of SM = 30

    Number of Cores = 240

    The more SM you have thebetter

  • 7/29/2019 CUDA_odp

    15/29

    Hardware Hierarchy

    Stream Processor Array

    Contains 10 Texture Processor Clusters

    Texture Processor Clusters

    Contains 3 Streaming Multiprocessors

    Streaming Multiprocessors

    Contains 8 Scalar Processors

    Scalar Processors

    They do the work :)

  • 7/29/2019 CUDA_odp

    16/29

    Connecting some dots...

    Great! We see the GPU architecture is different from what wesee in the traditional CPU.

    So... Now what?

    What this all means?

    How do we use it?

  • 7/29/2019 CUDA_odp

    17/29

    Glossary

    The HOST Is the machine executing main program

    The DEVICE Is the card with the GPU

    The KERNEL Is the routine that runs on the GPU

    A THREAD Is the basic execution unit in the GPU

    A BLOCK Is a group of threads

    A GRID Is a group of blocks

    A WARP Is a group of 32 threads

  • 7/29/2019 CUDA_odp

    18/29

    CUDA Kernel Execution

    Recall that threads are organized in BLOCKS and at the sametime BLOCKS are organized in a GRID.

    The GRID can have 2 dimensions. X and Y

    Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

    The BLOCK(S) can have 3 dimensions X,Y,Z

    Maximum sizes of each dimension of a block: 512 x 512 x 64

    Prior to kernel execution we need to set it up by setting thedimensions of the GRID and the dimensions of the BLOCKS

  • 7/29/2019 CUDA_odp

    19/29

    Scheduling in Hardware

    Grid is launched

    Blocks are distributed to thenecessary SMs

    SM initiates processing ofwarps

    SM schedules warps that areready

    As warps finish and resourcesare liberated, then new warpsare scheduled.

    SM can take 1024 threads

    Ex: 256 x 4 OR 128 x 8

    Host

    Kernel1

    Kernel2

    Device

    Grid 1

    Block(0, 0)

    Block(1, 0)

    Block(2, 0)

    Block(0, 1)

    Block(1, 1)

    Block(2, 1)

    Grid 2

    Block (1, 1)

    Thread

    (0, 1)

    Thread

    (1, 1)

    Thread

    (2, 1)

    Thread

    (3, 1)

    Thread

    (4, 1)

    Thread

    (0, 2)

    Thread

    (1, 2)

    Thread

    (2, 2)

    Thread

    (3, 2)

    Thread

    (4, 2)

    Thread

    (0, 0)

    Thread

    (1, 0)

    Thread

    (2, 0)

    Thread

    (3, 0)

    Thread

    (4, 0)

    Kirk & Hwu University of Illinois Urbana- Champaign

  • 7/29/2019 CUDA_odp

    20/29

    Memory Layout

    Registers and sharedmemory are the fastest

    Local Memory is virtualmemory

    Global Memory is theslowest.

    From Nvidia CUDA Programing Guide

  • 7/29/2019 CUDA_odp

    21/29

    Thread Memory Access

    Threads access memory as follows

    Registers Read & Write

    Local Memory Read & Write

    Shared Memory Read & Write (block level)

    Global Memory Read & Write (grid level)

    Constant Memory Read (grid level)

    Remember that Local Memory is implemented as virtualmemory from a region that resides in Global Memory.

  • 7/29/2019 CUDA_odp

    22/29

    CUDA API

  • 7/29/2019 CUDA_odp

    23/29

    Programming Pattern

    Host reads input and allocates memory in the device

    Host copies data to the device

    Host invokes a kernel that gets executed in parallel, using thedata and hardware in the device, to do some useful work.

    Host copies back the results from the device for postprocessing.

    S

  • 7/29/2019 CUDA_odp

    24/29

    Kernel Setup

    _global_ void myKernel(); //declaration

    dim3 dimGrid(2,2,1);

    dim3 dimBlock(4,8,8);

    myKernel>( d_b, d_a );

    D i M All ti

  • 7/29/2019 CUDA_odp

    25/29

    Device Memory Allocation

    cudaMalloc(&myDataAddress,sizeOfData)

    Address of a pointer to the allocated data and the size of suchdata.

    cudaFree(myDataPointer)

    Used to free the allocated memory on the device.

    Also check cudaMallocHost() and cudaFreeHost() in the CUDA

    Refrence Manual.

    D i D t T f

  • 7/29/2019 CUDA_odp

    26/29

    Device Data Transfer

    cudaMemcpy()

    Requires: pointer to destination, pointer to source, size, type oftransfer

    Examples:

    cudaMemcpy(elements_d, elements_h,size,cudaMemcpyHostToDevice);

    cudaMemcpy(elements_h,elements_d,size,cudaMemcpyDeviceToHost);

    F ti D l ti

  • 7/29/2019 CUDA_odp

    27/29

    Function Declaration

    _ global _ is used to declare a kernel. It must be void.

    Executes On Callable From

    _device_ float myDeviceFunc() Device Device

    _host_ float myHostFunc() Host Host

    _global_ void myKernel() Device Host

    U f l V i bl

  • 7/29/2019 CUDA_odp

    28/29

    Useful Variables

    gridDim.(x|y) = grid dimension on x and y

    blockDim = number of threads in a block

    blockIdx = block index whithin the grid

    blockIdx.(x|y)

    threadIdx = Thread index within a block

    threadIdx.(x|y|z)

    V i bl T Q lifi

  • 7/29/2019 CUDA_odp

    29/29

    Variable Type Qualifiers

    Variable type qualifiers specify the memory location of avariable on the devices memory

    __device__

    Declares a variable in the device

    __constant__ Declares a constant in the device

    __shared__

    Declares a variable in thread shared memory

    Note: All shared memory variables start at the same address. Youmust use offsets if multiple variables are declared in shared memory.