cs 179: lecture 4 lab review 2

Post on 24-Feb-2016

35 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy). ( largest to smallest ) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better) - PowerPoint PPT Presentation

TRANSCRIPT

CS 179: Lecture 4Lab Review 2

Groups of Threads (Hierarchy) (largest to smallest)

“Grid”: All of the threads Size: (number of threads per block) * (number of blocks)

“Block”: Size: User-specified

Should at least be a multiple of 32 (often, higher is better) Upper limit given by hardware (512 in Tesla, 1024 in Fermi)

Features: Shared memory Synchronization

Groups of Threads “Warp”:

Group of 32 threads Execute in lockstep

(same instructions)

Susceptible to divergence!

Divergence“Two roads diverged in a wood…

…and I took both”

Divergence

What happens: Executes normally until if-statement Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)

“Divergent tree”

… 506, 508, 510

Assume 512 threads in block…

… 500, 504, 508

… 488, 496, 504

… 464, 480, 496

“Divergent tree”//Let our shared memory block be partial_outputs[]...

synchronize threads before starting...set offset to 1while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index]double the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Assumes block size is power of 2…

“Non-divergent tree”Example purposes only! Real blocks are way bigger!

“Non-divergent tree”//Let our shared memory block be partial_outputs[]...set offset to highest power of 2 that’s less than the

block dimension

while (offset >= 1):if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index]halve the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Assumes block size is power of 2…

“Divergent tree”Where is the divergence? Two branches:

Accumulate Do nothing

If the second branch does nothing, then where is the performance loss?

“Divergent tree” – Analysis First iteration: (Reduce 512 -> 256):

Warp of threads 0-31: (After calculating polynomial) Thread 0: Accumulate Thread 1: Do nothing Thread 2: Accumulate Thread 3: Do nothing …

Warp of threads 32-63: (same thing!)

… (up to) Warp of threads 480-511

Number of executing warps: 512 / 32 = 16

“Divergent tree” – Analysis Second iteration: (Reduce 256 -> 128):

Warp of threads 0-31: (After calculating polynomial) Threads 0: Accumulate Thread 1-3: Do nothing Thread 4: Accumulate Thread 5-7: Do nothing …

Warp of threads 32-63: (same thing!)

… (up to) Warp of threads 480-511

Number of executing warps: 16 (again!)

“Divergent tree” – Analysis (Process continues, until offset is large enough to separate

warps)

“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 1)

Warp of threads 0-31: Accumulate

Warp of threads 32-63: Accumulate

… (up to) Warp of threads 224-255

Then what?

“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 2)

Warp of threads 256-287: Do nothing!

… (up to) Warp of threads 480-511

Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

“Non-divergent tree” – Analysis Second iteration: (Reduce 256 -> 128):

Warp of threads 0-31, …, 96-127: Accumulate

Warp of threads 128-159, …, 480-511

Do nothing!

Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

What happened? “Implicit divergence”

Why did we do this? Performance improvements Reveals GPU internals!

Final Puzzle What happens when the polynomial order increases?

All these threads that we think are competing… are they?

The Real World

In medicine… More sensitive devices -> more data! More intensive algorithms Real-time imaging and analysis

Most are parallelizable problems!

http://www.varian.com

MRI “k-space” – Inverse FFT Real-time and high-resolution imaging

http://oregonstate.edu

CT, PET Low-dose techniques

Safety! 4D CT imaging X-ray CT vs. PET CT

Texture memory!

http://www.upmccancercenter.com/

Radiation Therapy Goal: Give sufficient dose to cancerous cells, minimize dose

to healthy cells More accurate algorithms possible!

Accuracy = safety! 40 minutes -> 10 seconds

http://en.wikipedia.org

Notes Office hours:

Kevin: Monday 8-10 PM Ben: Tuesday 7-9 PM Connor: Tuesday 8-10 PM

Lab 2: Due Wednesday (4/16), 5 PM

top related