cs 179: lecture 4 lab review 2

28
CS 179: Lecture 4 Lab Review 2

Upload: vangie

Post on 24-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy). ( largest to smallest ) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 179: Lecture 4 Lab Review 2

CS 179: Lecture 4Lab Review 2

Page 2: CS 179: Lecture 4 Lab Review 2

Groups of Threads (Hierarchy) (largest to smallest)

“Grid”: All of the threads Size: (number of threads per block) * (number of blocks)

“Block”: Size: User-specified

Should at least be a multiple of 32 (often, higher is better) Upper limit given by hardware (512 in Tesla, 1024 in Fermi)

Features: Shared memory Synchronization

Page 3: CS 179: Lecture 4 Lab Review 2

Groups of Threads “Warp”:

Group of 32 threads Execute in lockstep

(same instructions)

Susceptible to divergence!

Page 4: CS 179: Lecture 4 Lab Review 2

Divergence“Two roads diverged in a wood…

…and I took both”

Page 5: CS 179: Lecture 4 Lab Review 2

Divergence

What happens: Executes normally until if-statement Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)

Page 6: CS 179: Lecture 4 Lab Review 2
Page 7: CS 179: Lecture 4 Lab Review 2
Page 8: CS 179: Lecture 4 Lab Review 2

“Divergent tree”

… 506, 508, 510

Assume 512 threads in block…

… 500, 504, 508

… 488, 496, 504

… 464, 480, 496

Page 9: CS 179: Lecture 4 Lab Review 2

“Divergent tree”//Let our shared memory block be partial_outputs[]...

synchronize threads before starting...set offset to 1while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index]double the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Assumes block size is power of 2…

Page 10: CS 179: Lecture 4 Lab Review 2

“Non-divergent tree”Example purposes only! Real blocks are way bigger!

Page 11: CS 179: Lecture 4 Lab Review 2

“Non-divergent tree”//Let our shared memory block be partial_outputs[]...set offset to highest power of 2 that’s less than the

block dimension

while (offset >= 1):if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index]halve the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Assumes block size is power of 2…

Page 12: CS 179: Lecture 4 Lab Review 2

“Divergent tree”Where is the divergence? Two branches:

Accumulate Do nothing

If the second branch does nothing, then where is the performance loss?

Page 13: CS 179: Lecture 4 Lab Review 2
Page 14: CS 179: Lecture 4 Lab Review 2

“Divergent tree” – Analysis First iteration: (Reduce 512 -> 256):

Warp of threads 0-31: (After calculating polynomial) Thread 0: Accumulate Thread 1: Do nothing Thread 2: Accumulate Thread 3: Do nothing …

Warp of threads 32-63: (same thing!)

… (up to) Warp of threads 480-511

Number of executing warps: 512 / 32 = 16

Page 15: CS 179: Lecture 4 Lab Review 2

“Divergent tree” – Analysis Second iteration: (Reduce 256 -> 128):

Warp of threads 0-31: (After calculating polynomial) Threads 0: Accumulate Thread 1-3: Do nothing Thread 4: Accumulate Thread 5-7: Do nothing …

Warp of threads 32-63: (same thing!)

… (up to) Warp of threads 480-511

Number of executing warps: 16 (again!)

Page 16: CS 179: Lecture 4 Lab Review 2

“Divergent tree” – Analysis (Process continues, until offset is large enough to separate

warps)

Page 17: CS 179: Lecture 4 Lab Review 2

“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 1)

Warp of threads 0-31: Accumulate

Warp of threads 32-63: Accumulate

… (up to) Warp of threads 224-255

Then what?

Page 18: CS 179: Lecture 4 Lab Review 2

“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 2)

Warp of threads 256-287: Do nothing!

… (up to) Warp of threads 480-511

Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

Page 19: CS 179: Lecture 4 Lab Review 2

“Non-divergent tree” – Analysis Second iteration: (Reduce 256 -> 128):

Warp of threads 0-31, …, 96-127: Accumulate

Warp of threads 128-159, …, 480-511

Do nothing!

Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

Page 20: CS 179: Lecture 4 Lab Review 2

What happened? “Implicit divergence”

Page 21: CS 179: Lecture 4 Lab Review 2

Why did we do this? Performance improvements Reveals GPU internals!

Page 22: CS 179: Lecture 4 Lab Review 2

Final Puzzle What happens when the polynomial order increases?

All these threads that we think are competing… are they?

Page 23: CS 179: Lecture 4 Lab Review 2

The Real World

Page 24: CS 179: Lecture 4 Lab Review 2

In medicine… More sensitive devices -> more data! More intensive algorithms Real-time imaging and analysis

Most are parallelizable problems!

http://www.varian.com

Page 25: CS 179: Lecture 4 Lab Review 2

MRI “k-space” – Inverse FFT Real-time and high-resolution imaging

http://oregonstate.edu

Page 26: CS 179: Lecture 4 Lab Review 2

CT, PET Low-dose techniques

Safety! 4D CT imaging X-ray CT vs. PET CT

Texture memory!

http://www.upmccancercenter.com/

Page 27: CS 179: Lecture 4 Lab Review 2

Radiation Therapy Goal: Give sufficient dose to cancerous cells, minimize dose

to healthy cells More accurate algorithms possible!

Accuracy = safety! 40 minutes -> 10 seconds

http://en.wikipedia.org

Page 28: CS 179: Lecture 4 Lab Review 2

Notes Office hours:

Kevin: Monday 8-10 PM Ben: Tuesday 7-9 PM Connor: Tuesday 8-10 PM

Lab 2: Due Wednesday (4/16), 5 PM