ical gpu 架構中所提供分散式運算之功能與限制. 11/17/09ical2 outline parallel...

ICAL

GPU 架構中所提供分散式運算之功能與限制

11/17/09 ICAL 2

Outline

• Parallel computing with GPU

• NVIDIA CUDA

• SVD matrix computation

• Conclusion

11/17/09 ICAL 3

Parallel computing with GPU

• Parallel computing

• Flynn’s Taxonomy

• Algorithm decomposed

• Amdahl’s Law

• Correctness concepts

11/17/09 ICAL 4

Parallel computing

• Parallel computing is a form of computation in which many calculations are carried out simultaneously.

• Parallel computers hardware:– Single machine: multi-core CPU, GPU– Multiple machines: clusters, MPPs, grid

11/17/09 ICAL 5

Parallel computing (cont.)

• There are several kinds of parallel computing, such as:

– Bit-level

– Instruction level

– Data decomposed

– Task decomposed

• The parallel computing has the speedup limit.

11/17/09 ICAL 6

Algorithm decomposed

• Task decomposition • Data DecompositionPrepared the Dinner

Enjoy the dinner

Cooking

Cleaningtable

Purchasing

John clean the table Mary go shopping

Wishing dishes

John and Mary wishing dishes

11/17/09 ICAL 7

Flynn’s Taxonomy

Data

Inst

ruct

ion

Single Multiple

Sing

leM

ulti

ple

SISD

MISD

SIMD

MIMD

11/17/09 ICAL 8

Amdahl’s Law

• Amdahl's law is a model for the expected speedup from partial improvement

SP

P-1

1

P: Parallel PortionS: Speedup of parallel portion

11/17/09 ICAL 9

Correctness concepts

• Race condition • Deadlock

……a=19……

Read a

save a=21

……a=20……

save a=a+1

ERROR!

11/17/09 ICAL 10

NVIDIA CUDA

• Historical Trends

• CUDA

• Programming Languages

• Reported Speedup

11/17/09 ICAL 11

Historical Trends

11/17/09 ICAL 12

CUDA

• Compute Unified Device Architecture, CUDA

• CUDA is a computing engine in NVIDIA GPU (graphics processing units)

11/17/09 ICAL 13

Programming Languages

Application

C/C++ Fortran OpenCL ......

NVIDIA GPUwith the CUDA Parallel Computing Architecture

11/17/09 ICAL 14

Reported Speedup

11/17/09 ICAL 15

CUDA Architecture

• Physical Reality behind CUDA

• CUDA Architectures

• Introducing the “Fermi” Architecture

• SM Architecture

• CUDA Core Architecture

11/17/09 ICAL 16

Physical Reality behind CUDA

CPU (host)GPU (device)

Main Memory

11/17/09 ICAL 17

CUDA Architectures

• G80– First CUDA-capable

processor

• G8x, G9x– Global memory

• GT200– Double precision

– Shared memory

– Larger register file

– Relaxed memory coalescing rules

Basic CUDA architecture

11/17/09 ICAL 18

“Fermi” Architecture

• 3 billion transistors

• Over 2x the cores (512 total)

• 8x the peak DP performance

• L1 and L2 caches

• ~2x memory bandwidth

• Up to 1 terabyte of GPU memory

11/17/09 ICAL 19

SM Architecture

• 32 CUDA cores per SM (Streaming Multiprocessor)

• 8x peak double precision floating point performance

• Dual Thread Scheduler

• 64 KB of RAM for shared memory and L1 cache

11/17/09 ICAL 20

CUDA Core Architecture

• New IEEE 754-2008 floating-point standard

• Fused multiply-add (FMA) instruction for both single and double precision

• Newly designed integer ALC optimized for 64-bit and extended precision operations

11/17/09 ICAL 21

SVD matrix computation

• SVD

• SVD matrix computation

• Experiment Datasets

• Experiment Environment

• Experiment Results

11/17/09 ICAL 22

SVD

• The singular value decomposition (SVD) is an important factorization of matrix, with many applications in signal processing and statistics.

• Suppose M is an m-by-n matrix, then there exists a factorization of the form.

*VUM

11/17/09

SVD matrix computation

212....17521

.............

.......24053

44...32210

32....4445

.............

.......100151

98...121112

187....175121

.............

.......54212

33...128121

ImageRGB pixel matrixSVD Matrix

*RRRR VUM

*GGGG VUM

*BBBB VUM

11/17/09

Experiment Datasets

• 3 test images

• RBG full color

• 1024x1024 total 1048576 pixels

11/17/09

Experiment Environment

GPU

DeviceNVIDA Geforce 9600 GSO

Cores 96

ProcessorClock

1375 MHz

StandardMemory

384 MB

MemoryBandwidth

38.4 GB/sec

CPU

DeviceIntel Core2 Quad Q9300

Cores 4

Processor Clock

2.5 GHz

FSB speed 1333 MHz

L2 Cache 6 MB

11/17/09

Experiment Results

11/17/09 ICAL 27

Conclusion

• Using GPU to improve the program speed is feasible.

• NVIDIA CUDA is good with SIMD parallel computing.

• But there are additional costs about Data passing between main memory and GPU memory.

ical gpu 架構中所提供分散式運算 之功能與限制. 11/17/09ical2 outline parallel...

Documents

ical gpu 架構中所提供分散式運算之功能與限制. 11/17/09ical2 outline parallel...