ical gpu 架構中所提供分散式運算 之功能與限制. 11/17/09ical2 outline parallel...
TRANSCRIPT
ICAL
GPU 架構中所提供分散式運 算之功能與限制
11/17/09 ICAL 2
Outline
• Parallel computing with GPU
• NVIDIA CUDA
• SVD matrix computation
• Conclusion
11/17/09 ICAL 3
Parallel computing with GPU
• Parallel computing
• Flynn’s Taxonomy
• Algorithm decomposed
• Amdahl’s Law
• Correctness concepts
11/17/09 ICAL 4
Parallel computing
• Parallel computing is a form of computation in which many calculations are carried out simultaneously.
• Parallel computers hardware:– Single machine: multi-core CPU, GPU– Multiple machines: clusters, MPPs, grid
11/17/09 ICAL 5
Parallel computing (cont.)
• There are several kinds of parallel computing, such as:
– Bit-level
– Instruction level
– Data decomposed
– Task decomposed
• The parallel computing has the speedup limit.
11/17/09 ICAL 6
Algorithm decomposed
• Task decomposition • Data DecompositionPrepared the Dinner
Enjoy the dinner
Cooking
Cleaningtable
Purchasing
John clean the table Mary go shopping
Wishing dishes
John and Mary wishing dishes
11/17/09 ICAL 7
Flynn’s Taxonomy
Data
Inst
ruct
ion
Single Multiple
Sing
leM
ulti
ple
SISD
MISD
SIMD
MIMD
11/17/09 ICAL 8
Amdahl’s Law
• Amdahl's law is a model for the expected speedup from partial improvement
SP
P-1
1
P: Parallel PortionS: Speedup of parallel portion
11/17/09 ICAL 9
Correctness concepts
• Race condition • Deadlock
……a=19……
Read a
save a=21
……a=20……
save a=a+1
ERROR!
11/17/09 ICAL 10
NVIDIA CUDA
• Historical Trends
• CUDA
• Programming Languages
• Reported Speedup
11/17/09 ICAL 11
Historical Trends
11/17/09 ICAL 12
CUDA
• Compute Unified Device Architecture, CUDA
• CUDA is a computing engine in NVIDIA GPU (graphics processing units)
11/17/09 ICAL 13
Programming Languages
Application
C/C++ Fortran OpenCL ......
NVIDIA GPUwith the CUDA Parallel Computing Architecture
11/17/09 ICAL 14
Reported Speedup
11/17/09 ICAL 15
CUDA Architecture
• Physical Reality behind CUDA
• CUDA Architectures
• Introducing the “Fermi” Architecture
• SM Architecture
• CUDA Core Architecture
11/17/09 ICAL 16
Physical Reality behind CUDA
CPU (host)GPU (device)
Main Memory
11/17/09 ICAL 17
CUDA Architectures
• G80– First CUDA-capable
processor
• G8x, G9x– Global memory
• GT200– Double precision
– Shared memory
– Larger register file
– Relaxed memory coalescing rules
Basic CUDA architecture
11/17/09 ICAL 18
“Fermi” Architecture
• 3 billion transistors
• Over 2x the cores (512 total)
• 8x the peak DP performance
• L1 and L2 caches
• ~2x memory bandwidth
• Up to 1 terabyte of GPU memory
11/17/09 ICAL 19
SM Architecture
• 32 CUDA cores per SM (Streaming Multiprocessor)
• 8x peak double precision floating point performance
• Dual Thread Scheduler
• 64 KB of RAM for shared memory and L1 cache
11/17/09 ICAL 20
CUDA Core Architecture
• New IEEE 754-2008 floating-point standard
• Fused multiply-add (FMA) instruction for both single and double precision
• Newly designed integer ALC optimized for 64-bit and extended precision operations
11/17/09 ICAL 21
SVD matrix computation
• SVD
• SVD matrix computation
• Experiment Datasets
• Experiment Environment
• Experiment Results
11/17/09 ICAL 22
SVD
• The singular value decomposition (SVD) is an important factorization of matrix, with many applications in signal processing and statistics.
• Suppose M is an m-by-n matrix, then there exists a factorization of the form.
*VUM
11/17/09
SVD matrix computation
212....17521
.............
.......24053
44...32210
32....4445
.............
.......100151
98...121112
187....175121
.............
.......54212
33...128121
ImageRGB pixel matrixSVD Matrix
*RRRR VUM
*GGGG VUM
*BBBB VUM
11/17/09
Experiment Datasets
• 3 test images
• RBG full color
• 1024x1024 total 1048576 pixels
11/17/09
Experiment Environment
GPU
DeviceNVIDA Geforce 9600 GSO
Cores 96
ProcessorClock
1375 MHz
StandardMemory
384 MB
MemoryBandwidth
38.4 GB/sec
CPU
DeviceIntel Core2 Quad Q9300
Cores 4
Processor Clock
2.5 GHz
FSB speed 1333 MHz
L2 Cache 6 MB
11/17/09
Experiment Results
11/17/09 ICAL 27
Conclusion
• Using GPU to improve the program speed is feasible.
• NVIDIA CUDA is good with SIMD parallel computing.
• But there are additional costs about Data passing between main memory and GPU memory.