nas ep algorithm
DESCRIPTION
TRANSCRIPT
Random Number Generation using OpenCL
신성원 나정호 배성호 김종수
Contents
Introduction
Theory
Result
Conclusion
MultiGrid
Conjugate Gradient
Fast Fourier Transform
Embarrassingly Parallel
Integer Sort
Data Cube operator
Data Traffic
Contents
Introduction
Theory
Result
Conclusion
Marsaglia Polar Method
𝑠 = 𝑥2 + 𝑦2 < 1
(𝑥, 𝑦)
Get Random Numbers
𝒙−𝟐 𝐥𝐧 𝒔
𝒔, 𝒚
−𝟐 𝐥𝐧 𝒔
𝒔 Get
Gaussian Pairs
Pseudo code
double sparse = false; bool sparseready = false; double getGaussian(double center, double stdDev) { if(sparseready) sparseready = false; return sparse * stdDev + center; double u, v, s; do{ u = random() * 2.0 – 1.0; v = random() * 2.0 – 1.0; s = u * u + v * v; } while(s >= 1 || s == 0); sparse = v * sqrt(-2.0 * log(s) / s); sparseready = true; return center + stdDev * u * sqrt(-2.0 * log(s) / s); }
Profiling Result
Gaussian pairs,
54%
Serial portions,
0.01%
Random
numbers, 46%
Mapping Instance to Kernel
Optimization
Increasing memory bandwidth
by using a coalesced memory access
0 1 2 3
4 5 6 7
8 9 A B
0 1 2 3 4 5 6 7 8 9 A B
3x4 matrix (Conceptual)
In memory (Linear mapping)
※ row-wise order
Increasing memory bandwidth
by using a coalesced memory access
0 1 2 3
4 5 6 7
8 9 A B
0 1 2 3
4 5 6 7
8 9 A B
Option 1 Option 2
Work item #1
Work item #2
Work item #3
Work item #4
Lowering memory access latency
by using local memory
Unoptimized
__kernel EP(...) { ... for (i = 0; i < NK; i++) { ... q[l] = q[l] + 1.0; // array q[] fits into local memory ... } }
Optimized
__kernel local_EP(...) { ... lq[] = q[]; for (i = 0; i < NK; i++) { ... lq[l] = lq[l] + 1.0; // array q[] fits into local memory ... } q[] = lq[]; }
Hot spot
Exploiting GPU parallelism
with optimal NDRange size
216 Iteration
Local_work_size : 64
Local_work_size : 64
Local_work_size : 64
• • •
Exactly
Fit!
Independent
Contents
Introduction
Theory
Result
Conclusion
Machine Specification
Host Compute Device
Processor 2 x Intel Xeon E5520 8 x NVIDIA Tesla C1060
Clock Freq. 2.27 Ghz 1296 Mhz
Cores per CPU 4 (N/A)
Cores per GPU (N/A) 240
Memory Size 24GB 32GB (4GB * 8)
OS Redhat 4.4 (N/A)
1
10
100
1000
CPU GPU #1 GPU #2 GPU #4 GPU #8
Execution Time (sec)
※ Log scale
Result
0
50
100
150
200
250
300
350
GPU #1 GPU #2 GPU #4 GPU #8
Speed up
Result