nas ep algorithm

Random Number Generation using OpenCL

신성원 나정호 배성호 김종수

Contents

Introduction

Theory

Result

Conclusion

MultiGrid

Conjugate Gradient

Fast Fourier Transform

Embarrassingly Parallel

Integer Sort

Data Cube operator

Data Traffic

Contents

Introduction

Theory

Result

Conclusion

Marsaglia Polar Method

𝑠 = 𝑥2 + 𝑦2 < 1

(𝑥, 𝑦)

Get Random Numbers

𝒙−𝟐 𝐥𝐧 𝒔

𝒔, 𝒚

−𝟐 𝐥𝐧 𝒔

𝒔 Get

Gaussian Pairs

Pseudo code

double sparse = false; bool sparseready = false; double getGaussian(double center, double stdDev) { if(sparseready) sparseready = false; return sparse * stdDev + center; double u, v, s; do{ u = random() * 2.0 – 1.0; v = random() * 2.0 – 1.0; s = u * u + v * v; } while(s >= 1 || s == 0); sparse = v * sqrt(-2.0 * log(s) / s); sparseready = true; return center + stdDev * u * sqrt(-2.0 * log(s) / s); }

Profiling Result

Gaussian pairs,

54%

Serial portions,

0.01%

Random

numbers, 46%

Mapping Instance to Kernel

Optimization

Increasing memory bandwidth

by using a coalesced memory access

0 1 2 3

4 5 6 7

8 9 A B

0 1 2 3 4 5 6 7 8 9 A B

3x4 matrix (Conceptual)

In memory (Linear mapping)

※ row-wise order

Increasing memory bandwidth

by using a coalesced memory access

0 1 2 3

4 5 6 7

8 9 A B

0 1 2 3

4 5 6 7

8 9 A B

Option 1 Option 2

Work item #1

Work item #2

Work item #3

Work item #4

Lowering memory access latency

by using local memory

Unoptimized

__kernel EP(...) { ... for (i = 0; i < NK; i++) { ... q[l] = q[l] + 1.0; // array q[] fits into local memory ... } }

Optimized

__kernel local_EP(...) { ... lq[] = q[]; for (i = 0; i < NK; i++) { ... lq[l] = lq[l] + 1.0; // array q[] fits into local memory ... } q[] = lq[]; }

Hot spot

Exploiting GPU parallelism

with optimal NDRange size

216 Iteration

Local_work_size : 64



• • •

Exactly

Fit!

Independent

Contents

Introduction

Theory

Result

Conclusion

Machine Specification

Host Compute Device

Processor 2 x Intel Xeon E5520 8 x NVIDIA Tesla C1060

Clock Freq. 2.27 Ghz 1296 Mhz

Cores per CPU 4 (N/A)

Cores per GPU (N/A) 240

Memory Size 24GB 32GB (4GB * 8)

OS Redhat 4.4 (N/A)

1

10

100

1000

CPU GPU #1 GPU #2 GPU #4 GPU #8

Execution Time (sec)

※ Log scale

Result

0

50

100

150

200

250

300

350

GPU #1 GPU #2 GPU #4 GPU #8

Speed up

Result

nas ep algorithm

Technology