nas ep algorithm

21
Random Number Generation using OpenCL 신성원 나정호 배성호 김종수

Upload: liam-jongsu-kim

Post on 25-Jan-2015

277 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: NAS EP Algorithm

Random Number Generation using OpenCL

신성원 나정호 배성호 김종수

Page 2: NAS EP Algorithm

Contents

Introduction

Theory

Result

Conclusion

Page 3: NAS EP Algorithm

MultiGrid

Conjugate Gradient

Fast Fourier Transform

Embarrassingly Parallel

Integer Sort

Data Cube operator

Data Traffic

Page 4: NAS EP Algorithm
Page 5: NAS EP Algorithm
Page 6: NAS EP Algorithm

Contents

Introduction

Theory

Result

Conclusion

Page 7: NAS EP Algorithm

Marsaglia Polar Method

𝑠 = 𝑥2 + 𝑦2 < 1

(𝑥, 𝑦)

Get Random Numbers

𝒙−𝟐 𝐥𝐧 𝒔

𝒔, 𝒚

−𝟐 𝐥𝐧 𝒔

𝒔 Get

Gaussian Pairs

Page 8: NAS EP Algorithm

Pseudo code

double sparse = false; bool sparseready = false; double getGaussian(double center, double stdDev) { if(sparseready) sparseready = false; return sparse * stdDev + center; double u, v, s; do{ u = random() * 2.0 – 1.0; v = random() * 2.0 – 1.0; s = u * u + v * v; } while(s >= 1 || s == 0); sparse = v * sqrt(-2.0 * log(s) / s); sparseready = true; return center + stdDev * u * sqrt(-2.0 * log(s) / s); }

Page 9: NAS EP Algorithm
Page 10: NAS EP Algorithm

Profiling Result

Gaussian pairs,

54%

Serial portions,

0.01%

Random

numbers, 46%

Page 11: NAS EP Algorithm

Mapping Instance to Kernel

Page 12: NAS EP Algorithm

Optimization

Page 13: NAS EP Algorithm

Increasing memory bandwidth

by using a coalesced memory access

0 1 2 3

4 5 6 7

8 9 A B

0 1 2 3 4 5 6 7 8 9 A B

3x4 matrix (Conceptual)

In memory (Linear mapping)

※ row-wise order

Page 14: NAS EP Algorithm

Increasing memory bandwidth

by using a coalesced memory access

0 1 2 3

4 5 6 7

8 9 A B

0 1 2 3

4 5 6 7

8 9 A B

Option 1 Option 2

Work item #1

Work item #2

Work item #3

Work item #4

Page 15: NAS EP Algorithm

Lowering memory access latency

by using local memory

Unoptimized

__kernel EP(...) { ... for (i = 0; i < NK; i++) { ... q[l] = q[l] + 1.0; // array q[] fits into local memory ... } }

Optimized

__kernel local_EP(...) { ... lq[] = q[]; for (i = 0; i < NK; i++) { ... lq[l] = lq[l] + 1.0; // array q[] fits into local memory ... } q[] = lq[]; }

Hot spot

Page 16: NAS EP Algorithm

Exploiting GPU parallelism

with optimal NDRange size

216 Iteration

Local_work_size : 64

Local_work_size : 64

Local_work_size : 64

• • •

Exactly

Fit!

Independent

Page 17: NAS EP Algorithm

Contents

Introduction

Theory

Result

Conclusion

Page 18: NAS EP Algorithm

Machine Specification

Host Compute Device

Processor 2 x Intel Xeon E5520 8 x NVIDIA Tesla C1060

Clock Freq. 2.27 Ghz 1296 Mhz

Cores per CPU 4 (N/A)

Cores per GPU (N/A) 240

Memory Size 24GB 32GB (4GB * 8)

OS Redhat 4.4 (N/A)

Page 19: NAS EP Algorithm

1

10

100

1000

CPU GPU #1 GPU #2 GPU #4 GPU #8

Execution Time (sec)

※ Log scale

Result

Page 20: NAS EP Algorithm

0

50

100

150

200

250

300

350

GPU #1 GPU #2 GPU #4 GPU #8

Speed up

Result

Page 21: NAS EP Algorithm