performance optimization for gpusimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... ·...

36
Peking University Center for Energy-efficient Computing and Applications Performance Optimization for GPUs GPU 性能优化技术 Yun (Eric) Liang, 梁云 Center for Energy-efficient and Applications (CECA) School of EECS, Peking University, China

Upload: others

Post on 27-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Peking University Center for Energy-efficient Computing and Applications

Performance Optimization for GPUs GPU 性能优化技术

Yun (Eric) Liang, 梁云

Center for Energy-efficient and Applications (CECA)

School of EECS, Peking University, China

Page 2: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Why GPUs ?

Yun (Eric) Liang @ Peking University 2 9/23/2016

Massive Parallelism

Source: Nvidia Inc

Computing Power

SM SM SM SM SM SM SM SM

Graphics Processing Units

Page 3: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Applications of GPUs

Yun (Eric) Liang @ Peking University 3 9/23/2016

NVIDIA Tegra Series

Samsung Exynos

Qualcomm Snapdragon

Super computing system Embedded system

System Configuration

Titan, Oak Ridge National Lab

Cray XK7 , Opteron 6274 16C

2.200GHz, NVIDIA K20x

Piz Daint CSCS, Switzerland

Cray XC30, Xeon E5-2670 8C

2.600GHz, NVIDIA K20x

Page 4: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Ubiquitous GPU Computing

Yun (Eric) Liang @ Peking University 4 9/23/2016

Augmented Reality Electronic Design Automation

Biology

3D Graphics Rendering

Finance Deep Learning

Page 5: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

GPU Performance Optimization

Performance tuning is difficult

• Many architecture, compiler and application parameters

GPU kernel development

• heavy lifting task

Yun (Eric) Liang @ Peking University 5 9/23/2016

Page 6: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Research Summary

Yun (Eric) Liang @ Peking University 6 9/23/2016

Heterogeneous System

Programming model,

Compilation and Run-time

System

MapReduce (TPDS’14, Bigdata’13)

SpMV (CGO’15)

Register (MICRO’15, ASPDAC’16)

Applications

Multitasking (TPDS’15, DATE’16)

Cache Byassing (HPCA’15, ICCAD’13, TCAD’15)

Divergence, Power (IPDPS’12, DAC’14, TCAD’16)

High Level Synthesis (FPGA’13, DAC’13, FCCM’14, TCAD’16)

Tool DAC’16

Memory (TCAD’15)

LTE (PACT’15)

Real-time (DAC’13)

Page 7: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

On-chip Storage in GPUs

warp warp warp warp

On-chip storage

“Coordinated Static and Dynamic Cache Bypassing on GPUs”, International Symposium on High Performance Computer Architecture (HPCA), February, 2015

Cache Shared Memory

Register File

Page 8: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Challenge for Cache: Massive Parallelism

0

200

400

600

800

1000

1200

1400

1600

1800

Nu

mb

er o

f A

ctiv

e T

hre

ad

s

Fermi GTX 480

16KB cache , 10 ~ 20 bytes per thread 48KB cache, 30 ~ 80 bytes per thread

“Coordinated Static and Dynamic Cache Bypassing on GPUs”, International Symposium on High Performance Computer Architecture (HPCA), February, 2015

Page 9: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Challenge for Cache: Low Cache Hit Rate

Yun (Eric) Liang @ Peking University 9 9/23/2016

0%

20%

40%

60%

80%

100%

L1

Ca

che

Hit

Ra

te

Fermi: GTX 480

L1 Hit Rate - 16KB L1 Hit Rate - 48KB

Page 10: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Challenges: Resource Congestion Stalls

Yun (Eric) Liang @ Peking University 10 9/23/2016

Memory Requests

…… Memory Coalescing

Return data

L1 Cache

Hit

Miss ……

Miss Status Holding Registers ……

……

Memory stage stall

00 01 00 11 00 10 00 01

00

Page 11: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Cache Bypassing on GPUs

Yun (Eric) Liang @ Peking University 11 9/23/2016

memory requests

……

cache line requests

L1 Cache

return data

miss MSHR

L2 Cache

miss

allocate data

Off-chip memory

allocate data

bypass (L1 cache)

bypass (L1 cache) return data

coalescing

Page 12: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

System Overview

Yun (Eric) Liang @ Peking University 12 9/23/2016

L2 Cache

L1 Cache

ld.global …

ld.global …

ld.global …

Static Cache Bypassing

compile-time

ld.global.ca

ld.global.cg

ld.global.cm

Dynamic Cache Bypassing

good

bad

medium

cm load

cg load

ca load

Cache thread block

Bypass thread block

Maintain the thread level parallelism

Page 13: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Performance Model

• Definition: Traffic Reduction Graph(G(V,E))

v ∈ V, global load instructions

e ∈ E, reuses between instructions

weighted graph using L2 cache traffic

weight(vi) , weight(ei,j)

• Max-Clique Problem

V3

V1

V2

V4 V5

“An Efficient Compiler for Cache Bypassing on GPUs”, International Conference on Computer Aided Design (ICCAD), 2013

Page 14: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Performance Results (1/2)

Cache sensitive applications on 16KB cache – Average 1.32X performance improvement – 8.6% energy savings

Yun (Eric) Liang @ Peking University 14 9/23/2016

0

0.5

1

1.5

2

2.5

No

rma

lize

d I

PC

Default Static Dynamic Coordinated

1.32X

Page 15: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Register File on GPUs

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Tesla Fermi Kelper

Re

gist

er

File

Siz

e

Thread block Thread block

… …

Large register file, 256 KB register file > L1 cache + shared memory (64KB) Keep increasing

register

“Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs”, IEEE/ACM International Symposium on Microarchitecture (MICRO), December, 2015

Page 16: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Thread Throttling Technique

Mitigate cache contention • Balance between parallelism and cache contention

Yun (Eric) Liang @ Peking University 16 9/23/2016

#Thread blocks per SM

Perfo

rm

an

ce

OptTLP

MaxTLP

Page 17: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Thread Throttling helps, but …

Yun (Eric) Liang @ Peking University 17 9/23/2016

1.42X

-51.3%

Register under-utilization

Page 18: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Performance Impact of Register Allocation

Yun (Eric) Liang @ Peking University 9/23/2016

Register

spilling

Code

insert

Thread-level Parallelism

Single-thread Performance

Page 19: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Current Optimization Tool-chain

Yun (Eric) Liang @ Peking University 19 9/23/2016

Register allocation

mov %r0, %tid.x; mov %r1, %ntid.x; mul %r3, %r2, %r1; add %r4, %3, %r0; …

PTX code

binary

assemble

Cache cache

Cache Thread throttling

Thread throttling register

shared memory

Thread block limits

Thread Limits

others

Register under-utilization

Page 20: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Motivational Example (CFD)

Yun (Eric) Liang @ Peking University 20 9/23/2016

0.8

1

1.2

1.4

1.6

1.8

2IPC

MaxTLP: maximum TLP (TLP = 8, Reg = 32)

OptTLP: optimal TLP (TLP = 7, Reg = 32)

OptTLP + Reg (TLP = 7, Reg = 36)

CRAT: Coordinated (TLP = 5, Reg = 50)

0%

5%

10%

15%

20%

25%

L1 Cache Hit Rate

70%

80%

90%

100%

Register Utilization

Page 21: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Design Space

Yun (Eric) Liang @ Peking University 21 9/23/2016

Single-thread performance

TLP

Complex Design Space Trade-off

Page 22: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Yun (Eric) Liang @ Peking University 22 9/23/2016

Design Space

Pruning

Optimized

GPU PTX Kernel

Output

Original GPU PTX

Kernel

.entry PTXkernel(){ … mul.lo.s32 %r3, %r2, %r1; add.s32 %r4, %r0, %r3; add.s32 %r3, %r2, %r1; sub.s32 %r5, %r2, %r1; … }

Input

Register Allocation

Spilling

Optimization

.entry PTXkernel(){ … mul.lo.s32 %r1, %r2, %r1; add.s32 %r2, %r0, %r1; add.s32 %r2, %r0, %r1; … }

CRAT: Coordinated Register Allocation and Thread-level Parallelism Optimization

CRAT

Page 23: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Design Space Pruning

Yun (Eric) Liang @ Peking University 23 9/23/2016

TLP

MaxReg MinReg

MaxTLP

OptTLP

Possible solution

Cache contention

Candidate solutions

Design space • MaxTLP, OptTLP, MinReg, MaxReg

Page 24: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Register Allocation

Yun (Eric) Liang @ Peking University 24 9/23/2016

Register allocator • GPGPU-Sim (Static Single Assignment, SSA )

• Based on Chaitin-Briggs’ register allocator

Control-flow analysis Data-flow analysis Register coloring

Spill code

insert

Page 25: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Spilling Optimization

Yun (Eric) Liang @ Peking University 25 9/23/2016

V0 V1 V2 V3 V4 V5 Spill stack

V0 V1

V1 V4 V5 V3

Splitting Sub

spill stack

Shared memory

Local Memory

Spill to shared memory if possible

Spilled variables

V0 V2 Vn …

Register Coloring

V0 V1 V2 V3 V4 V5 Spill stack

Local Memory

V2 Optimize

Page 26: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Performance Metric

• TPSC: Thread-level Parallelism and Spill Cost

Yun (Eric) Liang @ Peking University 26 9/23/2016

othersshmshmlocallocalt

gain

tgain

NumCostNumCostNumSpill

MaxThreadBlockSizeTLP

BlockSizeTLPTLP

SpillTLPTPSC

cos

cos

1

Main memory

Instruction

Shared memory

Instruction

Computing

Instruction TLP Candidate

solutions

Page 27: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Experimental Evaluation

Yun (Eric) Liang @ Peking University 27 9/23/2016

0.5

0.75

1

1.25

1.5

Norm

alized I

PC 1.25X

MaxTLP OptTLP CRAT

0%

25%

50%

75%

100%

No

rmal

ized

En

ergy

OptTLP CRAT

16.5%

Speedup

Energy saving

Page 28: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Performance Analysis

Yun (Eric) Liang @ Peking University 28 9/23/2016

0

1

2

3

4

5

6

#Th

read

blo

cks/

SM

MaxTLP CRAT

5.1

2.6

Cache Contention

Register Utilization

0%

25%

50%

75%

100%

ESP DTC FDTD CFD HST BLK STE

OptTLP

CRAT

0%

25%

50%

75%

100%

Local Memory Access

DTC FDTD CFD STE Ave

Page 29: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Experimental Results

Yun (Eric) Liang @ Peking University 29 9/23/2016

Kepler Architecture

– 1.32X IPC (compared with OptTLP)

0

0.5

1

1.5

2

2.5

STM ESP SPMV KMN LBM DTC FDTD CFD HST BLK STE Geo

Overall Performance MaxTLP OptTLP CRAT

Page 30: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

CRAT: Open Source Project

Yun (Eric) Liang @ Peking University 30 9/23/2016

http://ceca.pku.edu.cn/crat/

Download: CMU, Michigan, USC, etc. Invited internship at IBM TJ Watson.

Page 31: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Multitasking for GPUs: Software Solution

“Efficient GPU Spatial-Temporal Multitasking”, IEEE Transactions on Parallel and Distributed Systems (TPDS), March, 2015

App App App

Thread block interleaving

via leaky-bucket

Spatial-temporal multitasking

App.

binary

profile …

App.

binary

profile

App.

binary

profile …

A set of independent kernels

Thread

block id 0 1 2 3 4 5

bucket

Page 32: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Multitasking for GPUs: Software Solution

Host (CPU)

compute_mapping(); // mapKernel and mapBlock scheduler( ……);

000

_global_ scheduler( ,…, mapBlk, mapKernel, gridDim_A, blkDim_A, gridDim_B, blkDim_B) { // bid is the blk identifier of the schedule kernel kernel_id = mapKernel[bid]; if(kernel_id == 0) Kernel_A(,..., mapBlk, blkDim_A, gridDim_A); else Kernel_B(,…, mapBlk, blkDim_B, gridDim_B); }

Device (GPU)

-5

0

5

10

15

20

25

30

35

40

45Kepler GTX680 Kepler K20

“Efficient GPU Spatial-Temporal Multitasking”, IEEE Transactions on Parallel and Distributed Systems (TPDS), March, 2015

Page 33: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Mulitasking for GPUs: Hardware Solution

TLP Modulation

Cache Bypassing

grid A

grid B

A’s block

B’s block

SM 0

SM 14

L1 C

ach

e

L2 C

ach

e

Bypass

0.8

1.2

1.6

2

BL

K_

HS

T

SP

M_

HS

T

SR

D_

KM

S

KM

S_

ST

C

LB

M_

BK

P

BL

K_

BK

P

SP

M_

BK

P

SP

M_

BL

K

LB

M_

BL

K

HS

T_

KM

S

LB

M_

SP

M

SP

M_

SR

D

SP

M_

KM

S

LB

M_

KM

S

LB

M_

HS

T

BL

K_

KM

S

BL

K_

SR

D

LB

M_

SR

D

HS

T_

ST

C

BK

P_

ST

C

LB

M_

ST

C

BK

P_

KM

S

SP

M_

ST

C

BL

K_

ST

C

SR

D_

ST

C

GE

OM

EA

N

No

rmali

zed

IP

C

TLP modulationTLP modulation + Cache bypassing

"Efficient Kernel Management on GPUs", in the proceedings of the Design Automation and Test in Europe (DATE), March, 2016.

Page 34: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Control Flow Divergence Modeling

Program control flow graph

Basic Block Vector

1. sub r0, r1, r2

2. mul r0, r2, 3

3. load r2 cb[r4]

4. madd r1, r2, r3

5. cmp r1

"An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization", Proceedings of IEEE International Parallel Author Distributed Processing Symposium (IPDPS), May, 2012.

D = input[tid];

If(D > 2)

{

//computation;

}

(1) If statement

(2) If else statement

D = input[tid];

If(D > 2)

{

….

}else{

if(….) // nested divergence

}

(3) For loop statement

D = input[tid]; for( I = D; I < 100; i++) { // computation. }

Page 35: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Control Flow Divergence Modeling

Static Schedule

SM0

tb0

SM1

tb1

tb2

tb3

tb4

Dynamic Schedule

SM0 SM1

tb0tb1

tb2

tb3tb5

tb4tb5

un-weighted

weighted

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

MC SW NW SL SM

Sp

eed

up

Sorting Greedy K-mean

• Simple sorting – Each thread is represented using its

BBV

• Greedy – Merges the most two closet threads and

continue…

• Clustering – K-mean clustering

"An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization", Proceedings of IEEE International Parallel Author Distributed Processing Symposium (IPDPS), May, 2012.

Page 36: Performance Optimization for GPUsimages.nvidia.com/cn/gtc/downloads/pdf/big-data/204... · 2016-09-24 · NVIDIA Tegra Series Samsung Exynos Qualcomm Snapdragon Super computing system

Conclusion

• Ubiquitous GPU Computing

– Supercomputer, datacenter, embedded, IoT

• Challenges

– Performance optimization

• Contribution

– Automatic performance analysis and optimization techniques

Yun (Eric) Liang @ Peking University 36 9/23/2016