simd divergence optimization through intra-warp compactioncamelab.org/uploads/main/simd...
Post on 20-Jul-2020
5 Views
Preview:
TRANSCRIPT
SIMD Divergence Optimization through Intra-Warp Compaction
Aniruddha Vaidya Anahita ShayestehDong Hyuk Woo Roy Saharoy Mani Azimi
ISCA 13
Problem
• GPU: wide SIMD lanes – 16 lanes per warp in this work
• SIMD control flow divergence on “if/else” condition
• Common solution: sequentially execute all the control flow paths for all channels– Both the “if” and “else” portion are executed in turn by all channels,
while turning off appropriate channels in each path
• Recent studies: combine threads from different warps that have the same if/else flow path– Problem: increase memory divergence (i.e. the number of distinct
memory or cache line requests per SIMD instruction)
Observation
• The number of hardware execution lanes is typically a fraction of the SIMD instruction width – 4-wide SIMD ALU in Intel’s Ivy Bridge GPU.
• Wide SIMD instructions typically executes over multiple execution cycles due to narrower hardware width.
Goal
• By exploiting the difference between logical andphysical SIMD width of a GPU pipeline, this work addresses the SIMD control divergence problem with intra-warp compaction
GPU register file
0 1 2 3 4 5 6 7 8 9 a b c d e f
r1
r0
0 1 2 3 4 5 6 7 8 9 a b c d e f
r(n-1)
r2
r1
r0
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 1 2 3 4 5 6 7 8 9 a b c d e f
r0
r(n-1)
r2
r1
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 1 2 3 4 5 6 7 8 9 a b c d e f
Warp 0r1
Warp 0
Warp 1
16 lanes
Basic Cycle Compression (BCC)
Fused multiply-add (FMA) r3 = r0 * r1 + r2
r0 / fetch @ 1 r1 / fetch @ 2 r2 / fetch @ 3
issue @ 4issue @ 5issue @ 6issue @ 7
Instead, we want to issue a next warp at cycle 5.
Basic Cycle Compression (BCC)
• In this example, the compressed execution time =
the execution time without the divergence caused by the “if/else" clause
If
else
Unfruitful Cases for BCC
• Turned off channels in an instruction are not contiguous or contiguous but not favorably aligned to the hardware SIMD pipeline width
Swizzled Cycle Compression (SCC)
• The positions of disabled and enabled channels are rearranged
If
else
Control Algorithm for Swizzling
• Method:1. Detect the optimal number of cycles for execution
2. Balance occupancy across lanes
Lane 0
Lane 1
Lane 2
Lane 3
1 1 1 1
1 1 1 1
Total 4
Total 0
Total 4
Total 0
1
1
1
1
For 1st EXE cycle, fill idle lanes (1, 3) from busy lanes
For 2nd EXE cycle, fill idle lanes (1, 3) similarly
Total 3Total 2
Total 1Total 2
Total 3Total 2
Total 1Total 2
Optimum cycle: 8/4 = 2
2
2
2
2
Simulation Methods
• Execution-driven simulation– In-house cycle-level Intel GPGPU simulator
• Standalone GPU simulation • A module in parallel CPU+GPU simulation
– Entire GPU performance simulation with entire memory hierarchy
– 50+ OpenCL benchmark applications evaluated
• Trace-driven simulation– GPU core performance simulation only– ~600 OpenCL, OpenGL, multimedia workload traces
Results
0%
10%
20%
30%
40%
50%
BFS HtS
Lava
MD
NW
Par
t
EV
RT-
PR
-Co
nf
RT-
PR
-AL
RT-
PR
-BL
RT-
PR
-WM
RT-
AO
-AL
RT-
AO
-BL
RT-
AO
-WM
LuxM
ark-
sky
LuxM
ark_
sala
luxm
ark_
ocl
cp
bu
lletp
hys
ics
ocl
pro
fv1
p0
righ
twar
e_m
and
elb
ulb
tree
_sea
rch
LuxM
ark_
hd
r
Op
tSA
A
san
dra
_ocl
ati-
eig
enva
l
ati_
flo
ydw
arsh
all
glb
ench
_egy
pt
glb
ench
_pro
FD_
Inte
lFin
alis
ts
FD_
po
litic
ian
s
ALU
cyc
les
save
d
BCC%SCC%
ALU cycles saved (OpenGL and OpenCL)
Results• System Performance (OpenCL; RayTracing)
• Dependent on Data Cluster Bandwidth (L3 cache)
0%
20%
40%
60%
Spe
ed
up
∞ bandwidth2 L3$ lines / cycle1 L3$ line / cycle
BCC%SCC%
On average (across divergent applications),+12% with 1$ line / cycle bandwidth
+18% with 2$ line / cycle bandwidth
Conclusion
• SIMD control divergence solutions
– Exploiting the multi-cycle execution feature of GPUs
– Intra-Warp Compaction
• Basic cycle compression
• Swizzled cycle compression
Register file organization
Baseline: use pairs of registers BCC: fetch only half width registers
Register file organization
Operand fetch (16 lanes, 512b) is done in 1-cycle. This operand is held in a 512b latch.Each quad (128b) passes through a four lane swizzler with individual lane enables.
Overhead
top related