simd divergence optimization through intra-warp compactioncamelab.org/uploads/main/simd...

SIMD Divergence Optimization through Intra-Warp Compaction

Aniruddha Vaidya Anahita ShayestehDong Hyuk Woo Roy Saharoy Mani Azimi

ISCA 13

Problem

• GPU: wide SIMD lanes – 16 lanes per warp in this work

• SIMD control flow divergence on “if/else” condition

• Common solution: sequentially execute all the control flow paths for all channels– Both the “if” and “else” portion are executed in turn by all channels,

while turning off appropriate channels in each path

• Recent studies: combine threads from different warps that have the same if/else flow path– Problem: increase memory divergence (i.e. the number of distinct

memory or cache line requests per SIMD instruction)

Observation

• The number of hardware execution lanes is typically a fraction of the SIMD instruction width – 4-wide SIMD ALU in Intel’s Ivy Bridge GPU.

• Wide SIMD instructions typically executes over multiple execution cycles due to narrower hardware width.

• By exploiting the difference between logical andphysical SIMD width of a GPU pipeline, this work addresses the SIMD control divergence problem with intra-warp compaction

GPU register file

0 1 2 3 4 5 6 7 8 9 a b c d e f

r(n-1)

0 1 2 3 4 5 6 7 8 9 a b c d e f

r(n-1)

0 1 2 3 4 5 6 7 8 9 a b c d e f

Warp 0r1

Warp 0

Warp 1

16 lanes

Basic Cycle Compression (BCC)

Fused multiply-add (FMA) r3 = r0 * r1 + r2

r0 / fetch @ 1 r1 / fetch @ 2 r2 / fetch @ 3

issue @ 4issue @ 5issue @ 6issue @ 7

Instead, we want to issue a next warp at cycle 5.

Basic Cycle Compression (BCC)

• In this example, the compressed execution time =

the execution time without the divergence caused by the “if/else" clause

Unfruitful Cases for BCC

• Turned off channels in an instruction are not contiguous or contiguous but not favorably aligned to the hardware SIMD pipeline width

Swizzled Cycle Compression (SCC)

• The positions of disabled and enabled channels are rearranged

Control Algorithm for Swizzling

• Method:1. Detect the optimal number of cycles for execution

2. Balance occupancy across lanes

Lane 0

Lane 1

Lane 2

Lane 3

1 1 1 1

Total 4

Total 0

Total 4

Total 0

For 1st EXE cycle, fill idle lanes (1, 3) from busy lanes

For 2nd EXE cycle, fill idle lanes (1, 3) similarly

Total 3Total 2

Total 1Total 2

Total 3Total 2

Total 1Total 2

Optimum cycle: 8/4 = 2

Simulation Methods

• Execution-driven simulation– In-house cycle-level Intel GPGPU simulator

• Standalone GPU simulation • A module in parallel CPU+GPU simulation

– Entire GPU performance simulation with entire memory hierarchy

– 50+ OpenCL benchmark applications evaluated

• Trace-driven simulation– GPU core performance simulation only– ~600 OpenCL, OpenGL, multimedia workload traces

Results

BFS HtS

BCC%SCC%

ALU cycles saved (OpenGL and OpenCL)

Results• System Performance (OpenCL; RayTracing)

• Dependent on Data Cluster Bandwidth (L3 cache)

∞ bandwidth2 L3$ lines / cycle1 L3$ line / cycle

BCC%SCC%

On average (across divergent applications),+12% with 1$ line / cycle bandwidth

+18% with 2$ line / cycle bandwidth

Conclusion

• SIMD control divergence solutions

– Exploiting the multi-cycle execution feature of GPUs

– Intra-Warp Compaction

• Basic cycle compression

• Swizzled cycle compression

Register file organization

Baseline: use pairs of registers BCC: fetch only half width registers

Register file organization

Operand fetch (16 lanes, 512b) is done in 1-cycle. This operand is held in a 512b latch.Each quad (128b) passes through a four lane swizzler with individual lane enables.

Overhead

simd divergence optimization through intra-warp compactioncamelab.org/uploads/main/simd...

Documents

locally divergence-free galerkin methods.pdf

simd - peter elderon

time warp resolución funcionarios porteños

ancient polymorphisms and divergence hitchhiking ...ancient...

warp up.pbl.ipt

akcelerace dwt pomocí simd

divergence analysis

chunking algorithm fast simd-based

velocidad warp

simd in dlang

divergence analysis with affine constraints

warp knitting

simd exploitation in (jit) compilers · we can write simd...

kettfadenwächter warp-stop-motion para...

divergence march 2015

yaminabe simd

warp films presentation

2.1 divergence: main inequality - mit opencourseware · 2.1...

tutoriais photoshop efeito warp

time warp satb