spatial: a language and compiler for application...
TRANSCRIPT
![Page 1: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/1.jpg)
Spatial: A Language and Compilerfor Application Accelerators
David Koeplinger Matthew Feldman Raghu Prabhakar
Yaqi Zhang Stefan Hadjis Ruben Fiszel
Tian Zhao Luigi Nardi Ardavan PedramChristos Kozyrakis Kunle Olukotun
PLDI June 21, 2018
![Page 2: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/2.jpg)
Instructions Add Overheads
32-bit ADD: ~0.5 pJ
Register File
Access
Control Overheads25pJ 6pJ 38pJ
Instruction: 70 pJ
I-cache access
CPU
L1 Cache (Instructions)
L1 Cache(Data)
L2 Cache
DRAM
Register FileInst. Queue
Arithmetic/LogicControl
Floating Point
Mark Horowitz, Computing’s Energy Problem (and what we can do about it) ISSCC 2014
LegendControl Compute
RegsSRAM
Instruction-Based
2
mov r8, rcxadd r8, 8mov r9, rdxadd r9, 8mov rcx, raxmov rax, 0
.calc:mov rbx, [r9] imul rbx, [r8]add rax, rbxadd r8, 8add r9, 8loop .calc
vectorA · vectorB
![Page 3: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/3.jpg)
A Dark Tale: The CPU Power Wall
3
![Page 4: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/4.jpg)
A More Efficient WayConfiguration-Based
CPU*
L1 Cache (Instructions)
L1 Cache(Data)
L2 Cache
DRAM
Register FileInst. Queue
Arithmetic/LogicControl
Floating Point
*Not to scale
Instruction-Based
Custom Hardware*
+×vectorB
vectorAacc
DRAM
+
ctr
ctrl
*Also not to scale
LegendControl Compute
RegsSRAM4
.calc:
add r8, 8
add r8, 8
loop .calcadd r9, 8add r8, 8add rax, rbximul rbx, [r8]mov rbx, [r9]
mov rax, 0mov rcx, rax
mov r9, rdx
mov r8, rcx
vectorA · vectorB
![Page 5: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/5.jpg)
The Future Is (Probably) Reconfigurable
10,000
1,000
100
10
1
0.1
Energ
y E
ffic
iency
(M
OP
S/m
W)
Not programmable Less programmable More programmable
Programmability
ASIC
CPU
GPU
Reconfigurable
Instruction-BasedFPGA
5
CGRADedicated
25x perf/W vs. CPU
XPU (HotChips ’17)
287 MOps/mW
Brainwave (ISCA ’18)
![Page 6: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/6.jpg)
Key Question
Performance Productivity
Portability
How can we more productively targetreconfigurable architectures like FPGAs?
Fast and efficient designs Fast and efficient programmers
Target-generic solutions
6
![Page 7: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/7.jpg)
Language Taxonomy
x86
DomainSpecificity
Abstraction
Domain-Specific
Multi-Domain
General Purpose
Higher Level Lower Level
Instruction-Based Architectures (CPUs)
Lower Level
Reconfigurable Architectures (FPGAs)
AbstractionVHDLVerilog
NetlistMyHDL
Halide
“What?” “How?”“How?”
7
![Page 8: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/8.jpg)
Abstracting Hardware DesignDomain
Specificity
AbstractionHigher Level Lower Level
Instruction-Based Architectures (CPUs)
Lower Level
Reconfigurable Architectures (FPGAs)
AbstractionNetlist HDLs
“What?” “How?”“How?”
8
+hardware pragmas
![Page 9: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/9.jpg)
HDLs
Performance Productivity
Portability
9
Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec
✓Arbitrary RTL ✘ No high-level abstractions
✘ Significant target-specific code
![Page 10: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/10.jpg)
C + Pragmas
Performance Productivity
Portability
10
✓ Nested loops
✘ Difficult to optimize
✘Ad-hoc mix of software/hardware
✓ Portable for single vendor
✘ No memoryhierarchy
✘ No arbitrarypipelining
Existing High Level Synthesis (C + Pragmas)e.g. Vivado HLS, SDAccel, Altera OpenCL
HDLs
![Page 11: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/11.jpg)
Criteria for Improved HLS
11
Requirement C+Pragmas
Express control as nested loopsEnables analysis of access patterns
Represent memory hierarchy explicitlyAids on-chip memory optimization, specialization
Specialize memory transfers Enables customized memory controllers based on access patterns
Capture design parametersEnables automatic design tuning in compiler
Support arbitrarily nested pipeliningExploits nested parallelism
![Page 12: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/12.jpg)
FPGA
+×tileB
tileA
acc
DRAM
vectorA
vectorB
Design Space Parameters ExamplevectorA · vectorB
LegendControl Compute
RegsSRAM
Small and simple, but slow!
ctr
12
![Page 13: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/13.jpg)
n Increases length of DRAM accesses Runtimen Increases exploited locality Runtimen Increases local memory sizes Area
FPGA
+×tileB
tileA
acc
DRAM
vectorA
vectorB
vectorA · vectorB
Important Parameters: Buffer Sizes
LegendControl Compute
RegsSRAM
ctr
13
![Page 14: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/14.jpg)
FPGA
Stage 2
Tile B
n Overlaps memory and compute Runtimen Increases local memory sizes Arean Adds synchronization logic Area
Important Parameters: Pipelining
Stage 1
+×tileB (0)
tileA (0)
acc
DRAM
vectorA
vectorB
tileA (1)
tileB (1)
vectorA · vectorB
LegendControl Compute
RegsSRAM
Double Buffer
14
Metapipelining requires buffering
![Page 15: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/15.jpg)
n Improves element throughput Runtimen Duplicates compute resources Area
Important Parameters: Parallelization
FPGA
+× acc
DRAM
vectorA
vectorBctr
vectorA · vectorB
×ctrctr
×+
+
LegendControl Compute
RegsSRAM
tileB
tileA
15
![Page 16: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/16.jpg)
n Improves memory bandwidth Runtimen May duplicate memory resources Area
Important Parameters: Memory Banking
+× acc
DRAM
vectorA
vectorBctr
vectorA · vectorB
×ctrctr
×+
+
LegendControl Compute
RegsSRAM
tileB
tileA
Banked SRAM
16
Parallelization requires banking
![Page 17: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/17.jpg)
Criteria for Improved HLSRequirement C+Pragmas
Express control as nested loopsEnables analysis of access patterns
Represent memory hierarchy explicitlyAids on-chip memory optimization, specialization
Specialize memory transfers Enables customized memory controllers based on access patterns
Capture design parametersEnables automatic design tuning in compiler
Support arbitrarily nested pipeliningExploits nested parallelism
17
![Page 18: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/18.jpg)
Rethinking HLS
Performance Productivity
Portability
18
✓ Nested loops✓Automatic memory
banking/buffering✓ Implicit design parameters
(unrolling, banking, etc.)
✓ Memory hierarchy
✓Arbitrary pipelining
✓Target-generic sourceacross reconfigurable architectures
✓Automated design tuning
HDLs C + PragmasImproved HLS
![Page 19: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/19.jpg)
Abstracting Hardware DesignDomainSpecificity
AbstractionHigher Level Lower Level
Instruction-Based Architectures (CPUs)
Lower Level
Reconfigurable Architectures (FPGAs)
AbstractionNetlist HDLs
“What?” “How?”“How?”
19
Spatial
+pragmas
![Page 20: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/20.jpg)
Spatial: Memory Hierarchy
20
DDR DRAMGB
On-Chip SRAMMB
Local RegsKB
val image = DRAM[UInt8](H,W)
val buffer = SRAM[UInt8](C)val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)
val accum = Reg[Double]val pixels = RegFile[UInt8](R,C)
buffer load image(i, j::j+C) // densebuffer gather image(a) // sparse
![Page 21: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/21.jpg)
val B = 64 (64 → 1024)val buffer = SRAM[Float](B)Foreach(N by B){i =>
…}
val P = 16 (1 → 32)Reduce(0)(N by 1 par P){i =>data(i)
}{(a,b) => a + b}Stream.Foreach(0 until N){i => …
}
Implicit/Explicit parallelization factors(optional, but can be explicitly declared)
Explicit size parameters for loop step size and buffer sizes(informs compiler it can tune this value)
Implicit/Explicit control schemes(also optional, but can be used to override compiler)
Foreach(64 par 16){i => buffer(i) // Parallel read
}
Implicit memory banking and buffering schemes for parallelized access
Spatial: Control And Design Parameters
21
![Page 22: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/22.jpg)
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
}
DRAM
vectorA
Off-chip memory declarations
vectorB
FPGA
output
22
![Page 23: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/23.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
}
vectorA
vectorB
Explicit work division in IR
FPGA
output
24
![Page 24: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/24.jpg)
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
DRAM
vectorA
vectorB
Tiled reduction (outer)
FPGAOuter Reduce
output
24
![Page 25: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/25.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
vectorA
vectorB
FPGAOuter Reduce
On-chip memory declarations
tileB (0)
tileA (0) tileA (1)
tileB (1)acc
output
24
acc
![Page 26: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/26.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
vectorA
vectorB
FPGAOuter Reduce
DRAM → SRAM transfers(also have store, scatter, and gather)
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
output
24
acc acc
![Page 27: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/27.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
vectorA
vectorB
FPGAOuter Reduce
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Tiled reduction (pipelined)
Stage 2
+× acc
output
24
acc acc
![Page 28: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/28.jpg)
FPGAOuter ReduceStage 3
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
}
vectorA
vectorB
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Stage 2
+× acc
output
Outer reduce function
+
24
acc acc
![Page 29: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/29.jpg)
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
} 24
FPGAOuter ReduceStage 3
DRAM
vectorA
vectorB
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Stage 2
+× acc
output +
acc acc
Tile Size (B)Banking strategy
Parallelism factor #1Metapipelining toggle
Parallelism factor #3
Parallelism factor #2
![Page 30: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/30.jpg)
Dot Product in Spatial
Spatial Program Design Parameters
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>
val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
} 25
![Page 31: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/31.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis Spatial IRSpatial Program Design Parameters
IntermediateRepresentation
DesignParameters
IR Transformation
IR Analysis
Code Generation
Legend
The Spatial Compiler
26
![Page 32: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/32.jpg)
Control SchedulingSpatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IR
n Creates loop pipeline schedulesn Detects data dependencies across loop intervalsn Calculate initiation interval of pipelinesn Set maximum depth of buffers
n Supports arbitrarily nested pipelines(Commercial HLS tools don’t support this)
27
![Page 33: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/33.jpg)
n Insight: determine banking strategy in a single loop nest using the polyhedral model [Wang, Li, Cong FPGA ’14]
n Spatial’s contribution: find the (near) optimal banking/buffering strategy across all loop nests
n Algorithm in a nutshell: 1. Bank each reader as a separate coherent copy
(accounting for reaching writes)2. Greedily merge copies if merging is legal and cheaper
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IRLocal Memory Analysis
+× acc
ctr×
ctrctr
×+
+tileB
tileA
28
![Page 34: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/34.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
ModifiedParameters
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Original tuning methods:n Pre-prune space using simple heuristicsn Randomly sample ~100,000 design pointsn Model area/runtime of each point
Proposed tuning methodn Active learning: HyperMapper
(More details in paper)
n Fast: No slow transformers in loop
Design Tuning
29
![Page 35: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/35.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IRThe Spatial Compiler: The Rest
Code generationn Synthesizable Chiseln C++ code for host CPU
30
![Page 36: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/36.jpg)
n FPGA:n Amazon EC2 F1 Instance: Xilinx VU9P FPGAn Fixed clock rate of 150 MHz
n Applicationsn SDAccel: Hand optimized, tuned implementationsn Spatial: Hand written, automatically tuned implementations
n Execution time = FPGA execution time
Evaluation: Performance
31
![Page 37: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/37.jpg)
0
5
10
15
BlackS
chole
sGDA
GEMM
K-Mea
ns
PageR
ank
Smith-W
aterm
an
TPC-H Q
6
Performance (Spatial vs. SDAccel) Average 2.9x faster hardware than SDAccel
Spee
dup
over
SD
Acc
el 8.5x 1.4x1.6x 1.4x 3.5x 14.1x 1.3x
32
![Page 38: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/38.jpg)
Productivity: Lines of Code
0
50
100
150
200
250
BlackS
chole
sGDA
GEMM
K-Mea
ns
PageR
ank
Smith-W
aterm
an
TPC-H Q
6
SDAccel
Spatial
12%
Average 42% shorter programs versus SDAccel
60%47% 44% 31% 66% 35%
Lines
33
![Page 39: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/39.jpg)
n FPGA 1n Amazon EC2 F1 Instance: Xilinx VU9P FPGAn 19.2 GB/s DRAM bandwidth (single channel)
n FPGA 2n Xilinx Zynq ZC706 n 4.3 GB/s
n Applicationsn Spatial: Hand written, automatically tuned implementationsn Fixed clock rate of 150 MHz
Evaluation: Portability
34
![Page 40: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/40.jpg)
Portability: VU9P vs. Zynq ZC706
0
5
10
15
20
25
BlackS
chole
sGDA
GEMM
K-Mea
ns
PageR
ank
Smith-W
aterm
an
TPC-H Q
6
2.5x 1.2x2.5x 2.5x 1.3x 2.5x 4.6xIdentical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA
Speedup
DRAM Bandwidth: 4.5xLUTs (GP compute): 47.3xDSPs (integer FMA): 7.6xOn-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 41: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/41.jpg)
Portability: VU9P vs. Zynq ZC706
0
5
10
15
20
25
BlackS
chole
sGDA
GEMM
K-Mea
ns
PageR
ank
Smith-W
aterm
an
TPC-H Q
6
2.6x 2.1x9.4x 2.7x 1.7x 1.0x 1.1xIdentical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA
Tuning: Speedup only from tuning parameters for larger FPGA
Speedup
DRAM Bandwidth: 4.5xLUTs (GP compute): 47.3xDSPs (integer FMA): 7.6xOn-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 42: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/42.jpg)
Portability: VU9P vs. Zynq ZC706
0
5
10
15
20
25
BlackS
chole
sGDA
GEMM
K-Mea
ns
PageR
ank
Smith-W
aterm
an
TPC-H Q
6
6.5x 2.5x23.4x 6.8x 2.2x 2.5x 5.0xIdentical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA
Tuning: Speedup only from tuning parameters for larger FPGA
Product = Porting ×Tuning
Speedup
DRAM Bandwidth: 4.5xLUTs (GP compute): 47.3xDSPs (integer FMA): 7.6xOn-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 43: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/43.jpg)
Portability: Plasticine CGRAIdentical Spatial source, multiple targets
Even reconfigurable hardware that isn’t an FPGA!
36
BenchmarkDRAM Bandwidth (%)
Load StoreResource Utilization (%)
PCU PMU AG Speedupvs. VU9P
BlackScholes 77.4 12.9 73.4 10.9 20.6 1.6GDA 24.0 0.2 95.3 73.4 38.2 9.8GEMM 20.5 2.1 96.8 64.1 11.7 55.0K-Means 8.0 0.4 89.1 57.8 17.6 6.3TPC-H Q6 97.2 0.0 29.7 37.5 70.6 1.6
Prabhakar et al. Plasticine: A Reconfigurable Architecture For Parallel Patterns (ISCA ‘17)
![Page 44: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/44.jpg)
Conclusionn Reconfigurable architectures are becoming key for performance / energy efficiencyn Current programming solutions for reconfigurables are still inadequaten Need to rethink outside of the C box for high level synthesis:
n Memory hierarchy for optimizationn Design parameters for tuningn Arbitrarily nestable pipelines
n Spatial prototypes these language and compiler criteria:n Average speedup of 2.9x versus SDAccel on VU9Pn Average 42% less code than SDAcceln Achieves transparent portability through internal support for automated design tuning
(HyperMapper)
37
Spatial is open source: spatial.stanford.edu
Performance Productivity
Portability
![Page 45: Spatial: A Language and Compiler for Application Acceleratorscsl.stanford.edu/~christos/publications/2018.spatial... · 2018-07-03 · Spatial: A Language and Compiler for Application](https://reader033.vdocuments.pub/reader033/viewer/2022041904/5e629b234919fc5e6a2881d1/html5/thumbnails/45.jpg)
The Team
RaghuPrabhakar
YaqiZhang
DavidKoeplinger
MattFeldman
TianZhao
ArdavanPedram
ChristosKozyrakis
KunleOlukotun
StefanHadjis
RubenFiszel
LuigiNardi
38