systolic architecture design - national chiao tung...
TRANSCRIPT
VLSI Digital Signal Processing Systems
Systolic Architecture Design
Lan-Da Van (范倫達), Ph. D.
Department of Computer Science
National Chiao Tung University
Taiwan, R.O.C.
Fall, 2010
http://www.cs.nctu.tw/~ldvan/
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-2
Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-3
Systolic Architecture
What is systolic architecture (also called Systolic Arrays)?
A network of PEs that rhythmically compute and pass
data through the system.
Used as a coprocessor in combination with a host
computer and the behavior is analogous to the flow of
blood through the heart; thus named as systolic.
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-4
Characteristics of Systolic Arrays
Synchronization
Modularity
Regularity
Locality
Finite Connection
Parallel/Pipeline
Extendibility
Some relaxations are introduced to increase the
utility of systolic arrays Neighbor interconnection ( near, but not nearest )
Data broadcast operations
Different PEs, especially at the boundaries
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-5
Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-6
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-7
Projection vector dT = [d1 d2] Determine how DG is compressed.
Two nodes that are displaced by d or multiples of d are executed by the same processor
Processor space vector pT = [p1 p2] Any node with index IT = [i, j] would be executed by processor pTI.
Schedule vector sT = [s1 s2] Any node with index IT = [i, j] would be executed at time sTI.
Hardware utilization efficiency: HUE = 1/|sTd| This is because two tasks executed by the same processor are spaced
1/|sTd| time units apart.
Feasibility constrains Processor space vector and the projection vector must be orthogonal to
each other. p is orthogonal to d, that is, pTd = 0 If A and B differ by projection vector, i.e, IA-IB = d,
then they must be executed by the same processor => pTIA = pTIB =>pT(IA-IB) = 0 => pTd = 0
If A and B are mapped to the same processor, then they cannot be executed at the same time, i.e, sTIA ≠ sTIB => sTd ≠ 0
Edge mapping: If an edge e exists in DG, then an edge pTe exists in the systolic array with sTe delays.
Design Methodology: Basic Vectors
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-8
Space to Space-Time Representation
Space-time representation
Interpreting one of the spatial dimensions as temporal
dimension
j’: processor axis, t’: scheduling time instance
t
j
i
s
p
t
j
i
T
T
0
0
100
'
'
'
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-9
Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Matrix-Matrix Multiplication and 2D Systolic
Array Design
Systolic Design for Space Representations
Containing Delays
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-10
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-11
DG of FIR Filter
Dependence Graph (DG)
Ex: FIR filter: y(n) = w0(n)x(n)+w1x(n-1)+w2x(n-2)
i
j
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-12
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-13
Applying Projection and Scheduling (1/2)
11
010
IpT 1
1
110
IpT
00
010
IpT
Processor vector
pT = [0 1]
Projection vector
dT = [1 0]
Part of DG:
00
110
IpT
11
101
IsT 0
1
001
IsT
10
101
IsT 0
0
001
IsT
Scheduling vector
sT = [1 0]
0
0
1
0
1
1
0
1
22
010
IpT 2
2
110
IpT
12
101
IsT 0
2
001
IsT
2
0
2
1
apply
…
…
…
processor 0
processor 2
processor 1
SFG
PE2
PE0
PE1
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-14
Applying Projection and Scheduling(2/2)
Applying projection and SchedulingDependence Graph Space-time representation
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-15
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-16
Edge Mapping
j
ie
esDelay TEdge mapping
epe T'
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-17
Edge Mapping
00
1102
'
2
epe T
01
00111
esdelay T
e
11
1103
'
3
epe T
1
01einput
Example:
1
13eoutput
0
12eweight
Edge mapping
11
0101
'
1
epe T
10
10122
esdelay T
e
11
10133
esdelay T
e
pT=[0 1]
sT=[1 0]
dT=[1 0]
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-18
Edge mapping
Edge mapping table
e pTe sTe
Input [0 1]T 1 0
Weight [1 0]T 0 1
Output [1 -1]T -1 1
pT=[0 1]
sT=[1 0]
dT=[1 0]
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-19
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-20
Construct the Final Systolic Architecture
PE2
PE0
PE1
Input Output
Weight 0
Weight 1
Weight 2
D
D
D
D
D
D
This is called B1 design
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-21
Alternative Designs
B1 (Broadcast inputs, Move results, Weight Stay)
B2 (Broadcast inputs, Move Weight, Results stay)
F (Fan-in results, Move inputs, Weight stay)
R1 (Results stay, Inputs and Weight move in opposite directions)
R2 and Dual R2 (Results stay, Inputs and Weights move in the same direction but at different speeds)
W1 (Weights stay, Inputs and Results move in opposite directions)
W2 and Dual W2 (Weights stay, Inputs and Results move in same direction but at different speeds)
Relating systolic designs using transformations
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-22
B2 – Broadcast Inputs, Move Weight, Results Stay
e pTe sTe
wt [1 0]T 1 1
input [0 1]T 1 0
result [1 -1]T 0 1
dT=[1 -1]
pT=[1 1]
sT=[1 0]
11
ds
HUET
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-23
F - Fan-in Results, Move Inputs, Weight Stay
e pTe sTe
wt [1 0]T 0 1
input [0 1]T 1 1
result [1 -1]T -1 0
dT=[1 0]
pT=[0 1]
sT=[1 1]
11
dS
HUET
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-24
R1 - Results Stay, Inputs and Weight Move in Opposite Directions
e pTe sTe
wt [1 0]T 1 1
input [0 1]T -1 1
result [1 -1]T 0 2
dT=[1 -1]
pT=[1 1]
sT=[1 -1] 2
11
dsHUE
T
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-25
R2 and Dual R2-Results Stay, Inputs and Weights Move in the Same Direction but at Different Speeds
e pTe sTe
wt [1 0]T 1 1
input [0 1]T 1 2
result [1 -1]T 0 1
e pTe sTe
wt [1 0]T 1 2
input [0 1]T 1 1
result [1 -1]T 0 1
R2
dT=[1 -1]
pT=[1 1]
sT=[2 1]
11
ds
HUET
Dual R2
dT=[1 -1]
pT=[1 1]
sT=[1 2]
11
ds
HUET
PE2PE1PE0D D
2D 2D
Input
Weight
result
PE2PE1PE02D 2D
D D
Input
Weight
result
D D D
D D D
D
2D
2D
D
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-26
W1 – Weights Stay, Inputs and Results Move in Opposite Directions
e pTe sTe
wt [1 0]T 0 2
input [0 1]T 1 1
result [1 -1]T -1 1
dT=[1 0]
pT=[0 1]
sT=[2 1] 2
11
dsHUE
T
PE2PE1PE0D D
D
Input
result
weight
2D 2D 2DD
D
weightweight
D
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-27
W2 and Dual W2-Weights Stay, Inputs and Results Move in Same Direction but at Different Speeds
e pTe sTe
wt [1 0]T 0 1
input [0 1]T -1 1
result [1 -1]T -1 2
e pTe sTe
wt [1 0]T 0 1
input [0 1]T 1 2
result [1 -1]T 1 1
W2
dT=[1 0]
pT=[0 1]
sT=[1 2]
11
ds
HUET
Dual W2
dT=[1 0]
pT=[0 1]
sT=[1 -1]
11
ds
HUET
PE2PE1PE02D 2D
D D
Input
result
weightD D D
PE2PE1PE0D D
2D 2D
Input
result
D D D
2D
D
weightweight
weightweightweight
D
2D
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-28
Relating Systolic Designs Using Transformations
The same projection vector and processor space vector
Different scheduling vectors
Can derive each other using transformations
Edge reversal : reverse edge direction in DG when no precedence
constraints
Associativity : when accumulating (a+b)+c = a+(b+c)
Slow-down
Retiming
Pipelining
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-29
Cutset Retiming Transformation
F
B1
cutset retiming
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-30
Outline
Introduction
Systolic array design methodology
FIR systolic arrays
Selection of scheduling vector
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-31
Scheduling Inequalities (1/3)
Based on selected scheduling vector sT, the
projection vector d and the processor space vector pT
can be selected.
Consider the dependence relation X -> Y,
where Ix and Iy are the indices of node X and node Y,
respectively. The scheduling inequality for this
dependence is defined as
00)( dpIIpT
BA
T
0 dsIsIsT
B
T
A
T
x
x
xj
iX I:
y
y
y j
iY I:
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-32
Scheduling Inequalities (2/3)
Linear scheduling
Affine scheduling
1 2
1 2
xT
x x
x
yT
y y
y
iS s I s s
j
iS s I s s
j
1 2
1 2
xT
x x x x
x
yT
y y y y
y
iS s I s s
j
iS s I s s
j
Where Tx is the time to compute node X and Sx, Sy are the
scheduling times for nodes X, Y, respectively.
xxy TSS Eq. (1)
Eq. (3)
Eq. (2)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-33
Scheduling Inequalities (3/3)
Define the edge from node X to node Y as
Hence the selection of scheduling vector consists of two steps: Capture all the fundamental edges. The reduced dependence
graph (RDG) is used to capture the fundamental edges and the regular iterative algorithm (RIA) description of the corresponding problem is used to construct RDGs.
Construct the scheduling inequalities and solve them for feasible sT.
xxyyx
T
xyyx
Trr
es
IIe
Eqs. (1) & (2)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-34
Regular Iterative Algorithm (RIA)
The regular iterative algorithm is the method for
constructing the reduce dependence graph (RDG).
The regular iterative algorithm (RIA) has two
standard forms:
The RIA is in standard input RIA form if the index of the
inputs are the same for all equations.
The RIA is in standard output RIA form if output indices are
the same for all equations.
FIR example:
)1,1()1,1(
),()1,1(
),()1,(
),(),1(
jiXjiW
jiYjiY
jiXjiX
jiWjiW
),(),(
)1,1(),(
)1,(),(
),1(),(
jiXjiW
jiYjiY
jiXjiX
jiWjiW
Output RIA Form
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-35
Scheduling Vector and Systolic Array Design Using RDG
Constructing scheduling inequalities using RDG
Determine the scheduling vector using
scheduling inequalities
Systolic mapping using the scheduling vector
This formulation can accommodate different
computation times for various operations due to
its generality.
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-36
Example 7.4.1 (1/4)
There are 5 edges in the above RDG.
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-37
Example 7.4.1 (2/4)
2
1
1 2
0: , 0
0
0: , 1
1
1: , 1
0
0: , 0
0
1: , 5 2 1
1
y x
x x
w w
y x
y y
W Y e
X X e s
W W e s
X Y e
Y Y e s s
Reduced Dependence Graph (RDG)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-38
Example 7.4.1 (3/4)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-39
Example 7.4.1 (4/4)
Linear scheduling
1 2 1 2
2 1
1, 1, 8
1, 9 9 1
(1, 1), (1,1)
T
T
s s s s
s s s
d p
e pTe sTe
wt(1,0) 1 9
i/p(0,1) 1 1
Result(1,-1) 0 8
D
9D
D
9D
X
W
8D8D
8D
Systolic array architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-40
Conclusion
Systolic architecture
A massively parallel processing with limited I/O
communication with host computer
Suitable for many regular interactive operations
Design methodology
Map an N-dimensional DG to (N-1) dimensional
space-time representation
Needs to determine three critical vectors
Projection vector
Processor space vector
Scheduling vector