systolic architecture design - national chiao tung...

VLSI Digital Signal Processing Systems

Systolic Architecture Design

Lan-Da Van (范倫達), Ph. D.

Department of Computer Science

National Chiao Tung University

Taiwan, R.O.C.

Fall, 2010

[email protected]

http://www.cs.nctu.tw/~ldvan/


Lan-Da Van VLSI-DSP-7-2

Outline

Introduction

Systolic Array Design Methodology

FIR Systolic Arrays

Selection of Scheduling Vector

Conclusion



Systolic Architecture

What is systolic architecture (also called Systolic Arrays)?

A network of PEs that rhythmically compute and pass

data through the system.

Used as a coprocessor in combination with a host

computer and the behavior is analogous to the flow of

blood through the heart; thus named as systolic.



Characteristics of Systolic Arrays

Synchronization

Modularity

Regularity

Locality

Finite Connection

Parallel/Pipeline

Extendibility

Some relaxations are introduced to increase the

utility of systolic arrays Neighbor interconnection ( near, but not nearest )

Data broadcast operations

Different PEs, especially at the boundaries



Outline

Introduction


FIR Systolic Arrays


Conclusion




Represent the Algorithm as a Dependence Graph

Applying Projection, Processor, and Scheduling Vectors

(Space-Time Representation)

Edge Mapping

Construct the Final Systolic Architecture



Projection vector dT = [d1 d2] Determine how DG is compressed.

Two nodes that are displaced by d or multiples of d are executed by the same processor

Processor space vector pT = [p1 p2] Any node with index IT = [i, j] would be executed by processor pTI.

Schedule vector sT = [s1 s2] Any node with index IT = [i, j] would be executed at time sTI.

Hardware utilization efficiency: HUE = 1/|sTd| This is because two tasks executed by the same processor are spaced

1/|sTd| time units apart.

Feasibility constrains Processor space vector and the projection vector must be orthogonal to

each other. p is orthogonal to d, that is, pTd = 0 If A and B differ by projection vector, i.e, IA-IB = d,

then they must be executed by the same processor => pTIA = pTIB =>pT(IA-IB) = 0 => pTd = 0

If A and B are mapped to the same processor, then they cannot be executed at the same time, i.e, sTIA ≠ sTIB => sTd ≠ 0

Edge mapping: If an edge e exists in DG, then an edge pTe exists in the systolic array with sTe delays.

Design Methodology: Basic Vectors



Space to Space-Time Representation

Space-time representation

Interpreting one of the spatial dimensions as temporal

dimension

j’: processor axis, t’: scheduling time instance

t

j

i

s

p

t

j

i

T

T

0

0

100

'

'

'



Outline

Introduction


FIR Systolic Arrays


Matrix-Matrix Multiplication and 2D Systolic

Array Design

Systolic Design for Space Representations

Containing Delays

Conclusion







Edge Mapping




DG of FIR Filter

Dependence Graph (DG)

Ex: FIR filter: y(n) = w0(n)x(n)+w1x(n-1)+w2x(n-2)

i

j







Edge Mapping




Applying Projection and Scheduling (1/2)

11

010

IpT 1

1

110

IpT

00

010

IpT

Processor vector

pT = [0 1]

Projection vector

dT = [1 0]

Part of DG:

00

110

IpT

11

101

IsT 0

1

001

IsT

10

101

IsT 0

0

001

IsT

Scheduling vector

sT = [1 0]

0

0

1

0

1

1

0

1

22

010

IpT 2

2

110

IpT

12

101

IsT 0

2

001

IsT

2

0

2

1

apply

…

…

…

processor 0

processor 2

processor 1

SFG

PE2

PE0

PE1



Applying Projection and Scheduling(2/2)

Applying projection and SchedulingDependence Graph Space-time representation







Edge Mapping




Edge Mapping

j

ie

esDelay TEdge mapping

epe T'



Edge Mapping

00

1102

'

2

epe T

01

00111

esdelay T

e

11

1103

'

3

epe T

1

01einput

Example:

1

13eoutput

0

12eweight

Edge mapping

11

0101

'

1

epe T

10

10122

esdelay T

e

11

10133

esdelay T

e

pT=[0 1]

sT=[1 0]

dT=[1 0]



Edge mapping

Edge mapping table

e pTe sTe

Input [0 1]T 1 0

Weight [1 0]T 0 1

Output [1 -1]T -1 1

pT=[0 1]

sT=[1 0]

dT=[1 0]







Edge Mapping





PE2

PE0

PE1

Input Output

Weight 0

Weight 1

Weight 2

D

D

D

D

D

D

This is called B1 design



Alternative Designs

B1 (Broadcast inputs, Move results, Weight Stay)

B2 (Broadcast inputs, Move Weight, Results stay)

F (Fan-in results, Move inputs, Weight stay)

R1 (Results stay, Inputs and Weight move in opposite directions)

R2 and Dual R2 (Results stay, Inputs and Weights move in the same direction but at different speeds)

W1 (Weights stay, Inputs and Results move in opposite directions)

W2 and Dual W2 (Weights stay, Inputs and Results move in same direction but at different speeds)

Relating systolic designs using transformations



B2 – Broadcast Inputs, Move Weight, Results Stay

e pTe sTe

wt [1 0]T 1 1

input [0 1]T 1 0

result [1 -1]T 0 1

dT=[1 -1]

pT=[1 1]

sT=[1 0]

11

ds

HUET



F - Fan-in Results, Move Inputs, Weight Stay

e pTe sTe

wt [1 0]T 0 1

input [0 1]T 1 1

result [1 -1]T -1 0

dT=[1 0]

pT=[0 1]

sT=[1 1]

11

dS

HUET



R1 - Results Stay, Inputs and Weight Move in Opposite Directions

e pTe sTe

wt [1 0]T 1 1

input [0 1]T -1 1

result [1 -1]T 0 2

dT=[1 -1]

pT=[1 1]

sT=[1 -1] 2

11

dsHUE

T



R2 and Dual R2-Results Stay, Inputs and Weights Move in the Same Direction but at Different Speeds

e pTe sTe

wt [1 0]T 1 1

input [0 1]T 1 2

result [1 -1]T 0 1

e pTe sTe

wt [1 0]T 1 2

input [0 1]T 1 1

result [1 -1]T 0 1

R2

dT=[1 -1]

pT=[1 1]

sT=[2 1]

11

ds

HUET

Dual R2

dT=[1 -1]

pT=[1 1]

sT=[1 2]

11

ds

HUET

PE2PE1PE0D D

2D 2D

Input

Weight

result

PE2PE1PE02D 2D

D D

Input

Weight

result

D D D

D D D

D

2D

2D

D



W1 – Weights Stay, Inputs and Results Move in Opposite Directions

e pTe sTe

wt [1 0]T 0 2

input [0 1]T 1 1

result [1 -1]T -1 1

dT=[1 0]

pT=[0 1]

sT=[2 1] 2

11

dsHUE

T

PE2PE1PE0D D

D

Input

result

weight

2D 2D 2DD

D

weightweight

D



W2 and Dual W2-Weights Stay, Inputs and Results Move in Same Direction but at Different Speeds

e pTe sTe

wt [1 0]T 0 1

input [0 1]T -1 1

result [1 -1]T -1 2

e pTe sTe

wt [1 0]T 0 1

input [0 1]T 1 2

result [1 -1]T 1 1

W2

dT=[1 0]

pT=[0 1]

sT=[1 2]

11

ds

HUET

Dual W2

dT=[1 0]

pT=[0 1]

sT=[1 -1]

11

ds

HUET

PE2PE1PE02D 2D

D D

Input

result

weightD D D

PE2PE1PE0D D

2D 2D

Input

result

D D D

2D

D

weightweight

weightweightweight

D

2D



Relating Systolic Designs Using Transformations

The same projection vector and processor space vector

Different scheduling vectors

Can derive each other using transformations

Edge reversal : reverse edge direction in DG when no precedence

constraints

Associativity : when accumulating (a+b)+c = a+(b+c)

Slow-down

Retiming

Pipelining



Cutset Retiming Transformation

F

B1

cutset retiming



Outline

Introduction

Systolic array design methodology

FIR systolic arrays

Selection of scheduling vector

Conclusion



Scheduling Inequalities (1/3)

Based on selected scheduling vector sT, the

projection vector d and the processor space vector pT

can be selected.

Consider the dependence relation X -> Y,

where Ix and Iy are the indices of node X and node Y,

respectively. The scheduling inequality for this

dependence is defined as

00)( dpIIpT

BA

T

0 dsIsIsT

B

T

A

T

x

x

xj

iX I:

y

y

y j

iY I:




Linear scheduling

Affine scheduling

1 2

1 2

xT

x x

x

yT

y y

y

iS s I s s

j

iS s I s s

j

1 2

1 2

xT

x x x x

x

yT

y y y y

y

iS s I s s

j

iS s I s s

j

Where Tx is the time to compute node X and Sx, Sy are the

scheduling times for nodes X, Y, respectively.

xxy TSS Eq. (1)

Eq. (3)

Eq. (2)




Define the edge from node X to node Y as

Hence the selection of scheduling vector consists of two steps: Capture all the fundamental edges. The reduced dependence

graph (RDG) is used to capture the fundamental edges and the regular iterative algorithm (RIA) description of the corresponding problem is used to construct RDGs.

Construct the scheduling inequalities and solve them for feasible sT.

xxyyx

T

xyyx

Trr

es

IIe

Eqs. (1) & (2)



Regular Iterative Algorithm (RIA)

The regular iterative algorithm is the method for

constructing the reduce dependence graph (RDG).

The regular iterative algorithm (RIA) has two

standard forms:

The RIA is in standard input RIA form if the index of the

inputs are the same for all equations.

The RIA is in standard output RIA form if output indices are

the same for all equations.

FIR example:

)1,1()1,1(

),()1,1(

),()1,(

),(),1(

jiXjiW

jiYjiY

jiXjiX

jiWjiW

),(),(

)1,1(),(

)1,(),(

),1(),(

jiXjiW

jiYjiY

jiXjiX

jiWjiW

Output RIA Form



Scheduling Vector and Systolic Array Design Using RDG

Constructing scheduling inequalities using RDG

Determine the scheduling vector using

scheduling inequalities

Systolic mapping using the scheduling vector

This formulation can accommodate different

computation times for various operations due to

its generality.



Example 7.4.1 (1/4)

There are 5 edges in the above RDG.



Example 7.4.1 (2/4)

2

1

1 2

0: , 0

0

0: , 1

1

1: , 1

0

0: , 0

0

1: , 5 2 1

1

y x

x x

w w

y x

y y

W Y e

X X e s

W W e s

X Y e

Y Y e s s

Reduced Dependence Graph (RDG)



Example 7.4.1 (3/4)



Example 7.4.1 (4/4)

Linear scheduling

1 2 1 2

2 1

1, 1, 8

1, 9 9 1

(1, 1), (1,1)

T

T

s s s s

s s s

d p

e pTe sTe

wt(1,0) 1 9

i/p(0,1) 1 1

Result(1,-1) 0 8

D

9D

D

9D

X

W

8D8D

8D

Systolic array architecture



Conclusion

Systolic architecture

A massively parallel processing with limited I/O

communication with host computer

Suitable for many regular interactive operations

Design methodology

Map an N-dimensional DG to (N-1) dimensional

space-time representation

Needs to determine three critical vectors

Projection vector

Processor space vector

Scheduling vector

systolic architecture design - national chiao tung...

Documents