l12 : lower power high level synthesis(3)

39
L12 : Lower Power High Level Synthesis(3) 1999. 8 성성성성성성 성 성 성 성성 http://vada.skku.ac.kr

Upload: noma

Post on 05-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

L12 : Lower Power High Level Synthesis(3). 1999. 8 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr. Matrix-vector product algorithm. Retiming. Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit. Exploiting spatial locality for interconnect power reduction. Global - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: L12 : Lower Power High Level Synthesis(3)

L12 : Lower Power High Level Synthesis(3)

1999. 8

성균관대학교 조 준 동 교수 http://vada.skku.ac.kr

Page 2: L12 : Lower Power High Level Synthesis(3)

Matrix-vector product algorithm

Page 3: L12 : Lower Power High Level Synthesis(3)

Retiming

Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit

Page 4: L12 : Lower Power High Level Synthesis(3)

Exploiting spatial locality for interconnect power reduction

Global

Local

Adder1

Adder2

Page 5: L12 : Lower Power High Level Synthesis(3)

Balancing maximal time-sharing and fully-parallel implementation

A fourth-order parallel-form

IIR filter

(a) Local assignment

(2 global transfers), (b) Non-local assignment

(20 global transfers)

Page 6: L12 : Lower Power High Level Synthesis(3)

Retiming/pipelining for Critical path

+ +

+

D

D

+

Biquad Biquad Biquad1st

OrderBiquad Biquad

in out

Minimal Area Time-multiplexed Solution

(meeting the throughput constraint)

Retiming and pipelining

Biquad Biquad Biquad1st

OrderBiquad Biquad

in outD DDDD

+

++

+

DD

D

D

"Fastest" Solution

Supply voltage can be reduced keeping throughput fixed.

Page 7: L12 : Lower Power High Level Synthesis(3)

Effective Resource Utilization

+

+

+

+

D

D

S

5 1 2

3 4

6

7

Retiming

D

D

D

D

D+

+

+

+S

51 2 6

7

43

Before AFTER

CYCLE Multipliers1 1, 3

2, 4

-

-5

6, 8

7

2

13

4

Adder8

6

7

5

Adder Multipliers

2

1

1

1

-

Can reducd interconnect capacitance.

Page 8: L12 : Lower Power High Level Synthesis(3)

Hazard propagation elimination by clocked sampling

By sampling a steady state signal at a register input, no more glitches are propagated through the nextcombinational logics.

Page 9: L12 : Lower Power High Level Synthesis(3)

Regularity

• Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware.

+ *

+ *

+ *

+ <

<+ *<<

A 1

A 1

A 1

A 2

A 2

M 1

M 1

S1

S1

M 1

+ +

* <<

A 1 A 2

M 1 S1

+ *

+ *

+ *

+ <

<+ *<<

A 1

A 2

A 2

A 1

A 2

M 1

M 1

S1

S1

M 1

* <<M 1 S1

MUX

+A 1 + A 2

MUX

(a)±ÔÄ¢Àû ¸ðµâÇÒ´ç

(b)ºñ±ÔÄ¢Àû ¸ðµâÇÒ´ç

Page 10: L12 : Lower Power High Level Synthesis(3)

Module Selection

• Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints.

• Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path.

• Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized.

• During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints

Page 11: L12 : Lower Power High Level Synthesis(3)

Estimation• Estimate min and max bounds on the required resources to

– delimit the design space min bounds to serve as an initial solution

– serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations

• Max bound on execution time is tmax: topological ordering of DFG using ASAP and ALAP

• Minimum bounds on the number of resources for each resource class

Where NRi: the number of resources of class Ri

dRi : the duration of a single operation

ORi : the number of operations

Page 12: L12 : Lower Power High Level Synthesis(3)

Exploring the Design Space

• Find the minimal area solution constrained to the timing constraints

• By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied.

• After acceptable graph is obtained, the resource allocation process is

• initiated.

– change the available hardware (FU's, registers, busses)

– redistribute the time allocation over the sub-graphs

– transform the graph to reduce the hardware requirements.

• Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.

Page 13: L12 : Lower Power High Level Synthesis(3)

Data path Synthesis

Page 14: L12 : Lower Power High Level Synthesis(3)

Scheduling and Binding• The scheduling task selects the control step, in which a given operation

will happen, i.e., assign each operation to an execution cycle

• Sharing: Bind a resource to more than one operation.– Operations must not execute concurrently.

• Graph scheduled hierachically in a bottom-up fashion

• Power tradeoffs– Shorter schedules enable supply voltage (Vdd) scaling– Schedule directly impacts resource sharing– Energy consumption depends what the previous instruction was– Reordering to minimize the switching on the control path

• Clock selection – Eliminate slacks– Choose optimal system clock period

Page 15: L12 : Lower Power High Level Synthesis(3)

ASAP Scheduling

• Algorithm • HAL Example

Page 16: L12 : Lower Power High Level Synthesis(3)

• Algorithm

ALAP Scheduling

• HAL Example

Page 17: L12 : Lower Power High Level Synthesis(3)

Force Directed Scheduling

Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy:

Force = constant displacement. constant = operation-type distribution. displacement = change in probability.

Page 18: L12 : Lower Power High Level Synthesis(3)

Force Directed Scheduling

Page 19: L12 : Lower Power High Level Synthesis(3)

Example : Operation V6

Page 20: L12 : Lower Power High Level Synthesis(3)

Force-Directed Scheduling• Algorithm (Paulin)

Page 21: L12 : Lower Power High Level Synthesis(3)

Force-Directed Scheduling Example• Probability of scheduling operations

into control steps

• Probability of scheduling operations into control steps after operation o3 is scheduled to step s2

• Operator cost for multiplications in a

• Operator cost for multiplications in c

Page 22: L12 : Lower Power High Level Synthesis(3)

List Scheduling• The scheduled DFG• DFG with mobility labeling (inside <>)

• ready operation list/resource constraint

Page 23: L12 : Lower Power High Level Synthesis(3)

Static-List Scheduling• DFG

• Partial schedule of five nodes

• Priority list

The final schedule

Page 24: L12 : Lower Power High Level Synthesis(3)

Divide-and-Conquer to minimize the power consumption

• Decompose a computation into strongly connected components• Any adjacent trivial SCCs are merged into a sub part;• Use pipelining to isolate the sub parts;• For each sub part

– Minimize the number of delays using retiming;

– If (the sub part is linear)• Apply optimal unfolding;

– Else• Apply unfolding after the isolation of nonlinear operations;

• Merge linear sub parts to further optimize;• Schedule merged sub parts to minimize memory usage

Page 25: L12 : Lower Power High Level Synthesis(3)

Choosing Optimal Clock Period

Page 26: L12 : Lower Power High Level Synthesis(3)

SCC decomposition stepUsing the standard depth-first search-based algorithm [Tarjan,1972] which has a low order polynomial-time complexity. For any pair of operations A and B within an SCC, there exist both a path from A to B and a path from B to A. The graph formed by all the SCCs is acyclic. Thus, the SCCs can be isolated from each other using pipeline delays, which enables us to optimize each SCC separately.

Page 27: L12 : Lower Power High Level Synthesis(3)

Idetifying SCC

• The first step of the approach is to identify the computation's strongly connected components,.

Page 28: L12 : Lower Power High Level Synthesis(3)

Choosing Optimal Clock Period

Page 29: L12 : Lower Power High Level Synthesis(3)

Supply Voltage Scaling

Lowering Vdd reduces energy, but increase delays

Page 30: L12 : Lower Power High Level Synthesis(3)

Multiple Supply VoltagesFilter Example

Page 31: L12 : Lower Power High Level Synthesis(3)

Shut-down 을 이용한 Scheduling: |a-b|

> -Mux

-1

2

a bab b a

>-

-1

2

a bab b a

3 Mux

aba b b

>

- -1

2

3 Mux

a

Page 32: L12 : Lower Power High Level Synthesis(3)

Loop Scheduling

• Sequential Execution

• Partial loop unrolling

• Loop folding

Page 33: L12 : Lower Power High Level Synthesis(3)

Loop folding

• Reduce execution delay of a loop.• Pipeline operations inside a loop.

• Overlap execution of operations.• Need a prologue and epilogue.

• Use pipeline scheduling for loop graph model.

Page 34: L12 : Lower Power High Level Synthesis(3)

DFG Restructuring• DFG2 • DFG2 after redundant operation

insertion

Page 35: L12 : Lower Power High Level Synthesis(3)

Minimizing the bit transitions for constants during Scheduling

Page 36: L12 : Lower Power High Level Synthesis(3)

Control Synthesis

•Synthesize circuit that:•Executes scheduled operations.•Provides synchronization.•Supports:

• Iteration.• Branching.• Hierarchy.• Interfaces.

Page 37: L12 : Lower Power High Level Synthesis(3)

Allocation ◆Bind a resource to more than one operation.

Page 38: L12 : Lower Power High Level Synthesis(3)

Optimum binding

Page 39: L12 : Lower Power High Level Synthesis(3)

Example