vada lab.sungkyunkwan univ. 1 lower power design guide 1998. 6.7 성균관대학교 조 준 동...

376
SungKyunKwan Univ . 1 VADA Lab. Lower Power Design Guide 1998. 6.7 성성성성성성 성 성 성 성성 http://vlsicad.skku.ac.kr

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • VADA Lab.SungKyunKwan Univ. 1 Lower Power Design Guide 1998. 6.7 http://vlsicad.skku.ac.kr
  • Slide 2
  • VADA Lab.SungKyunKwan Univ. 2 Contents 1. Intoduction Trends for High-Level Lower Power Design 2. Power Management Clock/Cache/Memory Management 3. Architecture Level Design Architecture Trade offs, Transformation 4. RTL Level Design Retiming, Loop-Unrolling, Clock Selection, Scheduling, Resource Sharing, Register Allocation 5. partitioning 6. Logic Level Design 7. Circuit Level Design 8. Quarter Sub Micron Layout Design Lower Power Clock Designs 9. CAD tools 10. References
  • Slide 3
  • VADA Lab.SungKyunKwan Univ. 3 1. Introduction
  • Slide 4
  • VADA Lab.SungKyunKwan Univ. 4 Motivation Portable Mobile (=ubiquitous =nomadic) Systems with limited for heat sinks Lowering power with fixed performance: DSPs in modems and cellular phones Reliability: Increasing power ! increasing electromigration, 40- year reliability guarantee (product life cycle of telecommunication industries) Adding fans to reduce power cause reliability to plummet. Higher power leads to higher packaging costs: 2-watt package can be four times greater than a 1-watt package Myriad Constraints: timing, power, testability, area, packaging, time-to-market. Ad-Hoc Design: Lack a systematic process leading to universal applicability.
  • Slide 5
  • VADA Lab.SungKyunKwan Univ. 5 Power!Power!Power!
  • Slide 6
  • VADA Lab.SungKyunKwan Univ. 6 Power Dissipation in VLSIs MPU1 clock memory I/O clock I/O logic memory MPU1ASSP1 ASSP2 MPU1: low-end microprocessor for embedded use MPU2: high-end CPU with large amount of cache ASSP1: MPEG2 decoder ASSP2: ATM switch
  • Slide 7
  • VADA Lab.SungKyunKwan Univ. 7 Current Design Issues in Lower Power Problem Energy-hungry Function by Network Server: Infopad (univ. of California, Berkeley), weight < 1 pound, 0.5W (re ective color display) + 0.5W (computation,communication, I/O support) = 1W (Alpha chip: 25W StrongARM: 215 MHz at 2.0V:0.3W) runtime 50 hours, target: 100MIPS/mW. Deep-sub micron (0.35 - 0.18) with low voltage for portable full motion video terminal; 0:5 m : 40 AA NiMH; 1 m : 1 AA NiMH System-On-A-Chip to reduce external Interconnection Capacitances Power Management: shut down idle units Power Optimization Techniques in Software, Architecture,Logic/Circuit, Layout Phases to reduce operations, frequency, capacitance, switching activity with maintaining the same throughput.
  • Slide 8
  • VADA Lab.SungKyunKwan Univ. 8 Battery Trends
  • Slide 9
  • VADA Lab.SungKyunKwan Univ. 9 Road-Map in Semiconductor Device Integration
  • Slide 10
  • VADA Lab.SungKyunKwan Univ. 10 Road-Map in Semiconductor Device Complexity
  • Slide 11
  • VADA Lab.SungKyunKwan Univ. 11 Power Component Static: Leakage current(
  • VADA Lab.SungKyunKwan Univ. 121 List Scheduling The scheduled DFG DFG with mobility labeling (inside ) ready operation list/resource constraint
  • Slide 122
  • VADA Lab.SungKyunKwan Univ. 122 Static-List Scheduling DFG Partial schedule of five nodes Priority list The final schedule
  • Slide 123
  • VADA Lab.SungKyunKwan Univ. 123 Choosing Optimal Clock Period
  • Slide 124
  • VADA Lab.SungKyunKwan Univ. 124 Supply Voltage Scaling Lowering Vdd reduces energy, but increase delays
  • Slide 125
  • VADA Lab.SungKyunKwan Univ. 125 Shut-down Scheduling: |a-b|
  • Slide 126
  • VADA Lab.SungKyunKwan Univ. 126 Loop Scheduling Sequential Execution Partial loop unrolling Loop folding
  • Slide 127
  • VADA Lab.SungKyunKwan Univ. 127 Loop folding Reduce execution delay of a loop. Pipeline operations inside a loop. Overlap execution of operations. Need a prologue and epilogue. Use pipeline scheduling for loop graph model.
  • Slide 128
  • VADA Lab.SungKyunKwan Univ. 128 DFG Restructuring DFG2 DFG2 after redundant operation insertion
  • Slide 129
  • VADA Lab.SungKyunKwan Univ. 129 Minimizing the bit transitions for constants during Scheduling
  • Slide 130
  • VADA Lab.SungKyunKwan Univ. 130 Control Synthesis Synthesize circuit that: Executes scheduled operations. Provides synchronization. Supports: Iteration. Branching. Hierarchy. Interfaces.
  • Slide 131
  • VADA Lab.SungKyunKwan Univ. 131 Allocation Bind a resource to more than one operation.
  • Slide 132
  • VADA Lab.SungKyunKwan Univ. 132 Optimum binding
  • Slide 133
  • VADA Lab.SungKyunKwan Univ. 133 Example
  • Slide 134
  • VADA Lab.SungKyunKwan Univ. 134 RESOURCE SHARING Parallel vs. time-sharing buses (or execution units) Resource sharing can destroy signal correlations and increase switching activity, should be done between operations that are strongly connected. Map operations with correlated input signals to the same units Regularity: repeated patterns of computation (e.g., (+, * ), ( *,*), (+,>)) simplifying interconnect (busses, multiplexers, buffers)
  • Slide 135
  • VADA Lab.SungKyunKwan Univ. 135 Datapath interconnections Multiplexer-oriented datapath Bus-oriented datapath
  • Slide 136
  • VADA Lab.SungKyunKwan Univ. 136 Sequential Execution Example of three micro-operations in the same clock period
  • Slide 137
  • VADA Lab.SungKyunKwan Univ. 137 Insertion of Latch (out) Insertion of latches at the output ports of the functional units
  • Slide 138
  • VADA Lab.SungKyunKwan Univ. 138 Insertion of Latch (in/out) Insertion of latches at both the input and output ports of the functional units
  • Slide 139
  • VADA Lab.SungKyunKwan Univ. 139 Overlapping Data Transfer(in) Overlapping read and write data transfers
  • Slide 140
  • VADA Lab.SungKyunKwan Univ. 140 Overlapping of Data Transfer (in/out) Overlapping data transfer with functional-unit execution
  • Slide 141
  • VADA Lab.SungKyunKwan Univ. 141 Register Allocation Using Clique Partitioning Scheduled DFG Graph model Lifetime intervals of variable Clique-partitioning solution
  • Slide 142
  • VADA Lab.SungKyunKwan Univ. 142 Left-Edge Algorithm Register allocation using Left-Edge Algorithm
  • Slide 143
  • VADA Lab.SungKyunKwan Univ. 143 Register Allocation: Left-Edge Algorithm Sorted variable lifetime intervalsFive-register allocation result
  • Slide 144
  • VADA Lab.SungKyunKwan Univ. 144 Register Allocation Allocation : bind registers and functional modules to variables and operations in the CDFG and specify the interconnection among modules and registers in terms of MUX or BUS. Reduce capacitance during allocation by minimizing the number of functional modules, registers, and multiplexers. Composite weight w.r.t transition activity and capacitance loads is incorporated into CDFG. Find the highest composite weight and merge the two nodes it joins, i.e., maps the corresponding variable to the same register. Allocation continues till no edges are left in the CDFG while updating the composite weight values. Set the maximum # of operations alive in any control step to be one. Sequence operations/variables to enhance signal correlations
  • Slide 145
  • VADA Lab.SungKyunKwan Univ. 145 Exploiting spatial locality for interconnect power reduction A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation. Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low. A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware. Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus). The partitioning information is passed to the architecture netlist and floorplanning tools. Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs
  • Slide 146
  • VADA Lab.SungKyunKwan Univ. 146 Hardware Mapping The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks. The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment). The mapping process transforms the flow graph into three structural sub-graphs: the data path structure graph the controller state machine graph the interface graph (between data path control inputs and the controller output signals)
  • Slide 147
  • VADA Lab.SungKyunKwan Univ. 147 Spectral Partitioning in High-Level Synthesis The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together. The relative distances is a measure of the tightness of connections. Use the eigenvector ordering to generate several partitioning solutions The area estimates are based on distribution graphs. A distribution graph displays the expected number of operations executed in each time slot. Local bus power: the number of global data transfers times the area of the cluster Global bus power: the number of global data transfer times the total area:
  • Slide 148
  • VADA Lab.SungKyunKwan Univ. 148 Finding a good Partition
  • Slide 149
  • VADA Lab.SungKyunKwan Univ. 149 Interconnection Estimation For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about 30-40% of the datapath height. Average global bus length : square root of the estimated chip area. The three terms represent white space, active area of the components, and wiring area. The coefficients are derived statistically.
  • Slide 150
  • VADA Lab.SungKyunKwan Univ. 150 Incorporating into HYPER-LP
  • Slide 151
  • VADA Lab.SungKyunKwan Univ. 151 Experiments
  • Slide 152
  • VADA Lab.SungKyunKwan Univ. 152 Datapath Generation Register file recognition and the multiplexer reduction: Individual registers are merged as much as possible into register files reduces the number of bus multiplexers, the overall number of busses (since all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder). Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used) Data path partitioning is to optimize the processor floorplan The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes.
  • Slide 153
  • VADA Lab.SungKyunKwan Univ. 153 Hardware Mapper
  • Slide 154
  • VADA Lab.SungKyunKwan Univ. 154 Hyper's Basic Architecture Model
  • Slide 155
  • VADA Lab.SungKyunKwan Univ. 155 Hyper's Crossbar Network
  • Slide 156
  • VADA Lab.SungKyunKwan Univ. 156 Refined Architecture Model
  • Slide 157
  • VADA Lab.SungKyunKwan Univ. 157 Bus Merging
  • Slide 158
  • VADA Lab.SungKyunKwan Univ. 158 Fanin Bus Merging
  • Slide 159
  • VADA Lab.SungKyunKwan Univ. 159 Fanout Bus merging
  • Slide 160
  • VADA Lab.SungKyunKwan Univ. 160 Global bus Merging
  • Slide 161
  • VADA Lab.SungKyunKwan Univ. 161 Test Example
  • Slide 162
  • VADA Lab.SungKyunKwan Univ. 162 Control Signal Assignment
  • Slide 163
  • VADA Lab.SungKyunKwan Univ. Efficient High Level Synthesis Algorithm for Lower Power Design 1998.5.19 ,
  • Slide 164
  • VADA Lab.SungKyunKwan Univ. ( Minimum Cost Flow Algorithm )
  • Slide 165
  • VADA Lab.SungKyunKwan Univ. IC - Single Chip - - - , , , PDA LCDs Battery - ULSI Microprocessors - Parallel Computer - cooling - Battery
  • Slide 166
  • VADA Lab.SungKyunKwan Univ.
  • Slide 167
  • VADA Lab.SungKyunKwan Univ. ( High Level Synthesis ) ( High Level Synthesis ) Instructions Operations Variables Arrays signals Control Datapath Memory Operators, Registers, Memory, Multiplexor Control Scheduling Hardware allocation Memory inferencing Register sharing Control interencing for(I=0;I