raven3: 28nm risc-v vector processor with on-chip dc/dc...
TRANSCRIPT
Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors
Brian Zimmer1, Yunsup Lee1, Alberto Puggelli1, Jaehwa Kwak1, Ruzica Jevtic1, Ben Keller1, Stevo Bailey1, Milovan Blagojevic1,2,
Pi-Feng Chiu1, Hanh-Phuc Le1, Po-Hung Chen1, Nicholas Sutardja1, Rimas Avizienis1, Andrew Waterman1, Brian Richards1,
Philippe Flatresse2, Elad Alon1, Krste Asanovic ́1, and Borivoje Nikolic ́1
1University of California, Berkeley, USA 2STMicroelectronics, Crolles, France
6/30/2015, RISC-V Workshop
Motivation• Dynamic voltage and frequency scaling (DVFS)
maximizes energy efficiency
Time
PerformanceRequirement
Fixed voltage
DVFS voltage
• Off-chip conversionSlow mode transitionsLimited voltage domainsCostly off-chip components
• On-chip conversionFast transitionsMany domainsNo off-chip components
✖ ✖ ✖
✔ ✔ ✔
2
Project Goals
• Fine-grained DVFS• High conversion efficiency
Energy-efficient
• Entirely on-chip converter• Low area overheadlow-cost
• Runs Linux• Realistic digital loadprocessor
3
Integrated Voltage RegulatorsMethod Efficiency
(0.5V output)Off-chip components
Area overhead
Linear regulator (eg. Toprak-Deniz ISSCC2014)
<50%
Buck converter (eg. Kurd ISSCC2014)
?%-90%
Interleaved switched-capacitor (eg. Kim ISSCC20151, Clerc ISSCC20152)
40%-65%1
65%-82%2
Simultaneous switched-capacitor with adaptive clock (Proposed)
85%
4
Videal VoutVlow
VhighVideal VoutVlow
Vhigh
Proposed Solution• Switch all converters simultaneously to avoid
charge sharing
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le
+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le
+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le
+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
Traditional Interleaving Simultaneous Switching
Charge sharing losses
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le
+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le
+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le
+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
1.0V1.8V
24 switched-capacitorDC-DC unit cells
(0.19 mm2)
DC-DC controller
Adaptive clock generator
Vout...
togg
le
+
FSMVref
clk
Async. FIFO/Level shiftersbetween domains
2Ghzref
CORE (1.19 mm2)
UNCORE
16KB ScalarInst. Cache
(Custom 8T SRAM Macros)
32KB Shared Data Cache
(Custom 8T SRAM Macros)
8KB VectorInst. Cache
(Custom 8T SRAM Macros)
To/from off-chip FPGA FSB and DRAM
Digital IO pads to wire-bonded chip-on-board
To off-chip oscilloscope
1.0V
RISC-V scalar core Vector acceleratorVector Issue Unit
Vector Memory Unit
Crossbar
ScalarRF FPUint
Arbiter
...
...
int int int int int
(16KB Vector RF uses eight custom 8T SRAM macros)
...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA
No charge sharing!
• Clock frequency adapts to track the voltage ripple5
CORE (1.19 mm2)Chip Architecture
6
Floorplan• ST 28nm FDSOI• Die area: 2.37mm2
• Core area: 1.19mm2
• Converter area: 0.19mm2 (16%)
• MOS+MOM (to M3) density: 11fF/µm2
7
Core
Reconfigurable SC Converters
• Simple lower bound control
1V Mode 1.8V 1/2 Mode 1V 2/3 Mode 1V 1/2 Mode(~1V) (~0.9V) (~0.67V) (~0.5V)
Unit Cell Topologies
Lower-bound Controller
Operational Waveform
1V
1V
1V
1.8V
1V
1.8V
1V
1V
1V
1V
1V
1V
Vout VoutVoutVout
+FSM
2GHz comparator
VrefVout toggle
toggleVref
Vout
Φ1Φ1 Φ1 Φ1
Φ2Φ2Φ2
Φ2
Φ2
Φ2
Φ2 Φ1
Φ1
Φ1
Φ1off
off
Φ2
Φ2 Φ1
Φ1
Φ1
Φ1
off
Φ2off
off
Φ1
Φ1
Φ1
Φ1off
off
Φ2
Φ2
Φ2
Φ2
off
off
off
off
Efficiency ImprovementsInterleaving Rippling (this approach)
Videal VoutVlow
Vhigh VoutVidealVlow
Vhigh
clkAdaptive clock negates losses by increasing the average clock frequency
caused by charge sharing with interleaved converters
clk
No losses�V loss
�V
Vlow
Vin
Vlow
VinVhigh
Vlow
Vin VinVhigh
toggletoggle
• Four output voltages, single unit cell
1V Mode 1.8V 1/2 Mode 1V 2/3 Mode 1V 1/2 Mode(~1V) (~0.9V) (~0.67V) (~0.5V)
Unit Cell Topologies
Lower-bound Controller
Operational Waveform
1V
1V
1V
1.8V
1V
1.8V
1V
1V
1V
1V
1V
1V
Vout VoutVoutVout
+FSM
2GHz comparator
VrefVout toggle
toggleVref
Vout
Φ1Φ1 Φ1 Φ1
Φ2Φ2Φ2
Φ2
Φ2
Φ2
Φ2 Φ1
Φ1
Φ1
Φ1off
off
Φ2
Φ2 Φ1
Φ1
Φ1
Φ1
off
Φ2off
off
Φ1
Φ1
Φ1
Φ1off
off
Φ2
Φ2
Φ2
Φ2
off
off
off
off
Efficiency ImprovementsInterleaving Rippling (this approach)
Videal VoutVlow
Vhigh VoutVidealVlow
Vhigh
clkAdaptive clock negates losses by increasing the average clock frequency
caused by charge sharing with interleaved converters
clk
No losses�V loss
�V
Vlow
Vin
Vlow
VinVhigh
Vlow
Vin VinVhigh
toggletoggle
• Simultaneous switching
1V Mode 1.8V 1/2 Mode 1V 2/3 Mode 1V 1/2 Mode(~1V) (~0.9V) (~0.67V) (~0.5V)
1V1V
1V
1.8V1V
1.8V
1V1V
1V
1V1V
1V
Vout
Vout
Vout
Vout
Φ2
Φ2
Φ2
Φ2 Φ1
Φ1
Φ1
Φ1off
offΦ2
Φ2 Φ1
Φ1
Φ1
Φ1
off
Φ2off
offΦ1
Φ1
Φ1
Φ1off
offΦ2
Φ2
Φ2Φ2
off
off
offoff
8
Self-Adjusting Clock Generator
• Replica tracks critical path with voltage ripple• Controller quantizes clock edge
Block Diagram (Tunable Replica Path)
DLL2Ghz input, 16 phases
ControllerSelect closest edge
TunableReplica
Path
Vout .........
...
...
......
Adaptive system clock
Vout
9
Test setup
• Host: Zedboard (ARM+FPGA, network accessible)
• Motherboard: Programmable supplies• Daughterboard: Chip-on-board
Conf. 1
Conf. 2
Conf. 3
10
Vout Measurements
Fast converter mode transitions enable extremely fine-grain DVFS algorithms
11
GFLOPS/W
28nm FDSOI offers high energy efficiency and 4 converter modes cover wide operating range
• Double-precision matrix-multiplication used as energy-efficiency metric
• GFLOPS/W inverse of energy/operation
12
Conclusion
1. Fine-grained and wide-range DVFS– 20ns transitions– 1V to 0.45V
2. Entirely on-chip voltage conversion– Simultaneous switched-capacitor
3. High efficiency– 80% across wide voltage range
4. Extreme energy efficiency– 26 GFLOPS/W with on-chip conversion
Energy-efficient 28nm processor featuring:
13
Acknowledgements
• Funding: BWRC members, ASPIRE members, DARPA PERFECT Award Number HR0011-12-2-0016, Intel ARO, AMD, GRC, Marie Curie FP7, NSF GRFP, NVIDIA Fellowship
• Fabrication donation by STMicroelectronics.
• Tom Burd, James Dunn, Olivier Thomas, and Andrei Vladimirescu
14