partial runtime reconfiguration for industrial ... · partial runtime reconfiguration for...
TRANSCRIPT
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 1
Dirk Koch
Partial Runtime Reconfiguration Partial Runtime Reconfiguration for Industrial Applicationsfor Industrial Applications
––Methods and ToolsMethods and Tools
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 2
Why using Runtime Reconfiguration?� Economics of ASIC- and FPGA designs
� FPGA buyers: reduce unit cost
� FPGA vendors: more attractive for high volume designs
Sourc
e: E
lectr
onic
New
s 1
6.0
3.2
006
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 3
Runtime Reconfiguration: Applications
� Applications for non-frequent reconfiguration:
� Rapid prototyping,
� Searching (text, genetic database)
� Mode changing (test equipment, radio)
� Self-repair / self-optimizing
� Applications for high-speed reconfiguration (area saving)
� Networking (exchange packet filters according to traffic)
� Mutually exclusive functionality (MP3-music versus phoning)
� Modulation/frequency/encryption hopping in military radios
� Applications for high-speed reconfiguration (acceleration)
� Crypto (e.g., asym. crypto for key exchange & symmetric for data)
� Sorting (data-base acceleration): optimize individual sort steps
� Possibly video processing (en-/decoding, object classification)
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 4
Applications: Area Saving� Networking
example:
Adapt to different
protocols over time
sourc
e:w
ww
.caid
a.o
rg
dispatcher config.
VoIP
SSH
HTTP
FTP
configurationrepository
FPGA network processor
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 5
Applications: Area Saving� Time-variant resource usage
M0M0 M0
M1 M1
M2 M2
M1
M2
M0
M1
M2
M0
ν
ν
ν ν+1
ν+1
ν+1
ν
ν
ν
ν+1
ν+1
ν+1
ν+2
S0
S1
S2
S0
S2
S0
S1
S2
tt
τ
τ
a) b)
S3
S3
S1
configurationoverhead
S0
S1
S2
� Example SSL encrypted data transfer� Asymmetric key exchange (Montgomery multiplication)� Symmetric data exchange (AES)
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 6
Applications: Area vs. Speed/Power
� Runtime reconfiguration utilizes idle parts of the FPGA
� The size of the idle parts can be tuned with the clock speed
� Simplification of the system level design
� Simple integration of additional functionality
� Optimize clock/power for a particular systems
S0
S1
S2
t
τ
S3
S0
S1
S2
t
τ
S3
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 7
Applications: Acceleration
� May alternatively allow to reduce clock frequency (and power)
� Lower latency might reduce buffer sizes
� May also increase throughput
BA C
S0
S1
S2
A
B
Ct
A
S0
S1
S2
A B C
t
A B C A
BCABCABA C
S0
S1
S2
A
B
Ct
A
S0
S2
A B C
t
A B C A
AC
time
latency S1
� Reduce latency by spending more area on submodules
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 8
Applications: Database Acceleration� In large databases, roughly
30% of CPU- time is spent
on sorting.
� Huge problem classes
(number and size of thekeys can be several GB)
� Different datatypes (int, txt, ...)
1) Build longest sorted subsequences that fit into the FPGA:
2) Merge subsequences in one ormore steps to fully sorted result:
� Reconfigure for exchanging sorters
Assumed HWArchitecture:
2x2
2x4
2x8
2x16
2x32
2x64
2x128
distributed 2BRAM 2BRAM 2BRAM
2x256
2BRAM
2x512
4BRAMdistributeddistributed
2x1024
6BRAM
2x2048
12BRAM
2x4096
24BRAMdistributed
memory
memory
load unit
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 9
� More than 5x faster when using runtime reconfiguration as compared to static solutions
� Reduced I/O throughput � lower power!
� Less iterations required, because more work is done per step
� Useful for many huge problems (CT-image processing,
matrix multiplication, matrix inversion, …)
Applications: Database Acceleration
2x
2
2x
4
2x
8
2x
16
2x
32
2x
64
2x
128
distributed 2BRAM 2BRAM 2BRAM
2x
256
2BRAM
2x
512
4BRAMdistributeddistributed
2x
1024
6BRAM
2x
2048
12BRAM
2x
4096
24BRAMdistributed
memory
memory
load unit
load unit load unit load unit load unit
memory
round robin memory dispatcher
round robin memory collector
memory
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 10
Partial tuntime reconfigration is brilliant, but not used !?!
m3
m4
m1
cconst
m1
m2 m
3
m4
m2
cconstm
5 m6
M
M M M overhead
internal fragmentation
communication cost c
� Internal fragmentation is dominating the overhead
� Can be optimized with small slots � 2D placement
� 2D enhances BRAM/DSP utilization
Requires adequate communication architectures
The Runtime Reconfiguration Paradox
� Missing: efficient methodologies for integrating partial modules
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 11
OK – Small Slots: But How Small?
Impact of Resource Slot Size and Communication Cost on the Average Module Overhead
Result: Slotsize ~200–300 LUTs or ~25–40 CLBs
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 12
0 1 2 3 0 1 2 3
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
D31...D24 D23...D16 D15...D8 D7...D0
0
0
0
start point & mux select value
used connection
unused connection
m1
Communication Architectures: Buses
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 13
0 1 2 3 0 1 2 3
1 2 3 0 1 2 3
2 3 0 1 2 3
3 0 1 2 3
0
0 1
0 1 2
≥1≥1≥1≥1≥1≥1≥1≥1≥1≥1≥1≥1≥1≥1≥1≥1
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
D31...D24 D23...D16 D15...D8 D7...D0
0
0
0
start point & mux select value
used connection
unused connection
0 1 2 3 4 5 6 7
0
1
2
3
x
y
m2
m1
m3
m4
slot3,6
slot indexing: sloty,x
Communication Architectures: Buses
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 14
static system
static system
video
ReCoBus
out
audioout
Slot 0 Slot 1 Slot 5Slot 4Slot 3Slot 2
videoin
audioin
Communication Architectures: I/O Bars
� Selectable read-modify-write connection
� Ideal for data streaming
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 15
Incoming signals
Outgoing signals
Route through signals
Communication Architectures: I/O Bars
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 16
What else do we need for PR?
� Methods for high-speed reconfiguration
� Overclocking: more than 1 GB/s on Virtex-5
� swap a fully featured Microblaze in less than 100 us
� Bitstream decompression
� Configuration prefetching
� Design tools:
� Analysis and Simulation
� Floorplanning(partitioning between static part and reconfigurable regions)
� Bitstream assembly
Christopher Claus from TU München (to be published at DATE)
Koch et al. “Hardware Decompression Techniques...”
Hauck “Configuration Prefetch for Single Context Reconfigurable Coprocessors”
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 17
Design Tools: PlanAhead
� Xilinx floorplanning tool
� Originally designed for densely packed designs and incremental
design flows
� Proxy logic for fixing routing
between static/reconf. parts
� Resource estimator for fast
initial budgeting
� Simple build-in DRC
� Island style � may result in low resource utilization!
� No strict module encapsulation: static design may route througha reconfigurable region � many limitations!
http://www.xilinx.com/tools/planahead.htm
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 18
module1
bitfilemodule
1
bitfile partial module
1
bitfile
bitstream linking [bitscan]
staticnetlist
module1
templates
place & route static [PAR] build partial module1
initial system
bitfile
build partial module1
repository for the run-time system
ReCoBusI/O bars
staticconstraints
module1
netlist
build static bitstream [bitgen] build module1 bitstream [bitgen]
place&route module1
[PAR]
module1
bitfilemodule
1
bitfile full module
1
bitfile
static systembitfile
budgeting[Xilinx XST]
budgeting[Xilinx XST]
module1
constraints
floorplanning andcommunication synthesis
[ReCoBus-Builder]
partial bitstream extraction
[bitscan]
bitlink module.bit X Y \
static.bit initial.bit
Design Tools: ReCoBus-Builder www.Re Bus.de
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 19
Design Tools: ReCoBus-Builder www.Re Bus.de
� Example system
� PPC
� Ext. SDRAM
� Video I/O
� 96 Slots
� PLB compatible
backplane
� 600 MB/s DMA
� All slots can read
the video stream
� Easy to implement with ReCoBus-Builder & Xilinx ISE / EDK
Re Bus
ReCoBus_TT
ReCoBus_TB
ReCoBus_BT
ReCoBus_BB
IOBar_TT
IOBar _TB
IOBar _BT
IOBar _BB
PPC
static
static
top system
bottom system
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 20
Design Tools: ReCoBus-Builder www.Re Bus.de
� Easy usable
builder for
reconfigurable
systems
� Available on
www.recobus.de
System Specification
(Communication Architecture & Floorplan)
generate
static
system
generate
module
repository
bitlink module.bit -pos X,Y static.bit -outfile initial.bit
Re Bus
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 21
Why using Component-based Design?� Closing the design productivity gap
� Enhance design reuse (e.g., by using standardized interfaces)
Sourc
e: M
ichael F
lynn (
AS
AP
2005)
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 22
Context-Switching on FPGAsSoftware: run in timeFPGA / ASIC: run in space
COSRECOS
� Combine the advantages of both by using run-time reconfiguration:
� High performance
� Enhanced resource efficiency
� Simplified design
(Context Switching Reconfigurable Hardware for Communication Sys(Context Switching Reconfigurable Hardware for Communication Systems)tems)
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 23
Context-Switching on FPGAs
� What is the Context of an FPGA?
2) State of a module
(logic level)• Register snapshot
• RAM blocks
• External state
1) Present FPGA configuration
(technology level)
Access via configuration portAccess via configuration port or
extra logic (e.g., scan-chain)
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 24
Context-Switching on FPGAs
Technology level (FPGA)Log
ic leve
l (m
odule
)
• Module runs forever• Single configuration/ module context
• ASIC-like
(e.g., memory controller)
• Configuration swapping• Run-to-completion model(no module context is considered at start)
(e.g., motion-JPEG)
• Multiple module contexts• on a single configuration
(e.g., multi channel crypto)
• module preemption andresuming
• Configuration swapping
• Transparent (like software)
dynamicstatic
dynamic
static
COSRECOS
� All variants may co-exist in a reconfigurable SoC
Dirk Koch ([email protected])Partial Runtime Reconfiguration for Industrial Applications – Methods and Tools 25
Thanks for your attention
� COSRECOS people:
� Jim Tørresen
� Dirk Koch
� Simen Gimle Hansen
� Alexander Wold
� Christian Beckhoff
� + Students
� Very active research project � follow us on
QuestionsSuggestions Comments
This project is funded by the Research Council of Norway