skku 휴대폰학과 © 조준동 2008 1 조 준 동 2008.1 1 digital signal processing with fpgas...
Post on 20-Dec-2015
221 views
TRANSCRIPT
SKKU 휴대폰학과 © 조준동 2008 1
조 준 동
2008.1
1
Digital Signal Processing With FPGAs
Paul EkasJean-Charles Bouzigues
SKKU 휴대폰학과 © 조준동 2008 2 2
OptionOption ResourceResource Area UsageArea Usage
1 Logic Multipliers
Logic Elements (Traditional)
500 LEs per 18x18 Multiplier
2 Hard Multipliers DSP Blocks 4 18x18 Multipliers per
DSP Block
3 Soft Multipliers RAM 1 to 2 Embedded Memory Blocks
Multiplier Options In FPGAs Multiplier Options In FPGAs
SKKU 휴대폰학과 © 조준동 2008 3 3
Logic Elements
• Smallest Unit of Logic• Grouped into Logic Array
Blocks (LABs) of Ten LEs• Features
– Four-Input Look-Up Table (LUT)
– Configurable Register– Dynamic Add/Subtract Control– Carry-Select Chain Logic
LE14
4
4
4
4
4
4
4
4
4
Control Signals
LocalInterconnect
LE2
LE3
LE4
LE5
LE6
LE7
LE8
LE9
LE10
LogicElement
Logic ArrayBlock
SKKU 휴대폰학과 © 조준동 2008 4 4
18 Bit x 18 Bit
4 Multiplies
2 Multiplies with Accumulate
1 Sum of 2 Multipliers (Complex Multiply)
1 Sum of 4 Multiplies
9 Bit x 9 Bit
8 Multiplies
2 Multiplies with Accumulate
2 Sum of 2 Multipliers(Complex Multipliers)
2 Sum of 4 Multiplies
+
Op
tio
nal
Pip
elin
ing
Ou
tpu
t R
egis
ter
Un
it
Ou
tpu
t M
UX
144144
36
36
36
36
37
37
38
+ -
+ -
Inp
ut
Reg
iste
r U
nit
36 Bit x 36 Bit
1 Multiply
DSP Block: Optimized Hard MACDSP Block: Optimized Hard MAC
SKKU 휴대폰학과 © 조준동 2008 5 5
Soft Multipliers: Lookup Based Multiplication
Soft Multipliers: Lookup Based Multiplication
• Use Embedded RAM Blocks as Look-Up Tables (LUTs) for Generating Partial Products
• Coefficient or Sum of Coefficients Values Stored in RAM Blocks• MSB Partial Product Shifted & Added to LSB Partial Product
ADDRESS MULT_RESULT
00000 0
00001 C
00010 2*C
00011 3*C
… ….
11111 31*C
32*18M512
C = Coefficient[12:0]
Multiplier Table5
18
Address
Data Output
• Example– Multiplication of 5-Bit
Input with 13-Bit Coefficient
• All 18 Bit Possible Results Stored at 32*18 Look Up Table
SKKU 휴대폰학과 © 조준동 2008 6 6
Altera FPGA Memory Architectures
• Today’s applications need more high performance memory• One size does not fit all • Wide choice of modes and widths
M512 Blocks M4K Blocks M-RAM External Memory Devices DDR SDRAM & SRAM SDR SDRAM QDR & QDRII SRAM ZBT SRAM DDR FCRAM
True Dual Port RAM Embedded Shift Register
Mode 512K bits 300 Mhz Operates Up to 300Mhz Mixed Clock Mode
True Dual Port RAM Embedded Shift
Register Mode Operates Up to
312Mhz Mixed Clock Mode
Rate Changing Embedded Shift
Register Mode Operates Up to
312Mhz Mixed Clock Mode
More Bits For Larger Memory Buffering
More Data Ports for Greater Memory Bandwidth
SKKU 휴대폰학과 © 조준동 2008 7 7
Soft Multiplier: Sum of Multiplications Soft Multiplier: Sum of Multiplications
M51232*18
18
M51232*18
1
18
1935
1 1
4ADDRESSADDRESS MULT_RESULTMULT_RESULT
0000 0
0001 C0
0010 C1
0011 C0+C1
… ….
1111 C0+C1+C2+C3
16-Bit Serial Shift Registers
Sum of Multiplications Table
Output
Input
(Sample 16-Bit, Coefficient 16 Bit)
Example: FIR FilterMemory: 2 M512
++
++
4
16-Bit Serial Shift Registers
SKKU 휴대폰학과 © 조준동 2008 8
조 준 동
2008.1
8
Example Direct Sequence Spread Spectrum (DSSS)
Modem
SKKU 휴대폰학과 © 조준동 2008 9 9
DSSS Modem
• Five Independent Data Channels Spread to 3.84 Mcps• Three-Stage FIR Interpolation-by-32• Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth• 112 dB SFDR 15.36 MHz Quadrature Carriers• 122.88 MSPS Transmitter Output with 5 MHz Bandwidth & Over 78-dB Out–of-Band Rejecti
on• Automatic Gain Control (AGC) Compensating for Channel Attenuation of up to 30 dB• Costas Loop Carrier Recovery• 4x Oversampling Code Synchronization
DSSSModulator
ChannelModel
DCH0
DCH1
DCH2
DCH3
DCH4
DCH0
DCH1
DCH2
DCH3
DCH4
DSSSDemodulator
SKKU 휴대폰학과 © 조준동 2008 10 10
DSSS Modulator
FIR3 RRC25-Tap FIR
FilterInterpolation x4
Ex BW:22%
NCO FrequencyResolution:
0.03HzSFDR: 112dB
FIR1LPF
2-Channel87-Tap
FIR FilterInterpolation
x2
Length 256Gold CodeSpreader
DCH0
DCH1
DCH2
DCH3
DCH4
PCH
Cch,16,0
Cch,16,1
Cch,16,2
Cch,16,8
Cch,16,9
Cch,16,10
SCH
FIR2LPF
2-Channel47-Tap
FIR FilterInterpolation
x4
FIR3 RRC25-Tap FIR
FilterInterpolation x4
Ex BW:22%
Sin(wn)
Cos(wn)
Carrier PhaseIncrement
K
K
gi
gq
Re[]
Im[]
SKKU 휴대폰학과 © 조준동 2008 11 11
DSSS Demodulator
PeakDetector
NCOFrequencyResolution:
0.03HzSFDR: 112dB
FIRAltera RRC
31-Tap FIR FilterExcess BW: 22%
Fixed Rate
AGC
Free-RunningPhase Increment
FIRAltera RRC
31-Tap FIR FilterExcess BW: 22%
Fixed Rate
CarrierRecovery
Loop
8 Gold CodeCorrelator
4xOversampling
Buffer I-QDerotate
Pilot Monitor
HadamardDespreader
PilotOutput
DataChannels
Output1…5
pn_lock
max_index
8
SKKU 휴대폰학과 © 조준동 2008 12 12
DSSS Modem Resources
Resource Usage Summary
DesignEntity
LogicElements
M512RAM
M4KRAM
MegaRAM
DSP BlockElements
Modulator 9943 1 8 0 12
Demodulator 12196 60 8 1 60
Power Usage Estimates
Power mW
Total Standby Internal Power 75
Total Logic Element Internal Power 283
Total Clocktree Internal Power 175
Total DSP Internal Power 23
Other Internal Power 92
Total Power 505
SKKU 휴대폰학과 © 조준동 2008 13 13
FIR Filter Example* – 16X Cost/Performance Improvement
Device Solution FIR Performance
(MHz)
Device Cost****
Cost perFIR MHz
TI C6713-200 64 cycles** @ 200MHz
3.125 $24.59 $7.87
TI C6416-600 32 cycles** @ 600MHz
18.75 $160 $8.53
Altera 1C3-8 8 cycles*** @ 230MHz 28.75 $14 $0.49
Altera 1C12-8 1 Cycles*** @ 170MHz
170 $84 $0.49
* FIR 128 Tap, 16 bit data, 14 bit coefficients** DSPLib Optimized Assembly Libraries from Texas Instruments*** MegaCore Optimized FIR Compiler from Altera**** Pricing in quantity of 100 at Arrow 6/25/03
* FIR 128 Tap, 16 bit data, 14 bit coefficients** DSPLib Optimized Assembly Libraries from Texas Instruments*** MegaCore Optimized FIR Compiler from Altera**** Pricing in quantity of 100 at Arrow 6/25/03
SKKU 휴대폰학과 © 조준동 2008 14 14
Reconfigurable video processor for SDRAM access optimization
(Henriss, Ernst et al.)
SKKU 휴대폰학과 © 조준동 2008 15 15
Reconfigurable video platform
· SDRAM memory centered design· FPGA based scheduler merges different streams and
random accesses exploitation of SDRAM bank structure
· supports 2 HDTV streams at 1.48 Gbit/s each plus DSP and filter unit access
· reaches 700MByte/s in practical application for 4 Byte SDRAM memory word
· extremly cost efficient design· used in professional video product line
SKKU 휴대폰학과 © 조준동 2008 16 16
Fine-Grained RSOCs: Triscend A7 CSOC
• A7 Family• 32-bit ARM 7 with
8kB Cache• 3200 logic cells m
ax. (40K gates)• Up to 3800 FF’s• Up to 300 Prog. I/
O pins• www.triscend.com
SKKU 휴대폰학과 © 조준동 2008 17 17
Coarse-Grained RSOCsChameleon Structure (2000)
Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M.Heysters, www.chameleonsystems.com
• 32-bit ARC control processor• Up to 84 32-bit Datapath Units • DPU=a 32-bit ALU+a 32-bit barrel shif
ter • Up to 24 of 16x24-bit multipliers• Up to 48 of 128x32-bit local memory
modules• Up to 160 Prog. I/O pins• Targeted at 3rd gen. wireless • basestation, wireless local loop, • SW radio, etc.
Design a battery powered personal mobile computing device that has multimedia functionality and can operate in a dynamic environment.
- Do just enough and not too much for a given task (QoS)
SKKU 휴대폰학과 © 조준동 2008 18 18
Field Programmable Function Array
• The FPFA concept has a number of advantage– The FPFA has a highly regular organisation– We use general purpose process core– Its scalability stands in contrast to the dedicated chips de
signed nowadays– The FPFA can do media processing tasks such as compre
ssion/decompression efficiently
SKKU 휴대폰학과 © 조준동 2008 19 19
Field Programmable Function Array
ALU ALU ALU ALU ALU
M M M M M M M M M M Memory
CrossBar
Registers
ALUs
• Processor tiles– Consists of five identical blocks, which share a control unit and a communic
ation unit– An individual block contains an ALU, two memories and four register banks o
f four 20-bit wide register– A crossbar-switch makes flexible routing between the ALUs, registers and m
emories– This structure is convenient for the Fast Fourier Transform(6-input,4-output)
and the Finite impulse response
SKKU 휴대폰학과 © 조준동 2008 20 20
Dedicated Hardware Architecture
Per
form
ance
(M
MA
Cs/
sec)
DSP System Architecture OptionsDSP System Architecture Options
DSP DSP DSP DSP
DSP DSP DSP DSP
DSP DSP DSP DSP
DSP DSP DSP DSP
Processor ArrayStand-Alone Processor
DSP
Processor + Co-Processor
DSP
SKKU 휴대폰학과 © 조준동 2008 21 21
Optional Coprocessor Mappings
ProcessorProcessor
MemoryMemory
FPGAFPGAFPGAFPGA
Processor External to FPGAProcessor External to FPGAProcessor On FPGAProcessor On FPGA
•TI c6x (EMIF)•Mot PPC (MPX)•Mot Starcore (MPX, AHB)•Intel 2850 (PCI Express)•ARM (AHB)•…..
•TI c6x (EMIF)•Mot PPC (MPX)•Mot Starcore (MPX, AHB)•Intel 2850 (PCI Express)•ARM (AHB)•…..
•Nios•ARM (AHB)
•Nios•ARM (AHB)
SKKU 휴대폰학과 © 조준동 2008 22 22
Mapping of DSP Algorithms on the FPFA
DFT
N=8
FFT
N=8
DFT
N=8
DFT
N=8
FFT
N=8
FFT
N=8
FFT
N=8
DFTN=2
DFTN=2
DFTN=2
DFTN=2
• Fast Fourier Transform– FFT recursively divides a DFT into smaller DFTs
+
--
a
b
W
Recursion of a radix 2 FFT with 8 inputs
The radix 2 FFT butterfly
SKKU 휴대폰학과 © 조준동 2008 23 23
OMAPTM(open multimedia application platform)
• OMAP architecture 는 platform 의 전체 clocking 과 idle mode의 전체 control 을 할 수 있는 SW/OS 가 있다 .
• Dual core architecture 는 task 에 대해 가정 적당한 process에게 task 를 할당하는 것이 가능
SKKU 휴대폰학과 © 조준동 2008 24 24
Mapping of DSP Algorithms on the FPFA
1 2 3 4 5O
h4 h3 h2 h1 h0
Cross Bar
Level 2
• Five-tap finite-impulse response filter
SKKU 휴대폰학과 © 조준동 2008 25 25
MorphoSys (1999)
SKKU 휴대폰학과 © 조준동 2008 26 26
Reconfigurable cell
SKKU 휴대폰학과 © 조준동 2008 27 27
RC Array
•Array of reconfigurable cells•64 cells in a 2-D matrix
•SIMD model•Same row(column) share configuration• Each RC operates on different data
SKKU 휴대폰학과 © 조준동 2008 28 28
TinyRISC (Cont’d)
SKKU 휴대폰학과 © 조준동 2008 29 29
Implementation & Performance
•0.35 micron technology•4 metal layers•Operation at 100MHz•170 mm2
Motion Estimation
Block size : 16x16 pixel, Image size : 352x288 pixel
SKKU 휴대폰학과 © 조준동 2008 30 30
Lx de STMicroelectronics
SKKU 휴대폰학과 © 조준동 2008 31 31
DART, Raphael David, IRISA/ENSSAT
With STMicroelectronics, UBO univ.With STMicroelectronics, UBO univ.
• Reconfigurable multigrain= DPR+FPGA
• Reconfiguration Dynamique• Faible Consommation• Distribution hierarchique des r
essources• SCMD (Single Configuration M
ultiple Data)
DARTCluster
11 GOPS/cluster1.6 GMACS/cluster0.64 W @ 11GOPS16 MIPS/mW @ 11GOPS0.18u CMOS
SKKU 휴대폰학과 © 조준동 2008 32 32
Cluster architecture
Configmem.
FPGA
DMA ctrl
Control
DPR1
DPR2
DPR3
DPR4
DPR5
DPR6
Data mem
Segm
ented network
SKKU 휴대폰학과 © 조준동 2008 33 33
DPR architecture
reg1 reg2MUL1 ALU1 MUL2 ALU2
Multibus network
Datamem1
Datamem2
Datamem3
Datamem4
AG1 AG2 AG3 AG4
Loop management
Global bus
SKKU 휴대폰학과 © 조준동 2008 34 34
• Run-time configurable ASIC: DS spreading, Chip shaping (FIR filter), Timing recovery, Antijam, transmission security, Correlator(low precision arithmetic to reduce power consumption)
• Maximize the number of functions performed by the DSP: Data burst, FEC, Interleaving,• Adaptive S.P. Deinterleaver, Adaptive Decoder• SDR 기술에 적용 가능한 분야
Hardware Software-Controlled Hardware Programmable SoftwarePost-Shipping
Programmable Software
Antenna
VCOBaseband B/WOutput Power
Modulator(Switched)Encryption
RF SelectivityIF
Chip-rate processing
ModulationEncryption
Smart AntennaSignal Processing
Source codingIF Selectivity
Power-ManagementSymbol-rate processing
User-interface
SKKU 휴대폰학과 © 조준동 2008 35 35
BB/IF Real/Complex
Digital/Analog
ANTENNA RFChannelSelector/Combiner
BasebandProcessing
DSP
Call/MessageProcessing &
I/O
CommonSystem
Equipment
I/O
MONITOR/CONTROL
Multimedia/WAP
ROUTING
I/O I/O I/O I/O
BBText Flow
Control bits
BBText Flow
Control BitsRFRF
Voice/PSTN
Data/IP
Flow Control
NSS/Network
AIR
I
C
I
C
I
C
I
C
AUX AUX AUX AUX AUX
Ext. Ref
Clock/StobeRef, Power
Remote Control/Display
Local Control
• Typical Signal Processing blocks in software Defined Radio– SDR Forum Recommended
SKKU 휴대폰학과 © 조준동 2008 36 36
• ADC sampling rate• dynamic range (determine precision of arithmetic op
erations)• translation of digital IF to baseband• modulation/demodulation algorithms• error coding/decoding algorithms• synchronization algorithms
SKKU 휴대폰학과 © 조준동 2008 37 37
Soft Radio Research Group
• DARPA’s Adaptive Computing Systems Project• Virginia Tech• University of California at Berkeley• Brigham Young University• Chameleon Systems Inc.• Morphic Inc.• Quicksilver Technology Inc.• Sirius Inc.
SKKU 휴대폰학과 © 조준동 2008 38 38
• low power : low-power DSP and MCU processor in combination with a small, low power programmable logic device (PLD).– Functions needed for GSM Phase 2+ or UMTS termi
nal. – DSP16000 and ARM7 MCU, Xilinx’s CoolRunner PL
D with extreme low power consumption (<0.5mA)
• serve as HW co-processor for MCU, DSP or both.
• reconfigurable coprocessor• SW part designed in Processor Expert™ • Embedded Beans library
SKKU 휴대폰학과 © 조준동 2008 39 39
• Object oriented, component based embedded application CASE development tool
– code portability, component reusability– expert knowledge system assistance.– virtual prototyping– IP sharing by embedded components exchange.
• GSM - UMTS– components (Embedded Beans) as building blocks
• MCU expert knowledge system– calculates overall system timing propagation – automatic connection of peripherals – Verifies the application timing
• Processor Expert™ generates resulting source code (in selected language – typically C, ASM, C++ or VHDL).
SKKU 휴대폰학과 © 조준동 2008 40 40
BRAMs
BRAMs
VersaRing
VersaRing
Ver
saR
ing
Ver
saR
ing
IOB
’s
IOB
’s
IOB’s
IOB’s
DLL DLL
DLLDLL
Control
LUT
Control
LUT
Configurable storageelement
CLBs
Configurable storageelement
StandardArrary of CLBs
LUT :o look up table for logic functionsowide RAM or ROMo shift registerControl :o Combination of both LUTso Arithmetic supporto Carry controlo Route throughConfigurable Storageelement :o clocking modeo polarity asynchronous reset
Xilinx Virtex FPGA : intelligent configurationmechanism for fast and partial
Increasing density and reducing powerIncluded extra functions to support digital signaloperations such as extra arithmetic support andincreased RAMDynamic reconfiguration is also supported.
Block RAM large resource for storage ofapplication data
I n p u t O u t p u tBlocks (IOBs). configurable interfacing
SKKU 휴대폰학과 © 조준동 2008 41 41
Algorithm Definition& Specification
Optimization ofHardware Structure
PerformanceEst.
DSP/MCURequirement
ASIC/FPGA
Verification
Complexity ofReconfiguration
processor technology,such as DSPs, FPGAs,
Complexity & Levels ofReconfigurationComplexity
Software Repositoryand Access Methods
Transparent Reconfiguration Reconfiguration Signalling Verifying the Reconfiguration
TransparentReconfiguration
Selective Redefinitionof Module(s)
Micro and Macro levelProcess Management
Software Repositoryand Access Methods
SKKU 휴대폰학과 © 조준동 2008 42 42
Mode 1
Mode 2
Mode n
RFBB signal
Processing
RFBB signal
Processing
RFBB signal
Processing
RF
RF
RF
Memory forparameter
set
Basebandsignal
processing Pro
gra
mm
able
hig
h p
ow
erB
aseb
an
d s
ign
al p
roce
ssin
g
Fle
xib
le a
nd
ad
apti
ve R
F f
ron
ten
d
Multi-mode terminal with parallel modesMulti-mode terminal with software defined
signal processingFully adaptive software reconfigurable
system
RF BaseBand
수신된 신호를 IF 혹은 Baseband 신호로 변환
변조부, 채널 코덱부, 채널화기, 암호화부,시간/위상 추적부
SKKU 휴대폰학과 © 조준동 2008 43 43
• 다중 대역 안테나• 선형 광대역 RF 부품• 광대역 A/D, D/A 변환기• 고성능 DSP/ 재구성 가능한 로직
Antenna RF ADC DSP
Smart 안테나
고 효율 선형 안테나
광대역, 소형화고 효율, 선형 RF 전력 증폭
기다른 신호와 동일 시간에 간
섭과 잡음이 없는 설계단일 모드와 같은 특성을 내
는 고주파 부품
첫번째 IF 단(아날로그 내림 변환)- ADC- 두번째 IF 단(디지털 내림 변
환)Band pass sigma delta
구조
기저대역부를 SW화 할 수있을 만큼의 성능,
TMS320C62X : 최대 성능1600 MIPS, TMS320C64X :
4800 MIPS
Reconfigurable Logic
FPGA,RC(ReconfigurableComputing) ASIC
SKKU 휴대폰학과 © 조준동 2008 44 44
RFConversion
to IF andA/D
I/Ocontroller
ProcessController
TemporaryStorageBuffer
Output andinterface with
host PC
ProgramMemory
ProgramMemory
Fo
rmat
ion
of
Str
eam
Pa
cket
s/In
terp
reta
tio
n
InterconnectingArray of Processing
Elements
Configurable ASIC FPGA
적절한 수준의 프로그래 밍 능력과 집적도를 제공
할 때 최선의 솔루션 , 낮은 프로그램 능력 집
적도
/ 고속 병렬 선형 신호처리 를 위한 최선의 프로그래머
블 솔루션 , 높은 전력 소비 칩 사이즈
가 큼
DSP
복잡한 분석, 의사 결정을 포함하는 기능에 대한 최선의 프로
그래머블 솔루션ASIC, FPGA에 비해
낮은 성능
Programmability,Level of Integration,
Development/Implementation/Test
Cycle,Performance in required
processing time,Power.
SKKU 휴대폰학과 © 조준동 2008 45 45
Multiplexing &Burst Construction Encription
ChannelCoding
Interleaving
DataProcessing
CRCinsertionModulation
Sequencer
Spreading
Equalization
Rate matching Channelization
Segmentation
RadioResource
Advantage Drawback
Only simple program-Scheduling,
factorization forcommon function
Restrict re-configurabilitywithin macro,
Data path routing-macro function composedof ASIC or FPGA or both, Routing Device-
Sequence
SKKU 휴대폰학과 © 조준동 2008 46 46
Advantage Drawback
Low-complexity ofhardware
Slower reconfiguration process, ifreconfiguration is failed, the
system will not operate-necessaryof default mode
Systematic re-programming of wholebaseband module, new standard is
installed on same hardware
FPGA
MPU
Previous Standard is running
FPGA
FPGA
FPGA
FPGA
MPU
Reconfiuration
FPGA
FPGA
FPGA
FPGA
MPU
Present Standard is running
FPGA
FPGA
FPGA
SKKU 휴대폰학과 © 조준동 2008 47
조 준 동
2008.1
47
Systolic Ring : Scalable Structure Pascal BENOIT
G. Sassatelli – L. Torres – D. Demigny M. Robert – G. Cambon
SKKU 휴대폰학과 © 조준동 2008 48 48
Systolic Ring
• Based on a coarse-grained
configurable PE
• Circular datapaths C: # of layers C = 4 N: # of Dnodes per
layer N = 2 S: # of Rings s = 1
• Control Units (sequencer)
Local Dnode unit Local Ring unit Global unit
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switch
SwitchSw
itch
Switch
layer 1
layer 2
layer 3
layer 4
Dnode Sequencer
Local RingSequencer
SKKU 휴대폰학과 © 조준동 2008 49 49
Remanence
Fe
Fc
FcNcFeN
R PE
..
• NPE: # of processing elements (PE) • Nc: # of PE configurable per cycle• Fe: operating frequency • Fc configuration frequency
• Characterizes the Dynamism• # of cycles to (re)configure the whole architecture• Amount of data to compute between 2 configurations
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0
Sequencer
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0
Sequencer
SKKU 휴대폰학과 © 조준동 2008 50 50
Operative Density
NPE: # of PE
A: Core Area (relative unit ²)
Area can be expressed as a function of NPE
)()(
PE
PEPE
NAN
NOD
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0Sequencer
Interconnection
PE PE PE PE PE
instn
…
Configuration Memory
Processing Elements
Routing
Sequencing Unit
…inst3inst2inst1inst0Sequencer
SKKU 휴대폰학과 © 조준동 2008 51 51
Remanence formalisation
• # of layers : C = 8• # of Dnode per layer : N = 2• 1 Systolic Ring: S = 1
0
5
10
15
20
25
30
35
40
0 20 40 60 80 100 120 140 160 180 # Dnodes
REMANENCE
k = 2k = 4
k = 8
0
5
10
15
20
25
30
35
40
0 20 40 60 80 100 120 140 160 180 # Dnodes
REMANENCE
Switch
Dnode Dnode
Dnode Dnode
Swit
ch
Dnode
Dnode
Switch
Dnode
Dnode
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Switch
Dnode Dnode
Switch
Dnode Dnode
Dnode Dnode
Swit
ch
Dnode
Dnode
Switch
Dnode
Dnode
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Switch
Dnode Dnode
layer 1 layer 2
layer 3
layer 4
layer 5layer 6
layer 7
layer 8
k = 1k = 1
k = 2k = 4
k = 8
PEPENkNR .)(
k= C/N
SKKU 휴대폰학과 © 조준동 2008 52 52
Architectural model Characterization
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Global Bus
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
SwitchSwitc
h
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Dnode Dnode
Dnode
Dnode
Dnode
Dnode
Dnode Dnode
Switc
h
Switc
h
Switch
SwitchSwitc
h
Switc
h
Switch
Switch
Global Bus
Global Sequencer
Local RingSequencer
Local RingSequencer
Local RingSequencer
Local RingSequencer
# of layers : 4 (C = 4) # of Dnode per layer : 2 (N = 2)4 Systolic Ring (S = 4)
Control Units• Local Dnode unit• Local Ring unit• Global unit
•www.qstech.com
SKKU 휴대폰학과 © 조준동 2008 53 53
Best OD and remanence
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Op
erat
ive
Den
sity
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanence
Rem
anen
ce
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Op
erat
ive
Den
sity
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanence
Rem
anen
ce
Design SpaceWorst interconnect resources and processing power
SKKU 휴대폰학과 © 조준동 2008 54 54
Worst OD and remanence
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Op
erat
ive
Den
sity
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanence
Rem
anen
ce
0,000
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0 20 40 60 80 100 120 140
# Dnodes
Op
erat
ive
Den
sity
S=1
S=2
S=4
S=8
0
5
10
15
20
Remanence
Rem
anen
ce
Design SpaceBest interconnect resources and processing
power
SKKU 휴대폰학과 © 조준동 2008 55 55
Comparisons of RA
1. Only 1 cycle to (re)configure the DSP
2. Few cycles to (re)configure coarse grain RA (8)
3. Many cycles to (re)configure fine grain RA
NPE Nc RName Type F (MHz)
2304 0.14 16457
24 4 6
24 4 6
128 16 8
ARDOISE
Systolic Ring
DART
MorphoSys
TMS320C62
Fine Grain RA
Coarse Grain RA
Coarse Grain RA
Coarse Grain RA
DSP VLIW 8 8
33
200
130
100
300 1
FcNc
FeNR PE
.
.
Pascal BENOIT