from embedded system to digital ic design 陳 培 殷 教授 國立成功大學 資訊工程系
Post on 15-Jan-2016
247 views
TRANSCRIPT
From Embedded System to
Digital IC Design
陳 培 殷 教授國立成功大學 資訊工程系
PCB
PC: a general-purpose computing system
PC
Pentium
PCBEmbedded System: a special-purpose computing system
Most embedded systems are designed for
1. special purpose (customizing and non-programmable)
2. real time applications 3. stable applications
4. automatic applications
Embedded System (1/2)CPUs
uPUART
MPEG
ROM
RAM
Embedded System (2/2)
Traditional embedded systems uses low-level processors only.
ARM PCI
MPEGUSBFLASH
ROM
RAM
DSPAMBA
Advanced embedded systems
Multi-core
Applications
Information Appliances (IA):
1. Smart phone, VOIP
2. Digital TV, set-top box
3. PlayStation
4. PDA, mp3 player
5. Camera, DV
6. Air-conditioner, microwave oven, refrigerator,
vacuum cleaner, sensor network
7. Motorcycles
8. Car (abs, engine firing, air bag) >100 processors
9. … Ubiquitous computing (many computers for everyone)
Application Everywhere!
Requirements
1. Friendly user interface
2. Multiple-rate matching
3. Short time-to-market
4. Real-time (Speed)
5. Cost
6. Power consummation/dissipation
(cooling strategy and battery how?)
7. Distributed property
Power Consumption
The basic equation to represent the average power consumption in CMOS:
VCfP avg
2
:V
:C
::f
Supply voltage
Capacitance
Clock frequency (*)
Average number of 0-to-1 transitions (*)
Transitions reduction, Sleep mode
Distributed property
Machine #1
OS#1
Machine #2
OS#2
Machine #3
OS#3
Machine #4
OS#4
Network
Distributed applications
Middleware
.
.
.
Design Flow
Specification
System Architecture
Hardware Design Software Design
System Integration
System Verification/Testing
Hardware/software partition is
very difficult !!! (cost time)
Synthesis
Applications
傳輸距離: 100 m功率消耗: 2 W功能:影像傳輸、 語音傳輸
螢 幕: 176 x 220 pixel 65535 色 1.8 吋 TFT
其他
傳輸距離: 100 m功率消耗: 2 W功能:影像傳輸、 語音傳輸
螢 幕: 176 x 220 pixel 65535 色 1.8 吋 TFT
其他
Specifications
Hardware:CPU,
RAM, I/O…
Hardware:CPU,
RAM, I/O…
Software:C,C++
Software:C,C++
System Design
always @(posedge clk) begin if (sel1) begin out=in1; else out=in2; endend
always @(posedge clk) begin if (sel1) begin out=in1; else out=in2; endend
Component Design
Layout
Placement & RoutingFabrication
Marketing
System Development Flow
Testing
IC Industry in Taiwan
晶粒測試及切割
晶圓切割
設計
導線架
測試 封裝 製造 光罩
晶圓
邏輯設計 封 裝
化學品
成品測試光罩設計
長晶
ARM PCI
MPEGUSBFLASH
ROM
RAM
DSPAMBA
Hardware Design -- Chip (1/4)
ASIC
The basic design flow for digital cell-based ASIC
Describe circuits with hardware description language
(HDL 硬體描述語言 ) VHDL and Verilog
Synthesis ( 合成 ) the circuits ….
application specific integrated circuit (ASIC 晶片 ) such as USB, MPEG, ….
Full custom design vs. semi custom (cell-based) design
Hardware Design -- Chip (2/4)
always @(IN)begin OUT = (IN[0] | IN[1]) &
(IN[2] | IN[3]);end
OUT
IN[0]IN[1]
IN[2]IN[3]
Example:
….
….
Hardware Design -- Chip (3/4)
always @(…) if (a==b) if (c==1) d=f; else d=1; else d=0; a
b
c
f
d
fca
b
d
Translate into Boolean Representation
Optimize + MapHDL Source
Target Technology
Synthesis = Translation+Optimization+Mapping
Process of logic synthesis
Hardware Design -- Chip (4/4)
FPGA or CPLD
Real ASIC chip
less flexible, long design cycle, higher speed,
larger-scale production to reduce price
more flexible, shorter design cycle, lower speed, lower utilization
suitable for smaller production
Standard cellStandard cell
PLDPLD
Fab (TSMC, UMC, ..)
Two implementations :
Xilinx, Altera
Hardware Design -- System
ARM PCI
MPEGUSBFLASH
ROM
RAM
DSPAMBA
ASIC
Input devices: keyboard, touch screen, switch, button, ..
Output devices: monitor, LCD, LED, …
Extended devices: compact flash card (CF), PCMCIA, SD
(for storage, wireless communication, I/O)
Power system:
Transmission Interface: PCI, USB, IEEE 1394, UART, bluetooth…
Bus: AMBA (Advanced Microcontroller Bus Architecture)
Input devices
Output devices
Firmware Design
ARM PCI
MPEGUSBFLASH
ROM
RAM
DSPAMBA
ASIC
Devices drivers for I/O devices, extended devices, transmission interface
Assembly codes and C codes for some dedicated CPUs (ARM, 8086,..)
Architectures and instruction sets of different CPUs, DMA,…
Input devices
Output devices
Software Design
ARM PCI
MPEGUSBFLASH
ROM
RAM
DSPAMBA
ASIC
Input devices
Output devices
Embedded OS: WinCE, Palm OS, uC/OS, Linux, JAVA
Real time OS (time) as small as possible (memory)
Distributed embedded system (+ fault tolerance)
Application Software:
wireless communication, network, multimedia,
health, convenience, Web, ….
Porting a customized embedded system to different
machines is very difficult (need large modification)
FutureChip: tens of millions of transistors or more (.35, .25, .18, .09)
Design shifts from ASIC/board to system
System on a Board(printed circuit board)
System on a chip
uP FPGA
MPEGASIC
ATMROM
ROM
SW SW
SWSWPCB
uP Core SRAM
ROM
ATMMPEG
ROM
FPGA
Glue Logic
A/D Block
PCB
SOCSystem-on-a chip is possible
(the whole system is
built in a single chip)
SOC is industry trend
Example: Mobile Phone
Voice only; 2 processors 4 year product life cycle Short talk time
Yesterday
Voice, data, video, SMS <12 month product life cycle Lower power; longer talk time
Today
• 5~8 Processors
• Memory• Graphics• Bluetooth• GPS• Radio• WLAN
Single Chip
DSP
Radio
FlashMemory
Processor
Source: EI-SONICS
Hardware Algorithm and VLSI Implementation for
1. H.264
2. Color Filter Array
3. Image Scaling
4. Image Noise Suppression
5. Wide Angle Correction
Current Work in My DIC LABCurrent Work in My DIC LAB
Example : Barrel Distortion Correction
Wide-angle cameras are widely used in many imaging applications nowadays. Images captured by wide-angle lens suffer from
barrel distortion.
DIS: Distorted Image Space CIS: Corrected Image Space
Barrel Distortion Correction
Motivation Low Cost, Real Time, Quality As Best As Possible T.H. Ngo and K.V. Asari
A Pipeline Architecture for Real-Time Correction of Barrel Distortion in Wide-Angle Camera Images
(IEEE Trans. Circuits and System for Video Technology, vol. 15, no. 3, March. 2005)
1. CORDIC (Cartesian to Polar) 2. Back Mapping 3. CORDIC (Polar to Cartesian) 4. Linear Interpolation
Cartesian to Polar
CoordinateTransformation
Back Mapping
Polar to Cartesian
CoordinateTransformation
u
vLinear
Interpolation ),( vuI
u
v
Back
(u’, v’) ‧
(u, v) ‧
Block diagram of Ngo’s architecture
Cartesian to Polar
CoordinateTransformation
Back Mapping
Polar to Cartesian
CoordinateTransformation
u
vLinear
Interpolation ),( vuI
u
v
Step 1:
Cartesian to Polar
Coordinate
Step 2:
Back
Mapping
Step 3:
Polar to Cartesian
Coordinate
Step 4:
Linear
Interpolation
Inputs ),( vu , ),( cc vu ),( , Nbb ~1 ),( , ),( cc vu ),( vu
Outputs
22 )()( cc vvuu
)arctan(c
c
uu
vv
N
n
nnb
1
coscuu
sincvv ),( vuI
Proposed Method (1/2) Drawback of Ngo
CORDIC
Goal Simplify Ngo
Modified Back Mapping
Linear Interpolation
Modified Back Mapping
u
vLinear
Interpolation),( vuI
u
v
Cartesian to Polar
CoordinateTransformation
Back Mapping
Polar to Cartesian
CoordinateTransformation
u
vLinear
Interpolation ),( vuI
u
v
Proposed Method (2/2) Input and output of our circuit
Input and output of Ngo
Step 1:
Cartesian to Polar
Coordinate
Step 2:
Back
Mapping
Step 3:
Polar to Cartesian
Coordinate
Step 4:
Linear
Interpolation
Inputs ),( vu , ),( cc vu ),( , Nbb ~1 ),( , ),( cc vu ),( vu
Outputs
22 )()( cc vvuu
)arctan(c
c
uu
vv
N
n
nnb
1
coscuu
sincvv ),( vuI
Step 1:
Modified Backing Mapping
Step 2:
Linear Interpolation
Inputs ),( vu , ),( cc vu , Ncc ~1 , ),( cc vu ),( vu
Outputs
222 )()( cc vvuu
)...)(1( 63
42
21 cc uucccuu
)...)(1( 63
42
21 cc vvcccvv
),( vuI
Proposed VLSI Architecture (1/7)
Proposed VLSI architecture The first three steps are combined into one
step. Mapping (Modified Back Mapping) Linear Interpolation
We develop a low-cost 21-stage pipelined VLSI architecture
Mapping
u
vLinear
Interpolation),( vuI
u
v
VLSI Architecture (2/8)
2
c
c
vvtuut
2
1
)( 2435 ttt
)( )(
)(
23538
21517
4556
ctctctct
ttt
)( )(
)(
426211
636810
8669
ctctcttt
ttt
)1( 1)(
)(
21714
42
63111013
849412
cttccttt
ctct
141315 ttt
121516 ttt
2161811617 , tttttt
1817 , tvvtuu cc
Start
),( newGet vu
224113 , tttttt
yvyvy
xuxux
,
,
ytxtytxt
1,11,1
2221
2019
yxtytttxttttttItxIytIyxI
262125
2224222123
201920
19
,,
),( ,),( ,),( ),,( Read
26201930
252029
241928
2327
),(),(),(
),(
tttItttxIttytIttyxIt
302932
282731
tttttt
3231),( ttvuI
INPUT:OUTPUT: ),( vuI
),(,,,,),,( 4321 cccc vuccccvu
Mapping
LinearInterpolation
Another pixel?
Stop
no
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
S17
84
63
42
2116 1 cccct
Back II
Back I
State flow chart of our two-step procedure
Block diagram of our VLSI architecture It generates the intensity value of one pixel in
CIS at every clock cycle.
The linear interpolation unit The mapping unit The memory bank The controller
MappingUnit
LinearInterpolation
Unit
u
v
u
v
),( yx
),1( yx
)1,( yx
)1,1( yx
RAM2
RAM3
RAM4
RAM1
)1,1( yxI
)1,( yxI
),1( yxI
),( yxI
),( vuI
Memory Bank
Controller
Proposed VLSI Architecture (3/7)
Proposed VLSI Architecture (4/7) 15-stage pipelined architecture of mapping unit
uc
+ c1c3
c4
c2
+1
+
+
+
+ +
u vcvStage-1
Stage-2 t1 t2
t3 t4t5
t7
t7
t8 t6
t9 t10 t11
t13 t14
t15t12
t16
cu cvt18t17
Stage-3
Stage-4
Stage-5
Stage-6
Stage-7
Stage-8
Stage-9
Stage-10
Stage-11
Stage-12
Stage-13
Stage-14
Stage-15vu
from t2from t1
xx yy
v-vcu-uc
u’c+t16*(u-uc) v’c+t16*(v-vc) Cf.
v-vcu-uc
(v-vc)2(u-uc)2
Proposed VLSI Architecture (5/7)
Mapping Hardware Architecture
u’
- reg
- reg
××
reg
reg
+ reg
× reg
× reg
× reg
× reg × reg
×
+ reg
reg
× reg
+ reg
+ reg
+ reg
× reg
× reg
+ reg
+ reg
1
u
u
v
vc
c
1t
2t
3t
4t
5t
6t
7t
8t
9t
10t
11t14t
13t
12t
15t
16t
17t
18t
19t
20t
c
v’c
u’
v’
4c
3c
1c
2c
Proposed VLSI Architecture (6/7)
Linear Interpolation Unit Four neighboring pixels -> One output. 6-stage pipelined architecture of linear interpolation
unit - from state S13 to state S17.
Memory
Stage-16
Stage-17
Stage-18
Stage-20
Stage-21
Stage-19
+ +
+ +
+
t19 t20
t23 t24 t25 t26
t27 t28 t29t30
t31 t32
x
x 1 y 1x yy x y
),( vuI
x1
y
y1
x
),( ,),( ,),( ),,( Read 20192019 tttxytyx
),( ,),( ,),( ),,( 20192019 ttItxIytIyxI ),( yxI ),( 19 ytI ),( 20txI ),( 2019 ttI
S13=Stage-16; S14=Stage-16、 17;
S15=Stage-18、 19; S16=Stage-20;
S17=Stage-21;
State flow
Proposed VLSI Architecture (7/7)
Linear Interpolation Hardware Architecture
+
reg
reg
reg
reg
-+
reg
reg
reg
reg-
x
x'
y
y'
1
1
u'
fraction
integer
v'
fraction
integer
x
y+1DIS
RAM
DISRAM
DISRAM
DISRAM
×
×
×
×
x+1
y+1
x+1
y
x
y
1-x'
y'
x'
y'
x'
1-y'
1-x'
1-y'
reg
reg
reg
reg
I(x, y)
I(x, y+1)
I(x+1, y+1)
I(x+1, y)
×
×
×
×
++
+
reg
reg
reg I25t
26t
24t
23t
reg
reg
reg
reg
29t
30t
28t
27t
32t
31t
33t
Results & Discussions (1/4)
Our circuit requires : less hardware cost higher clock rate
Feature Total Logic
Elements Flip-Flops
Clock rate
Clock cycle Throughput
Pipeline
Latency
[5] 18,344 ( 75% ) 15355 40 MHz
25 ns
30 M
pixels/s
91 clock
cycles
Proposed 7,163 ( 29 % ) 2811 56.98 MHz
17.55 ns
40 M
pixels/s
21 clock
cycles
Total cell area Gate count clock period clock rate
TSMC
0.18μm 449928.875 45128.272 6.6 ns 150 MHz
Altera EP20K600EBC652-1X FPGA.
Results & Discussions (2/4)
DOT (DIS) DOT (CIS)
Results & Discussions (3/4)
Grid (DIS) Grid (CIS)
Results & Discussions (4/4)
Lab (DIS) Lab (CIS)
Demo
Source ImageSending Image
Barrel Distortion Correction Circuit
FPGA Board
Receiving Image
Result Image
Software ProgramHardware Platform
PC
PC
Hardware Software Co-Simulation/VerificationSMIMS board
USB