vada lab.sungkyunkwan univ. 1 lower power algorithm for multimedia systems 1999. 8...

85
SungKyunKwan Univ . 1 VADA Lab. Lower Power Algorithm for Multimedia Systems 1999. 8 성성성성성성 성 성 성 http://vada.skku.ac.kr

Upload: esmond-may

Post on 13-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

1VADA Lab.

Lower Power Algorithmfor Multimedia Systems

1999. 8

성균관대학교 조 준 동 http://vada.skku.ac.kr

Page 2: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

2VADA Lab.

Contents

• Algorithmic Effects on Low Power

• Low Power Management

• Low Power Applications

– Low Power Video Processor

– Single Chip Video Camera

– Vector Quantization

– Data Encoding

– CDMA Searcher

– Viterbi Decoder

Page 3: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

3VADA Lab.

Low Power Algorithm

Page 4: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

4VADA Lab.

Algorithm Selection

• Example: 8x8 matrix DCT

Page 5: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

5VADA Lab.

Strength Reduction: DIGLOG multiplierC n n C n n

n

A A B B

A B A B B A A B

mult add

jR

kR

jR

kR

jR

kR R R

( ) , ( ) ,

,

( )( )

253 214

2 2

2 2 2 2

2

where world length in bits

1st Iter 2nd Iter 3rd Iter

Worst-case error -25% -6% -1.6%

Prob. of Error<1% 10% 70% 99.8%

With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

Page 6: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

6VADA Lab.

Logarithmic Number System

L x

L L L L L L

L L L L

x

AB A B A B A B

A A A A

log | |,

, ,

, ,/

2

2 1 1

--> Significant Strength Reduction

Page 7: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

7VADA Lab.

Switching Activity Reduction

(a) Average activity in a multiplier as a function of the constant value

(b) A parallel and serial implementations of an adder tree.

Page 8: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

8VADA Lab.

System-Level Solutions• System management, System partitioning, Algorithm selection

• Precompute physical capacitance of Interconnect and switching activity (number of bus accesses)

• Regularity: to minimize the power in the control hardware and the interconnection network.

• Modularity: to exploit data locality through distributed processing units, memories and control.

– Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity

– Temporal locality:average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past).

• Few memory references: since references to memories are expensive in terms of power.

Page 9: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

9VADA Lab.

System-Level Solutions - cont.• Simulator: Instruction-level Energy

Estimation• Software: Energy Efficient

Algorithms• OS: Voltage Scheduling Algorithms • OS: Multiprocessing for Energy• Microprocessor: Dynamic Caches

Page 10: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

10VADA Lab.

Processor Systems:high Power

• Thinkpad (Pentium) 0.3 Hours/AA• InfoPad (ARM) 0.8 Hours/AA• Toshiba Portable (486) 0.9 Hours/AA• Newton (ARM) 2.0 Hours/AA

Operations per Battery Life:Minimize Energy Consumed per OperationOperations per Second:Maximize Throughput Operations/ second

Page 11: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

11VADA Lab.

DPM vs SPM

• DPM (Dynamic Power Management): stops the clock switching of a specific unit generated by clock generators.

• SPM (Static Power Management): When the system remains idle for a significant period time, then it is shut-down.

Identify power hungry modules and look for opportunities to reduce power

Page 12: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

12VADA Lab.

Vdd vs Delay

•Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing. •Scale down device sizes to compensate for delay (Interconnects do not scale proportionately and can become dominant)

Page 13: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

13VADA Lab.

Power PC 603 Strategy• Baseline: use right supply and right frequency to each part of th

e system If one has to wait on the occurence of some input, only a small circuit could wait and wake-up the main circuit when the input occurs.

• PowerPC 603 is a 2-issue (2 instructions read at a time) with 5 parallel

• Execution units. 4 modes:– Full on mode for full speed– Doze mode in which the execution units are not running– Nap mode which also stops the bus clocking and the Sleep

mode which stops the clock generator– Sleep mode which stops the clock generator with or without t

he PLL (20-100mW).

Page 14: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

14VADA Lab.

Power PC 603 Power Management

Page 15: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

15VADA Lab.

TI Structures • Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and syst

em cost for wireless communication applications • C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family: T

hree different power down modes, these devices are well-suited for wireless communications products such as digital cellular phones, personal digital assistants, and wireless modem,low power on voice coding and decoding

• The TMS320LC548 features:– 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times– 3.0- and 3.3-V operation

• 32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip• Integrated Viterbi accelerator that reduces Viterbi butterfly update in four instructi

on cycles for GSM channel decoding• Powerful single-cycle instructions (dual operand, parallel instructions, conditional

instructions)

Page 16: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

16VADA Lab.

InfoPad Architecture, UC-Berkeley

SpeechRecognizer

“PadServer”Wireless Basestation

InfoPadInfoPadMaintain state in the network, not

on the Pad

Transmit audio and raw bitmaps across

the wireless link

WebBrowser

Internet

Example:Hand-held

speech-enabled web-browser

Perform all computation in the network to minimize client energy dissipation

Page 17: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

17VADA Lab.

InfoPad Hardware Flexibility

Only header sentto microprocessor

10 MIPSμProcessor

ControlStatisticsReliabilityDebugging

Entire packet routed to dedicated hardware

RX Packet

PacketHeader

Frame-bufferupdate

Embedded software responsible for high-level functions

Main data-flow handled by custom low-power ASICs

Radio

FrameBuffer

• Use hardware/software integration toprovide energy-efficient high-level functionality

Page 18: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

18VADA Lab.

Multimedia I/O Terminal.

Page 19: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

19VADA Lab.

Multimedia I/O terminal

Page 20: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

20VADA Lab.

InfoPad EvolutionTotal Power: ~7 W

Where did the power go?

No local computation?

Commercial radios

Commercial DC/DC

Inefficientimplementation

IntercomIntercomEnergy-Efficient

ProcessorsInfoPadInfoPad

• High-level system design optimizes complete solution and drives new research

Page 21: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

21VADA Lab.

Power-Down Techniques

Page 22: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

22VADA Lab.

Low Power Memory

Page 23: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

23VADA Lab.

Low Power Video Processor

Uzi Zangi, Technion - VLSI Systems Research Center, 1997

Asynchronous logic to save power Didn’t work because:Slow design (13.5MHz) &Small circ

uit (<100K gates) : clock load is small.Adding Async. control costs more then clocking.

Gated clock Didn’t work because:

Frequency is very low (13.5MHz). Register activity is very high. No need for clock tree.

Page 24: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

24VADA Lab.

Minimizing bus switching Transfer the value or it’s negative on the bus, according to

the minimum number of toggle bits. Add one bit that will indicate the polarity of the bus. Good for buses with:

large number of bits (more than 10). High capacitance (more then 2pF). High toggle activity (more then 1/2).

Overheads: Routing of one more bit. Extra logic for the decision (timing, area).

Page 25: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

25VADA Lab.

Minimizing bus switching (Cont.)

Didn’t work because:Largest bus is 8bit.Capacitance less than 1pF.Toggle activity not very high.

Block A

decisionunitCx

Block B

nnBus (Ct)

E linen slice

n

Page 26: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

26VADA Lab.

Power Reduction in InfoPad

Approach PowerReduction

Comments

Voltage Scaling x21 1.1V vs 5VOptimized Cell Lib. x3-4 TR sizing, Reduced swing

and self-timed FIFO…Gated Clocks x2-3 error checking for

address onlyBlock decoding x8 enabling only one block in

the SRAMAlgorithm Selection x5-10 VQ vs DCTBit swing reduction x3.7 1.1V vs 300mV in

memory

Page 27: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

27VADA Lab.

Power Management by Gated Clock

• Power Management Scheme by Enabling Clock

• Power Management Scheme by adding Clock Generation block

block 1

block 1

block 1

enable 1

enable 3

enable 2

c lk

block 1

block 1

block 1

c lk

enable 1

enable 3

enable 2

c lock management

Page 28: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

28VADA Lab.

Method That Works: Pixel Differentials

Pixel value area locality. This is exploited most heavily in compression (save on sto

rage and transmission). Most of the functions are linear, able to work on differenc

es. The entire algorithm was rewritten (interpolations, filters,

matrices, etc.) New algorithm differs from original by no more then

1 lsb bit per pixel.

Page 29: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

29VADA Lab.

MethodologyC++

SimulatorAlgorithm Image

Image

Compare

VerilogSimulator

RTL

Synopsys Netlist P&RCadence Opus

SpiceNetlist

EpicPowermill

Currents,power

Image

0.35 LibCompass

Page 30: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

30VADA Lab.

Pixel Difference

0

2

4

6

8

10

12

0.00% 20.00% 40.00% 60.00% 80.00% 100.00%Pixel Differential

Cu

rre

nt

Register Current [mA]

Logic Current [mA]

Total Current [mA]

Page 31: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

31VADA Lab.

Pixel Differentials Algorithm Results

Number of Pixels

Differential Pixels

Differential Ratio

Register Current (mA)

Logic Current (mA)

Total Current (mA)

Current Saving

Power Saving

3600 0 0.00% 3.8 6.8 10.63646 424 11.63% 1.5 3.6 5.1 52% 77%3646 616 16.90% 1.4 3.22 4.62 56% 81%3190 1536 48.15% 1.02 2.2 3.22 70% 91%3494 2730 78.13% 0.82 1.21 2.03 81% 96%3190 3116 97.68% 0.8 1.16 1.96 82% 97%

Page 32: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

32VADA Lab.

Summary Attempted to save power on a battery-operated chip by app

lication specific algorithmic/architectural techniques: Async. Logic, Gated clock, Minimizing bus switching.

All Attempts failed. These methods may still apply to very large, very fast chips, and on variable load application.

Successfully applied an algorithmic change, inspired by image compression. It may not work on non-compressible data but works exceptionally well on images.

Easily saved 80% power, potentially can save more than 90%.

Page 33: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

33VADA Lab.

A SINGLE-CHIP DIGITAL CAMERAH. Teresa H. Meng, “Low-Power Wireless Video System” , IEEE Communication Magazine, June, 1998

◈ Given the recent development in CMOS RF transceiver design, wireless transmission at a bandwidth in excess of 10Mb/s will soon become possible using next-generation CMOS technology.

◈ The design of a low-power large-scale parallel MPEG2 encoder architecture to be used in a single-chip digital CMOS video camera.

◈ The single-chip digital camera architecture includes a 640 x 480 array of CMOS photo diodes, embedded DRAM for storing four frames of color data, and parallel array processor for video signal processing

◈ The parallel processor architecture is designed to implement highly computationally intensive image and video processing tasks such as color conversion , discrete cosine transform(DCT), and motion estimation for MPGE2.

Page 34: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

34VADA Lab.

A SINGLE-CHIP DIGITAL CAMERA

C MO S photo sensors

Emnbedded DRAM (pixel memory)

Parallel video processors

C olume processor 40

C olume processor 39

C olume processor 2

C olume processor 116 colume x 480 pixels

480 pixels

640 p

ixels

S ilicon surface

Sideview

Topview

Page 35: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

35VADA Lab.

A SINGLE-CHIP DIGITAL CAMERA

Module/ operation

External I/ O access

8 x 128 x 16 SRAM (write)

8 x 128 x 126 SRAM (read)

Latch

Multiplier

C arry- selec tor adder

Word size

16 bits

16 bits

16 bits

16 bits

16 bits

16 bits

Energy/ op(pJ )

160

180

80

4

64

18

Normalized to adder

10

9

4.4

0.22

3.6

1

Energy per operation at a 1.5V supply in 0.8m CMOS technology

Page 36: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

36VADA Lab.

A SINGLE-CHIP DIGITAL CAMERA◈ Design Consideration

The proposed architecture considers three algorithms commonly used in video coding standards : red-green-blue(RGB)-to-yellow-ultraviolet (YUV) conversion, discrete cosign transform(DCT), and motion estimation

To reduce power consumption, as many parallel processors as practically feasible should be used to reduce the clock frequency, because a reduced clock frequency implies a lower supply voltage.

For MPEG-2 encoding, the computational demand required for motion estimation(1.6 BOPS for 30 frames/s based on the algorithm proposed by Chalidabhongese and Kuo) limits the number of columns in each processor domain to 16, because otherwise the required clock speed for each processor would be too high for a low-power design

Page 37: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

37VADA Lab.

A SINGLE-CHIP DIGITAL CAMERA

◈ PERFORMANCE

In order to sustain this computational demand, each processor is required to run at a clock frequency equal to or higher than 40 MHz.

When implemented in a 0.2 CMOS technology, a 1V supply voltage should be more than enough to support a 40MHz operation

Under these condition, this parallel processor architecture delivers a processing of 1.6 BOPS with a power consumption of 40mW

Page 38: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

38VADA Lab.

Vector Quantization• Lossy compression technique which exploits the corre

lation that exists between neighboring samples and quantizes samples together

Page 39: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

39VADA Lab.

Complexity of VQ Encoding

The distortion metric between an input vector X anda codebook vector C_i is computed as follows:

Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook tree-search.

Page 40: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

40VADA Lab.

Full Search• Brute-force VQ: the distortion between the input vecto

r and every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder.

• For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined.

Page 41: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

41VADA Lab.

Tree-structured Vector Quantization

If for example at level 1, the input vector iscloser to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 istransmitted.

Here only 2 x log 2 256 = 16 distortion calculations with 8 comparisons

Page 42: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

42VADA Lab.

Differential Codebook Tree-structure Vector Quantization

• The distortion difference b/w the left and right node needs to

be computed. This equation can be manipulated to reduce the number of operations

.

Page 43: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

43VADA Lab.

Comparisons• The number of memory access operations can be reduced; that is,

by changing the contents of the code-book through computational transformations, the number of switching events - number of multiplications, additions/subtractions and memory accesses- can be reduced.

Page 44: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

44VADA Lab.

Multiplication with Constants• Techniques and tools have been developed to

scale coefficients so as to minimize the number of 1’s in the coefficients so as to minimize the number of shift-add operations.

Page 45: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

45VADA Lab.

Gated clocks to shut down modules when not used.

Page 46: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

46VADA Lab.

Lower Power Data Encoding

• S.S.Chun and J.D.Cho’97• 허프만 부호화 알고리즘에 의하여 발생된 압축률을

유지하면서 허프만코드를 재구성하여 스위칭 동작 횟수를 줄이는 방법

• 공통된 서브 시퀀스를 많이 갖는 서브 스트림에 그레이 코드와 같은 스위칭 횟수가 적은 부호화 방식을 채택하는 것이다 .

• RISC 인스트럭션 어드레싱 방식중 바이너리코드 어드레싱 방식에 비해서 그레이코드 어드레싱 방식을 사용할 경우 50% 까지의 전력감축 효과를 나타낸다

Page 47: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

47VADA Lab.

Gray Code• 두 개의 n 차원 (n bit) 벡터 U = u_1, u_2, … , u_n 과

V = v_1, v_2, … , v_n 의 해밍 거리를 h(U,V) = SUM from i=1 to n (u_i, v_i ) 로 정의하자 . 여기서 (u_i v_i ) 는 u 와 v 의 bit 값이 다르면 1 이 되고 그렇지 않으면 0 이 된다 . 이것은 n 차원 hypercube G 의 변을 따라갈 때의 거리로 표현 할 수도 있다 . Gray code = shortest path in G

• 허프만 코드는 문자의 코드 길이가 다를 수 있으며 prefix-free 코드를 유지하여야 하기 때문에 정확한 그레이 코드로 변환하는 것은 불가능하며 비트 변화량을 최소화하기 위한 압축 부호화가 필요하게 된다 .

Page 48: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

48VADA Lab.

2-D Traveling Salesman Problem

• 제안된 문제는 문자의 인접 빈도수가 많은 문자쌍에 해밍 거리가 작은 코드쌍을 할당하는 문제이기 때문에 두 개 이상의 TSP 를 동시에 처리하는 새로운 문제로 표현된다 .

• Using heuristic: 10% reduction in switching activity for random un-correlated data

Page 49: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

49VADA Lab.

Lower Power CDMA Searcher

1999. 8

S. Kim and J.D.Cho

성균관대학교 http://vada.skku.ac.kr

Page 50: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

50VADA Lab.

Searcher (Using a Common Double Dwell Method)

◈ CDMA 시스템의 송수신간의 정확한 PN 부호의 동기를 위한 초기 동기 포착 과정 .

O RX a RX aI I I Q Q ( ) ( )

Local PN_Q ( )a Q

Local PN_I ( )a I

Local PN_I ( ) a I

O RX a RX aQ I Q Q I ( ) ( ( ))

RX I

RX Q

N C

G

G

Y G OI I

Y G OQ Q N C

N N

Z Y YI Q 2 2

1 ?

Yes (Switch ON)

No

Search_Slew

2 ?

No

Search Done !!

ZN N

Page 51: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

51VADA Lab.

Operation Flow1 기지국에서 전송하는 파일럿 채널을 단말기에서 발생된 PN

부호열과 역확산 과정 수행 .

2 역확산된 결과를 동기 누적 횟수 Nc 만큼 누적한 후 에너지 계산 과정을 거침 ( 제곱 연산 ).

3 에너지 계산 결과값들은 첫번째 임계치 ( ) 와 비교하여 초과할 경우 뒷 단에서 비동기 누적 (Nn) 수행 .

4 그렇지 못할 경우 PN 부호열을 한 칩 빨리 발생시키고 입력되는 신호에 대하여 앞의 과정을 반복 .

5 비동기 누적을 거친 결과값을 두번째 임계치 ( ) 와 비교 .

6 를 초과하면 탐색 과정을 종료하고 , 그렇지 않을 경우 PN

부호열을 한 칩 빨리 발생시키고 앞의 과정을 반복 .

1

2

2

Page 52: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

52VADA Lab.

Data Flow Graph of Searcher Operation

X O R X O R X O R X O R

+ +

+ +

()2 ()2

>

>

+

>

RXI TXI RXQ TXQ RXI TXQ RXQ - TXI

max 값 선 택

θ 1 와 비 교

θ 2 와 비 교

동 기 누 적 단

비 동 기 누 적 단

에 너 지 계 산 단

동기 누적단– 덧셈 과정 4회

에너지 계산단– 곱셈 과정 2회

Page 53: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

53VADA Lab.

Rescheduled Data Flow Graph

X O R X O R X O R X O R

()2

>

>

+

>

RXI TXI RXQ TXQ RXI TXQRXQ - TXI

max 값 선 택

θ 1 와 비 교

θ 2 와 비 교

동 기 누 적 단

비 동 기 누 적 단

에 너 지 계 산 단

| | | |

C SA C SA

동기 누적단– Carry Save Adder (or 3 Ii

nput ALU) 사용

임계치 비교– Pre-computation 적용

에너지 계산단– Data Flow 순서를

변화하여 곱셈 과정을 줄임

Page 54: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

54VADA Lab.

Pre-computation Power saving

– Reduces power dissipation of combinational logic

– Reduces internal power to precomputed registers

Cost

– Increase area

– Impact circuit timing

– Increase design complexity

• number of bits to precompute

– Testability

• may generate redundant logic

Page 55: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

55VADA Lab.

Pre-computation

◈ A comparator example : Shrinivas Devadas, 1994

◈ Precomputation for external idleness : M. Alidina, 1994

Page 56: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

56VADA Lab.

Low Power Comparator• YI와 YQ의 MSB 는 절대값의

signed bit 이며 , 모두 ‘ 0’ 임 .

• MSB 를 제외한 상위 2bit 를

이용하여 pre-computation 을

실시 .

• Pre-computation 의 결과에

의해 |YI| 와 |YQ| 중 큰 값을

선택 .

• 임계치 θ1과 비교시 compara

tor 대신 multiplexter 를 사용 .

Page 57: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

57VADA Lab.

Three Input ALU ( Ovadia Bat-Sheva, 1998 )

The three input ALU consumes much less power than an ALU and an ASU

A drawback of using a 3IALU is the added complexity in calculating the carry and overflow.

MUL0 MUL1

ALU ALU/ ASU

ac c 0 ac c 1

P0 P1

Two ALUs Struc ture

MUL0 MUL1

P0 P1

3IALU

ac c 1

Three Input ALU Struc ture

Page 58: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

58VADA Lab.

실험 결과 및 결론• IS-95 기반의 DS/CDMA 시스템의 단말기에 사용하기위한 MSM

(Mobile Station Modem) 칩의 탐색자 (Searcher Engine) 에 대한

RTL 수준 저전력 설계 구현 .

– 동작 주파수 : 12.5MHz

• Data flow graph 를 사용하여 rescheduling, pre-computation 및

strength reduction 등을 적용하여 , area 와 power 를 각각 최대

67.68%, 41.35% 감소 시킴 .

Page 59: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

59VADA Lab.

Lower Power Viterbi Decoder

1999. 8

J.H. Ryu and J.D.Cho

성균관대학교 http://vada.skku.ac.kr

Page 60: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

60VADA Lab.

Viterbi Decoder◈ Convolutional Encoder

K = 3 (Constraint Length) R = 1/2 (Rate)

+ +

+

Informationsequence

U uj a j b j

A1 A0

V

C odeword

a j=uj+uj- 1+uj- 2

b j=uj+uj- 2

A(3,1/ 2) C onvolutional encoder

Page 61: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

61VADA Lab.

Viterbi Decoder00

10

01

11

00

11

00

11

.......

10

01

00

11

10

01

10

01

00

11

State

Time 0 1 65432

Fig. 2. Trellis diagram for a (2,1/ 2) convolutional code

Information sequence : U = (0,0,1,0,1,0,...) Output codeword : V = (00,00,11,10,00,10,...)

Page 62: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

62VADA Lab.

◈ Viterbi DecoderViterbi Decoder

BMU SMUAC SU

PMM

Rec eivedSignal

BM SP Dec odedData

Viterbi decoder struc ture

Page 63: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

63VADA Lab.

Branch Metric Unit(BMU) : The branch metrics measure the difference the received symbol and the symbol that causes the transitions between states in the trellis.

Add-Compare-Select Unit(ACSU) : To find the survivor path entering each state, the branch metric of a given transition is added to its corresponding partial path metric(PM) stored in the path metric memory (PMM). This new partial path metric is compared with all the other new partial metric corresponding to all the other transitions entering that state. The transition that has the minimum partial path metric is chosen to be the survivor path of the state. The path metric of the survivor path of each state is updated and stored back into the PMM.

Survivor memory Unit(SMU) : The survivor path are stored in the SMU. A traceback mechanism is applied on the SMU during the decoding stage to output the decoded data.

Viterbi Decoder

Page 64: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

64VADA Lab.

⑴ Low power ACSU VLSI architecture▶ Conventional ACSU VLSI architecture

Butterfly structure

Viterbi Decoder

s a

sb sb

s aS0

S0

S1

S0

Page 65: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

65VADA Lab.

Viterbi Decoder

Architecture of conventional ACSU

Adder

Adder

C omp

Adder

Adder

C omp

(sa,S0)

BM i

PM i- 1

BM i

BM i

PM i- 1

BM i

(sa)

(sb,S1)

(sb)

(sa,S1)

(sb,S0)

M i

M i

(S0)

(S1)

Page 66: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

66VADA Lab.

― AlgorithmViterbi Decoder [SKKU. Solution]

☞ The area and power of the lower power ACSU design are reduced by

20% and 30%, respectively, comparing with the conventional ACSU

design

>PM i- 1(sa) (sa,S0)

BMi+ BMi(sb,S0)

PM i- 1(sb)

+

>PM i- 1(sa)

PM i- 1(sb)

-(sa,S0)

BMiBMi(sb,S0)

-

Page 67: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

67VADA Lab.

▶ Low power ACSU VLSI architecture [C-Y Tsui, ISLPED’99]Viterbi Decoder [SKKU. Solution]

Page 68: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

68VADA Lab.

※ Glitch minimization [Raghunathan, DAC’96]

(a) Lower power ACSU architecture (b) Conventional ACSU architecture

☞ The power consumption of architecture (a) is larger than that of architecture (b) by more than 17% because of glitch power dissipation

Viterbi Decoder [SKKU. Solution]

Y

X

+

+

0

1

<

A

B

D

C

(a) compare- add (b) add- compare

+

0

1

0

1

<

A

B

D

C

Y

X

Page 69: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

69VADA Lab.

※ Glitches in control logicViterbi Decoder [SKKU. Solution]

C LK

+

0

1

0

1

<

A

B

D

C

Y

X&

S

C

D

S

Fs=0 Fs=1 = A B. .

Page 70: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

70VADA Lab.

⑵ Low power traceback VLSI architecture▶ Systolic Viterbi, traceback decoder[J. Sparso’91]

Viterbi Decoder

ACSUTrace-BackUnit

1

Trace-BackUnit

2

Trace-BackUnit

3

Trace-BackUnit10

.....

Trace- Back Units

The struc ture of systolic Viterbi decoder

Page 71: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

71VADA Lab.

Viterbi Decoder

00

10

01

11

.......

0

10

State

Time 0 1 65432

2 2

2

2

1

3

3

2

2

1

2

2

4

1

1

2

2

3

3

2

3

1

1

1

0

1

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1

0

path metric

dec ision vector

Sequence of staes of the trace- back methode

Received codeword : V = (00,00,11,10,00,10,...)

Page 72: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

72VADA Lab.

Viterbi DecoderTime unit

ACSU

0000

00XX

ACSU

00XX

2

1

ACSU

0000

00XX

3

0000

ACSU

0000

0000

00XX

4

1101

dec ision vec tor state with smallest path metric

Page 73: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

73VADA Lab.

Viterbi Decoder....

Time unit

ACSU

1000

0000

0100

1011

0000

1101

1101

0000

0000

00xx

11

10

ACSU

1000

0000

0100

1011

0000

1101

1101

0000

0000

00xx

01 10

11

1110

ACSU

1000

0000

0100

1011

0000

1101

1101

0000

0000

00xx

0100

10 11 00

12

1110

survivor depth = 5K

T10 T1T2T3T4T5T6T7T8T9

T10 T1T2T3T4T5T6T7T8T9

T11

"0"

"1"

10

11

Page 74: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

74VADA Lab.

Viterbi Decoder

ACSU

1111

0000

0000

1011

0001

1000

1101

0001

0100

1110

1000

0000

0100

1011

0000

01 00 10 11 00 01 1

24

ACSU

1110

1000

0000

0100

1011

0000

1101

1101

0000

0000

00xx

0000

0000

1011

0001

1000

1101

0001

0100

11 00 10 01 01 10 00 10 10 00

19

ACSU

1111

0000

0000

1011

0001

1000

1101

0001

0100

1110

1000

0000

0100

1011

0000

1101

1101

0000

0000

01 10 01 00 10 11 00 01 01 00 0

20

.

.

.

.

.

.

Page 75: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

75VADA Lab.

※ Systolic array decoder 의 문제점

The systolic array viterbi decoder is organized to input the decision vector and the smallest path metric out of the ACSU and to output the decode bit by shifting every register for every cycle.

This system consumes a great dynamic power consumption due to switching activities of registers which is almost 80% of the total power consumption because every data in TBU shifts for every cycle.

Viterbi Decoder

Page 76: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

76VADA Lab.

Viterbi Decoder [SKKU. Solution]▶ Our low power trace-back unit

C ONTROL BLOC K

0000

0000

00XX

C ONTROL BLOC K

ACSU

C ONTROL BLOC K

Time unit

1

3

2

0

000

ACSU

ACSU

00XX

00XX

Page 77: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

77VADA Lab.

Viterbi Decoder [SKKU. Solution]

C ONTROL BLOC K

0000

1101

0000

1101

0000

1011

0100

0000

ACSU 9

00XX

C ONTROL BLOC K

0000

1101

0000

1101

0000

1011

0100

0000

1000

11

ACSU 10

00XX

C ONTROL BLOC K

0000

1101

0000

1101

0000

1011

0100

0000

10

1000

ACSU 11

1110

01

00XX

.

.

.T1 T9T8T7T6T5T4T3T2

Trace- back

Page 78: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

78VADA Lab.

Viterbi Decoder [SKKU. Solution]....

C ONTROL BLOC K

0000

10

1101

0000

1101

10

0000

1011

00

0100

0000

10

1000

0100

0000

1011

00

0001

1000

10

1101

0001

01

0000

11

ACSU 19

1110

01

00XX

00

C ONTROL BLOC K

0000

1101

01

0000

00

1101

01

0000

1011

0100

00

0000

1000

11

0100

10

0000

10

1011

0001

01

1000

1101

00

0001

1111

01

0000

0ACSU 20

1110

C ONTROL BLOC K

0000

10

1101

1101

10

0000

1011

00

0100

0000

10

1000

0100

0000

1011

00

0001

1000

10

1101

0001

01

1111

0000

110

ACSU 21

1110

01

Page 79: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

79VADA Lab.

After decision vector and the smallest path metric generated from ACSU

are transferred to the Control Block (CB), the CB outputs the decision ve

ctor and the smallest path metric with the right cycle using a counter and

a multiplexer.

The register array, which stores the value of trace-back from the CB, was

provided to finally output decoded bit, not by shifting all higher 4-bit d

ecision vector as in the classical TBU, but by shifting the lower 2-bit

only, which is the smallest path metric, to the left

Viterbi Decoder [SKKU. Solution]

Page 80: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

80VADA Lab.

◈ Experimental Result (area 11% , power 40% )Viterbi Decoder [SKKU. Solution]

A r e a

0

1000

2000

3000

4000

5000

6000

7000

8000

2 3 4

K

gate

s

Trace- back Unit Low Power Trace- back Unit

Power Dissipation

0

200

400

600

800

1000

1200

1400

1600

2 3 4

K

pow

er(

uW

)

Trace- back Unit Low Power Trace- back Unit

Page 81: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

81VADA Lab.

⑶ Low Power Asynchronous Viterbi Decoder [Y.h.Lee , Stanford] ▶ Algorithm

Viterbi Decoder [Stanford Solution]

time ntime

n+1

Traceback processing

converge point

Page 82: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

82VADA Lab.

① 초기화 : 구속장의 5배의 trellis 를 traceback 하고 , 그 경로를 저장한다 .

② Loop

A. 추적과 비교 : 임의의 초기 스테이트를 선택해 trace back 을 시작

한다 . 동시에 , route 를 추적해 나가면서 각 node 에서

저장된 route 와 비교한다 .

B. 비교 값이 같으면 추적을 멈추고 저장된 route 를 버린다 . 같지 않

을 때는 A 과정을 반복한다 .

③ 각각의 입력 신호에 대해 ② 과정을 반복한다 .

Viterbi Decoder [Stanford Solution]

Page 83: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

83VADA Lab.

▶ Implementation

Self-timed TBU block diagram

Viterbi Decoder [Stanford Solution]

Previous path

Input Port

AddressRD/WR Control

Shift ReisterMUX

M em ory M anagem entUnitAddress RD/WR

Control

SurvivingPath

M em ory

Self-precharge &Self-requesting

if not found

TraceBackUnit Oscillator

RingComparison

Logic

Requestif Path is not

found

RequestformACS

Acknowledge toACS

if path is found

Page 84: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

84VADA Lab.

① Self-timed TBU 가 request 신호를 기다리는 동안 전력 소모가 없다 .

② ACS 는 스테이트 결정 데이터를 버리기 위해 request 신호를 내보낸

다 .③ TBU 는 이전의 surviving path memory 와 previous path memory 를 읽어 들여 비

교한다 .

④ 같지 않으면 , TBU 는 previous path memory 를 update 하고 self- precharging, self-requesting 을 한 다음 ③ 과정을 반복한다 . 같으면 , ⑤ 과정으로 간다 .

⑤ TBU 는 ACS 에 scknowledgement 신호를 보내고 , 다음 ACS 의 request

신호를 위해 self-precharge 한다 .

Viterbi Decoder

Page 85: VADA Lab.SungKyunKwan Univ. 1 Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동

SungKyunKwan Univ.

85VADA Lab.

References• David Johnson, Venkatesh Akella, and Brett Stott, “Micropipelined Asynchronous Discr

et Cosine Transform (DCT/IDCT) Processor,”IEEE Transactions on very large scale inte

gration (VLSI) systems, vol. 6, no. 4, december 1998

• T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A VLSI Design for a Trace

-Back Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992

• Fettweis, G.H. Meyr, “High-Speed Parallel Viterbi Decoding Algorithm and VLSI-Archi

tecture”, IEEE Communications, May. 1991

• G. Feygin, P. Glenn Gulak, “Survivor Sequence Memory Management in Viterbi Decod

ers”, IEEE, 1991T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A VLSI D

esign for a

Trace-Back Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992

• Fettweis, G.H. Meyr, “High-Speed Parallel Viterbi Decoding Algorithm and VLSI-Archi

tecture”, IEEE Communications, May. 1991