skku 휴대폰학과 © 조준동 2008 1 조 준 동 2008.1 1 low power multiprocessor system on...

SKKU 휴대폰학과 © 조준동 2008 1

조 준 동

2008.1

1

Low Power Multiprocessor System on Chip

SKKU 휴대폰학과 © 조준동 2008 2 2

Motivation for low-power IC design


Driving Forces for Low-Power: Deep-Submicron Technology

ADVANTAGES Smaller geometries Higher clock

frequencies

DISADVANTAGES Higher power

consumption Lower reliability


Processor: AP - MC

Modem: GSM/GPRS - WCDMA - CDMA2000

Connectivity: Wireless LAN - GPS - Bluetooth

RF/Analog: Rx - Tx - Zero IF - PM

Camera Chipset: CIS - CCD - ISP

Display Driver IC (DDI): STN - TFT - OLED

Smart Card: Smart Card: SIMSIM

Flash Memory: Flash Memory: Code/Data Storage Code/Data Storage

SIP / MCPSIP / MCP

RAM: Mobile DRAM - SRAM - UtRAM

Example of System on Chip Example of System on Chip

SoCSoC


Five Minds for the future - Howard Gardner

1. Disciplinary Mind - Mastery of major schools of thought and of at least one powerful craft

2. Synthesizing mind - ability of integrating ideas from different disciplines

3. Creative mind - capacity to uncover and clarify new problems, questions and phenomena.

4. Respectful mind - Awareness of and appreciation for differences among human beings

5. Ethical mind - Fulfillment of one's responsibilities as a worker and a citizon.


IDC 선정 - 앞으로 10 년 내에 세상을 바꿀 9 가지 신기술

• (1) 스마트 더스트 (smart dust) : 지능형 소형센서로 물류 및 모니터링 등에 사용(2) 랫보트 (ratbots): 생물체간 또는 컴퓨터간 정보교환기술 (3) 나노튜브 (Nanotube): 초강력 빛을 이용한 회로 설계 , FED 에 사용(4) 시멘틱 웹 (Semantic Web): 단어의 의미도 분석해 주는 차세대 웹(5) 나노 머신 (Nano Machine): 나노크기의 첨단기기(6) 퀀텀 컴퓨팅 (Quantum Computing): 슈퍼 컴퓨팅 능력(7) 플라스틱 트랜지스터 (Plastic Transistors): 플라스틱에 유기광소자와 트랜지 스터를 붙인 것으로 유연하게 구부릴 수 있음 . (8) 그리드 컴퓨팅 (Grid Computing): 분산된 컴퓨터 , 대용량 저장장치 , 첨단 장비 등의 자원을 고속 네트워크로 연결 , 상호 공유할 수 있도록 하는 것 .(9) 릴리 패드 (Lily Pads): 무선네트워크를 연계시키는 개념


SOC Design Trends

Expected to integrate more and more complex• Web-browsing, real-time video processing, speech

recognition and synthesis

Average operating power at or below 100mW and standby power levels at or below 2mW

Performance levels must increase from 300 million operations per second (MOPS) today to 2500 MOPS in 2016


고성능 및 저전력의 필요성

3D graphics

Moore’s law

Shannon’s lawShannon’s law

((2.8x / 18m)

2G (IS-95)

9.6kbps

3G (CDMA 1xEV)

3,100kbps

4G (1GMbps~

100Mbps)

20031995 2012

Battery capacityQVGA

D1

HD (720p)

Full HD (1080i)

Mobile MultimediaMobile Multimedia

Design Complexity

Productivity Gap: Design complexity vs. Moore’s law

Power Gap: Design complexity vs. Battery


Gene’s Law and Power

Power Consumption (mW/MOPS)

*Source: Berkeley Wireless Research Center

Fle

xib

ility

(C

han

gea

bili

ty)

Gene’s Law

mW

/MIP

S Source: Texas Instruments

Gene’s Law

One half every 18 month

Embedded

Processor

DSP

Reconfigurable

Processor

Embedded FPGA

ASIC

Embedded

Processor

DSP

Reconfigurable

Processor

Embedded FPGA

ASIC

Candidates forBaseband SDR Modem

0.001~0.01 mW/MOPS

0.1~1 mW/MOPS

1~10 mW/MOPS

Should be 0.01~0.1 mW/

MOPS?


Darwin’s Philosophy


Soft eye?


Power MetricsPower Metrics


Dynamic Power Consumption

• Average power consumption by a node cycling at each period T: (each period has a 01 or a 1 0 transition)

CLKDDcycle

switching fVCT

EP

battery

20

CLKDDswitching fVCPbattery

20

Average power consumed by a node with partial activity(only a fraction of the periods has a transition)


동적 전력 소모 Dynamic Power

• Average power consumption by a node cycling at each period T:

CLKDDswitching fVCPbattery

20

Average power consumed by a node with partial activity(only a fraction of the periods has a transition)

PMOS

Network

NMOS

Network

VDD

iDD

CL Vo

+-

Vin

VVCdVCVdttPE

dt

dVCti

tiVdt

dEtP

DDL

V

oLDD

t

oLDD

DDDD

d

0010 )(

)(

)()(


CMOS Energy and Power

• E = CL VDD2 P01 + tsc VDD Ipeak P0/11/0 + VDD Ileak/f

• P = CL VDD2 f + tscVDD Ipeak f + VDD Ileak

f = P * fclock

Dynamic power(~80% today and

decreasing relatively)

Short-circuit power(~5% today and

decreasing absolutely)

Leakage power(~15% today

and increasing)


정적 전력 소모 Static power

Pstatic = VCC x Ntr X Ileak

0


SCALING TREND

Keeping the pace with Gene’s Law: DPS Chip’s energy efficiency (MIPS/Watt) doubles every 18 Month

Low Cost High flexibility Reduce idle power in idle

state• Gene’s Law Tech&Circ: Voltage islands, Arch: MPSoC• Low Cost Integrate, but only when cost effective• Push towards A & D integration• High flexibility Software radios, reconfigurable architectures

• Reduce static power in idle state Variable Vdd, VT


Gartner, 2007 년 10 대 기술 발표 • 오픈소스 (Open Source)• 가상화 (Virtualization)• 정보 액세스 (Information Access)• 유비쿼터스 컴퓨팅 (Ubiquitous Computing)• 그리드 컴퓨팅 (Grid Computing)• 컴퓨트 유틸리티 (Compute Utilities)• 멀티코어 프로세서 (Multicore Processors)• 웹 2.0(Web 2.0)• 네트워크 통합 (Network Convergence)• 수냉 방식 (Water Cooling)


미래의 모바일 컴퓨팅

Mudge et al:

• 실시간 처리 이동 슈퍼 컴퓨팅– Speech recognition, Cryptography.– Augmented reality.

• 16 개의 Pentium-4 필요– 2004 Intel P4 @3GHz; 55M TR’s 122mm2 0.09u – 2014 20GHz 0.03u

• 저전력을 만족하면서 고성능 – requires (massive) parallelism– Multi-processor systems– Subsystem integration


기존기술의 문제점

▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다 .

▷ 호모지니어스 MP-SoC 는 자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다 .

▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다 .

▷ 특정 응용분야에 대해서만 최고 성능을 낸다 . ▷ 온 칩 인터콘넥션의 설계 : 타스크 매핑 ,IPC 선택

( 파이프 , FIFO, 메시지 대기열 , 차단 표시기 , 공유 메모리 등 )


저전력 PC 및 임베디드 프로세서

• Highend Processor Core

– AMD: CPU내 메모리 컨틀롤러 및 Northbridge 내장 , ATI 인수 합병을 통한 Graphic Processor 의 통합

• 외부 메모리 접속 병목 현상 해소 + 전력 소모 절감– Intel: 내부 Cache 메모리 확장 , Prefetch 메커니즘 향상– IBM: Cell processor 의 멀티미디어 기능 통합

• 고성능 / 저전력을 위해 멀티코어 프로세서가 보편화되면서 대칭형 멀티 프로세서가 차세대 휴대 단말 칩셋에서도 채용될 것으로 예상

• ARM 사는 AXI 의 후속으로 AMBA4 발표 예정 , NOC 의 선구자인 Sonics 사의 SonicsMX 기술은 OMAP 등에 채용–


Road Map to MP-SoC Trends

• Mask NRE: Over 1M$ design NRE:10M$ to 75M$– ASICs replaced by programmable ASSP, FPGA

• Number of embedded processors– DVD/STB/HDTV, mobile phones: 5 to 8– Image proc, networking, basestation: 8 to 100+

• Enabled and compelled by Moore’s Law ITRS: 2009, 90nm process, 100M gates = 2500 ARM7 co

res


Dual-Core (DSP+ARM) Platform


MP-SoC Microprocessor


Cell Processor


MP-SoC Microprocessor


# of Processors per chip


Parallelism favors lower power solutions

P. G. Paulin et al, “Parallel ProgrammingModels for a Multiprocessor SoC PlatformApplied to Networking and Multimedia”,IEEE Transactions on VLSI Systems,Vol. 14, No. 7, July 2006


Parallelism Inside the Processor

Chris Rowen, President and CEO, Tensilica, Inc.


Multiple concurrent processorsmuch lower energy

Chris Rowen, President and CEO, Tensilica, Inc.


Keys to Efficient MP

Flexible range of topologiesChris Rowen, President and CEO, Tensilica, Inc.


Parallel Architectures


MP-ARM Platform


Homogeneous MP-SOC

• 32bit ARM processors• Private Memory• Shared Memory• Hardware interrupt module• Hardware semaphore module• 32bit interconnection (AMBA Bus or STBus)• Processor Core modeling : C++• Hardware interconnection modeling : SystemC

Interconnection (AMBA or STBus)

PrivateMemory

PrivateMemory

PrivateMemory

PrivateMemory

SharedMemory

Semaphoredevice

ARM ARMARMARM Interruptdevice


RTEMS: 멀티 프로세서 지원 OS

• RTEMS 는 C 와 Assembly 코드를 지원하는 OS 로 semaphore 나 interrupt 둘 다 사용이 가능하다 .

• RTEMS 가 지원하는 프로세서는 다음과 같다 . – ARM : ARM V7 and above– C4x : TI C3x and C4x DSPs– H8300 : Hitachi H8 Family– Hppa1.1 : HP PA-RISC– I386 : Intel i386, i486, Pentium and above, AMD Athlon and abov

e– I960 : Intel i960 family– M68k : Motorola m680x0, m683xx, CPU32, and Coldfire CPUs– MIPS : MIPS ISA Levels 1 and above for 32 and 64 bit CPU mode

ls– PowerPC : IBM and Motorola PowerPC 4xx, 5xx, 6xx, 7xx, 8xx, 74

xx, and 75xx


VDSL 모뎀 응용


모바일 응용 프로세서 MP211


Power Distribution

인텔 제온 프로세서


Cell Processor


Cradle’s CT 3400 Multi-core DSP

• 8 개 32 비트 DSP 코어

• 6 개 32 비트 범용 프로세서 코어

• 128 핀 프로그램 가능 I/O 서브시스템으로 구성

• C 프로그램 가능• H.264 및 MPEG4

코드를 지원

http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

H.264 encoder , decoder and audio codecs and the system control


CT 3616 Multi-core DSP

http://www.cradle.com/downloads/CT3600-PB.pdf


SODA System for 3G

2-Level scratchpad memories

- 12KB Local memory for stream queues

- 64KB global memory for larger buffers

Low-throughput shared bus

- 200Mhz 32-bit bus

- Inter-PE communication using DMA

Scott Mahlke, U. of Michigan


Task level parallelism

1


Heterogenous MP Core

If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection.


Heterogeneous Chip Multiprocessors ▷ Single-ISA heterogeneous multicore 구조는 볼테지 스케일링 , 클럭 게이팅 , speculation co

ntrol 등을 사용하는 경우에 비해 우수한 성능을 보인다 . ▷ Homogeneous CMP (Chip Multiprocessor) 와 비교해서 Heterogeneous CMP( 또는 asymmetr

ic CMP) 는 많은 장점을 가지고 있다 . 많은 응용 제품들은 큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다 . 또한 바테리를 사용하는 경우와 전원을 사용하는 경우등 시스템의 콘텍스트에 의존적이다 . 따라서 복잡도가 다른 코어들을 사용하는 것이 효율적이다 .

▷Multi-ISA multicore architecture 는 다른 ISA 를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism 을 동시에 처리 가능하도록 설계되었다 . 그러나 single-ISA heterogeneous CMP 는 모든 코어가 같은 ISA 를 수행하기 때문에 각 응용이 어느 코어에 매핑이 되어도 상관없게 된다

▷ 8-core 프로세서의 경우 , 인터콘넥트의 전력 소모량은 하나의 코어와 같다 . ▷ 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지

-딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다 . ▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level 을 이용하는 heter

ogeneous processors 는 homogeneous 에 비해서 63% 성능이 개선된다 . ▷ 5-8 threads level 을 사용하는 경우에는 평균 29%의 개선이 있다 . Amdahl's 의 법칙에

의하면 병렬 응용들의 속도개선은 직렬 응용 부분때문에 제한적이 된다 .


암달의 법칙 (Amdahl's law)

• 예를 들어서 어떤 작업의 40% 에 해당하는 부분의 속도를 2 배로 늘릴 수 있다면 , P 는 0.4 이고 S 는 2 이고 최대 성능 향상은 1.25 가 된다 .

•컴퓨터 시스템의 일부를 개선할 때 전체적으로 얼마만큼의 최대 성능 향상이 있는지 계산하는데 사용된다 .

•암달의 법칙에 따르면 , 어떤 시스템을 개선하여 P 만큼의 부분에서 S 만큼의 성능 향상이 있을 때 전체 시스템에서 최대 성능 향상은 다음과 같다 .


Amdahl’s law:Parallelization

In the special case of parallelization, Amdahl's law states that if F is the fraction of a calculation that is sequential (i.e. cannot benefit from parallelization), and (1 − F) is the fraction that can be parallelized, then the maximum speedup that can be achieved by using N processors is


Single-ISA heterogeneous multi-core

Power and relative performance of Alpha cores scaled to 0.1um. Performance is expressed normalized to EV4 performance.

EV8 is 80 times bigger but provides only two to three times more single-threaded performance.


Equal-area heterogeneous architectures

with multithreaded cores


Exploring the potential from heterogeneity


Heterogeneous MP-SoC 문제 및 개선점

- Processors are bound by wire and memory latencies

- Peak performance on only a small class of applications.

- How well they map to a given design- Diversification of workloads - Increased hardware complexity - Poor resource utilization


Multimode Embedded Systems

• MP3 players & Video decoders• Modes with higher prob. is implemented

more energy efficient.

• Task mapping• Communication mapping• Timing schedule• Voltage schedule


Energy-Aware Task mapping

Minimize Energy Consumption, given a CTG and a heterogenous NoC

• Find:– A mapping function M : tasks(T) => PEs (P)– Assuming the tasks are already scheduled and partitioned

• Solution formulated as a quadratic assignment problem and solved using Branch and Bound.

• Communication-optimal task mapping– minimal hardware (buffers and wires) required to meet th

e timing requirements defined in the specification.– given a multiprocessor network find a mapping of the ap

plication satisfies the timing constraints.• Genetic algorithm (Chromosome, Generation, Crossover, mu

tation)

Addressed by Hu et al 2002:


MPSoC Clock and PowerOlivier Franza, Intel

• Increased uncertainty with process scaling– Process, voltage, temperature variations, noise, coupling

• Affects design margin over design, power & performance loss– Increased power constraints– Increasing leakage, power (density, delivery) limitations

• More transistors mean:– Larger clock distribution networks– Higher capacitance (more load and parasitics)

• With each new technology:– Gate delay decreases ~25%– Wire delay increases ~100%– Cross-chip communication increases– Clock needs multiple cycles to cover die


Multiple clock domains

• Low skew and jitter ALWAYS a must• Clock modeling requires more accuracy• Within-die variations, inductance, crosstalk, electromigration, self-heat, …• Floor plan modularity• Think adding/removing cores seamlessly!• Hierarchical clock partitioning• Reduce global clock and possibly relax its requirements• Generate “locally”-used clock “locally”• Implement clock domain deskewing techniques• Bound clock problem into simple, reliable, efficient domains


Clock and Power ConvergenceIntel® Itanium® Montecito

• Each core split into 3 clock domains on variable power supply

• Each domain controlled by Digital Frequency Divider (DFD)

generating low-skew variable-frequency clocks; fed by central PLL and aligned through phase detectors

• Regional Voltage Detector (RVD): supply voltage monitor

• Second level clock buffer (SLCB): digitally controlled delay buffer for active deskewing

• Regional Active Deskew (RAD): phase comparators monitoring

and adjusting delay difference between SLCBs• Clock Vernier Device (CVD): digitally controlled

delay buffer


On-Chip Interconnects:Circuits and Signaling, Wayne Burleson

• Using Vdd programmability • High Vdd to devices on critical path • Low Vdd to devices on non-critical

paths • VddOff for inactive paths

A – Baseline FabricB – Fabric with Vdd Configurable Interconnect

This work builds on a similar idea for FPGAs described in:Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International Conference on Computer-Aided Design, Nov. 2004


퀴 즈

1. 다음중 차세대 SoC 의 생산성 증대를 위한 5 가지 요구 사항에 포함되지 않은 것은 ? ( 답 :1)

1)DSP 2) DfM 3) Small Form Factor 4) Low Power Solutions

2. 다음 중 ESL 에서 사용하는 검증 언어가 아닌것은 ? (답 : 3) 1) UML 2)Java 3) RTL 4) System-C

3. 다음중 저전력 기술이 아닌 것은 ? (답 : 4) 1) TR-sizing 2) DPM/DVS 3) MTCMOS 4) ESL


요 약

1. Trade-off between power and performance2. Logic and architecture level power optimization3. Switching activity and supply voltage reductions4. Multiprocessor SoC for voltage scaling5. Reconfigurability helps power reduction


Efficient Shared DRAM Subsystems for SOCs, Sonics Inc.

• 현재 MP-SOC 플랫폼은 공유메모리를 이용하여 서로 데이터를 주고 받은 몇 개의 마스터 프로세서로 구성된다 . 이런 공유메모리를 사용했을 때 프로세서에서 사용할 수 있는 대역폭이 줄어들어 , 각각의 프로세서에 직접 연결되어 있는 로컬메모리의 사용보다 훨씬 접근속도가 떨어져 , MP-SOC 의 성능저하에 중요한 요인이 되었다 . 이와 같은 성능저하를 피하기 위하여 현재 병렬 접근이 가능하여 대역폭이 개선된 multi-port 메모리가 사용된다 . 하지만 이 메모리는 넓은 wiring 면적과 많은 전력소모로 인하여 MP-SOC 에서 효율적인 솔루션을 제공하지 못하고 있다 . 그러므로 Single port memory 를 이용한 버스 스위칭 구조를 이용하여 면적과 전력소모를 줄이는 방법을 사용하고 있다 .


• Increases SOC performance • Improves efficiency of off-chip DRAM by up to 40% • Guarantees Quality of Service for on-chip cores • Lowers SOC costs • Consolidates and reduces multiple distributed buffers • Single Smart Interconnet replaces multiple layered busses • Shortens time to market • Smart Interconnect removes wire routing problem of classical architectur

es • Accurate architectural exploration ensures functionality in a day developm

ent • Increased market penetration • DRAM technology selection decoupled from the rest of the SOC • Threaded architecture enables easy scalability without re-design of memo

ry subsystem


Star Topology Access to a Shared DRAM Subsystem

• DRAM controller sees all initiator requests at the same time and can select the order of servicing Them

• optimize the performance of the DRAM subsystem by reordering requests

• providing flexible quality-of-service to each of the initiators.

• causes a large number of wires to converge on the DRAM controller, producing physical problems for the design.


SiliconBackplane and DRAM Scheduler

• The shared μnetwork remedies the wire congestion problem. The DRAM scheduler addresses both the DRAM performance issues. Quality-of-service guarantees by selectively scheduling the DRAM accesses.

• Round-Robin Arbitration: Many of the initiators receive less than their required bandwidth so overall application requirements are unsatisfied.

• Bandwidth Profile of Bus with Priority Arbitration: From 5000 to 9000 cycles all but two initiators (CPU and DSP) receive no service at all. Clearly, this is unacceptable for the set-top-box application.


Bandwidth Profile of Sonics Solution

• Each of the initiators are connected to the DRAM using a Silicon Backplane μNetwork, a DRAM scheduler, DRAM controller. The Silicon Backplane and DRAM bandwidth have been allocated to the different initiators according to their needs.

• All application requirements are met and overall DRAM utilization is pretty steady at around 70%.

• 벤치마크 및 응용 프로그램에 따라서 코어의 크기 및 전압 / 주파수를 자동으로 구성할 수 있는 시스템[11]의 개발이 필요하다 .

skku 휴대폰학과 © 조준동 2008 1 조 준 동 2008.1 1 low power multiprocessor system on...

Documents