multi-core system on chip 설계 동향 2 발표 : 조준동 교수 2003 년 12 월

95
1 Multi-Core System on Chip 설설 설설 2 설설 : 설설설 설설 2003 설 12 설

Upload: aric

Post on 04-Feb-2016

43 views

Category:

Documents


1 download

DESCRIPTION

Multi-Core System on Chip 설계 동향 2 발표 : 조준동 교수 2003 년 12 월. What is Software Radio. - A transceiver in which all aspects of its operation are determined using versatile general purpose hardware whose configuration is under software control - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

1

Multi-Core System on Chip

설계 동향 2

발표 : 조준동 교수 2003 년 12 월

Page 2: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

2

What is Software Radio

- A transceiver in which all aspects of its operation are determined using versatile general purpose hardware whose configuration is under software control

- Flexible all-purpose radios that can implement new and different standards or protocols through reprogramming.

- Same hardware for all air interfaces and modulation schemes

Page 3: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

3

Key Technological Constraints

• High speed wide band ADCs.• High speed DSPs. • Real Time Operating Systems (isochronous

software)• Power Consumption

Page 4: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

4

Research and Commercialization

• DARPA’s Adaptive computing system project

• Virginia Tech – algorithms and architecture ; multi user receiver based on reconfigurable computing ; generic soft radio architecture for reconfigurable hardware

• UC Berkeley – Pleiades, ultra low power, high performance multimedia computing ; high power efficiency by providing programmability

• Sirius Inc – Software Reconfigurable Code Division Multiple Access (CDMAx)

Page 5: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

5

Research and Commercialization

• Brigham Young University – Development of JHDL to facilitate hardware synthesis in reconfigurable processors

• Chameleon Systems- Reconfigurable Platform Architecture for wireless base station

• MorphIC Inc -Programmable hardware reconfigurable code using DRL

• Quicksilver Tech. Inc – Universal Wireless `Ngine (WunChip) baseband algorithms

Page 6: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

6

Applications

• User Applications and Base Station Applications

• Evolve as a universal terminal• Spectrum management:

Reconfigurability is a big advantage• Application updates, service

enhancements and personalization

Page 7: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

7

Programmable OFDM-CDMA Tranceiver.

• CDMA suffers from Multiple access interference and ISI.

• OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.

• It is proposed that this might be implemented by using SDR.

Page 8: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

8

SDR ArchitectureSignal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA

LNAData converterQuadrature MODEMBaseband MODEMInterface Control

C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit

Hitachi Kokusai Electric Inc., [email protected]

Page 9: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

9

Signal processing/control unit

• The signal processing/control unit consists of the following module– Data converter– Quadrature Modem– Baseband Modem– Interface/Control

• Every module is connected to each other by PCI bus, and provides a CPU in addition to the FPGA and DSP devices.

Page 10: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

10

Quadrature modem module• The Quadrature modem uses

FPGAs to process to generate baseband samplin

g rate

– Quadrature modulation– Quadrature detection– Sampling rate conversion– Filtering

Signal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA

LNAData converterQuadrature MODEMBaseband MODEMInterface Control

C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit

Page 11: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

11

Baseband modem module• The Baseband modem proces

ses– Multi-channel modulation– Multi-channel demodulation

• Using four floating points DSP devices

• individual DSP is assigned for each channel. Therefore, even if processing of either channel is under execution, a program can be downloaded to another channel.

Signal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA

LNAData converterQuadrature MODEMBaseband MODEMInterface Control

C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit

Page 12: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

12

A SDR/Multimedia SolutionW-CDMA / DAB / DVB / IEE802.11x; MPEG / JPEG Codecs

Page 13: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

13

PACT’s SDR XPP eXtreme Processor Platform

Page 14: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

14

PACT’s SDR XPP

Page 15: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

15

Architecture Goals• Provide template for the exploration of a range of

architectures

• Retarget compiler and simulator to the architecture

• Enable compiler to exploit the architecture

• Concurrency– Multiple instructions per processing element– Multiple threads per and across processing elements– Multiple processes per and across processing

elements

• Support for efficient computation– Special-purpose functional units, intelligent memory,

processing elements

• Support for efficient communication– Configurable network topology– Combined shared memory and message passing

Page 16: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

16

Architecture Template• Prototyping template for array of processing elements

– Configure processing element for efficient computation– Configure memory elements for efficient retiming– Configure the network topology for efficient communication

FUFU FU

RegFile

Memory

ICache

DCT HUFFUFU FU

RegFile

Memory

ICache

FU FU FUFU FU

RegFile

Memory

ICache

DCT HUF

Memory

RegFile

...configurePE...

...configurememoryelements...

...configure PEsand network tomatch the application...

Page 17: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

17

Future Processing Element• Specialized memory systems for efficient memory utility

– Multi-ported, banked, levels, and intelligent memory

• Split register file allows greater register bandwidth to FUs– Groups of functional units have dedicated register files

• Multiple contexts for a processing element provide latency tolerance– Hardware for efficient context switching to fill empty

instruction slots

• Specialized functional units and processing elements– SIMD instructions– Re-configurable fabrics for bit-level operations– Re-use IP blocks for more efficient computation– Custom hardware for the highest performance

Page 18: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

18

Initial Distributed Architecture

• Array of concurrent PEs and supporting network

• Malleable network topology

– Topology matches application

• Efficient communication• Memory organized around a PE

– Each PE has physical memory

– Message passing between PEs

PE PE

PE PE

PE

PE

PE PE PE

PE PE

PE PE

PE

PE

PE PE PE

Page 19: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

19

Future Distributed Architecture

• Multiple processing elements share a memory space– Shared memory communication

• Snooping cache coherency protocol• Directory based protocol required if PEs in a

shared memory space is large• Introspective processing elements

– Use processing elements to analyze the computation or communication• Identify dynamic bottlenecks and remove them

on the fly• Reschedule and bind tasks as the introspective

elements report

Page 20: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

20

So What’s Different?

• Traditional application hw/sw design requires– Hand selection of traditional general purpose OS components– Hand written customization of

• device drivers• memory management…

• Instead…– Application specific synthesis of OS components

• scheduling• synchronization…

– Automatic synthesis of hardware specific code from specifications• device drivers• memory management…

Page 21: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

21

ASIP Design• Given a set of applications, determine micro arch

itecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)

• To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code.

• The micro architecture of the processor is a design parameter!

Page 22: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

22

ASIP Design Flow

Page 23: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

23

Compiler Goals• Develop a retargetable compiler infrastructure that enable

s a set of interesting applications to be efficiently mapped onto a family of fully programmable architectures and microarchitectures.

• 10 Year Vision: – Will have fully automatically-retargetable compilation, O

S synthesis, and simulation for a class of architectures consisting of multiple heterogeneous processing elements with specialized functional units / memories

– Compiled code size and performance will be within 10% of hand-coding

Page 24: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

24

Compiler Research Issues

• Synthesis of RTOS elements in the compiler– On the application side: Generation of an efficient application-spe

cific static/run-time scheduler and synchronization– On the hardware side: Generation of device drivers, memory mana

gement primitives, etc. using hardware specifications• Automatic retargetability for family of target architectures while preser

ving aggressive optimization• Automatic application partitioning

– Mapping of process/task-level concurrency onto multiple PEs using programmer guidance in programmer’s model

• Effective visualization for family of target architectures

Page 25: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

25

An Efficient Architecture Model for Systematic Design of Application-Specific M

ultiprocessor SoCDATE’ 2001

Amer Baghdadi Damien Lyonnard Nacer-E. Zergainoh Ahmed A. JerrayaTIMA Laboratory, Grenoble, France

Page 26: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

26

Efficient application-specific multiprocessor design

• Modularity

• Flexibility

• Scalability

Page 27: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

27

A multiprocessor architecture platform for application-specific SoC design(1)

Figure 1. A multiprocessor architecture platform

Page 28: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

28

A multiprocessor architecture platform for application-specific SoC design(2)

• Architecture platform parameters

1. Number of CPUs,

1. Memory sizes for each processor

2. I/O ports for each processor

3. Interconnections between processors

4. Communication protocols and the external connections (peripherals)

Page 29: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

29

Application-specific multiprocessor SoC design flow (1)

Figure 2. The Y-chart: MFSAM-based architecture generation scheme

Page 30: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

30

Application-specific multiprocessor SoC design flow(2)

Figure 3. MFSAM-based architecture generation flow for multiprocessor SoC

Page 31: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

31

Architecture design(1)

Figure 4. Communication Interface

Page 32: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

32

Architecture design(2)

Figure 5. Block diagram of the packet routing switch (Point to Point network)

Page 33: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

33

Architecture validation

Figure 6. A 4-processor cosimulation architecture of the packet routing switch

Page 34: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

34

Analyzing the design cycle (1)

Figure 7. A 4-processor cosimulation architecture of the IS-95 CDMA

Page 35: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

35

Analyzing the design cycle (2)

Table 1. Time needed to fit the IS95 CDMA on the multiprocessor platform

Page 36: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

36

Conclusion

1. Presented a generic architecture model for application-

specific multiprocessor system-on-chip design

2. The proposed model is modular, flexible and scalable.

3. Definition of the architecture model and a systematic

design flow that can be automated.

Page 37: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

37

A Single-Chip Multiprocessor

• Currently, processor designs dynamically extract parallelism by executing many instructions within a single, sequential program in parallel.

• Future performance improvements will require processors to be enlarged to execute more instructions per clock cycle.

• Two alternative micro-architectures that exploit multiple threads of control

– SMT : simultaneous multithreading– CMP : chip multiprocessor

Page 38: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

38

A Single-Chip Multiprocessor

• Exploiting parallelism

– Loop level parallelism results when the instruction level parallelism comes from data independent loop iterations.

– Some compiler can also divide a program into multiple threads of control, exposing thread level parallelism.

– A third form of very coarse parallelism, process level parallelism, involves completely independent applications running in independent processes controlled by the operations system.

Page 39: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

39

Exploiting Program Parallelism

Instruction

Loop

Thread

Process

Leve

ls o

f P

aral

lelis

m

Grain Size (instructions)

1 10 100 1K 10K 100K 1M

Page 40: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

40

SMT (simultaneous mutlithreading)

• SMT processors augment wide (issuing many instructions at once) superscalar processors with hardware that allows the processor to execute instructions from multiple threads of control concurrently

• Dynamically selecting and executing instructions from many active threads simultaneously.

• Higher utilization of the processor’s execution resources

• Provides latency tolerance in case a thread stalls due to cache misses or data dependencies.

• When multiple threads are not available, however, the SMT simply looks like a conventional wide-issue superscalar.

Page 41: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

41

Single-vs Multi-threaded

multithreaded/non-blocking: CPU continues to execute along With accelerator.

single-threaded/blocking: CPU waits for accelerator;

Page 42: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

42

Mutithreading– Multiple threads to share the functional units of

a single processor in an overlapping fashion.– The processor must duplicate the independent

state of each thread. (register file, a separate PC, page table)

– Memory can be shared through the virtual memory mechanisms, which already support multiprocessing

– Needs hardware support for changing the threads.

Page 43: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

43

Single-Chip Multiprocessor

• CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores.

• If an application cannot be effectively decomposed into threads, CMPs will be underutilized.

Page 44: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

44

Basic Out-of-order Pipeline

Page 45: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

45

SMT Pipeline

Page 46: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

46

Instruction Issue

Reduced function unit utilization due to dependencies

Page 47: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

47

Superscalar Issue

Superscalar leads to more performance, but lower utilization

Page 48: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

48

Maximum utilization of function units by independent operations

Simultaneous Multithreading

Page 49: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

49

Super scalar Architecture

Issue up to 12 instructions per cycle

Page 50: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

50

SMT Architecture8 separate PCs , executes instructions from 8 diff thread concurrently

Multi bankcaches

Page 51: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

51

Chip multiprocessor architecture

8 small 2 issue superscalar processors. Depend on TLP

Page 52: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

52

Single-chip multiprocessor Kunle Olukotun http://www-hydra.stanford.edu

– Shared 2nd-level cache – Low latency interprocessor com-

munication (10 cycles)– Separate read and write buses

Four processors Separate primary caches Write-through data caches to

maintain coherence

Write-through Bus (64b)

Read/Replace Bus (256b)

On-chip L2 Cache

DRAM Main Memory

Rambus Memory Interface

CPU 0

L1 Inst. Cache L1 Data Cache

CPU 1

L1 Inst. Cache L1 Data Cache

CPU 2

L1 Inst. Cache L1 Data Cache

CPU 3

L1 Inst. Cache L1 Data Cache

I/O Devices

I/O Bus Interface

CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller

Centralized Bus Arbitration Mechanisms

Page 53: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

53

Characteristics of superscalar, simultaneous multithreading, and chip multiprocessor

Page 54: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

54

CMP and Memory

• A 12-issue superscalar or SMT processor can place large demands on the memory system.

• The CMP architecture features sixteen 16-Kbyte caches.– The small cache size and tight connection

to these caches allows single-cycle access.

Page 55: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

55

CMP Solution

• Short cycle time to be targeted with relatively little design effort, since its h/w is naturally clustered- each of the small CPUs is already a very small fast cluster of components.

• Since OS allocates a single s/w thread of control to each processor, and requires no h/w to dynamically allocate instructions to different clusters

• Heavy reliance on s/w to direct instructions to clusters limits the amount of ILP of CMP but allows the clusters within CMP to be small and fast.

Page 56: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

56

A Single-Chip Multiprocessor

• Relative performance of superscalar, simultaneous multithreading, and chip multiprocessor architectures

Page 57: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

57

Multi-core SoC Platform Integration using AMBA

DesignCon 2002 System on Chip and IP Design Conference

Robert L. Veal, Levon Petrosian, Neal Stollon

Page 58: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

58

Overview of AMBA AHB

AMBA Application to Multiprocessor Systems(RAMA)

Summary

OutlineOutline

Page 59: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

Core integration is significant part of Soc Design - Including both RISC and signal processing engines - Well defined bus strategies make it easier

AHB being adopted based on both features and standardization - Low overhead for core-to-memory communication - Standard interface increases IP value - RAMA integrates RADcore and OMNIcore using AHB along with memory blocks, arbiters and external interfaces

AMBA Based Integration for SoC PlatformsAMBA Based Integration for SoC PlatformsOverview of AMBA AHBOverview of AMBA AHB

AMBA : Advanced Microcontroller Bus ArchitectureAMBA : Advanced Microcontroller Bus ArchitectureAHB : Advanced High-performance BusAHB : Advanced High-performance BusRAMA : Reconfigurable Array Multimedia ArchitectureRAMA : Reconfigurable Array Multimedia ArchitectureRADcore : Infinite Technology Corporation’s proprietary cores for RADcore : Infinite Technology Corporation’s proprietary cores for reconfreconfigurable signal processingigurable signal processingOMNIcore : Infinite Technology Corporation’s proprietary core for OMNIcore : Infinite Technology Corporation’s proprietary core for genergeneral purpose RISC processing al purpose RISC processing

Page 60: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

Key to AHB - Definition of master and slave AHB components - Master : initiate operation by sourcing address and control signals for a bus operation - Slave : respond and perform operations under the control of a master, memories and peripherals

Attractive Key Features of AMBA AHB - Configurable data bus size (8 ~ 1024bits) - Dedicated request/grant and bus locking signals - Flexible (user-defined) arbiter based bus control - State based handshaking between master and slave - No tri-stated business; mux based unidirectional operation

Value of AMBA Interfaces in Core Value of AMBA Interfaces in Core IntegrationIntegration

Overview of AMBA AHBOverview of AMBA AHB

Page 61: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

AHB Principle of OperationsAHB Principle of Operations

Overview of AMBA AHBOverview of AMBA AHB

Page 62: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

AHB Principle of OperationsAHB Principle of Operations

Overview of AMBA AHBOverview of AMBA AHB

Specific datapath structure and signaling of a multiplexed bus - Interconnection of multiple masters and slaves is handled by multiplexors - On-chip bussing based on a arbitrated request/grant approach - Bussing of two types of interface

• Master interfaces : initiate transactions through granted requests and source of address and communication parameters of a data transfer• Slave interfaces : respond to master requests and provide status of requested transactions

High-performance system bus - Supports multiple bussed cores and provides high-bandwidth operation - Single-edge timed, multiplexed data bus controlled by arbitration logic - All busses and signals are unidirectional as an on-chip bus structure

Page 63: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

AHB VariantsAHB Variants

Overview of AMBA AHBOverview of AMBA AHB

Specifics of interconnection structure - Open to the user

Different bus structures and levels of transfer bandwidth - Characterized by number of masters and bus layers (sub-buses) - Efficient customization of the architecture within the standardized platform framework

Usage for multi-processor core platforms - Several types of busses are concurrently used for control and high data transfer in inter-core communications

Page 64: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

Single-layer/Single-master AHBSingle-layer/Single-master AHB

Overview of AMBA AHBOverview of AMBA AHB

- Known as AHB-Lite, reduced complexity version - A single master : no contention for bus ownership, no arbitration - No arbitration : no implementation of request and grant signals

Single-layer/Multi-master AHBSingle-layer/Multi-master AHB

- Ensure that a given master gains and maintains access to the bus - Increase the performance of data transfers between multiple signal processors and memories

Page 65: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

Multi-layer/Single-master AHBMulti-layer/Single-master AHB

Overview of AMBA AHBOverview of AMBA AHB

- Concurrently accessing common slave resources - The number of masters determines the number of bus layers - Each master has a dedicated bus

Multi-layer/Multi-master AHBMulti-layer/Multi-master AHB

- Each master has a dedicated bus in multi-layer - Both masters and slaves access a common set of bus resources - The number of bus layers defined by the number of slaves requiring concurrent data transfer

Page 66: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

Master Slave CommunicationMaster Slave Communication

Overview of AMBA AHBOverview of AMBA AHB

Both the AHB master and slave have embedded (4-state) state machines - Allow communication for master-slave and multiple maser status Specifics of the FSM operation - Driven by the features of the processor (transfer FSM) and memory (response FSM) blocks being used

Page 67: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RAMA Block DiagramRAMA Block Diagram

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Page 68: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RAMARAMA

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

High-performance multi-core platform for addressing datapath applications - Standardizes the on-chip bus operation by adopting AMBA AHB Integration of ITC’s RADcore and OMNIcore processor cores - Driven by the features of the processor (transfer FSM) and memory (response FSM) blocks being used RADcore - Signal processing engine : parallel processing, Reconfigurable Arithmetic Datapath (RAD) features - Data interface : Initialization I/O EXU, memory bus interfaces, RADbus interface OMNIcore - 32-bit cryptographic/RISC architecture - High-performance RISC processor with a dual memory bus interface - Uses AHB as its central bus structure Other elements - Memory blocks, an external memory interface core, arbitration logic

Page 69: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RAMA Multi-layer using AHBRAMA Multi-layer using AHB

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Page 70: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RAMA Multi-layer using AHBRAMA Multi-layer using AHB

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Inter-core communication is based on two AHB busses - Separate and reduce any interdependence of control and data access - Control interface : a single layer AHB with the OMNIcre control/ROM port and the External memory DMA - Slaves are the boot and Local (instruction) Memory and the RADCore control interfaces - Single the inter-core control and memory update operations are intermittent Data transfer AHB has up to six master - OMNIcore Data/RAM port and the external memory DMA, along with up to four RADcore I/O ports - To facilitate high-bandwidth multi-core performance, the data transfer AHB is a multi-layer AHB structure

Page 71: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

Interfaces of Multiple System Domains in Interfaces of Multiple System Domains in RAMARAMA

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Key communications interfaces of RAMA① RADcore to on chip memory array data

read/write operations② OMNIcore to on chip memory array dat

a read/write operations③ RADcore to External Memory Buffer rea

d/write operations④ OMNIcore to External Memory Buffer re

ad/write operations⑤ RADcore-to-RADcore data transfers⑥ RADcore to external logic data transfer

s⑦ External memory (DMA) to internal me

mory array read/write operations⑧ OMNIcore to RADcore control read/writ

e operations⑨ OMNIcore to local (scratch) RAM read/

write operations⑩ OMNIcore to (boot) ROM read operatio

ns

Page 72: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RADcore OverviewRADcore Overview

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

A High Performance Reconfigurable Signal Processor with Distributed IW Architecture

A core controller/sequence block, a DIW Instruction Memory, a set of Execution Units(EXUs), data I/O, external logic interface

The initialization busses, Reconfigurable Channel Bus (RCB) and the supporting Flags encapsulate and interconnect each EXU

Key features• 15 channel Reconfigurable data bus based architecture• Reduces register based operations• User definable pipeline depth• Distributed instruction word driven parallel operation• Supports highly pipelined dataflow• Configuration selectable by designer (up to 11 EXUs)• AMBA compatible Memory and core to core busses• Spreadsheet based RADware programming environment

Page 73: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RADcore InterfacesRADcore Interfaces

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Controller interface - between the RADcore and host processor Memory interface

- both on chip RAM block and off-chip memory interfaces RADbus interface

- RADcore-to-RADcore, initialization I/O EXU External Logic Buffer

- co-processing with arbitrary external logic

Page 74: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

OMNIcore OverviewOMNIcore Overview

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Key features• 32-bit RISC engine• Cryptographic support• AHB compliant control

and RAM busses→ User-selectable 8

to 32-bit operation• 4 stage pipeline

→ Low interrupt latency

• Two privilege levels user, system→ Supports smart

card applications

Page 75: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

OMNIcore OverviewOMNIcore Overview

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Two primary interface for instruction operation (Ctrl) and data read/write (RAM)

- Access to memory bus for on chip memory and external memory operation using its RAM interface

- Access to a local control bus for loading of instruction data into instruction cache and for supervisory and status communications with RADcore control blocks using its Control interface

Dual master AHB interface to integrate control and data functions

- Data output bus is shared - Instruction cache internal to the OMNIcore subsystems is used to avoid

stalling

Page 76: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

OMNIcore Crytographic FeaturesOMNIcore Crytographic Features

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Public-private key cryptographic algorithms - DES, RSA, DSA and Diffie-Hellman - Controlled by a set of cryptographic instructions

Cryptographic Instruction supports for - Compression Permutation - Expansion Permutation - Initial Permutation - Final Permutation - Key Permutation - Key Rotation - P-Box Permutation - S-Box Permutation

Page 77: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RAMA Memory SubsystemRAMA Memory Subsystem

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Distributed memory block architecture, consisting of dual port memory blocks

Key features - Dual Port RAM blocks - Multi-layer AHB for simultaneous memory access - Dual Mode External Memory Interface

→ DMA interface for internal – external memory transfer (AHB Master)

→ Buffer for processor – external memory transfers (AHB Slaves)

- Multi-layer Arbiter→ Priority based

Page 78: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

AHB ArbitrationAHB Arbitration

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

Multi-layer arbitration scheme - To coordinate concurrent processor-memory transfers between masters (OMNIcore, m

ultiple RADcores, external memory DMA) and slaves (memory, external memory buffer)

A Configurable Master/Slave PortA Configurable Master/Slave Port RADbus AMBA AHB features - Allows direct processor to process communic

ation - Hybrid (configurable) Master/Slave interface - Mode dependent changes in AHB operation - All write operations in master mode - All Read operations in slave mode - Uses first-come, first-serve method for arbitra

tion - Low overhead ensures fast operation

Page 79: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems

A Configurable Master/Slave PortA Configurable Master/Slave Port Structure of RADbus AHB scheme - All Out-puts are defined as bus masters, str

uctured as a Write-only Master - All In-ports are defined as bus slaves, struc

tured as a Write-only Slave - The number of RADcores connecting to the

RADbus determines the size of the address - Selection of which bus channel (A, B, C) is r

ead into the RADcore is defined as function of decoded address bits from the master in conjunction with the state of the slave

- Selection algorithm is based on a “first-com, first-serve” selection mechanism by the read mux, controlled by an address decoded select signal (a, b, c) for each bus

Page 80: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

RAMA discussed as platform-based solution Uses multiple AHB for core-to-core integration AHB easily integrated into RAMA architecture AHB provides well understood, flexible interfaces RADbus example shows AHB can be flexible Combination of OMNIcore and RADcore provides enhanced DSP and data processing Extends platform to reach emerging SoC applications

SummarySummary

SummarySummary

Cores +Infrastructure + Integration = SoC PlatformCores +Infrastructure + Integration = SoC Platform(OMNIcore + RADcore) (RAM) + AMBA AHB = RAMA(OMNIcore + RADcore) (RAM) + AMBA AHB = RAMA

Page 81: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

81

Lightweight Implementation of the POSIX Threads API for an On-Chip MIPS

Multiprocessor with VCI Interconnect

Page 82: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

82

Contents

• Target architecture• MIPS CPU properties• The architecture needs• Pthread specification• Implementation• Experimental setup• Conclusion

Page 83: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

83

Target architecture

General VCI based SoC architecture

• System consist of one or more MIPS R3000 as CPU

• Virtual Chip Interconnect compliant interconnect

Page 84: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

84

MIPS CPU properties

• Two separated caches for instruction and data.

• Direct mapped caches.• Write buffer with write update and write

through policy.• No memory management unit (MMU),

logical addresses are physical addresses. ( the total memory is fixed at design time )

Page 85: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

85

The architecture needs

• Protected access to shared data : Use spin lock– Spin lock is acquired using the pthread_spin_lock– Spin lock is released using the pthread_spin_unlock

• Cache coherency – if the interconnect is a shared bus, use snoopy cache.

• Reduce main memory traffic.

– if the interconnect is VCI compliant( or bus or network), need flush caches.

• Processor identification– CPUs must have an internal register allowing their identific

ation within the system.

Page 86: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

86

Pthread specification

• Main kernel objects are the threads and the scheduler.

Execute the thread : ‘start’ function call

Thread attribute : stack size, stack addr, scheduling policy

Unique identifier for the thread

Page 87: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

87

Pthread specification

• Changing state is done using some pthread function on a shared object.

• From RUNNABLE to RUN is done by the scheduler. Backward from RUN to RUNNABLE using sched_yield.

• A thread structure contains the context of execution of a thread and pointers to other threads.

Page 88: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

88

Pthread specification

• The scheduler manages 5lists of threads.– Symmetric Multi-Processor(SMP) : Scheduler may be shar

ed by all processors.– Distributed : Scheduler exist every processors.

• The access to the scheduler must be performed in critical section, and under the protection of a lock.

• Other implemented objects– Spin lock : the low level test and set access – Mutex : sequentialize access to shared data– Semaphore : sem_post is the only function that can be c

alled in interruption handlers.

Page 89: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

89

Implementation

• The scheduler_created variable must be declared with the volatile type qualifier to ensure that compiler will not optimize this seemingly infinite loop.

◈ Booting sequence

Page 90: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

90

Implementation

• Context Switch– Save the current value of the CPU registers into context

variable of the thread that is currently executing

– Sets the values of the CPU registers to the value of the context variable of the new thread to execute.

– The return address of the function is a register of the context

– Restoring a context sends the program back where the context was saved, not to the current caller of the context switching routine.

Page 91: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

91

Implementation

• All idle CPUs enter the same idle loop.

◈ CPU Idle Loop

Page 92: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

92

Experimental setup

• Review several types of scheduler– Symmetric Multiprocessor (SMP)

• Unique scheduler shared by all processors and protected• The threads can run on any processor, and migrate

– Centralized Non SMP (NON_SMP_CS)• Unique scheduler shared by all processors and protected• Every thread is assigned to a given processor and can run

only on it

– Distributed Non SMP (NON_SMP_DS)• Many schedulers as processors, and as many locks as

schedulers• Every thread is assigned to a given processor and can run

only on it

Page 93: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

93

Experimental setup

Execution times of the MJPEG application Cycles spent in the CPU idle Loop

◈ Motion JPEG application

Page 94: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

94

Experimental setup ◈ COMM application

• Does not exchange data between processors.

• The only resource shared here is the bus

• The application uses the processors at about full power.

Page 95: Multi-Core   System on Chip 설계 동향  2 발표 :  조준동 교수  2003 년  12 월

95

Conclusion

• The implementation is a bit tricky, but quite compact and efficient.

• Experimentations have shown that a POSIX compliant SMP kernel allowing task migration is an acceptable solution in terms of generality, performance and memory footprint for SoC.