multi-core system on chip 설계 동향 2 발표 : 조준동 교수 2003 년 12 월

1

Multi-Core System on Chip

설계 동향 2

발표 : 조준동 교수 2003 년 12 월

http://www.skku.ac.kr/

2

What is Software Radio

- A transceiver in which all aspects of its operation are determined using versatile general purpose hardware whose configuration is under software control

- Flexible all-purpose radios that can implement new and different standards or protocols through reprogramming.

- Same hardware for all air interfaces and modulation schemes


3

Key Technological Constraints

• High speed wide band ADCs.• High speed DSPs. • Real Time Operating Systems (isochronous

software)• Power Consumption


4

Research and Commercialization

• DARPA’s Adaptive computing system project

• Virginia Tech – algorithms and architecture ; multi user receiver based on reconfigurable computing ; generic soft radio architecture for reconfigurable hardware

• UC Berkeley – Pleiades, ultra low power, high performance multimedia computing ; high power efficiency by providing programmability

• Sirius Inc – Software Reconfigurable Code Division Multiple Access (CDMAx)

http://www.darpa.mil/ito/Solicitations/CBD_9902.html

http://www.mprg.ee.vt.edu/research/glomo/index.html

http://bwrc.eecs.berkeley.edu/research/configurable_architectures/

http://www.siriuscomm.com/


5

Research and Commercialization

• Brigham Young University – Development of JHDL to facilitate hardware synthesis in reconfigurable processors

• Chameleon Systems- Reconfigurable Platform Architecture for wireless base station

• MorphIC Inc -Programmable hardware reconfigurable code using DRL

• Quicksilver Tech. Inc – Universal Wireless `Ngine (WunChip) baseband algorithms

http://jhdl.ee.byu.edu/docs/jhdlDocs.html

http://www.chameleonsystems.com/whitepapers/WirelessBasestation.pdf

http://www.morphics.com/

http://www.morphics.com/

http://www.quicksilvertech.com/


6

Applications

• User Applications and Base Station Applications

• Evolve as a universal terminal• Spectrum management:

Reconfigurability is a big advantage• Application updates, service

enhancements and personalization


7

Programmable OFDM-CDMA Tranceiver.

• CDMA suffers from Multiple access interference and ISI.

• OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.

• It is proposed that this might be implemented by using SDR.


8

SDR ArchitectureSignal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA

LNAData converterQuadrature MODEMBaseband MODEMInterface Control

C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit

Hitachi Kokusai Electric Inc., [email protected]


9

Signal processing/control unit

• The signal processing/control unit consists of the following module– Data converter– Quadrature Modem– Baseband Modem– Interface/Control

• Every module is connected to each other by PCI bus, and provides a CPU in addition to the FPGA and DSP devices.


10

Quadrature modem module• The Quadrature modem uses

FPGAs to process to generate baseband samplin

g rate

– Quadrature modulation– Quadrature detection– Sampling rate conversion– Filtering

Signal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA


C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit


11

Baseband modem module• The Baseband modem proces

ses– Multi-channel modulation– Multi-channel demodulation

• Using four floating points DSP devices

• individual DSP is assigned for each channel. Therefore, even if processing of either channel is under execution, a program can be downloaded to another channel.

Signal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA


C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit


12

A SDR/Multimedia SolutionW-CDMA / DAB / DVB / IEE802.11x; MPEG / JPEG Codecs


13

PACT’s SDR XPP eXtreme Processor Platform


14

PACT’s SDR XPP


15

Architecture Goals• Provide template for the exploration of a range of

architectures

• Retarget compiler and simulator to the architecture

• Enable compiler to exploit the architecture

• Concurrency– Multiple instructions per processing element– Multiple threads per and across processing elements– Multiple processes per and across processing

elements

• Support for efficient computation– Special-purpose functional units, intelligent memory,

processing elements

• Support for efficient communication– Configurable network topology– Combined shared memory and message passing


16

Architecture Template• Prototyping template for array of processing elements

– Configure processing element for efficient computation– Configure memory elements for efficient retiming– Configure the network topology for efficient communication

FUFU FU

RegFile

Memory

ICache

DCT HUFFUFU FU

RegFile

Memory

ICache

FU FU FUFU FU

RegFile

Memory

ICache

DCT HUF

Memory

RegFile

...configurePE...

...configurememoryelements...

...configure PEsand network tomatch the application...


17

Future Processing Element• Specialized memory systems for efficient memory utility

– Multi-ported, banked, levels, and intelligent memory

• Split register file allows greater register bandwidth to FUs– Groups of functional units have dedicated register files

• Multiple contexts for a processing element provide latency tolerance– Hardware for efficient context switching to fill empty

instruction slots

• Specialized functional units and processing elements– SIMD instructions– Re-configurable fabrics for bit-level operations– Re-use IP blocks for more efficient computation– Custom hardware for the highest performance


18

Initial Distributed Architecture

• Array of concurrent PEs and supporting network

• Malleable network topology

– Topology matches application

• Efficient communication• Memory organized around a PE

– Each PE has physical memory

– Message passing between PEs

PE PE

PE PE

PE

PE

PE PE PE

PE PE

PE PE

PE

PE

PE PE PE


19

Future Distributed Architecture

• Multiple processing elements share a memory space– Shared memory communication

• Snooping cache coherency protocol• Directory based protocol required if PEs in a

shared memory space is large• Introspective processing elements

– Use processing elements to analyze the computation or communication• Identify dynamic bottlenecks and remove them

on the fly• Reschedule and bind tasks as the introspective

elements report


20

So What’s Different?

• Traditional application hw/sw design requires– Hand selection of traditional general purpose OS components– Hand written customization of

• device drivers• memory management…

• Instead…– Application specific synthesis of OS components

• scheduling• synchronization…

– Automatic synthesis of hardware specific code from specifications• device drivers• memory management…


21

ASIP Design• Given a set of applications, determine micro arch

itecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)

• To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code.

• The micro architecture of the processor is a design parameter!


22

ASIP Design Flow


23

Compiler Goals• Develop a retargetable compiler infrastructure that enable

s a set of interesting applications to be efficiently mapped onto a family of fully programmable architectures and microarchitectures.

• 10 Year Vision: – Will have fully automatically-retargetable compilation, O

S synthesis, and simulation for a class of architectures consisting of multiple heterogeneous processing elements with specialized functional units / memories

– Compiled code size and performance will be within 10% of hand-coding


24

Compiler Research Issues

• Synthesis of RTOS elements in the compiler– On the application side: Generation of an efficient application-spe

cific static/run-time scheduler and synchronization– On the hardware side: Generation of device drivers, memory mana

gement primitives, etc. using hardware specifications• Automatic retargetability for family of target architectures while preser

ving aggressive optimization• Automatic application partitioning

– Mapping of process/task-level concurrency onto multiple PEs using programmer guidance in programmer’s model

• Effective visualization for family of target architectures


25

An Efficient Architecture Model for Systematic Design of Application-Specific M

ultiprocessor SoCDATE’ 2001

Amer Baghdadi Damien Lyonnard Nacer-E. Zergainoh Ahmed A. JerrayaTIMA Laboratory, Grenoble, France


26

Efficient application-specific multiprocessor design

• Modularity

• Flexibility

• Scalability


27

A multiprocessor architecture platform for application-specific SoC design(1)

Figure 1. A multiprocessor architecture platform


28

A multiprocessor architecture platform for application-specific SoC design(2)

• Architecture platform parameters

1. Number of CPUs,

1. Memory sizes for each processor

2. I/O ports for each processor

3. Interconnections between processors

4. Communication protocols and the external connections (peripherals)


29

Application-specific multiprocessor SoC design flow (1)

Figure 2. The Y-chart: MFSAM-based architecture generation scheme


30

Application-specific multiprocessor SoC design flow(2)

Figure 3. MFSAM-based architecture generation flow for multiprocessor SoC


31

Architecture design(1)

Figure 4. Communication Interface


32

Architecture design(2)

Figure 5. Block diagram of the packet routing switch (Point to Point network)


33

Architecture validation

Figure 6. A 4-processor cosimulation architecture of the packet routing switch


34

Analyzing the design cycle (1)

Figure 7. A 4-processor cosimulation architecture of the IS-95 CDMA


35

Analyzing the design cycle (2)

Table 1. Time needed to fit the IS95 CDMA on the multiprocessor platform


36

Conclusion

1. Presented a generic architecture model for application-

specific multiprocessor system-on-chip design

2. The proposed model is modular, flexible and scalable.

3. Definition of the architecture model and a systematic

design flow that can be automated.


37

A Single-Chip Multiprocessor

• Currently, processor designs dynamically extract parallelism by executing many instructions within a single, sequential program in parallel.

• Future performance improvements will require processors to be enlarged to execute more instructions per clock cycle.

• Two alternative micro-architectures that exploit multiple threads of control

– SMT : simultaneous multithreading– CMP : chip multiprocessor


38


• Exploiting parallelism

– Loop level parallelism results when the instruction level parallelism comes from data independent loop iterations.

– Some compiler can also divide a program into multiple threads of control, exposing thread level parallelism.

– A third form of very coarse parallelism, process level parallelism, involves completely independent applications running in independent processes controlled by the operations system.


39

Exploiting Program Parallelism

Instruction

Loop

Thread

Process

Leve

ls o

f P

aral

lelis

m

Grain Size (instructions)

1 10 100 1K 10K 100K 1M


40

SMT (simultaneous mutlithreading)

• SMT processors augment wide (issuing many instructions at once) superscalar processors with hardware that allows the processor to execute instructions from multiple threads of control concurrently

• Dynamically selecting and executing instructions from many active threads simultaneously.

• Higher utilization of the processor’s execution resources

• Provides latency tolerance in case a thread stalls due to cache misses or data dependencies.

• When multiple threads are not available, however, the SMT simply looks like a conventional wide-issue superscalar.


41

Single-vs Multi-threaded

multithreaded/non-blocking: CPU continues to execute along With accelerator.

single-threaded/blocking: CPU waits for accelerator;


42

Mutithreading– Multiple threads to share the functional units of

a single processor in an overlapping fashion.– The processor must duplicate the independent

state of each thread. (register file, a separate PC, page table)

– Memory can be shared through the virtual memory mechanisms, which already support multiprocessing

– Needs hardware support for changing the threads.


43

Single-Chip Multiprocessor

• CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores.

• If an application cannot be effectively decomposed into threads, CMPs will be underutilized.


44

Basic Out-of-order Pipeline


45

SMT Pipeline


46

Instruction Issue

Reduced function unit utilization due to dependencies


47

Superscalar Issue

Superscalar leads to more performance, but lower utilization


48

Maximum utilization of function units by independent operations

Simultaneous Multithreading


49

Super scalar Architecture

Issue up to 12 instructions per cycle


50

SMT Architecture8 separate PCs , executes instructions from 8 diff thread concurrently

Multi bankcaches


51

Chip multiprocessor architecture

8 small 2 issue superscalar processors. Depend on TLP


52

Single-chip multiprocessor Kunle Olukotun http://www-hydra.stanford.edu

– Shared 2nd-level cache – Low latency interprocessor com-

munication (10 cycles)– Separate read and write buses

Four processors Separate primary caches Write-through data caches to

maintain coherence

Write-through Bus (64b)

Read/Replace Bus (256b)

On-chip L2 Cache

DRAM Main Memory

Rambus Memory Interface

CPU 0

L1 Inst. Cache L1 Data Cache

CPU 1


CPU 2


CPU 3


I/O Devices

I/O Bus Interface

CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller

Centralized Bus Arbitration Mechanisms


53

Characteristics of superscalar, simultaneous multithreading, and chip multiprocessor


54

CMP and Memory

• A 12-issue superscalar or SMT processor can place large demands on the memory system.

• The CMP architecture features sixteen 16-Kbyte caches.– The small cache size and tight connection

to these caches allows single-cycle access.


55

CMP Solution

• Short cycle time to be targeted with relatively little design effort, since its h/w is naturally clustered- each of the small CPUs is already a very small fast cluster of components.

• Since OS allocates a single s/w thread of control to each processor, and requires no h/w to dynamically allocate instructions to different clusters

• Heavy reliance on s/w to direct instructions to clusters limits the amount of ILP of CMP but allows the clusters within CMP to be small and fast.


56


• Relative performance of superscalar, simultaneous multithreading, and chip multiprocessor architectures


57

Multi-core SoC Platform Integration using AMBA

DesignCon 2002 System on Chip and IP Design Conference

Robert L. Veal, Levon Petrosian, Neal Stollon


58

Overview of AMBA AHB

AMBA Application to Multiprocessor Systems(RAMA)

Summary

OutlineOutline


Core integration is significant part of Soc Design - Including both RISC and signal processing engines - Well defined bus strategies make it easier

AHB being adopted based on both features and standardization - Low overhead for core-to-memory communication - Standard interface increases IP value - RAMA integrates RADcore and OMNIcore using AHB along with memory blocks, arbiters and external interfaces

AMBA Based Integration for SoC PlatformsAMBA Based Integration for SoC PlatformsOverview of AMBA AHBOverview of AMBA AHB

AMBA : Advanced Microcontroller Bus ArchitectureAMBA : Advanced Microcontroller Bus ArchitectureAHB : Advanced High-performance BusAHB : Advanced High-performance BusRAMA : Reconfigurable Array Multimedia ArchitectureRAMA : Reconfigurable Array Multimedia ArchitectureRADcore : Infinite Technology Corporation’s proprietary cores for RADcore : Infinite Technology Corporation’s proprietary cores for reconfreconfigurable signal processingigurable signal processingOMNIcore : Infinite Technology Corporation’s proprietary core for OMNIcore : Infinite Technology Corporation’s proprietary core for genergeneral purpose RISC processing al purpose RISC processing


Key to AHB - Definition of master and slave AHB components - Master : initiate operation by sourcing address and control signals for a bus operation - Slave : respond and perform operations under the control of a master, memories and peripherals

Attractive Key Features of AMBA AHB - Configurable data bus size (8 ~ 1024bits) - Dedicated request/grant and bus locking signals - Flexible (user-defined) arbiter based bus control - State based handshaking between master and slave - No tri-stated business; mux based unidirectional operation

Value of AMBA Interfaces in Core Value of AMBA Interfaces in Core IntegrationIntegration

Overview of AMBA AHBOverview of AMBA AHB


AHB Principle of OperationsAHB Principle of Operations



AHB Principle of OperationsAHB Principle of Operations


Specific datapath structure and signaling of a multiplexed bus - Interconnection of multiple masters and slaves is handled by multiplexors - On-chip bussing based on a arbitrated request/grant approach - Bussing of two types of interface

• Master interfaces : initiate transactions through granted requests and source of address and communication parameters of a data transfer• Slave interfaces : respond to master requests and provide status of requested transactions

High-performance system bus - Supports multiple bussed cores and provides high-bandwidth operation - Single-edge timed, multiplexed data bus controlled by arbitration logic - All busses and signals are unidirectional as an on-chip bus structure


AHB VariantsAHB Variants


Specifics of interconnection structure - Open to the user

Different bus structures and levels of transfer bandwidth - Characterized by number of masters and bus layers (sub-buses) - Efficient customization of the architecture within the standardized platform framework

Usage for multi-processor core platforms - Several types of busses are concurrently used for control and high data transfer in inter-core communications


Single-layer/Single-master AHBSingle-layer/Single-master AHB


- Known as AHB-Lite, reduced complexity version - A single master : no contention for bus ownership, no arbitration - No arbitration : no implementation of request and grant signals

Single-layer/Multi-master AHBSingle-layer/Multi-master AHB

- Ensure that a given master gains and maintains access to the bus - Increase the performance of data transfers between multiple signal processors and memories


Multi-layer/Single-master AHBMulti-layer/Single-master AHB


- Concurrently accessing common slave resources - The number of masters determines the number of bus layers - Each master has a dedicated bus

Multi-layer/Multi-master AHBMulti-layer/Multi-master AHB

- Each master has a dedicated bus in multi-layer - Both masters and slaves access a common set of bus resources - The number of bus layers defined by the number of slaves requiring concurrent data transfer


Master Slave CommunicationMaster Slave Communication


Both the AHB master and slave have embedded (4-state) state machines - Allow communication for master-slave and multiple maser status Specifics of the FSM operation - Driven by the features of the processor (transfer FSM) and memory (response FSM) blocks being used


RAMA Block DiagramRAMA Block Diagram

AMBA Applications to Mutiprocessor SystemsAMBA Applications to Mutiprocessor Systems


RAMARAMA


High-performance multi-core platform for addressing datapath applications - Standardizes the on-chip bus operation by adopting AMBA AHB Integration of ITC’s RADcore and OMNIcore processor cores - Driven by the features of the processor (transfer FSM) and memory (response FSM) blocks being used RADcore - Signal processing engine : parallel processing, Reconfigurable Arithmetic Datapath (RAD) features - Data interface : Initialization I/O EXU, memory bus interfaces, RADbus interface OMNIcore - 32-bit cryptographic/RISC architecture - High-performance RISC processor with a dual memory bus interface - Uses AHB as its central bus structure Other elements - Memory blocks, an external memory interface core, arbitration logic


RAMA Multi-layer using AHBRAMA Multi-layer using AHB



RAMA Multi-layer using AHBRAMA Multi-layer using AHB


Inter-core communication is based on two AHB busses - Separate and reduce any interdependence of control and data access - Control interface : a single layer AHB with the OMNIcre control/ROM port and the External memory DMA - Slaves are the boot and Local (instruction) Memory and the RADCore control interfaces - Single the inter-core control and memory update operations are intermittent Data transfer AHB has up to six master - OMNIcore Data/RAM port and the external memory DMA, along with up to four RADcore I/O ports - To facilitate high-bandwidth multi-core performance, the data transfer AHB is a multi-layer AHB structure


Interfaces of Multiple System Domains in Interfaces of Multiple System Domains in RAMARAMA


Key communications interfaces of RAMA① RADcore to on chip memory array data

read/write operations② OMNIcore to on chip memory array dat

a read/write operations③ RADcore to External Memory Buffer rea

d/write operations④ OMNIcore to External Memory Buffer re

ad/write operations⑤ RADcore-to-RADcore data transfers⑥ RADcore to external logic data transfer

s⑦ External memory (DMA) to internal me

mory array read/write operations⑧ OMNIcore to RADcore control read/writ

e operations⑨ OMNIcore to local (scratch) RAM read/

write operations⑩ OMNIcore to (boot) ROM read operatio

ns


RADcore OverviewRADcore Overview


A High Performance Reconfigurable Signal Processor with Distributed IW Architecture

A core controller/sequence block, a DIW Instruction Memory, a set of Execution Units(EXUs), data I/O, external logic interface

The initialization busses, Reconfigurable Channel Bus (RCB) and the supporting Flags encapsulate and interconnect each EXU

Key features• 15 channel Reconfigurable data bus based architecture• Reduces register based operations• User definable pipeline depth• Distributed instruction word driven parallel operation• Supports highly pipelined dataflow• Configuration selectable by designer (up to 11 EXUs)• AMBA compatible Memory and core to core busses• Spreadsheet based RADware programming environment


RADcore InterfacesRADcore Interfaces


Controller interface - between the RADcore and host processor Memory interface

- both on chip RAM block and off-chip memory interfaces RADbus interface

- RADcore-to-RADcore, initialization I/O EXU External Logic Buffer

- co-processing with arbitrary external logic


OMNIcore OverviewOMNIcore Overview


Key features• 32-bit RISC engine• Cryptographic support• AHB compliant control

and RAM busses→ User-selectable 8

to 32-bit operation• 4 stage pipeline

→ Low interrupt latency

• Two privilege levels user, system→ Supports smart

card applications


OMNIcore OverviewOMNIcore Overview


Two primary interface for instruction operation (Ctrl) and data read/write (RAM)

- Access to memory bus for on chip memory and external memory operation using its RAM interface

- Access to a local control bus for loading of instruction data into instruction cache and for supervisory and status communications with RADcore control blocks using its Control interface

Dual master AHB interface to integrate control and data functions

- Data output bus is shared - Instruction cache internal to the OMNIcore subsystems is used to avoid

stalling


OMNIcore Crytographic FeaturesOMNIcore Crytographic Features


Public-private key cryptographic algorithms - DES, RSA, DSA and Diffie-Hellman - Controlled by a set of cryptographic instructions

Cryptographic Instruction supports for - Compression Permutation - Expansion Permutation - Initial Permutation - Final Permutation - Key Permutation - Key Rotation - P-Box Permutation - S-Box Permutation


RAMA Memory SubsystemRAMA Memory Subsystem


Distributed memory block architecture, consisting of dual port memory blocks

Key features - Dual Port RAM blocks - Multi-layer AHB for simultaneous memory access - Dual Mode External Memory Interface

→ DMA interface for internal – external memory transfer (AHB Master)

→ Buffer for processor – external memory transfers (AHB Slaves)

- Multi-layer Arbiter→ Priority based


AHB ArbitrationAHB Arbitration


Multi-layer arbitration scheme - To coordinate concurrent processor-memory transfers between masters (OMNIcore, m

ultiple RADcores, external memory DMA) and slaves (memory, external memory buffer)

A Configurable Master/Slave PortA Configurable Master/Slave Port RADbus AMBA AHB features - Allows direct processor to process communic

ation - Hybrid (configurable) Master/Slave interface - Mode dependent changes in AHB operation - All write operations in master mode - All Read operations in slave mode - Uses first-come, first-serve method for arbitra

tion - Low overhead ensures fast operation



A Configurable Master/Slave PortA Configurable Master/Slave Port Structure of RADbus AHB scheme - All Out-puts are defined as bus masters, str

uctured as a Write-only Master - All In-ports are defined as bus slaves, struc

tured as a Write-only Slave - The number of RADcores connecting to the

RADbus determines the size of the address - Selection of which bus channel (A, B, C) is r

ead into the RADcore is defined as function of decoded address bits from the master in conjunction with the state of the slave

- Selection algorithm is based on a “first-com, first-serve” selection mechanism by the read mux, controlled by an address decoded select signal (a, b, c) for each bus


RAMA discussed as platform-based solution Uses multiple AHB for core-to-core integration AHB easily integrated into RAMA architecture AHB provides well understood, flexible interfaces RADbus example shows AHB can be flexible Combination of OMNIcore and RADcore provides enhanced DSP and data processing Extends platform to reach emerging SoC applications

SummarySummary

SummarySummary

Cores +Infrastructure + Integration = SoC PlatformCores +Infrastructure + Integration = SoC Platform(OMNIcore + RADcore) (RAM) + AMBA AHB = RAMA(OMNIcore + RADcore) (RAM) + AMBA AHB = RAMA


81

Lightweight Implementation of the POSIX Threads API for an On-Chip MIPS

Multiprocessor with VCI Interconnect


82

Contents

• Target architecture• MIPS CPU properties• The architecture needs• Pthread specification• Implementation• Experimental setup• Conclusion


83

Target architecture

General VCI based SoC architecture

• System consist of one or more MIPS R3000 as CPU

• Virtual Chip Interconnect compliant interconnect


84

MIPS CPU properties

• Two separated caches for instruction and data.

• Direct mapped caches.• Write buffer with write update and write

through policy.• No memory management unit (MMU),

logical addresses are physical addresses. ( the total memory is fixed at design time )


85

The architecture needs

• Protected access to shared data : Use spin lock– Spin lock is acquired using the pthread_spin_lock– Spin lock is released using the pthread_spin_unlock

• Cache coherency – if the interconnect is a shared bus, use snoopy cache.

• Reduce main memory traffic.

– if the interconnect is VCI compliant( or bus or network), need flush caches.

• Processor identification– CPUs must have an internal register allowing their identific

ation within the system.


86

Pthread specification

• Main kernel objects are the threads and the scheduler.

Execute the thread : ‘start’ function call

Thread attribute : stack size, stack addr, scheduling policy

Unique identifier for the thread


87


• Changing state is done using some pthread function on a shared object.

• From RUNNABLE to RUN is done by the scheduler. Backward from RUN to RUNNABLE using sched_yield.

• A thread structure contains the context of execution of a thread and pointers to other threads.


88


• The scheduler manages 5lists of threads.– Symmetric Multi-Processor(SMP) : Scheduler may be shar

ed by all processors.– Distributed : Scheduler exist every processors.

• The access to the scheduler must be performed in critical section, and under the protection of a lock.

• Other implemented objects– Spin lock : the low level test and set access – Mutex : sequentialize access to shared data– Semaphore : sem_post is the only function that can be c

alled in interruption handlers.


89

Implementation

• The scheduler_created variable must be declared with the volatile type qualifier to ensure that compiler will not optimize this seemingly infinite loop.

◈ Booting sequence


90

Implementation

• Context Switch– Save the current value of the CPU registers into context

variable of the thread that is currently executing

– Sets the values of the CPU registers to the value of the context variable of the new thread to execute.

– The return address of the function is a register of the context

– Restoring a context sends the program back where the context was saved, not to the current caller of the context switching routine.


91

Implementation

• All idle CPUs enter the same idle loop.

◈ CPU Idle Loop


92

Experimental setup

• Review several types of scheduler– Symmetric Multiprocessor (SMP)

• Unique scheduler shared by all processors and protected• The threads can run on any processor, and migrate

– Centralized Non SMP (NON_SMP_CS)• Unique scheduler shared by all processors and protected• Every thread is assigned to a given processor and can run

only on it

– Distributed Non SMP (NON_SMP_DS)• Many schedulers as processors, and as many locks as

schedulers• Every thread is assigned to a given processor and can run

only on it


93

Experimental setup

Execution times of the MJPEG application Cycles spent in the CPU idle Loop

◈ Motion JPEG application


94

Experimental setup ◈ COMM application

• Does not exchange data between processors.

• The only resource shared here is the bus

• The application uses the processors at about full power.


95

Conclusion

• The implementation is a bit tricky, but quite compact and efficient.

• Experimentations have shown that a POSIX compliant SMP kernel allowing task migration is an acceptable solution in terms of generality, performance and memory footprint for SoC.


multi-core system on chip 설계 동향 2 발표 : 조준동 교수 2003 년 12 월

Documents