chapter 3 cpus

Chapter 3

CPUs

金仲達教授清華大學資訊工程學系

(Slides are taken from the textbook slides)

CPUs-2

Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption

CPUs-3

I/O devices An embedded system usually includes

some input/output devices A typical I/O interface to CPU:

CPU

statusreg

datareg

mec

hani

sm

CPUs-4

Application: 8251 UART Universal asynchronous receiver

transmitter (UART): provides serial communication

8251 functions are usually integrated into standard PC interface chip

Allows many communication parameters Baud (bit) rate: e.g. 9600 chars/sec Number of bits per character Parity/no parity Even/odd parity Length of stop bit (1, 1.5, 2 bits)

CPUs-5

8251 CPU interface

CPU 8251

status(8 bit)

data(8 bit)

serialport

xmit/rcv

time

bit 0 bit 1 bit n-1

nochar

start stop...

CPUs-6

Programming I/O Two types of instructions can support I/O:

special-purpose I/O instructions memory-mapped load/store instructions

Intel x86 provides in, out instructions Most other CPUs use memory-mapped I/O

MM I/O provide address for I/O registers

CPUs-7

ARM memory-mapped I/O Define location for device:DEV1 EQU 0x1000

Read/write code:LDR r1,#DEV1; set up device adrsLDR r0,[r1] ; read DEV1LDR r0,#8 ; set up value to writeSTR r0,[r1] ; write value to device

CPUs-8

SHARC memory mapped I/O Device must be in external memory space

(above 0x400000). Use DM to control access:

I0 = 0x400000;M0 = 0;R1 = DM(I0,M0);

CPUs-9

Peek and poke Traditional C interfaces:

int peek(char *location) {return *location; }

void poke(char *location, char newval) {(*location) = newval; }

CPUs-10

Busy/wait output Simplest way to program device

Use instructions to test when device is ready

#define OUT_CHAR 0x1000 // device data register#define OUT_STATUS 0x1001 // device status register

current_char = mystring;while (*current_char != ‘\0’) {

poke(OUT_CHAR,*current_char);while (peek(OUT_STATUS) != 0); // busy waitingcurrent_char++;

}

CPUs-11

Simultaneous busy/wait input and outputwhile (TRUE) {/* read */while (peek(IN_STATUS) == 0);achar = (char)peek(IN_DATA);/* write */poke(OUT_DATA,achar);poke(OUT_STATUS,1);while (peek(OUT_STATUS) != 0);}

CPUs-12

Interrupt I/O Busy/wait is very inefficient

CPU can’t do other work while testing device Hard to do simultaneous I/O

Interrupts allow a device to change the flow of control in the CPU Causes subroutine call to handle device

CPU

statusreg

datareg

mec

hani

sm

PC

intr request

intr ack

data/address

IR

CPUs-13

Interrupt behavior Based on subroutine call mechanism. Interrupt forces next instruction to be a

subroutine call to a predetermined location. Return address is saved to resume executing

foreground program.

CPUs-14

Interrupt physical interface CPU and device are connected by CPU bus CPU and device handshake:

device asserts interrupt request CPU checks the interrupt request line at the

beginning of each instruction cycle CPU asserts interrupt acknowledge when it can

handle the interrupt CPU fetches the next instruction from the

interrupt handler routine

CPUs-15

Example: character I/O handlersvoid input_handler() {achar = peek(IN_DATA);gotchar = TRUE;poke(IN_STATUS,0);

}void output_handler() {}

CPUs-16

Example: interrupt-driven main programmain() {while (TRUE) {

if (gotchar) {poke(OUT_DATA,achar);poke(OUT_STATUS,1);gotchar = FALSE;}

}}

CPUs-17

Example: interrupt I/O with buffers Queue for characters:

head tailhead tail

a

CPUs-18

Buffer-based input handlervoid input_handler() {

char achar;if (full_buffer()) error = 1;else { achar = peek(IN_DATA); add_char(achar); }poke(IN_STATUS,0);if (nchars == 1)

{ poke(OUT_DATA,remove_char(); poke(OUT_STATUS,1); }}

CPUs-19

Debugging interrupt code What if you forget to change registers?

Foreground program can exhibit mysterious bugs.

Bugs will be hard to repeat---depend on interrupt timing.

CPUs-20

Priorities and Vectors Two mechanisms allow us to make

interrupts more specific: Priorities determine what interrupt gets CPU

first Vectors determine what code (handler routine)

is called for each type of interrupt Mechanisms are orthogonal: most CPUs

provide both

CPUs-21

Prioritized interrupts

CPU

device 1 device 2 device n

L1 L2 .. Ln

interruptacknowledge

CPUs-22

Interrupt prioritization Masking: interrupt with priority lower than cur

rent priority is not recognized until pending interrupt is complete

Non-maskable interrupt (NMI): highest-priority, never masked Often used for power-down

CPUs-23

Interrupt Vectors Allow different devices to be handled by

different code Require additional vector line from device

to CPU

handler 0

handler 1

handler 2

handler 3

Interrupt Vector Table

CPU

Device

int req ack

Vector

CPUs-24

Generic interrupt mechanism

Assume priority selection is handled before this point.

intr?N

Y

N

ignore

Y

ack

vector?

Y

Y

Ntimeout?

Ybus error

call table[vector]

intr priority > current priority?

continueexecution

CPUs-25

Interrupt sequence CPU acknowledges request Device sends vector CPU calls handler Software processes request CPU restores state to foreground program

CPUs-26

Sources of interrupt overhead Handler execution time Interrupt mechanism overhead Register save/restore Pipeline-related penalties Cache-related penalties

CPUs-27

ARM interrupts ARM7 supports two types of interrupts:

Fast interrupt requests (FIQs) Interrupt requests (IRQs) FIO takes priority over IRQ

Interrupt table starts at location 0

CPUs-28

ARM interrupt procedure CPU actions:

Save PC; copy CPSR to SPSR. Force bits in CPSR to record interrupt. Force PC to vector.

Handler responsibilities: Restore proper PC. Restore CPSR from SPSR. Clear interrupt disable flags.

CPUs-29

ARM interrupt latency Worst-case latency to respond to interrupt

is 27 cycles: Two cycles to synchronize external request. Up to 20 cycles to complete current instruction. Three cycles for data abort. Two cycles to enter interrupt handling state.

CPUs-30

SHARC interrupt structure Interrupts are vectored and prioritized. Priorities are fixed: reset highest, user SW

interrupt 3 lowest. Vectors are also fixed. Vector is offset in

vector table. Table starts at 0x20000 in internal memory, 0x40000 in external memory.v

Pads Lab

SHARC interrupt sequenceStart: must be executing or IDLE/IDLE16.1. Output appropriate interrupt vector

address.2. Push PC value onto PC stack.3. Set bit in interrupt latch register.4. Set IMASKP to current nesting state.

Pads Lab

SHARC interrupt returnInitiated by RTI instruction.1. Return to address at top of PC stack.2. Pop PC stack.3. Pop status stack if appropriate.4. Clear bits in interrupt latch register and

IMASKP.

Pads Lab

SHARC interrupt performance Three stages of response:

1 cycle: synchronization and latching 1 cycle: recognition 2 cycles: branching to vector

Total latency: 3 cycles. Multiprocessor vector interrupts have 6

cycle latency.

CPUs-34


CPUs-35

Supervisor mode May want to provide protective barriers

between programs. e.g., avoid memory corruption

Need supervisor mode to manage the various programs

SHARC does not have a supervisor mode

CPUs-36

ARM supervisor mode Use SWI instruction to enter supervisor

mode, similar to subroutine:SWI CODE_1 Sets PC to 0x08 Argument to SWI is passed to supervisor mode

code to request various services Saves CPSR in SPSR

CPUs-37

Exception Exception: internally detected error Exceptions are synchronous with instructions

but unpredictable Build exception mechanism on top of interrup

t mechanism Exceptions are usually prioritized and vectoriz

ed A single instruction may generate more than one ex

ception

CPUs-38

Trap Trap (software interrupt): an exception

generated by an instruction Call supervisor mode

ARM uses SWI instruction for traps SHARC offers three levels of software

interrupts. Called by setting bits in IRPTL register

CPUs-39

Co-processor Co-processor: added function unit that is

called by instruction e.g. floating-point operations

A co-processor instruction can cause trap and be handled by software (if no such co-processor exists)

ARM allows up to 16 co-processors

CPUs-40


CPUs-41

Caches and CPUs

CPU

cach

eco

ntro

ller

cache

mainmemory

data

data

address

data

address

Memory access speed is falling further and further behind than CPU

Cache: reduce the speed gap

CPUs-42

Cache operation May have caches for:

instructions data data + instructions (unified)

Memory access time is no longer deterministic Cache hit: required location is in cache Cache miss: required location is not in cache Working set: set of locations used by program

in a time interval.

CPUs-43

Types of cache misses Compulsory (cold): location has never

been accessed. Capacity: working set is too large Conflict: multiple locations in working set

map to same cache entry.

CPUs-44

Memory system performance Cache performance benefits:

Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.

Sequential accesses are faster after first access. h = cache hit rate. tcache = cache access time, tmain = main memory

access time. Average memory access time:

tav = htcache + (1-h)tmain

CPUs-45

Multi-level caches

h1 = cache hit rate. h2 = rate for miss on L1, hit on L2. Average memory access time:

tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain

CPU L1 cache L2 cache

CPUs-46

Replacement policies Replacement policy: strategy for choosing

which cache entry to throw out to make room for a new memory location.

Two popular strategies: Random. Least-recently used (LRU).

CPUs-47

Cache organizations Fully-associative: any memory location

can be stored anywhere in the cache (almost never implemented)

Direct-mapped: each memory location maps onto exactly one cache entry

N-way set-associative: each memory location can go into one of n sets

CPUs-48

Direct-mapped cache

valid

=

tag index offset

hit value

tag data

1 0xabcd byte byte byte ...

byte

cache block

CPUs-49

Direct-mapped cache locations Many locations map onto the same cache

block. Conflict misses are easy to generate:

Array a[] uses locations 0, 1, 2, … Array b[] uses locations 1024, 1025, 1026, … Operation a[i] + b[i] generates conflict misses

if locations 0 and 1024 are mapped to the same block in the cache.

Write operations: Write-through: immediately copy write to main

memory Write-back: write to main memory only when

location is removed from cache

CPUs-50

Set-associative cache A set of direct-mapped caches:

Set 1 Set 2 Set n...

hit data

CPUs-51

Memory Management Units Memory management unit (MMU)

translates addresses MMU are not common in embedded

system as it hardly has a secondary storage

CPU mainmemory

MMU

logicaladdress

physicaladdress

secondary storage

swapping

data

CPUs-52

Memory management tasks Allows programs to move in physical

memory during execution. Allows virtual memory:

memory images kept in secondary storage; images returned to main memory on demand

during execution. Page fault: request for location not

resident in memory, which generates an exception

CPUs-53

Address translation Requires some sort of register/table to

allow arbitrary mappings of logical to physical addresses

Two basic schemes: segmented paged

Segmentation and paging can be combined (x86)

CPUs-54

Segment address translation

segment base address logical address

rangecheck

physical address

+

rangeerror

segment lower boundsegment upper bound

to memory

from CPU

CPUs-55

Page address translation

page offset

page offset

page i base

concatenate

page table

logicaddress

physicaladdress

to memory

from CPU

CPUs-56

Page table organizations

flat tree

page descriptor

pagedescriptor

CPUs-57

Caching address translations Large translation tables require main

memory access. TLB: cache for address translation.

Typically small.

CPUs-58

ARM & SHARC memory management Memory region types:

section: 1 Mbyte block large page: 64 kbytes small page: 4 kbytes

An address is marked as section-mapped or page-mapped

Two-level translation scheme SHARC does not have a MMU

CPUs-59

ARM address translation

offset1st index 2nd index

physical address

Translation tablebase register

1st level tabledescriptor

2nd level tabledescriptor

concatenate

concatenate

CPUs-60


CPUs-61

Performance Acceleration There are 3 factors that can substantially

improve system performance: Pipelining Superscalar execution Caching

Need to take advantages of them where possible

But, they also cause problems in analyzing the performance

CPUs-62

Pipelining Several instructions are executed simultaneou

sly at different stages of completion Various conditions can cause pipeline stalls th

at reduce utilization: branches memory system delays data hazards, etc.

Both ARM and SHARC have 3-stage pipes: fetch instruction from memory; decode opcode and operands; execute.

CPUs-63

ARM pipeline execution

add r0,r1,#5

sub r2,r3,r6

cmp r2,#3

fetch

time

decode

fetch

execute

decode

fetch

execute

decode execute

1 2 3

CPUs-64

Pipeline performance Latency: time it takes for an instruction to

get through the pipeline. Throughput: number of instructions

executed per time period. Pipelining increases throughput without

reducing latency. Pipeline stall:

If a step cannot be completed in the same amount of time, pipeline stalls.

Bubbles introduced by stall increase latency, reduce throughput.

CPUs-65

fetch decodeex ld r2ldmia r0,{r2,r3}

sub r2,r3,r6

cmp r2,#3

fetch

time

ex ld r3

decode ex sub

fetch decodeex cmp

Data stall Multi-cycle execution and data stall LDMIA: load multiple

CPUs-66

Control stalls Branches often introduce stalls (branch

penalty). Stall time may depend on whether branch is

taken. May have to squash instructions that

already started executing. Don’t know what to fetch until condition is

evaluated.

CPUs-67

ARM pipelined branch

time

fetch decode ex bnebne foo

sub r2,r3,r6

fetch decode

foo add r0,r1,r2

ex bne

fetch decode ex add

ex bne

CPUs-68

Delayed branch To increase pipeline efficiency, delayed

branch mechanism requires n instructions after branch always executed whether branch is executed or not

SHARC supports delayed and non-delayed branches Specified by bit in branch instruction 2 instruction branch delay slot

CPUs-69

Example: SHARC code schedulingL1=5;DM(I0,M1)=R1;L8=8;DM(I8,M9)=R2;

CPU cannot use DAG on cycle just after loading DAG’s register, because both need the same internal bus. CPU performs NOP between register assign and DM.

CPUs-70

Rescheduled SHARC codeL1=5;L8=8;DM(I0,M1)=R1;DM(I8,M9)=R2;

Avoids two NOP cycles.

CPUs-71

Example: ARM execution time Determine execution time of FIR filter:

for (i=0; i<N; i++)f = f + c[i]*x[i];

Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst

case.

CPUs-72

Superscalar execution Superscalar processor can execute several

instructions per cycle Uses multiple pipelined data paths

Programs execute faster, but it is harder to determine how much faster

CPUs-73

Superscalar Processor

Control

Instruction 2

Instruction 1

Instruction Unit

Instruction Unit

Registers

CPUs-74

Data dependencies Execution time depends on operands, not just

opcode. Superscalar CPU checks data dependencies dy

namically:

add r2,r0,r1add r3,r2,r5

data dependency r0 r1

r2 r5

r3

CPUs-75

Memory system performance Caches introduce indeterminacy in

execution time. Depends on order of execution.

Cache miss penalty: added time due to a cache miss.

Several reasons for a miss: compulsory, conflict, capacity.

CPUs-76

CPU power consumption Most modern CPUs are designed with

power consumption in mind to some degree

Power vs. energy: Power is energy consumption per unit time heat depends on power consumption battery life depends on energy consumption

CPUs-77

CMOS power consumption Voltage drops: power consumption

proportional to V2

Toggling: more activity means more power Leakage: basic circuit characteristics; can

be eliminated by disconnecting power

CPUs-78

CPU power-saving strategies Reduce power supply voltage Run at lower clock frequency Disable function units with control signals

when not in use Disconnect parts from power supply when

not in use to eliminate leakage currents

CPUs-79

Power management styles Static power management: does not

depend on CPU activity Example: user-activated power-down mode

Dynamic power management: based on CPU activity Example: disabling off function units

CPUs-80

Application: PowerPC 603 Provides doze, nap, sleep modes for static po

wer management Dynamic power management features:

Can shut down unused execution units Cache organized into subarrays to minimize amoun

t of active circuitry

CPUs-81

PowerPC 603 activity Percentage of time units are idle for SPEC

integer/floating-point:unit Specint92 Specfp92D cache 29% 28%I cache 29% 17%load/store 35% 17%fixed-point 38% 76%floating-point 99% 30%system register 89% 97%

Idle units are turned off by switching off clocks Pipeline stages can be turned on or off

CPUs-82

Power-down costs Going into a power-down mode costs:

time energy

Must determine if going into mode is worthwhile

Can model CPU power states with power state machine

CPUs-83

Application: StrongARM SA-1100 Processor takes two supplies:

VDD is main 3.3V supply (on & off) VDDX is 1.5V (always remains on)

Three power modes: Run: normal operation. Idle: stops CPU clock, with logic still powered Sleep: shuts off most of chip activity; 3 steps,

each about 30 s; wakeup takes > 10 ms

CPUs-84

SA-1100 power state machine

run

idle sleep

Prun = 400 mW

Pidle = 50 mW Psleep = 0.16 mW

10 s

10 s90 s

160 ms90 s

CPUs-85

Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption Example Design: Data Compressor

CPUs-86

Goals Compress data transmitted over serial

line. Receives byte-size input symbols. Produces output symbols packed into bytes.

Will build software module only here.

CPUs-87

Collaboration diagram for compressor

:input :data compressor :output

1..n: inputsymbols

1..m: packedoutputsymbols

CPUs-88

Huffman coding Early statistical text compression

algorithm. Select non-uniform size codes.

Use shorter codes for more common symbols. Use longer codes for less common symbols.

To allow decoding, codes must have unique prefixes. No code can be a prefix of a longer valid code.

CPUs-89

Huffman examplecharacter Pa .45b .24c .11d .08e .07f .05

P=1

P=.55

P=.31P=.19

P=.12

CPUs-90

Example Huffman code Read code from root to leaves:

a 1b 01c 0000d 0001e 0010f 0011

CPUs-91

Huffman coder requirements table

name data compression module

purpose code module for Huffmancompression

inputs encoding table, uncodedbyte-size inputs

outputs packed compression outputsymbols

functions Huffman coding

performance fast

manufacturing cost N/A

power N/A

physical size/weight N/A

CPUs-92

Building a specification Collaboration diagram shows only steady-

state input/output. A real system must:

Accept an encoding table. Allow a system reset that flushes the

compression buffer.

CPUs-93

data-compressor class

data-compressor

buffer: data-buffertable: symbol-tablecurrent-bit: integer

encode(): boolean,data-buffer

flush()new-symbol-table()

CPUs-94

Data-compressor behaviors encode: takes one-byte input, generates

packed encoded symbols and a Boolean indicating whether the buffer is full.

new-symbol-table: installs new symbol table in object, throws away old table.

flush: returns current state of buffer, including number of valid bits in buffer.

CPUs-95

Auxiliary classes

data-buffer

databuf[databuflen] :character

len : integer

insert()length() : integer

symbol-table

symbols[nsymbols] :data-buffer

len : integer

value() : symbolload()

CPUs-96

Auxiliary class roles data-buffer holds both packed and unpacked

symbols. Longest Huffman code for 8-bit inputs is 256 bits.

symbol-table indexes encoded verison of each symbol. load() puts data in a new symbol table.

CPUs-97

Class relationships

symbol-table

data-compressor

data-buffer

1

1

1

1

CPUs-98

Encode behavior

encode

create new bufferadd to buffers

add to buffer

return true

return false

input symbol

buffer filled?

T

F

CPUs-99

Insert behavior

pack intothis buffer

pack bottom bitsinto this buffer,

top bits intooverflow buffer

updatelength

inputsymbol

fills buffer?

T

F

CPUs-100

Program design In an object-oriented language, we can

reflect the UML specification in the code more directly.

In a non-object-oriented language, we must either: add code to provide object-oriented features; diverge from the specification structure.

CPUs-101

C++ classesClass data_buffer {

char databuf[databuflen];int len;int length_in_chars() { return len/bitsperbyte; }

public:void insert(data_buffer,data_buffer&);int length() { return len; }int length_in_bytes() { return (int)ceil(len/8.0); }int initialize(); ...

CPUs-102

C++ classes, cont’d.class data_compressor {

data_buffer buffer;int current_bit;symbol_table table;

public:boolean encode(char,data_buffer&);void new_symbol_table(symbol_table);int flush(data_buffer&);data_compressor();~data_compressor();}

CPUs-103

C codestruct data_compressor_struct {

data_buffer buffer;int current_bit;sym_table table;

}typedef struct data_compressor_struct data_compressor,

*data_compressor_ptr;boolean data_compressor_encode(data_compressor_ptr mycmptr

s, char isymbol, data_buffer *fullbuf) ...

CPUs-104

Testing Test by encoding, then decoding:

input symbols

symbol table

encoder decoder

compare

result

CPUs-105

Code inspection tests Look at the code for potential problems:

Can we run past end of symbol table? What happens when the next symbol does not

fill the buffer? Does fill it? Do very long encoded symbols work properly?

Very short symbols? Does flush() work properly?

chapter 3 cpus

Documents