chapter 3 cpus
DESCRIPTION
Chapter 3 CPUs. 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides). Outline. Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption. I/O devices. An embedded system usually includes some input/output devices - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/1.jpg)
Chapter 3
CPUs
金仲達教授清華大學資訊工程學系
(Slides are taken from the textbook slides)
![Page 2: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/2.jpg)
CPUs-2
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
![Page 3: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/3.jpg)
CPUs-3
I/O devices An embedded system usually includes
some input/output devices A typical I/O interface to CPU:
CPU
statusreg
datareg
mec
hani
sm
![Page 4: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/4.jpg)
CPUs-4
Application: 8251 UART Universal asynchronous receiver
transmitter (UART): provides serial communication
8251 functions are usually integrated into standard PC interface chip
Allows many communication parameters Baud (bit) rate: e.g. 9600 chars/sec Number of bits per character Parity/no parity Even/odd parity Length of stop bit (1, 1.5, 2 bits)
![Page 5: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/5.jpg)
CPUs-5
8251 CPU interface
CPU 8251
status(8 bit)
data(8 bit)
serialport
xmit/rcv
time
bit 0 bit 1 bit n-1
nochar
start stop...
![Page 6: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/6.jpg)
CPUs-6
Programming I/O Two types of instructions can support I/O:
special-purpose I/O instructions memory-mapped load/store instructions
Intel x86 provides in, out instructions Most other CPUs use memory-mapped I/O
MM I/O provide address for I/O registers
![Page 7: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/7.jpg)
CPUs-7
ARM memory-mapped I/O Define location for device:DEV1 EQU 0x1000
Read/write code:LDR r1,#DEV1; set up device adrsLDR r0,[r1] ; read DEV1LDR r0,#8 ; set up value to writeSTR r0,[r1] ; write value to device
![Page 8: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/8.jpg)
CPUs-8
SHARC memory mapped I/O Device must be in external memory space
(above 0x400000). Use DM to control access:
I0 = 0x400000;M0 = 0;R1 = DM(I0,M0);
![Page 9: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/9.jpg)
CPUs-9
Peek and poke Traditional C interfaces:
int peek(char *location) {return *location; }
void poke(char *location, char newval) {(*location) = newval; }
![Page 10: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/10.jpg)
CPUs-10
Busy/wait output Simplest way to program device
Use instructions to test when device is ready
#define OUT_CHAR 0x1000 // device data register#define OUT_STATUS 0x1001 // device status register
current_char = mystring;while (*current_char != ‘\0’) {
poke(OUT_CHAR,*current_char);while (peek(OUT_STATUS) != 0); // busy waitingcurrent_char++;
}
![Page 11: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/11.jpg)
CPUs-11
Simultaneous busy/wait input and outputwhile (TRUE) {/* read */while (peek(IN_STATUS) == 0);achar = (char)peek(IN_DATA);/* write */poke(OUT_DATA,achar);poke(OUT_STATUS,1);while (peek(OUT_STATUS) != 0);}
![Page 12: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/12.jpg)
CPUs-12
Interrupt I/O Busy/wait is very inefficient
CPU can’t do other work while testing device Hard to do simultaneous I/O
Interrupts allow a device to change the flow of control in the CPU Causes subroutine call to handle device
CPU
statusreg
datareg
mec
hani
sm
PC
intr request
intr ack
data/address
IR
![Page 13: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/13.jpg)
CPUs-13
Interrupt behavior Based on subroutine call mechanism. Interrupt forces next instruction to be a
subroutine call to a predetermined location. Return address is saved to resume executing
foreground program.
![Page 14: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/14.jpg)
CPUs-14
Interrupt physical interface CPU and device are connected by CPU bus CPU and device handshake:
device asserts interrupt request CPU checks the interrupt request line at the
beginning of each instruction cycle CPU asserts interrupt acknowledge when it can
handle the interrupt CPU fetches the next instruction from the
interrupt handler routine
![Page 15: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/15.jpg)
CPUs-15
Example: character I/O handlersvoid input_handler() {achar = peek(IN_DATA);gotchar = TRUE;poke(IN_STATUS,0);
}void output_handler() {}
![Page 16: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/16.jpg)
CPUs-16
Example: interrupt-driven main programmain() {while (TRUE) {
if (gotchar) {poke(OUT_DATA,achar);poke(OUT_STATUS,1);gotchar = FALSE;}
}}
![Page 17: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/17.jpg)
CPUs-17
Example: interrupt I/O with buffers Queue for characters:
head tailhead tail
a
![Page 18: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/18.jpg)
CPUs-18
Buffer-based input handlervoid input_handler() {
char achar;if (full_buffer()) error = 1;else { achar = peek(IN_DATA); add_char(achar); }poke(IN_STATUS,0);if (nchars == 1)
{ poke(OUT_DATA,remove_char(); poke(OUT_STATUS,1); }}
![Page 19: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/19.jpg)
CPUs-19
Debugging interrupt code What if you forget to change registers?
Foreground program can exhibit mysterious bugs.
Bugs will be hard to repeat---depend on interrupt timing.
![Page 20: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/20.jpg)
CPUs-20
Priorities and Vectors Two mechanisms allow us to make
interrupts more specific: Priorities determine what interrupt gets CPU
first Vectors determine what code (handler routine)
is called for each type of interrupt Mechanisms are orthogonal: most CPUs
provide both
![Page 21: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/21.jpg)
CPUs-21
Prioritized interrupts
CPU
device 1 device 2 device n
L1 L2 .. Ln
interruptacknowledge
![Page 22: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/22.jpg)
CPUs-22
Interrupt prioritization Masking: interrupt with priority lower than cur
rent priority is not recognized until pending interrupt is complete
Non-maskable interrupt (NMI): highest-priority, never masked Often used for power-down
![Page 23: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/23.jpg)
CPUs-23
Interrupt Vectors Allow different devices to be handled by
different code Require additional vector line from device
to CPU
handler 0
handler 1
handler 2
handler 3
Interrupt Vector Table
CPU
Device
int req ack
Vector
![Page 24: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/24.jpg)
CPUs-24
Generic interrupt mechanism
Assume priority selection is handled before this point.
intr?N
Y
N
ignore
Y
ack
vector?
Y
Y
Ntimeout?
Ybus error
call table[vector]
intr priority > current priority?
continueexecution
![Page 25: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/25.jpg)
CPUs-25
Interrupt sequence CPU acknowledges request Device sends vector CPU calls handler Software processes request CPU restores state to foreground program
![Page 26: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/26.jpg)
CPUs-26
Sources of interrupt overhead Handler execution time Interrupt mechanism overhead Register save/restore Pipeline-related penalties Cache-related penalties
![Page 27: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/27.jpg)
CPUs-27
ARM interrupts ARM7 supports two types of interrupts:
Fast interrupt requests (FIQs) Interrupt requests (IRQs) FIO takes priority over IRQ
Interrupt table starts at location 0
![Page 28: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/28.jpg)
CPUs-28
ARM interrupt procedure CPU actions:
Save PC; copy CPSR to SPSR. Force bits in CPSR to record interrupt. Force PC to vector.
Handler responsibilities: Restore proper PC. Restore CPSR from SPSR. Clear interrupt disable flags.
![Page 29: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/29.jpg)
CPUs-29
ARM interrupt latency Worst-case latency to respond to interrupt
is 27 cycles: Two cycles to synchronize external request. Up to 20 cycles to complete current instruction. Three cycles for data abort. Two cycles to enter interrupt handling state.
![Page 30: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/30.jpg)
CPUs-30
SHARC interrupt structure Interrupts are vectored and prioritized. Priorities are fixed: reset highest, user SW
interrupt 3 lowest. Vectors are also fixed. Vector is offset in
vector table. Table starts at 0x20000 in internal memory, 0x40000 in external memory.v
![Page 31: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/31.jpg)
Pads Lab
SHARC interrupt sequenceStart: must be executing or IDLE/IDLE16.1. Output appropriate interrupt vector
address.2. Push PC value onto PC stack.3. Set bit in interrupt latch register.4. Set IMASKP to current nesting state.
![Page 32: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/32.jpg)
Pads Lab
SHARC interrupt returnInitiated by RTI instruction.1. Return to address at top of PC stack.2. Pop PC stack.3. Pop status stack if appropriate.4. Clear bits in interrupt latch register and
IMASKP.
![Page 33: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/33.jpg)
Pads Lab
SHARC interrupt performance Three stages of response:
1 cycle: synchronization and latching 1 cycle: recognition 2 cycles: branching to vector
Total latency: 3 cycles. Multiprocessor vector interrupts have 6
cycle latency.
![Page 34: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/34.jpg)
CPUs-34
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
![Page 35: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/35.jpg)
CPUs-35
Supervisor mode May want to provide protective barriers
between programs. e.g., avoid memory corruption
Need supervisor mode to manage the various programs
SHARC does not have a supervisor mode
![Page 36: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/36.jpg)
CPUs-36
ARM supervisor mode Use SWI instruction to enter supervisor
mode, similar to subroutine:SWI CODE_1 Sets PC to 0x08 Argument to SWI is passed to supervisor mode
code to request various services Saves CPSR in SPSR
![Page 37: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/37.jpg)
CPUs-37
Exception Exception: internally detected error Exceptions are synchronous with instructions
but unpredictable Build exception mechanism on top of interrup
t mechanism Exceptions are usually prioritized and vectoriz
ed A single instruction may generate more than one ex
ception
![Page 38: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/38.jpg)
CPUs-38
Trap Trap (software interrupt): an exception
generated by an instruction Call supervisor mode
ARM uses SWI instruction for traps SHARC offers three levels of software
interrupts. Called by setting bits in IRPTL register
![Page 39: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/39.jpg)
CPUs-39
Co-processor Co-processor: added function unit that is
called by instruction e.g. floating-point operations
A co-processor instruction can cause trap and be handled by software (if no such co-processor exists)
ARM allows up to 16 co-processors
![Page 40: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/40.jpg)
CPUs-40
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
![Page 41: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/41.jpg)
CPUs-41
Caches and CPUs
CPU
cach
eco
ntro
ller
cache
mainmemory
data
data
address
data
address
Memory access speed is falling further and further behind than CPU
Cache: reduce the speed gap
![Page 42: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/42.jpg)
CPUs-42
Cache operation May have caches for:
instructions data data + instructions (unified)
Memory access time is no longer deterministic Cache hit: required location is in cache Cache miss: required location is not in cache Working set: set of locations used by program
in a time interval.
![Page 43: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/43.jpg)
CPUs-43
Types of cache misses Compulsory (cold): location has never
been accessed. Capacity: working set is too large Conflict: multiple locations in working set
map to same cache entry.
![Page 44: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/44.jpg)
CPUs-44
Memory system performance Cache performance benefits:
Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.
Sequential accesses are faster after first access. h = cache hit rate. tcache = cache access time, tmain = main memory
access time. Average memory access time:
tav = htcache + (1-h)tmain
![Page 45: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/45.jpg)
CPUs-45
Multi-level caches
h1 = cache hit rate. h2 = rate for miss on L1, hit on L2. Average memory access time:
tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain
CPU L1 cache L2 cache
![Page 46: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/46.jpg)
CPUs-46
Replacement policies Replacement policy: strategy for choosing
which cache entry to throw out to make room for a new memory location.
Two popular strategies: Random. Least-recently used (LRU).
![Page 47: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/47.jpg)
CPUs-47
Cache organizations Fully-associative: any memory location
can be stored anywhere in the cache (almost never implemented)
Direct-mapped: each memory location maps onto exactly one cache entry
N-way set-associative: each memory location can go into one of n sets
![Page 48: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/48.jpg)
CPUs-48
Direct-mapped cache
valid
=
tag index offset
hit value
tag data
1 0xabcd byte byte byte ...
byte
cache block
![Page 49: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/49.jpg)
CPUs-49
Direct-mapped cache locations Many locations map onto the same cache
block. Conflict misses are easy to generate:
Array a[] uses locations 0, 1, 2, … Array b[] uses locations 1024, 1025, 1026, … Operation a[i] + b[i] generates conflict misses
if locations 0 and 1024 are mapped to the same block in the cache.
Write operations: Write-through: immediately copy write to main
memory Write-back: write to main memory only when
location is removed from cache
![Page 50: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/50.jpg)
CPUs-50
Set-associative cache A set of direct-mapped caches:
Set 1 Set 2 Set n...
hit data
![Page 51: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/51.jpg)
CPUs-51
Memory Management Units Memory management unit (MMU)
translates addresses MMU are not common in embedded
system as it hardly has a secondary storage
CPU mainmemory
MMU
logicaladdress
physicaladdress
secondary storage
swapping
data
![Page 52: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/52.jpg)
CPUs-52
Memory management tasks Allows programs to move in physical
memory during execution. Allows virtual memory:
memory images kept in secondary storage; images returned to main memory on demand
during execution. Page fault: request for location not
resident in memory, which generates an exception
![Page 53: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/53.jpg)
CPUs-53
Address translation Requires some sort of register/table to
allow arbitrary mappings of logical to physical addresses
Two basic schemes: segmented paged
Segmentation and paging can be combined (x86)
![Page 54: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/54.jpg)
CPUs-54
Segment address translation
segment base address logical address
rangecheck
physical address
+
rangeerror
segment lower boundsegment upper bound
to memory
from CPU
![Page 55: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/55.jpg)
CPUs-55
Page address translation
page offset
page offset
page i base
concatenate
page table
logicaddress
physicaladdress
to memory
from CPU
![Page 56: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/56.jpg)
CPUs-56
Page table organizations
flat tree
page descriptor
pagedescriptor
![Page 57: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/57.jpg)
CPUs-57
Caching address translations Large translation tables require main
memory access. TLB: cache for address translation.
Typically small.
![Page 58: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/58.jpg)
CPUs-58
ARM & SHARC memory management Memory region types:
section: 1 Mbyte block large page: 64 kbytes small page: 4 kbytes
An address is marked as section-mapped or page-mapped
Two-level translation scheme SHARC does not have a MMU
![Page 59: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/59.jpg)
CPUs-59
ARM address translation
offset1st index 2nd index
physical address
Translation tablebase register
1st level tabledescriptor
2nd level tabledescriptor
concatenate
concatenate
![Page 60: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/60.jpg)
CPUs-60
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption
![Page 61: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/61.jpg)
CPUs-61
Performance Acceleration There are 3 factors that can substantially
improve system performance: Pipelining Superscalar execution Caching
Need to take advantages of them where possible
But, they also cause problems in analyzing the performance
![Page 62: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/62.jpg)
CPUs-62
Pipelining Several instructions are executed simultaneou
sly at different stages of completion Various conditions can cause pipeline stalls th
at reduce utilization: branches memory system delays data hazards, etc.
Both ARM and SHARC have 3-stage pipes: fetch instruction from memory; decode opcode and operands; execute.
![Page 63: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/63.jpg)
CPUs-63
ARM pipeline execution
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute
decode
fetch
execute
decode execute
1 2 3
![Page 64: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/64.jpg)
CPUs-64
Pipeline performance Latency: time it takes for an instruction to
get through the pipeline. Throughput: number of instructions
executed per time period. Pipelining increases throughput without
reducing latency. Pipeline stall:
If a step cannot be completed in the same amount of time, pipeline stalls.
Bubbles introduced by stall increase latency, reduce throughput.
![Page 65: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/65.jpg)
CPUs-65
fetch decodeex ld r2ldmia r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
fetch
time
ex ld r3
decode ex sub
fetch decodeex cmp
Data stall Multi-cycle execution and data stall LDMIA: load multiple
![Page 66: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/66.jpg)
CPUs-66
Control stalls Branches often introduce stalls (branch
penalty). Stall time may depend on whether branch is
taken. May have to squash instructions that
already started executing. Don’t know what to fetch until condition is
evaluated.
![Page 67: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/67.jpg)
CPUs-67
ARM pipelined branch
time
fetch decode ex bnebne foo
sub r2,r3,r6
fetch decode
foo add r0,r1,r2
ex bne
fetch decode ex add
ex bne
![Page 68: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/68.jpg)
CPUs-68
Delayed branch To increase pipeline efficiency, delayed
branch mechanism requires n instructions after branch always executed whether branch is executed or not
SHARC supports delayed and non-delayed branches Specified by bit in branch instruction 2 instruction branch delay slot
![Page 69: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/69.jpg)
CPUs-69
Example: SHARC code schedulingL1=5;DM(I0,M1)=R1;L8=8;DM(I8,M9)=R2;
CPU cannot use DAG on cycle just after loading DAG’s register, because both need the same internal bus. CPU performs NOP between register assign and DM.
![Page 70: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/70.jpg)
CPUs-70
Rescheduled SHARC codeL1=5;L8=8;DM(I0,M1)=R1;DM(I8,M9)=R2;
Avoids two NOP cycles.
![Page 71: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/71.jpg)
CPUs-71
Example: ARM execution time Determine execution time of FIR filter:
for (i=0; i<N; i++)f = f + c[i]*x[i];
Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst
case.
![Page 72: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/72.jpg)
CPUs-72
Superscalar execution Superscalar processor can execute several
instructions per cycle Uses multiple pipelined data paths
Programs execute faster, but it is harder to determine how much faster
![Page 73: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/73.jpg)
CPUs-73
Superscalar Processor
Control
Instruction 2
Instruction 1
Instruction Unit
Instruction Unit
Registers
![Page 74: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/74.jpg)
CPUs-74
Data dependencies Execution time depends on operands, not just
opcode. Superscalar CPU checks data dependencies dy
namically:
add r2,r0,r1add r3,r2,r5
data dependency r0 r1
r2 r5
r3
![Page 75: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/75.jpg)
CPUs-75
Memory system performance Caches introduce indeterminacy in
execution time. Depends on order of execution.
Cache miss penalty: added time due to a cache miss.
Several reasons for a miss: compulsory, conflict, capacity.
![Page 76: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/76.jpg)
CPUs-76
CPU power consumption Most modern CPUs are designed with
power consumption in mind to some degree
Power vs. energy: Power is energy consumption per unit time heat depends on power consumption battery life depends on energy consumption
![Page 77: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/77.jpg)
CPUs-77
CMOS power consumption Voltage drops: power consumption
proportional to V2
Toggling: more activity means more power Leakage: basic circuit characteristics; can
be eliminated by disconnecting power
![Page 78: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/78.jpg)
CPUs-78
CPU power-saving strategies Reduce power supply voltage Run at lower clock frequency Disable function units with control signals
when not in use Disconnect parts from power supply when
not in use to eliminate leakage currents
![Page 79: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/79.jpg)
CPUs-79
Power management styles Static power management: does not
depend on CPU activity Example: user-activated power-down mode
Dynamic power management: based on CPU activity Example: disabling off function units
![Page 80: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/80.jpg)
CPUs-80
Application: PowerPC 603 Provides doze, nap, sleep modes for static po
wer management Dynamic power management features:
Can shut down unused execution units Cache organized into subarrays to minimize amoun
t of active circuitry
![Page 81: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/81.jpg)
CPUs-81
PowerPC 603 activity Percentage of time units are idle for SPEC
integer/floating-point:unit Specint92 Specfp92D cache 29% 28%I cache 29% 17%load/store 35% 17%fixed-point 38% 76%floating-point 99% 30%system register 89% 97%
Idle units are turned off by switching off clocks Pipeline stages can be turned on or off
![Page 82: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/82.jpg)
CPUs-82
Power-down costs Going into a power-down mode costs:
time energy
Must determine if going into mode is worthwhile
Can model CPU power states with power state machine
![Page 83: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/83.jpg)
CPUs-83
Application: StrongARM SA-1100 Processor takes two supplies:
VDD is main 3.3V supply (on & off) VDDX is 1.5V (always remains on)
Three power modes: Run: normal operation. Idle: stops CPU clock, with logic still powered Sleep: shuts off most of chip activity; 3 steps,
each about 30 s; wakeup takes > 10 ms
![Page 84: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/84.jpg)
CPUs-84
SA-1100 power state machine
run
idle sleep
Prun = 400 mW
Pidle = 50 mW Psleep = 0.16 mW
10 s
10 s90 s
160 ms90 s
![Page 85: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/85.jpg)
CPUs-85
Outline Input and Output Mechanisms Supervisor Mode, Exceptions, Traps Memory Management Performance and Power Consumption Example Design: Data Compressor
![Page 86: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/86.jpg)
CPUs-86
Goals Compress data transmitted over serial
line. Receives byte-size input symbols. Produces output symbols packed into bytes.
Will build software module only here.
![Page 87: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/87.jpg)
CPUs-87
Collaboration diagram for compressor
:input :data compressor :output
1..n: inputsymbols
1..m: packedoutputsymbols
![Page 88: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/88.jpg)
CPUs-88
Huffman coding Early statistical text compression
algorithm. Select non-uniform size codes.
Use shorter codes for more common symbols. Use longer codes for less common symbols.
To allow decoding, codes must have unique prefixes. No code can be a prefix of a longer valid code.
![Page 89: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/89.jpg)
CPUs-89
Huffman examplecharacter Pa .45b .24c .11d .08e .07f .05
P=1
P=.55
P=.31P=.19
P=.12
![Page 90: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/90.jpg)
CPUs-90
Example Huffman code Read code from root to leaves:
a 1b 01c 0000d 0001e 0010f 0011
![Page 91: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/91.jpg)
CPUs-91
Huffman coder requirements table
name data compression module
purpose code module for Huffmancompression
inputs encoding table, uncodedbyte-size inputs
outputs packed compression outputsymbols
functions Huffman coding
performance fast
manufacturing cost N/A
power N/A
physical size/weight N/A
![Page 92: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/92.jpg)
CPUs-92
Building a specification Collaboration diagram shows only steady-
state input/output. A real system must:
Accept an encoding table. Allow a system reset that flushes the
compression buffer.
![Page 93: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/93.jpg)
CPUs-93
data-compressor class
data-compressor
buffer: data-buffertable: symbol-tablecurrent-bit: integer
encode(): boolean,data-buffer
flush()new-symbol-table()
![Page 94: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/94.jpg)
CPUs-94
Data-compressor behaviors encode: takes one-byte input, generates
packed encoded symbols and a Boolean indicating whether the buffer is full.
new-symbol-table: installs new symbol table in object, throws away old table.
flush: returns current state of buffer, including number of valid bits in buffer.
![Page 95: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/95.jpg)
CPUs-95
Auxiliary classes
data-buffer
databuf[databuflen] :character
len : integer
insert()length() : integer
symbol-table
symbols[nsymbols] :data-buffer
len : integer
value() : symbolload()
![Page 96: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/96.jpg)
CPUs-96
Auxiliary class roles data-buffer holds both packed and unpacked
symbols. Longest Huffman code for 8-bit inputs is 256 bits.
symbol-table indexes encoded verison of each symbol. load() puts data in a new symbol table.
![Page 97: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/97.jpg)
CPUs-97
Class relationships
symbol-table
data-compressor
data-buffer
1
1
1
1
![Page 98: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/98.jpg)
CPUs-98
Encode behavior
encode
create new bufferadd to buffers
add to buffer
return true
return false
input symbol
buffer filled?
T
F
![Page 99: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/99.jpg)
CPUs-99
Insert behavior
pack intothis buffer
pack bottom bitsinto this buffer,
top bits intooverflow buffer
updatelength
inputsymbol
fills buffer?
T
F
![Page 100: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/100.jpg)
CPUs-100
Program design In an object-oriented language, we can
reflect the UML specification in the code more directly.
In a non-object-oriented language, we must either: add code to provide object-oriented features; diverge from the specification structure.
![Page 101: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/101.jpg)
CPUs-101
C++ classesClass data_buffer {
char databuf[databuflen];int len;int length_in_chars() { return len/bitsperbyte; }
public:void insert(data_buffer,data_buffer&);int length() { return len; }int length_in_bytes() { return (int)ceil(len/8.0); }int initialize(); ...
![Page 102: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/102.jpg)
CPUs-102
C++ classes, cont’d.class data_compressor {
data_buffer buffer;int current_bit;symbol_table table;
public:boolean encode(char,data_buffer&);void new_symbol_table(symbol_table);int flush(data_buffer&);data_compressor();~data_compressor();}
![Page 103: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/103.jpg)
CPUs-103
C codestruct data_compressor_struct {
data_buffer buffer;int current_bit;sym_table table;
}typedef struct data_compressor_struct data_compressor,
*data_compressor_ptr;boolean data_compressor_encode(data_compressor_ptr mycmptr
s, char isymbol, data_buffer *fullbuf) ...
![Page 104: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/104.jpg)
CPUs-104
Testing Test by encoding, then decoding:
input symbols
symbol table
encoder decoder
compare
result
![Page 105: Chapter 3 CPUs](https://reader036.vdocuments.pub/reader036/viewer/2022062309/568144cc550346895db19749/html5/thumbnails/105.jpg)
CPUs-105
Code inspection tests Look at the code for potential problems:
Can we run past end of symbol table? What happens when the next symbol does not
fill the buffer? Does fill it? Do very long encoded symbols work properly?
Very short symbols? Does flush() work properly?