csci-2500: computer organization

183
CSCI-2500: Computer Organization Processor Design

Upload: mira-sparks

Post on 01-Jan-2016

17 views

Category:

Documents


1 download

DESCRIPTION

CSCI-2500: Computer Organization. Processor Design. Datapath. The datapath is the interconnection of the components that make up the processor. The datapath must provide connections for moving bits between memory, registers and the ALU. Control. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSCI-2500: Computer Organization

CSCI-2500:Computer Organization

Processor Design

Page 2: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 2

Datapath The datapath is the interconnection

of the components that make up the processor.

The datapath must provide connections for moving bits between memory, registers and the ALU.

Page 3: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 3

Control The control is a collection of signals

that enable/disable the inputs/outputs of the various components.

You can think of the control as the brain, and the datapath as the body. the datapath does only what the brain

tells it to do.

Page 4: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 4

Processor Design

The sequencing and execution of instructions

We already know about many of the individual components that are necessary: ALU, Multiplexors, Decoders, Flip-Flops

We need to discuss how to use a clock

We need to think about registers and memory.

Page 5: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 5

The Clock

The clock generates a never-ending sequence of alternating 1s and 0s.

All operations are synchronized to the clock.

Page 6: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 6

Clocking Methodology Determines when (relative to the

clock) a signal can be read and written. Read: signal value is used by some

component. Written: a signal value is generated by

some component.

Page 7: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 7

Simple Example: Enabled AND

We want an AND gate that holds it’s output value constant until the clock switches from 0 (lo) to 1 (hi).

We can use a flip-flop to hold the inputs to the AND gate constant during the time we want the output constant.

We use a clocked flip-flop to make things happen when the clock changes.

Page 8: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 8

D Flip-Flop Reminder

DQ

Q

Clock

The output (Q) changes to reflect D only when the Clock is a 1.

Page 9: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 9

D Flip-Flop Timing

Q

D

C1

0

1

0

1

0

Page 10: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 10

Clocked AND gate

Dflip-flop

Dflip-flop

D

CQ

D

CQ

A

B

A•B (clocked)

Clock

Page 11: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 11

Edge-triggered Clocking Values stored are updated (can

change) only on a clock edge. When the clock switches from 0 to 1

everybody allows signals in. everybody means state elements combinational elements always do the same

thing, they don’t care about the clock (that’s why we added the flip-flops to our AND gate).

Page 12: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 12

State Elements Any component that stores one or

more values is a state element. The entire processor can be viewed as a

circuit that moves from one state (collection of all the state elements) to another state.

At time i a component uses values generated at time i-1.

Page 13: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 13

Register File

Read registernumber 1 Read

data 1

Readdata 2

Read registernumber 2

Register fileWriteregister

Writedata Write

Contains multiple registers•each holds 32 bits

•Two output ports (read ports)

•One input port (write port)

•To change the value of a register:•supply register number•supply data•clock (the Write control signal)

Page 14: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 14

Implementation of Read Ports

Figure B.19

Mux

Register 0

Register 1

Register n – 1

Register n

Mux

Read data 1

Read data 2

Read registernumber 1

Read registernumber 2

Page 15: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 15

Implementation of Write

n - t o - 1

d e c o d e r

R e g i s t e r 0

R e g i s t e r 1

R e g i s t e r n – 1

C

C

D

D

R e g i s t e r n

C

C

D

D

R e g i s t e r n u m b e r

W r i t e

R e g i s t e r d a t a

0

1

n – 1

n

Page 16: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 16

Memory Memory is similar to a very large register

file: single read port (output) chip select input signal output enable input signal write enable input signal address lines (determine which memory

element) data input lines (used to write a memory

element)

Page 17: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 17

4 x 2 Memory (SRAM)D

latch Q

D

C

Enable

D

latch Q

D

C

Enable

D

latch Q

D

C

Enable

D

latch Q

D

C

Enable

D

latch Q

D

C

Enable

D

latch Q

D

C

Enable

D

latch Q

D

C

Enable

D

latch Q

D

C

Enable

2-to-4

decoder

Write enable

Address

Din[0]Din[1]

Dout[1] Dout[0]

0

1

2

3

Page 18: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 18

Memory Usage For now, we treat memory as a single

component that supports 2 operations: write (we change the value stored in a

memory location) read (we get the value currently stored

in a memory location). We can only do one operation at a

time!

Page 19: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 19

Instruction & Data Memory It is useful to treat the memory that

holds instructions as a separate component. instruction memory is read-only

Typically there is really one memory that holds both instructions and data. as we will see when we talk more about

memory, the processor often has two interfaces to the memory, one for instructions and one for data!

Page 20: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 20

Designing a Datapath for MIPS

We start by looking at the datapaths needed to support a simple subset of MIPS instructions: a few arithmetic and logical instructions load and store word beq and j instructions

Page 21: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 21

Functions for MIPS Instructions

We can generalize the functions we need to: using the PC register as the address,

read a value from the memory (read the instruction)

Read one or two register values (depends on the specific instruction).

ALU Operation , Memory read or write, …

Possibly change the value of a register.

Page 22: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 22

Fetching the next instruction

PC Register holds the address Memory holds the instruction

we need to read from memory. Need to update the PC

add 4 to current value

Page 23: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 23

Instruction Fetch DataPath

PC

Instructionmemory

Readaddress

Instruction

4

Add

Page 24: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 24

Supporting R-format instructions

Includes add, sub, slt, and & or instructions.

Generalization: read 2 registers and send to ALU. perform ALU operation store result in a register

rsop rt rd shamt funct

6 bits5 bits6 bits 5 bits 5 bits 5 bits

Page 25: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 25

MIPS Registers MIPS has 32 general purpose

registers. Register File holds all 32 registers

need 5 bits to select a register rs, rt & rd fields in R-format

instructions. MIPS Register File has 2 read ports.

can get at both source registers at the same time.

Page 26: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 26

Datapath for R-format Instructions

InstructionRegisters

Writeregister

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writedata

ALUresult

ALU

Zero

RegWrite

ALU operation3

Page 27: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 27

Load and Store InstructionsNeed to compute the

address offset (part of the

instruction) base (stored in a register).

For Load: read from memory store in a register

For Store: read from register write to memory

Page 28: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 28

Computing the address 16 bit signed offset is part of the

instruction. We have a 32 bit ALU.

need to sign extend the offset (to 32 bits).

Feed the 32 bit offset and the contents of a register to the ALU

Tell the ALU to “add”.

Page 29: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 29

Load/Store Datapath

Instruction

16 32

RegistersWriteregister

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Datamemory

Writedata

Readdata

Writedata

Signextend

ALUresult

ZeroALU

Address

MemRead

MemWrite

RegWrite

ALU operation3

Page 30: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 30

Supporting beq 2 registers compared for equality

16 bit offset used to compute target address. signed offset is relative to the PC offset is in words not in bytes!

Might branch, might not (need to decide).

Page 31: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 31

Computing target address Recall that the offset is actually

relative to the address of the next instruction. we always add 4 to the PC, we must

make sure we use this value as the base.

Word vs. Byte offset we just need to shift the 16 bit offset 2

bits to the right (fill with 2 zeros).

Page 32: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 32

Branch Datapath

1 6 3 2S ig n

ex te n d

Z eroA L U

S u m

S h ift

l e ft 2

T o b r a n c h

c o n tro l l o g ic

B ra nc h ta rg e t

P C + 4 fro m in st r u c ti o n da ta p a t h

I n s tr u c tio n

A d d

R e g is te r sW ritere gi s te r

R e a dd a ta 1

R e a dd a ta 2

R e a dre gi s te r 1

R e a dre gi s te r 2

W rited a t a

R e g W rit e

A L U o p e r a ti o n3

Page 33: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 33

Datapaths

We looked at individual datapaths that support:

1. Fetching Instructions2. Arithmetic/Logical Instructions3. Load & Store Instructions4. Conditional branch

We need to combine these in to a single datapath.

Page 34: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 34

Issues When designing one datapath that

can be used for any operation: the goal is to be able to handle one

instruction per cycle. must make sure no datapath resource

needs to be used more than once at the same time.

if so – we need to provide more than one!

Page 35: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 35

Sharing Resources We can share datapath resources by

adding a multiplexor (and a control line). for example, the second input to the ALU

could come from either: a register (as in an arithmetic instruction) from the instruction (as in a load/store –

when computing the memory address).

Page 36: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 36

Sharing with a Multiplexor Example

Control

ADD

Operand 1

B Operand 2

C

AA+B (Control==0)

A+C (Control==1)

Page 37: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 37

Combining Datapaths for memory instructions and arithmetic instructions

Need to share the ALU For memory instructions used to

compute the address in memory. For Arithmetic/Logical instructions used

to perform arithmetic/logical operation.

Page 38: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 38

I n s t r u c t i o n

1 6 3 2

R e g i s t e r s

W r i t e

r e g i s t e r

R e a d

d a t a 1

R e a d

d a t a 2

R e a d

r e g i s t e r 1

R e a d

r e g i s t e r 2

D a t a

m e m o r y

W r i t e

d a t a

R e a d

d a t a

M

u

x

M

u

xW r i t e

d a t a

S i g n

e x t e n d

A L U

r e s u lt

Z e r o

A L U

A d d r e s s

R e g W r i t e

A L U o p e r a t i o n3

M e m R e a d

M e m W r i t e

A L U S r c

M e m t o R e g

Sharing Multiplexors

New Controls

Page 39: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 39

Adding the Instruction Fetch

One memory for instructions, separate memory for data. otherwise we might need to use the

memory twice in the same instruction. Dedicated Adder for updating the PC

otherwise we might need to use the ALU twice in the same instruction.

Page 40: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 40

P C

I n s t r u c t i o n

m e m o r y

R e a d

a d d r e s s

I n s t r u c t i o n

1 6 3 2

R e g i s t e r s

W r i t e

r e g i s t e r

W r i t e

d a t a

R e a d

d a t a 1

R e a d

d a t a 2

R e a d

r e g i s t e r 1

R e a d

r e g i s t e r 2

S i g n

e x t e n d

A L Ur e s u l t

Z e r o

D a t a

m e m o r y

A d d r e s s

W r i t e

d a t a

R e a d

d a t aMux

4

A d d

Mux

A L U

R e g W r i t e

A L U o p e r a t i o n3

M e m R e a d

M e m W r i t e

A L U S r c

M e m t o R e g

Two Memory Units

Dedicated Adder

Page 41: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 41

Need to add datapath for beq

Register comparison (requires ALU).

Another adder to compute target address. One input to adder is sign extended

offset, shifted by 2 bits. Other input to adder is PC+4

Page 42: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 42

PC

Instructionmemory

Readaddress

Instruction

16 32

Add ALUresult

Mux

Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Shiftleft 2

4

Mux

ALU operation3

RegWrite

MemRead

MemWrite

PCSrc

ALUSrc

MemtoReg

ALUresult

ZeroALU

Datamemory

Address

Writedata

Readdata M

ux

Signextend

Add

New adder and mux

Page 43: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 43

Whew!

Keep in mind that the datapath we now have supports just a few MIPS instructions!

Things get worse (more complex) as we support other instructions:

j jal jr addi

We won’t worry about them now…

Page 44: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 44

Control Unit We need something that can

generate the controls in the datapath.

Depending on what kind of instruction we are executing, different controls should be turned on (asserted) and off (deasserted).

We need to treat each control individually (as a separate boolean function).

Page 45: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 45

Controls Our datapath includes a bunch of

controls: ALU operation (3 bits) RegWrite ALUSrc MemWrite MemtoReg MemRead PCSrc

Page 46: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 46

ALU Operation Control A 3 bit control (assumes the ALU

designed in chapter 3):

ALU Control Input Operation

000 AND

001 OR

010 add

110 subtract

111 slt

Page 47: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 47

ALU Functions for other instructions

lw , sw (load/store): addition

beq: subtraction

add, sub, and, or, slt (arithmetic/logical): All R-format instructions

Page 48: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 48

R-Format Instructions

rsop rt rd shamt funct

6 bits5 bits6 bits 5 bits 5 bits 5 bits

Operation is specified by some bits in the funct field in the instruction.

Page 49: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 49

MIPS Instruction OPCODEs

The MS 6 bits are an OPCODE that identifies the instruction.

R-Format: always 000000 (funct identifies the operation)

lw sw beq

100011 101011 000100

varies depending on instructionop

6 bits

Page 50: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 50

Generating ALU Controls

We can view the 3 bit ALU control as 3 boolean functions. Inputs are: the op field (OPCODE) funct field (for R-format instructions

only)

Page 51: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 51

Simplifying The Opcode

For building the ALU Operation Controls, we are interested in only 4 different opcodes.

We can simplify things by first reducing the 6 bit op field to a 2 bit value we will call ALUOp

Page 52: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 52

Instruction ALUOp funct ALU action

ALU controls

lw 00 ?????? add 010

sw 00 ?????? add 010

beq 01 ?????? subtract 110

add 10 100000 add 010

sub 10 100010 subtract 110

and 10 100100 and 000

or 10 100101 or 001

slt 10 101010 slt 111

Page 53: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 53

Build a Truth Table We can now build a truth table for

the 3 bit ALU control. Inputs are:

2 bit ALUOp 6 bit funct field

Abbreviated Truth Table: only show the rows we care about!

Page 54: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 54

ALUOp funct ALU Control

0 0 x x x x x x 010

x 1 x x x x x x 110

1 x x x 0 0 0 0 010

1 x x x 0 0 1 0 110

1 x x x 0 1 0 0 000

1 x x x 0 1 0 1 001

1 x x x 1 0 1 0 111

x means “don’t care”

Page 55: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 55

Adding the ALU Control We can now add the ALU control to

the datapath: inputs to this control come from the

instruction and from ALUOp If we try to show all the details the

picture becomes too complex: just plop in an “ALU Control” box.

Page 56: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 56

M e m to R e g

M e m R e a d

M e m Wri te

A L U O p

A L U S rc

R e g D st

P C

I n s tru ct io n

m e m o ry

R e a da d d re s s

In s tr u ctio n[3 1 – 0 ]

In str uc tio n [2 0 – 1 6]

In str uc tio n [2 5 – 2 1]

A d d

I n s tru c tio n [5 – 0 ]

R e g Write

4

16 3 2In str u c tio n [1 5 – 0 ]

0

R e gi s te rs

Wr itere g is te r

Wr ited a ta

Writed a ta

R e a dd a ta 1

R e a dd a ta 2

R e a dre g is te r 1

R e a dre g is te r 2

S ig n

e x te nd

A L Ure su l t

Ze ro

D a tam e m o ry

A d d re s s R e a dd a ta

Mux

1

0

Mux

1

0

Mux

1

0

Mux

1

In str uc tio n [1 5 – 1 1]

A L Uco ntro l

S h i f t

le ft 2

P C S rc

A L U

A ddA L U

r e s u lt

Shows which bits from the instructionare fed to register file inputs ALU Control

Page 57: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 57

Implementing Other Controls

The other controls in out datapath must also be specified as functions.

We need to determine the inputs to all the functions. primarily the inputs are part of the

instructions, but there are exceptions. Need to define precisely what

conditions should turn on each control.

Page 58: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 58

RegDst Control Line

Controls a multiplexor that selects on of the fields rt or rd from an R-format or I-format instruction.

I-Format is used for load and store. sw needs to write to the register rt.

rsop rt rd shamt funct

rsop rt address

R-format

I-format

Page 59: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 59

RegDst usage RegDst should be

0 to send rt to the write register # input.

1 to send rd to the write register # input.

RegDst is a function of the opcode field: If instruction is sw, RegDst should be 0 For all other instrs RegDst should be 1

Page 60: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 60

RegWrite Control a 1 tells the register file to write a

register. whatever register is specified by the

write register # input is written with the data on the write register data inputs.

Should be a 1 for arithmetic/logical instructions and for a store.

Should be a 0 for load or beq.

Page 61: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 61

ALUSrc Control

MUX that selects the source for the second ALU operand. 1 means select the second register file

output (read data 2). 0 means select the sign-extended 16 bit

offset (part of the instruction). Should be a 1 for load and store. Should be a 0 for everything else.

Page 62: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 62

MemRead Control

• A 1 tells the memory to put the contents of the memory location (specified by the address lines) on the Read data output.

• Should be a 1 for load.

• Should be a 0 for everything else.

Page 63: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 63

MemWrite Control

• 1 means that memory location (specified by memory address lines) should get the value specified on the memory Write Data input.

• Should be a 1 for store.

• Should be a 0 for everything else.

Page 64: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 64

MemToReg Control

• MUX that selects the value to be stored in a register (that goes to register write data input).– 1 means select the value coming from the memory

data output.– 0 means select value coming from the ALU output.

• Should be a 1 for load and any arithmetic/logical instructions.

• Should be a 0 for everything else (sw, beq).

Page 65: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 65

PCSrc Control

• MUX that selects the source for the value written in to the PC register.– 1 means select the output of the Adder used to

compute the relative address for a branch.– 0 means select the output of the PC+4 adder.

• Should be a 1 for beq if registers are equal!

• Should be a 0 for other instructions or if registers are different.

Page 66: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 66

PCSrc depends on result of ALU operation!

This control line can’t be simply a function of the instruction (all the others can).

PCSrc should be a 1 only when: beq AND ALU zero output is a 1

We will generate a signal called “branch” that we can AND with the ALU zero output.

Page 67: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 67

Truth Table for Control

Instruction

RegDst ALUSrc Memto-

Reg

Reg

Write

Mem

Read

Mem

Write

Branch ALUOp

R-format

1 0 0 1 0 0 0 10

lw 0 1 1 1 1 0 0 00

sw x 1 x 0 0 1 0 00

beq x 0 x 0 0 0 1 01

Page 68: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 68

P C

I n str u c ti o n

m e m o ry

R e a da d dr e s s

I n s t r u ct io n[ 3 1 – 0 ]

I n s tr u c t i o n [ 2 0 1 6]

I n s tr u c t i o n [ 2 5 2 1]

A d d

I n s t ru ct io n [ 5 0 ]

M e mt o R e g

A L U O p

M e m Wr it e

R e g Writ e

M e m R e a d

B r a n c h

R e g D st

A L U S rc

I n s tr u c t i o n [ 3 1 2 6]

4

1 6 3 2I n s tr u c t i o n [ 1 5 0 ]

0

0Mux

0

1

C o n tr o l

A d dA L U

re s u lt

Mux

0

1

R e gi s t er sW r it er e g i ste r

W r it ed a t a

R e a dd a t a 1

R e a dd a t a 2

R e a dr e g i ste r 1

R e a dr e g i st e r 2

S ig n

e x t e n d

Mux

1

A L Ur e s ul t

Z er o

P C S r c

D at a

m e m or yW ri ted a t a

R e a dd a ta

Mux

1

I n s tr u c t i o n [ 1 5 1 1]

A L U

co n tr o l

S h i f t

l e ft 2

A L U

A d d r e s s

Page 69: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 69

Single Cycle Instructions View the entire datapath as a

combinational circuit. We can follow the flow of an

instruction through the datapath. single cycle instruction means that there

are not really any steps – everything just happens and becomes finalized when the clock cycle is over.

Page 70: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 70

add $t1,$t2,$t3 Control Lines:

ALU Controls specify an ALU add operation.

RegWrite will be a 1 so that when the clock cycle ends the value on the Register Write Input lines will be written to a register.

all other control lines are 0.

Page 71: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 71

lw $t1,offset($t2) Control Lines:

ALU Control set for an add operation. ALUSrc is set to 1 to indicate the second

operand is sign extended offset. MemRead would be a 1. RegDst would select the correct bits

from the instruction to specify the dest. register.

RegWrite would be a 1.

Page 72: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 72

Disadvantage of single cycle operation

If we have instructions execute in a single cycle, then the cycle time must be long enough for the slowest instruction. all instructions take the same time as

the slowest.

Page 73: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 73

Multicycle Implementation Chop up the processing of

instructions in to discrete stages. Each stage takes one clock cycle.

we can implement each stage as a big combinational circuit (like we just did for the whole thing).

provide some way to sequence through the stages.

Page 74: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 74

Advantages of Multicycle Only need those stages required by

an instruction. the control unit is more complex, but

instructions only take as long as necessary.

We can share components perhaps 2 different stages can use the

same ALU. We don’t need to duplicate resources.

Page 75: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 75

Additional Resources for Multicycle

To implement a multicycle implementation we need some additional registers that can be used to hold intermediate values. instruction computed address result of ALU operation …

Page 76: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 76

PC

Memory

Address

Instructionor data

Data

Instructionregister

Registers

Register #

Data

Register #

Register #

ALU

Memorydata

register

A

B

ALUOut

Multicycle Datapath

Page 77: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 77

Multicycle Datapath for MIPS

Page 78: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 78

MC DP with Control

Page 79: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 79

Instruction Stages Instruction Fetch Instruction decode/register fetch ALU operation/address computation Memory Access Register Write

Page 80: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 80

Complete Multicyle Datapath & Control

Page 81: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 81

Instruction Fetch/Decode (IF/ID) State Machine

Page 82: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 82

Memory Reference State Machine

Page 83: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 83

R-type Instruction State Machine

Page 84: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 84

Branch/Jump State Machine

Page 85: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 85

Put it all together!

Page 86: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 86

Control for Multicycle Need to define the controls Need to come up with some way to

sequence the controls Two techniques

finite state machine microprogramming

Page 87: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 87

Finite State Machine

Page 88: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 88

MicroProgramming (sec. 5.7)

The idea is to build a (very small) processor to generate the controls signals at the right time.

At each stage (cycle) one microinstruction is executed – the result changes the value of the control signals.

Somebody writes the microinstructions that make up each MIPS instruction.

Page 89: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 89

Example microinstructions

Fetch next instruction: turn on instruction memory read feed PC to memory address input write memory data output in to a holding

register.

Compute Address: route contents of base register to ALU route sign-extended offset to ALU perform ALU add write ALU output in to a holding register.

Control Signals

microinstruction

Page 90: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 90

Sequencing In addition to setting some control

signals, each microinstruction must specify the next microinstruction that should be executed.

3 Options: execute next microinstruction (default) start next MIPS instruction (Fetch) Dispatch (depends on control unit

inputs).

Page 91: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 91

Microinstruction Format A bunch of bits – one for each control

line needed by the control unit. bits specify the values of the control

lines directly. Some bits that are used to determine

the next microinstruction executed.

Page 92: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 92

Dispatch Sequencing Can be implemented as a table

lookup. bits in the microinstruction tell what row

in the table. inputs to the control unit tell what

column. value stored in table determines the

microaddress of the next microinstruction.

This is a simplified description (called a microdescription)

Page 93: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 93

Exceptions & Interrupts Hardest part of control is implementing

exceptions and interrupts – i.e., events that change the normal flow of instruction execution.

MIPS convention Exception refers to any unexpected change in

control flow w/o knowing if the cause is internal or external.

Interrupts refer to only events who are externally caused.

Ex. Interrupts: I/O device request (ignore for now)

Ex. Exceptions: undefined instruction, arithmetic overflow

Page 94: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 94

Handling Exceptions Let’s implemented exceptions for handling

Undefined instruction Overflow

Basic actions Save the offending instruction address in the Exception

Program Counter (EPC). Transfer control to the OS at some specified address Once exception is handled by OS, then either terminate the

program or continue on using the EPC to determine where to restart.

OS actions are determined based on what caused the exception.

So, OS needs a Cause register which determines which path w/i the exception

Alternative implementation – Vectored Interrupts – where each cause of an exception or interrupt is given a specific OS address to jump to.

We’ll use the first method.

Page 95: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 95

Extending the Multicycle D&C

What datapath elements to add? EPC: a 32-bit register used to hold the address

of the affected instruction. Cause: A 32-bit register used to record the

cause of the exception. (undef instruction = 0 and overflow = 1).

What control lines to add? EPCWrite and Cause write control signals to

allow regs to be written. IntCause (1-bit) control signal to set the low-

order bit of the cause register to the appropriate value.

Page 96: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 96

Revised Datapath & Control

Page 97: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 97

Final FSM w/ exception handling

Page 98: CSCI-2500: Computer Organization

Pipelining

Page 99: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 99

Multicycle Instructions Chop each instruction in to stages. Each stage takes one cycle. We need to provide some way to

sequence through the stages: microinstructions

Stages can share resources (ALU, Memory).

Page 100: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 100

Pipelining We can overlap the execution of

multiple instructions. At any time, there are multiple

instructions being executed – each in a different stage.

So much for sharing resources ?!?

Page 101: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 101

The Laundry Analogy

Non-pipelined approach:1. run 1 load of clothes through washer2. run load through dryer3. fold the clothes (optional step for

students)4. put the clothes away (also optional).

Two loads? Start all over.

Page 102: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 102

Pipelined Laundry While the first load is drying, put the

second load in the washing machine. When the first load is being folded and the

second load is in the dryer, put the third load in the washing machine.

Admittedly unrealistic scenario for CS students, as most only own 1 load of clothes…

Page 103: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 103

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

Page 104: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 104

Laundry Performance For 4 loads:

non-pipelined approach takes 16 units of time.

pipelined approach takes 7 units of time.

For 816 loads: non-pipelined approach takes 3264 units

of time. pipelined approach takes 819 units of

time.

Page 105: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 105

Execution Time vs. Throughput It still takes the same amount of time

to get your favorite pair of socks clean, pipelining won’t help.

However, the total time spent away from CompOrg homework is reduced.

It's the classic “Socks vs. CompOrg” issue.

Page 106: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 106

Instruction Pipelining

First we need to break instruction execution into discrete stages:

1. Instruction Fetch2. Instruction Decode/ Register Fetch3. ALU Operation4. Data Memory access5. Write result into register

Page 107: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 107

Operation Timings Some estimated timings for each

of the stages:

Instruction Fetch 200 ps

Register Read 100 ps

ALU Operation 200 ps

Data Memory 200 ps

Register Write 100 ps

Page 108: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 108

Comparison

I n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

800 p sI n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

800 p sI n s t r u c t i o n

f e t c h

800 p s

T i m e

l w $ 1 , 1 0 0 ( $ 0 )

l w $ 2 , 2 0 0 ( $ 0 )

l w $ 3 , 3 0 0 ( $ 0 )

2 4 6 8 1 0 1 2 1 4 1 6 1 8

2 4 6 8 1 0 1 2 1 4

. . .

P r o g r a m

e x e c u t i o n

o r d e r

( i n i n s t r u c t i o n s )

I n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

T i m e

l w $ 1 , 1 0 0 ( $ 0 )

l w $ 2 , 2 0 0 ( $ 0 )

l w $ 3 , 3 0 0 ( $ 0 )

200 p sI n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

200 p sI n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

200 p s 200 p s 200 p s 200 p s 200 p s

P r o g r a m

e x e c u t i o n

o r d e r

( i n i n s t r u c t i o n s )

Page 109: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 109

RISC and Pipelining One of the major advantages of RISC

instruction sets is the relative simplicity of a pipeline implementation. It’s much more complex in a CISC processor!!

RISC (MIPS) design features that make pipelining easy include: single length instruction (always 1 word) relatively few instruction formats load/store instruction set operands must be aligned in memory (a

single data transfer instruction requires a single memory operation).

Page 110: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 110

Pipeline Hazard Something happens that means the

next instruction cannot execute in the following clock cycle.

Three kinds of hazards: structural hazard control hazard data hazard

Page 111: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 111

Structural Hazards Two stages require the same

resource. What if we only had enough electricity to

run either the washer or the dryer at any given time?

What if MIPS datapath had only one memory unit instead of separate instruction and data memory?

Page 112: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 112

Avoiding Structural Hazards

Design the pipeline carefully. Might need to duplicate resources

an Adder to update PC, and ALU to perform other operations.

Detecting structural hazards at execution time (and delaying execution) is not something we want to do (structural hazards are minimized in the design phase).

Page 113: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 113

Control Hazards When one instruction needs to make

a decision based on the results of another instruction that has not yet finished.

Example: conditional branch The instruction that is fed to the pipeline

right after a beq depends on whether or not the branch is taken.

Page 114: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 114

beq Control Hazard

slt $t0,$s0,$s1beq $t0,$zero,skipaddi $s0,$s0,1

skip:lw $s3,0($t3)

slt

beq

???

The instruction to follow the beq could be either the addi or the lw, it depends on the result of the beq instruction.

a = b+c;if (x!=0)

y++;...

Page 115: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 115

One possible solution - stall

We can include in the control unit the ability to stall (to keep new instructions from entering the pipeline until we know which one).

Unfortunately conditional branches are very common operations, and this would slow things down considerably.

Page 116: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 116

A Stall

I n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

T i m e

b e q $ 1 , $ 2 , 4 0

a d d $ 4 , $ 5 , $ 6

l w $ 3 , 3 0 0 ( $ 0 )

4 n s

I n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

2 n s

I n s t r u c t i o n

f e t c hR e g A L U

D a t a

a c c e s sR e g

2 n s

2 4 6 8 1 0 1 2 1 4 1 6

P r o g r a m

e x e c u t i o n

o r d e r

( i n i n s t r u c t i o n s )

To achieve a 1 cycle stall (as shown above), we need to modify the implementation of the beq instruction so that the decision is made by the end of the second stage.

Page 117: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 117

Another strategy

Predict whether or not the branch will be taken.

Go ahead with the predicted instruction (feed it into the pipeline next).

If your prediction is right, you don't lose any time.

If your prediction is wrong, you need to undo some things and start the correct instruction

Page 118: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 118

Predicting branch not taken

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

Page 119: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 119

Dynamic Branch Prediction The idea is to build hardware that will

come up with a prediction based on the past history of the specific branch instruction.

Predict the branch will be taken if it has been taken more often than not in the recent past. This works great for loops! (90% +

correct). We’ll talk more about this …

Page 120: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 120

Yet another strategy: delayed branch

The compiler rearranges instructions so that the branch actually occurs delayed by one instruction from where its execution starts

This gives the hardware time to compute the address of the next instruction.

The new instruction is hopefully useful whether or not the branch is taken (this is tricky - compilers must be careful!).

Page 121: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 121

Delayed Branch

add $s2,$s3,$s4beq $t0,$zero,skipaddi $s0,$s0,1

skip:lw $s3,0($t3)

beq

add

Order reversed!

The compiler must generate code that differs from what you would expect.

a = b+c;if (x!=0)

y++;...

Page 122: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 122

Data Hazard One of the values needed by an

instruction is not yet available (the instruction that computes it isn't done yet).

This will cause a data hazard:add $t0,$s1,$s2

addi $t0,$t0,17

Page 123: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 123

IF Reg ALUData

AccessReg

IF Reg ALUData

AccessReg

add $t0,$s1,$s2

addi $t0,$t0,17

selects $s1 and $s2 for ALU op

adds $s1 and $s2

stores sum in $t0

selects $t0 for ALU optime

Page 124: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 124

Handling Data Hazards

We can hope that the compiler can arrange instructions so that data hazards never appear. this doesn't work, as programs

generally need to use previously computed values for everything!

Some data hazards aren't real - the value needed is available, just not in the right place.

Page 125: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 125

IF Reg ALUData

AccessReg

IF Reg ALUData

AccessReg

add $t0,$s1,$s2

addi $t0,$t0,17

ALU has finished computing sum

ALU needs sum from the previous ALU operationtime

The sum is available when needed!

Page 126: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 126

Forwarding It's possible to forward the value

directly from one resource to another (in time).

Hardware needs to detect (and handle) these situations automatically! This is difficult, but necessary.

Page 127: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 127

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

Picture of Forwarding

Page 128: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 128

Another Example

Ti m e2 4 6 8 1 0 1 2 1 4

l w $ s0, 2 0( $t1 )

su b $t2 , $ s0 , $ t3

Prog ra m

e xe c utio n

ord er

(in in st ru ctio n s)

IF I D W BM E ME X

I F I D W BM E ME X

bu b ble bu bble bu b ble bu b ble bu bble

Page 129: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 129

Pipelining and CPI If we keep the pipeline full, one

instruction completes every cycle. Another way of saying this: the

average time per instruction is 1 cycle. even though each instruction actually

takes 5 cycles (5 stage pipeline).

CPI=1

Page 130: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 130

CorrectnessPipeline and compiler designers must be

careful to ensure that the various schemes to avoid stalling do not change what the program does! only when and how it does it. It's impossible to test all possible

combinations of instructions (to make sure the hardware does what is expected).

It's impossible to test all combinations even without pipelining!

Page 131: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 131

Pipelined Datapath

We need to use a multicycle datapath.

includes registers that store the result of each stage (to pass on to the next stage).

can't have a single resource used by more than one stage at time.

Page 132: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 132

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Pipelined Datapath – 5 stages

Page 133: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 133

lw and pipelined datapath We can trace the execution of a load

word instruction through the datapath.

We need to keep in mind that other instructions are using the stages not in use by our lw instruction!

Page 134: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 134

I n s tr u c t i o n

m e m o r y

A d d r e s s

4

3 2

0

A d dA d d

r e s ul t

S h if t

l e ft 2

Ins

tru

cti

on

I F / I D E X/ M E M M E M / W B

M

ux

0

1

A d d

P C

0W ri t e

d a t a

M

u

x

1

R e g i st e r s

R e a d

d at a 1

R e a d

d at a 2

R e a d

r e g i s t er 1

R e a d

r e g i s t er 2

1 6S i g n

e x t e n d

W r i t e

r e g i s t er

W r i t e

d a t a

R e a d

d a t a

1

A L Ur e s u l t

M

u

x

A L U

Z e r o

I D / E X

I n s t r u c t i o n f e t c h

l w

A d d r e s s

D at a

m e m or y

Stage 1: Instruction Fetch (IF)

Page 135: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 135

I n st r u c t i o n

m e m o r y

A d d r e s s

4

3 2

0

A d dA d d

r e s u l t

S h if t

l e ft 2

Ins

tru

ct i

on

I F /I D E X / M E M

M

u

x

0

1

A d d

P C

0W ri t e

d a t a

M

u

x

1

R e g i s t e r s

R e a d

d a t a 1

R e a d

d a t a 2

R e a d

r e g is t e r 1

R e a d

r e g is t e r 2

1 6S i g n

e x t e n d

W r it e

r e g is t e r

W r it e

d a t a

R e a d

d a t a

1

A L U

r e s u ltMu

x

A L U

Z e r o

I D / E X M E M / W B

I n s t r u c t i o n d e c o d e

l w

A d d r e s s

D at a

m e m o r y

Stage 2: Instruction Decode (ID)

Page 136: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 136

I n st r u c t i o n

m e m o r y

A d d r e s s

4

3 2

0

A d dA d d

r e s u l t

S h if t

l e ft 2

Ins

tru

ct i

on

I F /I D E X / M E M

M

u

x

0

1

A d d

P C

0W ri t e

d a t a

M

u

x

1

R e g i s t e r s

R e a d

d a t a 1

R e a d

d a t a 2

R e a d

r e g is t e r 1

R e a d

r e g is t e r 2

1 6S i g n

e x t e n d

W r it e

r e g is t e r

W r it e

d a t a

R e a d

d a t a

1

A L U

r e s u ltMu

x

A L U

Z e r o

I D / E X M E M / W B

I n s t r u c t i o n d e c o d e

l w

A d d r e s s

D at a

m e m o r y

Stage 2: Instruction Decode (ID)

Page 137: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 137

I n s t r u c t i o n

m e m o r y

A d d r e s s

4

3 2

0

A d dA d d

r e s u lt

S h if t

l e f t 2

Ins

tru

ct i

on

I F /I D E X / M E M

M

u

x

0

1

A d d

P C

0W ri t e

d a t a

M

u

x

1

R e g i s t e r s

R e a d

d a t a 1

R e a d

d a t a 2

R e a d

r e g i s t e r 1

R e a d

r e g i s t e r 2

1 6S i g n

e x t e n d

W r i t e

r e g i s t e r

W r i t e

d a t a

R e a d

d a t a

1

A L U

r e s u ltM

u

x

A L U

Z e r o

I D / E X M E M / W B

E x e c u t i o n

l w

A d d r e s s

D a t a

m e m o r y

Stage 3: Execute (EX)

Page 138: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 138

I n s t r u c t i o n

m e m o r y

A d d r e s s

4

3 2

0

A d dA d d

r e s u l t

S h i f t

l e f t 2

Ins

tru

ct i

on

I F / I D E X / M E M

M

u

x

0

1

A d d

P C

0W r i t e

d a t a

M

u

x

1

R e g i s t e r s

R e a d

d a t a 1

R e a d

d a t a 2

R e a d

r e g i s t e r 1

R e a d

r e g i s t e r 2

1 6S i g n

e x t e n d

W r it e

r e g i s t e r

W r it e

d a t a

R e a d

d a t aD a t a

m e m o r y

1

A L U

r e s u l tM

u

x

A L U

Z e r o

I D / E X M E M / W B

M e m o r y

l w

A d d r e s s

Stage 4: Memory Access (MEM)

Page 139: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 139

I n str u c ti on

m e m o r y

A d d r e s s

4

3 2

0

A d dA d d

r e su l t

S h if t

l e f t 2

Ins

tru

ct i

on

I F /I D E X / M E M

M

u

x

0

1

A d d

P C

0W ri t e

d at a

M

u

x

1

R e g i st e r s

R e a d

d at a 1

R e a d

d at a 2

R e a d

r e g is t er 1

R e a dr e g is t er 2

1 6S i g n

e x t e n d

W ri t e

d a t a

R e a d

d a t aD a t a

m e m o r y

1

A L U

r e s u l tM

u

x

A L U

Z e r o

I D / E X M E M / W B

W r it e b a c k

l w

W ri t e

r e g is t erA d d r e s s

Stage 5: WriteBack (WB)

Page 140: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 140

A Bug! When the value read from memory is

written back to the register file, the inputs to the register file (write register #) are from a different instruction!

To fix the bug we need to save the part of the lw instruction (5 bits of it specify which register should get the value from memory).

Page 141: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 141

New Datapath

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

Figure 4.41

Page 142: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 142

Store Word (sw) Data Path Flow (EX)

Page 143: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 143

SW Data Path (cont.)

Page 144: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 144

Final Corrected Datapath

Page 145: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 145

Ex. With 5 instructions

Page 146: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 146

Ex: Alt View

Page 147: CSCI-2500: Computer Organization

Pipeline Control

Page 148: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 148

Pipelined DP w/ signals

Page 149: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 149

Control lines for pipeline stages

Page 150: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 150

Pipelined DP w/ Control

Page 151: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 151

Pipelined Dependencies

Page 152: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 152

Pipeline w/ Forwarding Values

Page 153: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 153

ALU & Regs: B4, After Fwding

Page 154: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 154

Datapath w/ forwarding

Page 155: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 155

Forwarding Control Table

ForwardA = 00

ID/EX 1st ALU op from reg file

ForwardA= 10 EX/MEM 1st ALU op fwd from prior ALU result

ForwardA = 01

MEM/WB 1st ALU op fwd from data mem or earlier result

MUX Control Source Reason

Page 156: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 156

Forwarding Control Table (cont.)

ForwardB = 00 ID/EX 2nd ALU op from reg file

ForwardB= 10 EX/MEM 2nd ALU op fwd from prior ALU result

ForwardB = 01 MEM/WB 2nd ALU op fwd from data mem or earlier result

MUX Control Source Reason

Page 157: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 157

EX Stage Hazard Detection & Resolution

if( EX/MEM.RegWrite && EX/MEM.RegisterRd != 0 && EX/MEM.RegisterRd == ID/EX.RegisterRs )

ForwardA = 10 if( EX/MEM.RegWrite &&

EX/MEM.RegisterRd != 0 && EX/MEM.RegisterRd == ID/EX.RegisterRt )

ForwardB = 10

Page 158: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 158

Mem Stage Hazard Detection & Resolution if( MEM/WB.RegWrite && MEM/WB.RegisterRd

!= 0 && EX/MEM.RegisterRd != ID/EX.RegisterRs && MEM/WB.RegisterRd = ID/EX.RegisterRs) ForwardA = 01

if( MEM/WB.RegWrite && MEM/WB.RegisterRd != 0 && EX/MEM.RegisterRd != ID/EX.RegisterRt && MEM/WB.RegisterRd = ID/EX.RegisterRt) ForwardB = 01

Does not consider hazard between WB stage results, MEM stage result & EX stage ALU src

Page 159: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 159

Data Hazards & Stalls Need Hazard detection unit in

addition to forwarding unit. Check for Load Instructions based

on… if( ID/EX.MemRead &&

(ID/EX.RegisterRt==IF/ID.RegisterRs || ID/EX.RegisterRt==IF/ID.RegisterRt))

StallThePipeline

Page 160: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 160

Where Forwarding Fails…must stall

Page 161: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 161

How Stalls Are Inserted

AND

Page 162: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 162

Pipelined control w/ fwding & hazard detection

Page 163: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 163

What about those crazy branches?

Problem: if the branch is taken, PC goes to addr 72, but don’t know until after 3 other instructions are processed

Page 164: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 164

Branch Hazards: Assume Branch Not Taken

Recall stalling until branch is complete is too ssssssllllooooowwww!!

So, assume the branch is not taken… If taken, instructions fetched/decoded must be

discarded or “squashed” discard instructions, just change the original control

values to 0’s (similar to load-use hazard), BIG DIFFERENCE: must flush the pipeline in the IF, ID

and EX stages How can we reduce the “flush” costs when a branch

is taken?

Page 165: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 165

Reducing the Delay of Branches

Let’s move the branch execution earlier in the pipeline.

EFFECT: fewer instructions need to be flushed.

NEED two actions: Compute branch target address (EASY –

can do on IF/ID stage). Eval of branch decision (HARD)

Page 166: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 166

Faster Branch Decision Recall, for BEQ instruction, we would

compare two regs during the ID stage and test for equality.

Equality can be tested by XORing the two regs. (a.k.a. equality unit)

Need additional ID stage forwarding and hazard detection hardware

This has 2 complicating factors…

Page 167: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 167

Faster Branch Decison: Complex Factors

1. In ID stage, now we need to decide whether a “bypass” path to the “equality” unit is needed.

• ALU forwarding logic is not sufficient, and so we need new forwarding logic for the equality unit.

2. Can stall due to a data hazard. • if an r-type instruction comes before the

branch who operands are used in the comparision in the branch, a stall is needed

Page 168: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 168

Example Pipelined Branch

36 sub $10, $4, $840 beq $1, $3, 7 # jmp2 40+4+(7*4) = 7244 and $12, $2, $548 or $13, $2, $652 and $14, $4, $256 slt $15, $6, $7

……..72 lw $4, 50($7)

Page 169: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 169

Branch Processing Example

Page 170: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 170

Dynamic Branch Prediction From the phase “There is no such thing as a

typical program”, this implies that programs will branch is different ways and so there is no “one size fits all” branch algorithm.

Alt approach: keep a history (1 bit) on each branch instruction and see if it was last taken or not.

Implementation: branch prediction buffer or branch history table.

Index based on lower part of branch address Single bit indicates if branch at address was last taken or

not. (1 or 0)

Page 171: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 171

Problem with 1-bit Branch Predictors

Consider a loop branch Suppose it occurs 9 times in a row, then

is not taken. What’s the branch prediction accuracy? ANSWER: 1-bit predictor will mispredict

the entry and exit points of the loop. Yields only an 80% accuracy when there

is potential for 90% (i.e., you have to guess wrong on the exit of the loop).

Page 172: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 172

Solution: 2-bit Branch Predictor

Must be wrong twice before changing predictionLearns if the branch is more biased towards “taken” or “not taken”

Page 173: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 173

Performance: Single vs Multicycle vs. PL

Assume: 200 ps for memory access, 100 ps for ALU ops, 50 ps for register access

Single-cycle clock cycle: 600 ps: 200 + 50 + 100 + 200 + 50

Futher assume instruction mix 25% loads, 10% stores, 11% branches, 2% jumps, 52%

ALU instructions Assume CPI for multi-cycle is 3.50 Multicycle clock cycle: must be longest unit which is

200 ps Total time for an “avg” instruction is 3.5 * 200 ps =

700ps

Page 174: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 174

Pipeline performance (cont)

For pipelined design… Loads take 1 cycle when no load-use

dependence and 2 cycles when there is yielding an average of 1.5 cycles per load.

Stores and ALU instructions take 1 cycle. Branches take 1 cycle when predicted correctly

and 2 cycles when not. Assume 75% accuracy, average branch cycles is 1.25.

Jumps are 2 cycles. Avg CPI then is:

1.5 x 25% + 1 x 10% + 1 x 52% + 1.25 x 11% + 2 x 2% = 1.17

Longest stage is 200 ps, so 200 x 1.17 = 234 ps

Page 175: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 175

Even more performance…

Ultimately we want greater and greater Instruction Level Parallelism (ILP)

How? Multiple instruction issue.

Results in CPI’s less than one. Here, instructions are grouped into “issue slot

s”. So, we usually talk about IPC (instructions per

cycle) Static: uses the compiler to assist with

grouping instructions and hazard resolution. Compiler MUST remove ALL hazards.

Dynamic: (i.e., superscalar) hardware creates the instruction schedule based on dynamically detected hazards

Page 176: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 176

Example Static 2-issue Datapath

Additions include:

•32 bits from intr. Mem

•Two read, 1 write ports on reg file

•1 more ALU (top handles address calc)

Page 177: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 177

Ex. 2-Issue Code Schedule

Loop: lw $t0, 0($s1) #t0=array elementaddiu $t0, $t0, $s2 #add scalar in $s2sw $t0, 0($s1) #store resultaddi $s1, $s1, -4 # dec pointerbne $s1, $zero, Loop # branch $s1!=0

ALU/Branch Data Xfer Inst. Cycles

Loop: lw $t0, 0($s1) 1addi $s1, $s1, -4 2addu $t0, $t0, $s2 3bne $s1, $zero, Loop sw $t0, 4($s1) 4

It take 4 clock cycles for 5 instructions or IPC of 1.25

Page 178: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 178

More Performance: Loop Unrolling

Technique where multiple copies of the loop body are made.

Make more ILP available by removing dependencies.

How? Complier introduces additional registers via “register renaming”.

This removes “name” or “anti” dependence where an instruction order is purely a consequence of

the reuse of a register and not a real data dependence. Ex. lw $t0, 0($s1), addu $t0, $t0, $s2 and sw $t0, 4($s1) No data values flow between one pair and the next pair Let’s assume we unroll a block of 4 interations of the

loop..

Page 179: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 179

Loop Unrolling ScheduleALU/Branch Instructions

Data Xfer Cycles

Loop addi $s1, $s1, -16 lw $t0, 0($s1) 1

lw $t1, 12($s1) 2

addu $t0, $t0, $s2 lw $t2, 8($s1) 3

addu $t1, $t1, $s2 lw $t3, 4($s1) 4

addu $t2, $t2, $s2 sw $t0, 16($s1) 5

addu $t3, $t3, $s2 sw $t1, 12($s1) 6

sw $t2, 8($s1) 7

bne $s1, $zero, loop

sw $t3, 4($s1) 8

Page 180: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 180

Performance of Instruction Schedule

12 of 14 instructions execute in a pair.

Takes 8 clock cycles for 4 loop iterations

Yields 2 clock cycles per iteration CPI = 8/14 0.57 2.19x speedup over regular loop Cost of improvement: 4 temp regs +

lots of additional code

Page 181: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 181

Dynamic Scheduled Pipeline

Page 182: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 182

Intel P4 Dynamic Pipeline

Page 183: CSCI-2500: Computer Organization

CSCI-2500 FALL 2012, Processor Design, Chapter 4 — 183

Summary of Pipeline Technology

We’ve exhausted

this!!