pipelined mips porcessor

7/31/2019 Pipelined MIPS Porcessor

1/48

Erik Jonsson School of En ineerin andComputer Science

e n vers y o exas a a as

The Pipelined MIPS Processor

The Pipelined MIPS Processor

We complete our study of computer architecture by investigating anapproach providing even higher performance for the MIPS CPU.

We first saw how the MIPS CPU erformance could be im roved

by converting the so-called single-cycle CPU to a multi-cycle design.

In the multi-cycle approach, instead of using a single clock cycle for the

whole instruction, the clock is accelerated, and instructions execute in

phases over several clock cycles.

Each instruction phase takes one clock cycle.

This means that as each instruction executes, only one section of the

CPU will be active per clock cycle -- the one executing that phase of theinstruction.

This suggests that perhaps we might redesign the CPU slightly so

N. B. Dodge 01/121 Lecture #21: The Pipeline MIPS Processor

a every sec on can opera e n epen en y on an ns ruc on

at the same time.


2/48



The Laundry ExampleThe Laundry Example

As an introduction to the concept of pipelining, Patterson andHennessy use the example of doing ones laundry.

Most eo le have or have access to a washer and dr er.

Assume that you need to wash several washer loads of clothing.

Would anyone divide the clothing into washer loads and then, ,

second?

No, if you were washing clothes, you would finish washing the first

, , . If there were more loads to wash, you would begin to fold and put

away finished clothing while the later loads were washing and


.

We can see this schematically on the next slide.


3/48



Graphical Example of the Laundry CycleGraphical Example of the Laundry Cycle



4/48



The Pipeline ProcessorThe Pipeline Processor

Patterson and Hennessy applied this simultaneous wash-dry-fold-put away concept to the single-cycle computer model.

, , ,

simultaneously so that the instruction throughput the number of

clock cycles per instructions could be dramatically decreased.,

cycle, but the clock must be as slow as the slowest instruction.

In the multi-cycle implementation, the clock runs faster, instructions

- , . What if, each time the clock ticked, we could process an instruction

in each section of the multicycle processor? Then we could process


,

completing an instruction every clock cycle.


5/48



Pipeline ArchitecturePipeline Architecture

A pipelined computer executes instructions concurrently.

Hardware units are organized into stages:

Execution in each stage takes exactly 1 clock period.

partial results to the next stage.

Unfortunately, as noted earlier, speed = complexity + cost.

e p pe ne approac r ngs a ona expense p us sown set of problems and complications, called hazards,

which we will also study.



6/48



Sequential Versus Pipelined ExecutionSequential Versus Pipelined Execution

(clock

cycles) 0 1 2 3 4 5 6 7 8 9 10

lw $t0, 16($a3) Instruc.Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

lw $t1, 32($a3)

lw $t2, 48($a3)

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc.

Fetch

Reg.

Fetch

AL

Proc

4 clock cycles

4 clock cycles

etc.Timeline

(clock

cycles) 0 1 2 3 4 5 6 7 8 9 105 clock cycles

lw $t0, 16($a3)

lw $t1, 32($a3)

Instruc.Fetch

Reg.Fetch

ALUProcess

Mem. R/Wor ALU Out

Reg.Write

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc. Reg. ALU Mem. R/W Reg.


lw $t2, 48($a3)etc.

e c e c rocess or u r e


7/48



Speed Advantage of the PipelineSpeed Advantage of the Pipeline

e mu t cyc e, ser a processor t at we stu e ast ecture can

execute n instructions in ns clock periods, or ETS= ns, where

A pipelined processor with s stages can execute n instructions in

ET = s + n

1 clock eriods. The ideal pipeline speedup depends on the number of stages, and

can be greater for more stages (hence Intels choice of a 20-stage

i eline for the current P-IV .

Thus the speed advantage of pipeline over multicycle can be

defined as:sET nsS s


Ps n


8/48



Pipeline StagesPipeline Stages

0 1 2 3 4 5

ID/

oc cyc es

The MIPS R2000 pipeline processor is divided into five processing

RF

stages:

1. Instruction fetch (IF)

2. Instruction decode (ID) and register fetch (RF)

3. ALU instruction execution (ALU) ALU processing, branchcondition evaluation, memory address computation, etc. This is also

referred to as execution (EX)


. emory access

5. Write back (WB) to register file


9/48



Overlapped Pipeline ExecutionOverlapped Pipeline Execution

0 1 2 3 4 5 6 7

Clock cycles

ALUIFID/

RFMEM WB Instruction 1

ALUIFID/

RFMEM WB

ID/

Instruction 2

RF


Instruction execution order


10/48



Single-Cycle DatapathSingle-Cycle Datapath

ADDBranch

Mem. Read

32

32

32

32Reg. Dest.

M

U

X

ADDMem. To Reg.ALU Op.

Reg. Write

ALU Srce.

+

Left

shift

2

s0-31

32

5

ControlMem. Write

6 (Bits 26-31)

ALUPInstruction

AddressM

Data

AddressInst.

M

s

Rt

Rd

Read

Data 1

Read

Data 2Read

Instructionbit

32

32

3232

5

5

M

U

X

5

ReadWrite

Mem./Reg.

Select

Lines indicate need for

UX

-

UXReg. Block

WriteData

Sign

Extend

32

Write

Data

Data

16 (Bits 0-15)

32

32

ALU

Instruction

Memory

Data

Memory


storage between stages if

processor is converted to

pipeline

6 (Bits 0-5)


11/48



Single-Cycle Datapath with Pipeline RegistersSingle-Cycle Datapath with Pipeline Registers

MU

X

Inter-stage registers are master-slave D flip-flops; the master can

be receiving new data from the previous stage of the instructionwhile the slave flip-flop is providing data to the next stage

ADD

ADD+4

Memory

Left

shift

2

Compare

resultReg. Block

ALU

P

C

ns ruc on

Address Memory

M

U

Data

Address

Inst.

0-31 Rt

RdM

U

Read

Data 1

Read

Data 2Read

Data

Sign

Extend

Write

Data

16 32

Write

DataMaster side

of register

Slave side


IF/ID ID/EX EX/MEM MEM/WB

of register

Note: Control lines and

logic not shown for clarity After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition


12/48



Instruction Process Through Pipeline (1)Instruction Process Through Pipeline (1)

MU

X

ADD

ADD+4

Memory

Left

shift

2

Compare

resultReg. Block

ALU

P

C

ns ruc on

Address Memory

M

U

Data

Address

Inst.

0-31 Rt

RdM

U

Read

Data 1

Read

Data 2Read

Data

Stage 1: Instruction

loaded into IF/ID

Sign

Extend

Write

Data

16 32

Write

Data


,

After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition



13/48



Instruction Process Through Pipeline (2)Instruction Process Through Pipeline (2)age : ns ruc on

decoded, register data

accessed, immediates

sign-extended

MU

X

ADD

ADD+4

Memory

Left

shift

2Compare

resultReg. Block

ALU

P

C

Instruction

Address Memory

M

U

Data

Address

Inst.

0-31

s

Rt

RdM

U

Read

Data 1

Read

Data 2Read

Data

Sign

Extend

XWriteData

16

X

32

Write

Data





14/48




Stage 3: Instruction

executed / branchaddress computed

M

U

X

ADD

ADD+4

Memory

Left

shift

2

Compare

resultReg. Block

ALU

P

C

ns ruc on

Address Memory

M

U

Data

Address

Inst.

0-31 Rt

RdM

U

Read

Data 1

Read

Data 2Read

Data

Sign

Extend

Write

Data

16 32

Write

Data





15/48




store, branch taken/not taken

ALU results bypass taken to

MEM/WB register

M

U

X

ADD

ADD+4

Memory

Left

shift

2

Compare

resultReg. Block

ALU

P

C

ns ruc on

Address Memory

M

U

Data

Address

Inst.

0-31 Rt

RdM

U

Read

Data 1

Read

Data 2Read

Data

Sign

Extend

Write

Data

16 32

Write

Data





16/48




M

UX

ADD+4

Instruction

Memory

Left

shift

2Compare

resultRs

Reg. Block

ALU

CAddress emory

M

U

X

Data

Address

Inst.

0-31 Rt

Rd

Write

M

U

X

Read

Data 1

Read

Data 2Read

Data

Stage 5: Result

write-back to

Sign

Extend

Data

16 32

WriteData


dest. register




17/48



Adding ControlAdding Control

Control information must be carried along as a part ofthe instruction, since this information is required at

different stages of the pipeline.

This can be done by adding more inter-stage storage

The result is very large inter-stage registers. For

example, the storage capacity required between the

instruction decode and ALU execution stages (ID/EXregister) is more than 120 bits.


is shown on the next slide


18/48



ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

Regis

terWr

P InstructionAddress

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1

Ins

tructionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emor

y/ALUResult

ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

Full PipelineDesign with

MemorySign

Extend

32DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


After David A. Patterson and John L. Hennessy,Computer Organization and Design, 2nd Edition

Reg. Dst.


19/48



The Pipeline in ActionThe Pipeline in Action

The following instruction sequence from the P&H textillustrates the pipeline in action.

w ,

sub $11, $2, $3

, ,

or $13, $6, $7

add $14 $8 $9

Note that registers are identified by number rather

than the letter ids, since that is the way they appear in


t e processor. s a rem n er, = at, - = t -

t6, $2-3=$v0-v1, $4-7=$a0-a3, etc.


20/48

IF: Idle ID/RF: Idle EX: Idle MEM: Idle WB:Idle

M

ID/EX EX/MEM MEM/WB

rite

ADD

ADD+4

U

X

IF/ID

0-31

Decode

rite

ead

Reg

ister

P

Instruction

Address

Memory

Left

shift

2

Inst.0-31

Reg. Block

Rs

Rt Read

Data 1Instructionbit

ALU Srce

Branch

MemoryW

MemoryR

emory/ALUResul

ALUM

U

X

Data

AddressM

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

Memory

Sign

Extend

a as -

Bits 16-20

Bits 11-15

ALU

Cont.

M

U

X

ALU Op

After David A. Patterson and John L. Hennessy,Computer Organization and Design, 2nd Edition20

Reg. Dst.

Lecture #21: The Pipeline MIPS Processor


21/48

IF: lw $10 20 $1 ID/RF: Idle EX: Idle MEM: Idle WB:Idle

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emor

y/ALUResult

ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

MemorySign

Extend

32DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



22/48

IF: sub $11 $2 $3 ID/RF: lw $10 20 $1 EX: Idle MEM: Idle WB:Idle

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

[ $1 ]

$ 1

$ 10

PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emor

y/ALUResult

X ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

$ 10

X

20

MemorySign

ExtendDataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



23/48

IF: and $12 $4 $5 MEM: Idle WB:ID/RF: sub $11 $2 $3 EX: lw $10 20 $1Idle

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

[ $ 2 ] [ $1 ]

$ 3

$ 2

PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emor

y/ALUResult

[ $ 3 ]

add20

ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

$ 11

X

X

20

$ 10

$ 10

MemorySign

Extend

DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



24/48

IF: or $13 $6 $7 WB:ID/RF: and $12 $4 $5 EX: sub $11 $2 $3 MEM: lw $10 20 $1Idle

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

[ $2 ]

$ 4

$ 5

PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emory/ALUResult

[$3]

[ $3 ]sub

ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

X

X

$ 12 $ 11$ 11 $ 10

MemorySign

Extend

DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



25/48

IF: add $14 $8 $9 ID/RF: or $13 $6 $7 EX: and $12 $4 $5 MEM: sub $11 $2 $3 WB: lw $1020($1)

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

[ $6 ]

$ 6

$ 7

[ $4 ]PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emory/ALUResult

[ $7 ] [$5]

[ $5 ]and

$10 ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

X

X

$ 13 $ 12$ 12 $ 11 $ 10

MemorySignExtend

32 DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



26/48

IF: Idle ID/RF: add $14 $8 $9 EX: or $13 $6 $7 MEM: and $12 WB: sub$4, $5 $11, $2, $3

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

[ $8 ]

$ 8

$ 9

[ $6 ]PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emory/ALUResult

[ $9 ]$11 [$7]

[ $7 ] or

ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

X

X

$ 14 $ 13$ 13 $ 12 $ 11

MemorySignExtend

DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



27/48

IF: Idle ID/RF: Idle WB EX: add $14 $8 $9$12, $4, $5

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

[ $8 ]PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emory/ALUResult

[$9]

[ $9 ]add

$12 ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

$ 14$ 14 $ 13 $ 12

MemorySignExtend

32 DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



28/48

IF: Idle ID/RF: Idle WB EX: Idle$13, $6, $7

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emory/ALUResult

$13ALU

M

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

$ 14 $ 13

MemorySignExtend

32 DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



29/48

IF: Idle ID/RF: Idle WB EX: Idle MEM: Idle$14, $8, $9

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emory/ALUResult

$14ALU

M

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

$ 14

MemorySignExtend

32 DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op


Reg. Dst.



30/48

IF: Idle ID/RF: Idle EX: Idle MEM: Idle WB:Idle

ID/EX EX/MEM MEM/WB

ite

ADD+4

U

X

IF/ID

-31

Control

Decode

ite d

RegisterWr

PInstruction

Address

Memory

Left

shift

2

Inst.

Reg. Block

Rs

Rt ReadData 1In

structionbits

ALU Srce

Branch

MemoryWr

MemoryRe

emory/ALUResult

ALUM

U

X

Data

Address

-

M

U

X

Rd

Write

Data

Read

Data 2

Write

Read

Data

M

MemorySignExtend

32 DataBits 0-15

Bits 16-20

Bits 11-15

ALU

Cont.

M

UALU Op

30

Reg. Dst.



31/48

Erik Jonsson School of En ineerin and

Computer Sciencee n vers y o exas a a as

Pipeline Processor Operation SummaryPipeline Processor Operation Summary

Pipelining replaces the single-cycle processor with a

- ,

completing one part of each instruction.

A new instruction is started every clock cycle. Inter-process registers store instruction information

(data, write register, branch conditions) between cycles

between the pipeline stages.

When the pipeline is filled with instructions, an


instruction completes every clock cycle.


32/48



Exercise 1Exercise 1

On the diagram on the next page, identify the

following:

1. Highlight all the control lines that must be active during a load

word instruction.2. As in our exercise in Lecture 20, identify the decoder

locations.

3. The ID/EX Re ister interface stores the most bits of an of the

pipeline section interfaces. Approximately how many bits isthat, according to the diagram?



33/48

Print out a copy of this diagram and bring to class.

M

U

X

ID/EX EX/MEM MEM/WB

Control

Decode

gisterWrite

ADD

ADD+4

Memory

Left

shift

2

Reg. Block

ionbits0-3

1

Branch moryWrite

moryRead

Re

Result

ALU

P

C

Instruction

Address

MData

Inst.

0-31

M

Rt

Rd

Read

Data 1

ReadData 2 Read

Instruct

ALU Srce Me

Me

Memory/AL

Memory

U

X

U

X

Write

Data

Sign

Extend

32Write

Data

a a

Bits 0-15

ALU

Bits 16-20

Bits 11-15

.

MU

X

ALU Op

Reg. Dst.



34/48



HazardsHazards

Hazards occur because data required for executing the

.

An instruction in the register fetch cycle may need

data from a register whose value will be changed by aninstruction downstream but still in process in the

pipeline (in the ALU, memory/memory bypass or

writeback c cle .

Thus an upstream instruction could access a register

and get incorrect data because the register data has not


yet een up ate y a ownstream nstruct on.


35/48



Hazards (2)Hazards (2)

There are two types of hazards, data hazards, andcontrol hazards.

Both occur because an instruction in the ID/RF stage of

the MIPS pipeline needs register data that will be

MEM/Bypass, or WB stage.

Data hazards occur when an instruction needs register

contents for an arithmetic/ logical/memory instruction. Control hazards occur when a branch instruction is


branch is not yet available in the same sort of scenario.


36/48



Data Hazard in the PipelineData Hazard in the Pipelineme ne

(clock


sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2 Instruc. Reg. ALU Mem. R/W Reg.

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

add $14, $2, $2

sw $15, 100($2) Instruc.Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

In the instruction sequence above, the last four instructionsrequire data from $2, which is changed in the first instruction.

The $2 data will not be rewritten until cycle 4, so the AND and OR


n an r ns ruc ons w e c ncorrec a a rom .

Even the add may not get the correct information (sw is okay).


37/48



Control Hazards in the PipelineControl Hazards in the Pipelineme ne

(clock


sub $2, $1, $3

blt $2, $8, wait

b t $2, $7, o Instruc. Reg. ALU Mem. R/W Reg.

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

add $14, $2, $2

sw $15, 100($2) Instruc.Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Here the problem is changed, with two branch instructions added.

Neither branch instruction may be executed correctly, once again


.

This wrong data could cause an incorrect branch.


38/48



Forwarding as a Solution to Data HazardsForwarding as a Solution to Data Hazards

0 1 2 3 4 5

oc cyc es

ID/

RF

ID/

One solution to the problem of data hazards is forwarding.

RF

Forwarding uses the fact that although instruction 2 needs registerdata two clock cycles before instruction 1 enters the WB stage, that

data is already available as the output of the ALU.


If a mechanism were available, instruction 1 could forward required

register data after its ALU cycle to the ID/RF cycle of instruction 2.


39/48



Forwarding Unit in the PipelineForwarding Unit in the Pipeline

Rs

Rt

Read

Data 1

ID/EX EX/MEM MEM/WB

M

U

ALU MU

X

Rd

Write

Data

Read

Data 2

Data

AddressRead

Data

M

U

X

Forward A

Memory

r e

Data

M

X

Rs

RtEX/MEM Register Rd

Forward B

X

Forwarding

UnitMEM/WB Register Rd




40/48



Forwarding Unit OperationForwarding Unit Operation

ALU

Reg. Block

Memory

ForwardingUnit

The forwarding unit samples register ids in the EX/MEM and

MEM/WB registers to determine if source registers in the ID/RF

cyc e are e same.

If so, source register data is replaced by pipeline (as yet unwritten)

data by the forwarding unit.


The correct information is thus processed and the instruction can

proceed to correct execution.


41/48



StallsStalls

Forwarding will not always solve the problems of data hazards.

For exam le su ose an add instruction follows a load word lw

and the add involves the register that receives the memory data.

In this case, forwarding will not work.

,

will not be available until the end of the MEM cycle. Thus the

required data is not available for a forward, and the add

instruction. if it roceeds will rocess the wron data.

A solution to this problem is the stall.

A stall halts the instruction awaiting data, while the key


cycle, after which the desired data is available to the add.


42/48



Result of Stall ApproachResult of Stall Approachme ne

(clock


lw $2, 32($3)

add $14, $6, $2

sw $15, 80 $2 Instruc. Reg. ALU Mem. R/W Reg.

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Consider the 3 instructions above, the last two

depending on the lw.

$2 contents will be available at the beginning of the WBstage in the first instruction, but not before.


,

the add and sw instructions hold place for one cycle.


43/48



Result of Stall Approach (2)Result of Stall Approach (2)(clock


lw $2, 32($3)

add $14, $6, $2 (delayed 1 count)

sw $15, 80($2) (delayed 1 count)Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

Instruc.

Fetch

Reg.

Fetch

ALU

Process

Mem. R/W

or ALU Out

Reg.

Write

.

Fetch

.

Fetch Process

.

or ALU Out

.

Write

With the delay, the lw result feeds the ALU input stage

of the add instruction, and the fetch stage of the sw.

Note that forwarding in still required (this time fromthe MEM/WB interface, not the ALU output).


, ,

following a lw must also be delayed for one clock cycle.


44/48



Other Problems With BranchesOther Problems With Branches

A remaining problem is what to do about instructions following a

branch. Even assuming forwarding and stalls, the branch/no

branch decision is not made until the third stage. This means that

in the MIPS pipeline, two following instructions will enter the pipe

before the branch/no branch decision is made. What if: The following instructions were for the case of branch taken and

the branch was not taken.

The following instructions were for branch not taken and it was

a en.

In either case, the wrong instructions are in the pipe and they must

be eliminated (flushed). How can this problem be prevented?


ew approac es to t e pro em are s own n t e o ow ng s es.


45/48



Control Hazard Approaches (1)Control Hazard Approaches (1)

MIPS R-2000 Pipeline Processor

WBALU/EX

ID/RFIFMEM/

One a roach is to alwa s assume the branch is or is not taken:

Direction of pipeline flow

Say we assume the branch is never taken. Then if the instruction in ALU/EX

is a branch, the instructions in IF and ID/RF will be those in the not taken

program line (branch determination is made in ALU/EX).

s assump on s correc , e p pe ne w con nue o ow w ou e ay.

When the branch is taken, instructions in IF and ID/RF must be flushed,

usually by changing the op code of those instructions to a nop and letting

them proceed to the end of the pipe.


Clearly, a 2-clock time delay is involved here, and it would be worse for longer

pipelines (P-IV pipeline ~ 20 stages).


46/48





WBALU/EXID/RF

IFMEM/

Branch

Comparator

Reducing the cost of taking the branch:

In this case, a branch assumption is still made (taken or not taken).

identified in the ID/RF stage, a comparator can be added there to do thebranch/no-branch determination.

With the branch determination made in this earl sta e onl one


instruction must be flushed, in the IF stage (only a 1-instruction delay).


47/48





WBALU/EXID/RF

IFMEM/

Branch feedback based on History

Branch

History

ynam c ranc pre c on ase on recen ranc s ory:

In this approach, an indicator bit (0/1) gives the last branch condition.

The next branch can be made according to the bit setting.

,

time until a substantial number of calculations are complete. Some schemes use 2 bits and do not change the prediction until the

predictor is wrong twice, after which the alternate behavior is chosen.


In either case, incorrect predictions will still be made, but hopefully notas often.


48/48



Exercise 2Exercise 2

1. Explain forwarding in your own words.

.

problem be solved?

3. Wh could 2-bit d namic branch rediction work toensure about a 1% error rate in branch prediction in

a subroutine that loops about 100 times before

called frequently, and that it always executes 100 ormore loop traversals before returning to the calling


program.

pipelined mips porcessor

Documents