pipeline performance
DESCRIPTION
Pipeline Performance. Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath. Pipeline Performance. Single-cycle (T c = 800ps). Pipelined (T c = 200ps). Pipeline Speedup. If all stages are balanced - PowerPoint PPT PresentationTRANSCRIPT
Pipeline Performance
Assume time for stages is 100ps for register read or write 200ps for other stages
Compare pipelined datapath with single-cycle datapath
Instr Instr fetch
Register read
ALU op Memory access
Register write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
Pipeline PerformanceSingle-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Pipeline Speedup
If all stages are balanced i.e., all take the same time Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages If not balanced, speedup is less Speedup due to increased throughput
Latency (time for each instruction) does not decrease
Pipelining and ISA Design
• MIPS ISA designed for pipelining• All instructions are 32-bits
• Easier to fetch and decode in one cycle• c.f. x86: 1- to 17-byte instructions
• Few and regular instruction formats• Can decode and read registers in one step
• Load/store addressing• Can calculate address in 3rd stage, access
memory in 4th stage• Alignment of memory operands
• Memory access takes only one cycle
5
Improving performance
Two ideas for improving performance:
1. Spilt each instruction into multiple steps, each taking 1 cycle steps: IF (instruction fetch), ID (instruction decode), EX
(execute ALU operation), MEM (memory access), WB (register write-back)
slow instructions take more cycles than fast instructions known as a multi-cycle implementation
2. Crucial observation: each instruction uses only a portion of the datapath in each step
can overlap instructions; each uses one portion of the datapath
known as a pipelined implementation
Examples of pipelining: any assembly process (cars, sandwiches), multiple loads of laundry (washer + dryer can be pipelined), etc.
6
Pipelining not just Multiprocessing
Pipelining does involve parallel processing, but in a specific way
Both multiprocessing and pipelining relate to the processing of multiple “things” using multiple “functional units” — In multiprocessing, each thing is processed entirely by a single
functional unit• e.g. multiple lanes at the supermarket
— In pipelining, each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unit
• e.g. checker vs. bagger
Pipelining and multiprocessing are not mutually exclusive— Modern processors do both, with multiple pipelines (e.g.
superscalar)
Pipelining is a general-purpose efficiency technique; used elsewhere in CS:— Networking, I/O devices, server software architecture
7
Instruction Fetch (IF)
Readaddress
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
While IF is executing, the rest of the data path is sitting idle…
8
Instruction Decode (ID)
Readaddress
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Then while ID is executing, the IF-related portion becomes idle…
9
Execute (EX)
Readaddress
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
..and so on for the EX portion…
10
Memory (MEM)
Readaddress
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
…the MEM portion…
11
Write back (WB)
Readaddress
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
…and the WB portion
12
Decoding and fetching together
Why don’t we go ahead and fetch the next instruction while we’re decoding the first one?
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Readaddress
Decode 1st instructionFetch 2nd
13
Executing, decoding and fetching Similarly, once the first instruction enters its Execute stage, we
can go ahead and decode the second instruction. But now the instruction memory is free again, so we can fetch the
third instruction!
Readaddress
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Decode 2ndFetch 3rd
Execute 1st
14
Break datapath into 5 stages
Each stage has its own functional units Full pipeline the datapath is simultaneously working on 5
instructions!
Readaddress
Instructionmemory
Instruction[31-0]
Readaddress
Writeaddress
Writedata
Datamemory
Readdata
MemWrite
MemRead
1
Mux
0
MemToReg
Signextend
0
Mux
1
ALUSrc
Result
ZeroALU
ALUOp
I [15 - 0]
I [25 - 21]
I [20 - 16]
I [15 - 11]
0
Mux
1
RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
IDIF EXE MEM WB
newest oldest
15
A pipeline diagram
A pipeline diagram shows the execution of a series of instructions— The instruction sequence is shown vertically, from top to
bottom— Clock cycles are shown horizontally, from left to right— Each instruction is divided into its component stages
This clearly indicates the overlapping of instructions. For example, there are three instructions active in the third cycle above.— The “lw” instruction is in its Execute stage.— Simultaneously, the “sub” is in its Instruction Decode stage.— Also, the “and” instruction is just being fetched.
Clock cycle1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WBsub $v0, $a0, $a1
IF ID EX MEM
WB
and $t1, $t2, $t3 IF ID EX MEM
WB
or $s0, $s1, $s2
IF ID EX MEM WB
addi $sp, $sp, -4 IF ID EX MEM WB
16
Pipeline terminology
The pipeline depth is the number of stages—in this case, five In the first four cycles here, the pipeline is filling, since there are
unused functional units In cycle 5, the pipeline is full. Five instructions are being executed
simultaneously, so all hardware units are in use In cycles 6-9, the pipeline is emptying
filling full emptying
Clock cycle1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WBsub $v0, $a0, $a1
IF ID EX MEM
WB
and $t1, $t2, $t3 IF ID EX MEM
WB
or $s0, $s1, $s2
IF ID EX MEM WB
add $sp, $sp, -4 IF ID EX MEM WB
17
Pipelining Performance
Execution time on ideal pipeline:— time to fill the pipeline + one cycle per instruction— How long for N instructions? k 1 + N, where k = pipeline
depth Alternate way of arriving at this formula: k cycles for the first
instruction, plus 1 for each of the remaining N 1 instructions.
Compare this pipelined implementation (2ns clock period) vs. a single cycle implementation (8ns clock period). How much faster is pipelining for N=1000 ?
Clock cycle1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WBlw $t1, 8($sp) IF ID EX ME
MWB
lw $t2, 12($sp) IF ID EX MEM
WB
lw $t3, 16($sp) IF ID EX MEM WBlw $t4, 20($sp) IF ID EX MEM WB
filling
18
Pipeline Datapath: Resource Requirements
Clock cycle1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WBlw $t1, 8($sp) IF ID EX ME
MWB
lw $t2, 12($sp) IF ID EX MEM
WB
lw $t3, 16($sp) IF ID EX MEM WBlw $t4, 20($sp) IF ID EX MEM WB We need to perform several operations in the same cycle.
— Increment the PC and add registers at the same time. — Fetch one instruction while another one reads or writes data.
What does that mean for our hardware?— Separate ADDER and ALU— Two memories (instruction memory and data memory)
19
Single-cycle datapath, slightly rearranged
MemToReg
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite
MemRead
1
0
4
Shiftleft 2
PC
Add
1
0
PCSrc
Signextend
ALUSrc
Result
ZeroALU
ALUOp
Instr [15 - 0]RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Add
Instr [15 - 11]
Instr [20 - 16]0
1
0
1
20
Pipeline registers In pipelining, we divide instruction execution into multiple cycles
— IF ID EX MEM WB Information computed during one cycle may be needed in a later
cycle:— Instruction read in IF stage determines which registers are
fetched in ID stage, what immediate is used for EX stage, and what destination register is for WB
— Register values read in ID are used in EX and/or MEM stages— ALU output produced in EX is an effective address for MEM or a
result for WB
A lot of information to save!— Saved in intermediate registers called pipeline registers
The registers are named for the stages they connect:
IF/ID ID/EX EX/MEM MEM/WB
No register is needed after the WB stage, because after WB the instruction is done
21
Pipelined datapath
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite
MemRead
1
0
MemToReg
4
Shiftleft 2
Add
Signextend
ALUSrc
Result
ZeroALU
ALUOp
Instr [15 - 0]RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Add
Instr [15 - 11]
Instr [20 - 16]0
1
0
1
IF/ID ID/EX EX/MEM MEM/WB
1
0
PCSrc
PC
22
Propagating values forward
Data values required later propagated through the pipeline registers
The most extreme example is the destination register (rd or rt)— It is retrieved in IF, but isn’t updated until the WB— Thus, it must be passed through all pipeline stages, as
shown in red on the next slide
Notice that we can’t keep a single “instruction register,” because the pipelined machine needs to fetch a new instruction every clock cycle
23
The destination register
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite
MemRead
1
0
MemToReg
4
Shiftleft 2
Add
ALUSrc
Result
ZeroALU
ALUOp
Instr [15 - 0]RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Add
Instr [15 - 11]
Instr [20 - 16]0
1
0
1
IF/ID ID/EX EX/MEM MEM/WB
1
0
PCSrc
PC
Signextend
24
What about control signals?
Control signals generated similar to the single-cycle processor— in the ID stage, the processor decodes the instruction fetched in
IF and produces the appropriate control values
Some of the control signals will not be needed until later stages— These signals must be propagated through the pipeline until
they reach the appropriate stage— We just pass them in the pipeline registers, along with the data
Control signals can be categorized by the pipeline stage that uses them Stag
eControl signals needed
EX ALUSrc ALUOp RegDst
MEM MemRead MemWrite PCSrc
WB RegWrite MemToReg
25
Pipelined data path and control
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite
MemRead
1
0
MemToReg
4
Shiftleft 2
Add
ALUSrc
Result
ZeroALU
ALUOp
Instr [15 - 0]RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Add
Instr [15 - 11]
Instr [20 - 16]0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
PC
1
0
PCSrc
Signextend
EX
M
WB
26
Here’s a sample sequence of instructions to execute
1000: lw $8, 4($29)1004: sub $2, $4, $51008: and $9, $10, $111012: or $16, $17, $181016: add $13, $14, $0
We’ll make some assumptions, just so we can show actual data values:— Each register contains its number plus 100. For instance,
register $8 contains 108, register $29 contains 129, etc.— Every data memory location contains 99
Our pipeline diagrams will follow some conventions:— An X indicates values that aren’t important, like the constant
field of an R-type instruction— Question marks ??? indicate values we don’t know, usually
resulting from instructions coming before and after the ones in our example
An example execution sequence
addresses in decimal
27
Cycle 1 (filling)IF: lw $8, 4($29) MEM: ??? WB: ???EX: ???ID: ???
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (?)
MemRead (?)
1
0
MemToReg(?)
Shiftleft 2
Add
1
0
PCSrc
ALUSrc (?)
Result
ZeroALU
ALUOp (???)
RegDst (?)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite (?)
Add
0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
1000
1004
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
??????
???
4
PC
Signextend
EX
M
WB
28
Cycle 2ID: lw $8, 4($29)IF: sub $2, $4, $5 MEM: ??? WB: ???EX: ???
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
1
0
4
Shiftleft 2
Add
PCSrc
Result
ZeroALU
4
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
Add
X
80
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
100429
X
1008
129
X
MemToReg(?)
???
???
???
???
???
???
RegWrite (?)
MemWrite (?)
MemRead (?)
???
???
???
ALUSrc (?)
ALUOp (???)
RegDst (?)
???
???
???
???
??????
???
PC
Signextend
EX
M
WB
1
0
29
Cycle 3ID: sub $2, $4, $5IF: and $9, $10, $11 EX: lw $8, 4($29) MEM: ??? WB: ???
MemToReg(?)
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (?)
MemRead (?)
1
0
4
Shiftleft 2
Add
PCSrc
ALUSrc (1)
Result
ZeroALU
ALUOp (add)
XRegDst (0)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
Add
2
X0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
10084
5
1012
104
105
129
4
X
8
X8
133 4
???
???
???
???
???
???
RegWrite (?)
???
???
???
PC
Signextend
1
0
EX
M
WB
30
Cycle 4ID: and $9, $10, $11IF: or $16, $17, $18 EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB: ???
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (0)
MemRead (1)
1
0
MemToReg(?)
4
Shiftleft 2
Add
PCSrc
ALUSrc (0)
Result
ZeroALU
ALUOp (sub)
XRegDst (1)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite (?)
Add
9
X0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
101210
11
1016
110
111
104
X
105
X
22
–1
133
X 99
8
???
???
???
???
???
???
PC
Signextend
EX
M
WB
1
0
31
Cycle 5 (full)ID: or $16, $17, $18IF: add $13, $14, $0 EX: and $9, $10, $11 MEM: sub $2, $4, $5 WB:
lw $8, 4($29)
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (0)
MemRead (0)
1
0
MemToReg(1)
4
Shiftleft 2
Add
PCSrc
ALUSrc (0)
Result
ZeroALU
ALUOp (and)
XRegDst (1)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite (1)
Add
16
X0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
101617
18
1020
117
118
110
X
111
X
99
110
-1
105 X
2
99
133
99
8
99
8
PC
Signextend
EX
M
WB
1
0
32
Cycle 6 (emptying)ID: add $13, $14, $0IF: ??? EX: or $16, $17, $18 MEM: and $9, $10, $11 WB: sub
$2, $4, $5
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (0)
MemRead (0)
1
0
MemToReg(0)
4
Shiftleft 2
Add
PCSrc
ALUSrc (0)
Result
ZeroALU
ALUOp (or)
XRegDst (1)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite (1)
Add
13
X0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
102014
0
???
114
0
117
X
118
X
1616
119
110
111 X
9
X
-1
-1
2
-1
2
PC
Signextend
1
0
EX
M
WB
33
Cycle 7ID: ???IF: ??? EX: add $13, $14, $0 MEM: or $16, $17, $18 WB: and
$9, $10, $11
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (0)
MemRead (0)
1
0
MemToReg(0)
4
Shiftleft 2
Add
PCSrc
ALUSrc (0)
Result
ZeroALU
ALUOp (add)
???RegDst (1)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite (1)
Add
???
???0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
???
???
???
???
114
X
0
X
1313
114
119
118 X
16
X
110
110
9
110
9
PC
Signextend
???
???
EX
M
WB
1
0
34
Cycle 8ID: ???IF: ??? EX: ??? MEM: add $13, $14, $0 WB: or $16,
$17, $18
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (0)
MemRead (0)
1
0
MemToReg(0)
4
Shiftleft 2
Add
PCSrc
ALUSrc (?)
Result
ZeroALU
ALUOp (???)
???RegDst (?)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite (1)
Add
???
???0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
???
???
???
??????
???
114
0 X
13
X
119
119
16
119
16
PC
Signextend
???
??? ???
???
???
???
???
1
0
EX
M
WB
35
Cycle 9ID: ???IF: ??? EX: ??? MEM: ??? WB: add
$13, $14, $0
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite (?)
MemRead (?)
1
0
MemToReg(0)
4
Shiftleft 2
Add
PCSrc
ALUSrc (?)
Result
ZeroALU
ALUOp (???)
???RegDst (?)
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite (1)
Add
???
???0
1
0
1
IF/ID
ID/EX
EX/MEM
MEM/WB Control
M
WB
WB
???
???
???
???
??????
???
???
? X
???
X
114
114
13
114
13
PC
Signextend
???
???
???
???
???
???
1
0
EX
M
WB
36
That’s a lot of diagrams there
Compare the last few slides with the pipeline diagram above— You can see how instruction executions are overlapped— Each functional unit is used by a different instruction in each
cycle— The pipeline registers save control and data values generated
in previous clock cycles for later use— When the pipeline is full in clock cycle 5, all of the hardware
units are utilized. This is the ideal situation, and what makes pipelined processors so fast
Clock cycle1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WBsub $v0, $a0, $a1
IF ID EX MEM
WB
and $t1, $t2, $t3 IF ID EX MEM
WB
or $s0, $s1, $s2
IF ID EX MEM WB
add $t5, $t6, $0 IF ID EX MEM WB
37
Note how everything goes left to right, except
Readaddress
Instructionmemory
Instruction[31-0]
Address
Writedata
Data memory
Readdata
MemWrite
MemRead
1
0
MemToReg
4
Shiftleft 2
Add
Signextend
ALUSrc
Result
ZeroALU
ALUOp
Instr [15 - 0]RegDst
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata 2
Readdata 1
Registers
RegWrite
Add
Instr [15 - 11]
Instr [20 - 16]0
1
0
1
IF/ID ID/EX EX/MEM MEM/WB
1
0
PCSrc
PC