arturo díaz pérez laboratorio de tecnologías de información …adiaz/arqcomp/14... · 2014. 7....
TRANSCRIPT
Arquitectura de Computadoras PipeliningPerformance- 1
Laboratorio deTecnologías de Información
Pipelining:Pipelining: Performance and LimitsPerformance and Limits
Arquitectura de ComputadorasArquitectura de ComputadorasArturo DArturo Dííaz Paz Péérezrez
Centro de InvestigaciCentro de Investigacióón y de Estudios Avanzados del IPNn y de Estudios Avanzados del IPN
Laboratorio de TecnologLaboratorio de Tecnologíías de Informacias de Informacióónn
[email protected]@cinvestav.mx
Arquitectura de Computadoras PipeliningPerformance- 2
Laboratorio deTecnologías de Información
Review: Summary of Pipelining BasicsReview: Summary of Pipelining Basics
Pipelines pass control information down the pipe just as data moves down pipeForwarding/Stalls handled by local controlHazards limit performance
■
Structural: need more HW resources■
Data: need forwarding, compiler scheduling■
Control: early evaluation & PC, delayed branch, prediction
Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latencyWhat about performance?
Arquitectura de Computadoras PipeliningPerformance- 3
Laboratorio deTecnologías de Información
Is CPI = 1 for our pipeline?Is CPI = 1 for our pipeline?
Remember that CPI is an “Average # cycles/inst
CPI here is 1, since the average throughput is 1 instruction every cycle.What if there are stalls or multi-cycle execution?Usually CPI > 1. How close can we get to 1??
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
Arquitectura de Computadoras PipeliningPerformance- 4
Laboratorio deTecnologías de Información
Speedup Equation for PipeliningSpeedup Equation for Pipelining
Pipeline SpeedupAverage instruction time unpipelined
Average instruction time pipelined
CPI Clock cycle
CPI Clock cycle
= CPI CPI
Clock cycleClock cycle
unpipelined unpipelined
pipelined pipelined
unpipelined
pipelined
unpipelined
pipelined
=
=××
×
Pipeline Speedup = CPI Pipeline depth
CPI Clock cycleClock cycle
Ideal
pipelined
unpipelined
pipelined
××
CPI = CPI
Pipeline depthIdealunpipelined
CPI = CPI + Pipeline stall cycles per instructionpipelined Ideal
Arquitectura de Computadoras PipeliningPerformance- 5
Laboratorio deTecnologías de Información
Speedup Equation for Pipelining, cont.Speedup Equation for Pipelining, cont.
Speedup = CPI Pipeline depth
CPI + Pipeline stall cycles Clock cycleClock cycle
Ideal
Ideal
unpipelined
pipelined
××
Speedup = CPI Pipeline depth
CPI + Pipeline stall cycles Ideal
Ideal
×
If
we
ignore Cycle
Time differences:
Arquitectura de Computadoras PipeliningPerformance- 6
Laboratorio deTecnologías de Información
ExampleExample: Dual: Dual--portport vs. single vs. single portportMachine A: Dual ported memoryMachine B: Single ported memory, 1.05 times faster clock rateCPIideal = 1 for bothLoads are 40% of instructions executed
SpeedupB = Pipeline depth/(1+0.4*1) x (clockunpipe /(clockpipe /1.05))= 0.75 x Pipeline depth
SpeedupA = Pipeline depth/(1+0) x (clockunpipe /clockpipe )= Pipeline depth
SpeedupA / SpeedupB = Pipeline depth/(0.75 x Pipeline depth= 1.33
Machine A is 1.33 times faster
Arquitectura de Computadoras PipeliningPerformance- 7
Laboratorio deTecnologías de Información
Performance SummaryPerformance Summary
Just overlap tasks, and easy if tasks are independentSpeedup <= Pipeline depth; if ideal CPI is 1, then:
• Hazards limit performance on computers:• structural: need more HW resources• data: need forwarding, compiler scheduling• control: discuss next time
Speedup = Pipeline depth
+ Pipeline stall cycles Clock cycleClock cycle
unpipelined
pipelined1 ×
Arquitectura de Computadoras PipeliningPerformance- 8
Laboratorio deTecnologías de Información
Recap: Pipeline HazardsRecap: Pipeline Hazards
I-Fet ch DCD MemOpFetch OpFetch Exec Store
IFetch DCD ° ° °StructuralHazard
I-Fet ch DCD OpFetch Jump
IFetch DCD ° ° °
Control Hazard
IF DCD EX Mem WB
IF DCD OF Ex Mem
RAW (read after write) Data Hazard
WAW Data Hazard (write after write)
IF DCD OF Ex RS WAR Data Hazard (write after read)
IF DCD EX Mem WB
IF DCD EX Mem WB
Arquitectura de Computadoras PipeliningPerformance- 9
Laboratorio deTecnologías de Información
Hazard DetectionHazard Detection
Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline.A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j )
■
Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction.
■
When instruction issues, reserve its result register. ■
When on operation completes, remove its write reservation.
A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j )A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j )
Arquitectura de Computadoras PipeliningPerformance- 10
Laboratorio deTecnologías de Información
Record of Pending Writes In Pipeline RegistersRecord of Pending Writes In Pipeline Registers
Current operand registersPending writeshazard <=
((rs == rwex)
& regWex
) OR((rs == rwmem)
& regWme
) OR((rs == rwwb) & regWwb
) OR((rt == rwex) & regWex
) OR((rt == rwmem) & regWme
) OR((rt == rwwb
) & regWwb
)
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw
rs
rt
Arquitectura de Computadoras PipeliningPerformance- 11
Laboratorio deTecnologías de Información
Resolve RAW by Resolve RAW by ““forwardingforwarding”” (or bypassing)(or bypassing)
Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe
Increase muxes to add paths from pipeline registers
Data Forwarding = Data Bypassing
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw
rs
rtForward
mux
Arquitectura de Computadoras PipeliningPerformance- 12
Laboratorio deTecnologías de Información
What about memory operations?What about memory operations?
A B
op Rd Ra Rb
op Rd Ra Rb
Rd
to regfile
R
Rd
If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations!What does delaying WB on arithmetic operations cost?
–
cycles ? –
hardware ?
What about data dependence on loads? R1 <-
R4 + R5
R2 <-
Mem[ R2 + I ] R3 <-
R2 + R1
⇒ “Delayed Loads”Can recognize this in decode stage and introduce bubble while stalling fetch stag.Tricky situation:
R1 <-
Mem[ R2 + I ] Mem[R3+34] <-
R1
Handle with bypass in memory stage!
D
Mem
T
Arquitectura de Computadoras PipeliningPerformance- 13
Laboratorio deTecnologías de Información
Software SchedulingSoftware Scheduling
Slow Code:
lw Rb, blw Rc, cadd Ra, Rb, Rcsw a, Ralw Re, elw Rf, fsub Rd, Re, Rfsw d, Rd
Fast Code:
lw Rb, blw Rc, clw Re, eadd Ra, Rb, Rclw Rf, fsw a, Rasub Rd, Re, Rfsw d, Rd
Try
producing
fast
code
fora = b + c;d = e -
f;
assuming
a, b, c, d, e, and
f in memory
Arquitectura de Computadoras PipeliningPerformance- 14
Laboratorio deTecnologías de Información
Compiler Avoiding Load Stalls:Compiler Avoiding Load Stalls:
% loads stalling pipeline
0% 20% 40% 60% 80%
tex
spice
gcc
25%
14%
31%
65%
42%
54%
scheduled unscheduled
Arquitectura de Computadoras PipeliningPerformance- 15
Laboratorio deTecnologías de Información
Pipelined ProcessorPipelined Processor
Separate control at each stageStalls propagate backwards to freeze previous stagesBubbles in pipeline introduced by placing “Noops” into local stage, stall previous stages.
Exe
c
Reg
. Fi
le
Mem
Acce
ss
Dat
aM
em
A
B
S
M
Reg
File
Equa
l
PC
Nex
t PC
IR
Inst
. Mem
Valid
IRex
Dcd
Ctrl
IRm
em
Ex
Ctrl
IRw
b
Mem
Ctrl
WB
Ctrl
D
Stalls
Bubbles
Arquitectura de Computadoras PipeliningPerformance- 16
Laboratorio deTecnologías de Información
FYI: MIPS R3000 clocking disciplineFYI: MIPS R3000 clocking discipline
2-phase non-overlapping clocksPipeline stage is two (level sensitive) latches
phi1
phi2
phi1 phi1phi2Edge-triggered
Arquitectura de Computadoras PipeliningPerformance- 17
Laboratorio deTecnologías de Información
MIPS R3000 Instruction PipelineMIPS R3000 Instruction Pipeline
Inst Fetch DecodeReg. Read
ALU / E.A Memory Write Reg
TLB I-Cache RF Operation WB
E.A. TLB D-Cache
TLBI-cache
RFALUALU
TLB
D-Cache
WB
Resource Usage
Write in phase 1, read in phase 2 => eliminates bypass from WB
Arquitectura de Computadoras PipeliningPerformance- 18
Laboratorio deTecnologías de Información
Recall: Data Hazard on r1Recall: Data Hazard on r1
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/RF EX MEM WBAL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
Im
AL
UReg Dm Reg
AL
UIm Reg Dm Reg
With MIPS R3000 pipeline, no need to forward from WB stage
Arquitectura de Computadoras PipeliningPerformance- 19
Laboratorio deTecnologías de Información
MIPS R3000 Multicycle OperationsMIPS R3000 Multicycle Operations
Ex: Multiply, Divide, Cache Miss
Use control word of local stage to step through multicycle operation
Stall all stages above multicycle operation in the pipeline
Drain (bubble) stages below it
Alternatively, launch multiply/divide to autonomous unit, only stall pipe if attempt to get result before ready
- This means stall mflo/mfhi in decode stage if multiply/divide still executing
A B
op Rd Ra Rb
mul Rd Ra Rb
Rd
to regfile
R
TRd
Arquitectura de Computadoras PipeliningPerformance- 20
Laboratorio deTecnologías de Información
Case Study: MIPS R4000 (200 MHz)Case Study: MIPS R4000 (200 MHz)
8 Stage Pipeline:■
IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.
■
IS–second half of access to instruction cache. ■
RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection.
■
EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
■
DF–data fetch, first half of access to data cache.■
DS–second half of access to data cache.■
TC–tag check, determine whether the data cache access hit.■
WB–write back for loads and register-register operations.
8 Stages: What is impact on Load delay? Branch delay? Why?
Arquitectura de Computadoras PipeliningPerformance- 21
Laboratorio deTecnologías de Información
Case Study: MIPS Case Study: MIPS R4000R4000
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
TWO CycleLoad Latency
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
THREE CycleBranch Latency(conditions evaluatedduring EX phase)Delay slot plus two stallsBranch likely cancels delay slot if not taken
Arquitectura de Computadoras PipeliningPerformance- 22
Laboratorio deTecnologías de Información
MIPS R4000 Floating PointMIPS R4000 Floating Point
FP Adder, FP Multiplier, FP DividerLast step of FP Multiplier/Divider uses FP Adder HW8 kinds of stages in FP units:
Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers
Arquitectura de Computadoras PipeliningPerformance- 23
Laboratorio deTecnologías de Información
MIPS FP Pipe StagesMIPS FP Pipe Stages
FP Instr 1 2 3 4 5 6 7 8 …Add, Subtract U S+A A+R R+SMultiply U E+M M M M N N+A RDivide U A R D28 … D+A D+R, D+R, D+A, D+R, A, RSquare root U E (A+R)108 … A RNegate U SAbsolute value U SFP compare U A RStages:
M First stage of multiplierN Second stage of multiplierR Rounding stageS Operand shift stageU Unpack FP numbers
A Mantissa ADD stage D Divide pipeline stageE Exception test stage
Arquitectura de Computadoras PipeliningPerformance- 24
Laboratorio deTecnologías de Información
R4000 PerformanceR4000 PerformanceNot ideal CPI of 1:
■
Load stalls (1 or 2 clock cycles)■
Branch stalls (2 cycles + unfilled slots)■
FP result stalls: RAW data hazard (latency)■
FP structural stalls: Not enough FP hardware (parallelism)
00.5
11.5
22.5
33.5
44.5
eqnt
ott
espr
esso gc
c li
dodu
c
nasa
7
ora
spic
e2g6
su2c
or
tom
catv
Base Load stalls Branch stalls FP result stalls FP structuralstalls
Arquitectura de Computadoras PipeliningPerformance- 25
Laboratorio deTecnologías de Información
Improving Instruction Level Parallelism (ILP)Improving Instruction Level Parallelism (ILP)
ILP: Overlap execution of unrelated instructionsgcc 17% control transfer
■
5 instructions + 1 branch■
Beyond single block to get more instruction level parallelism
Loop level parallelism one opportunity■
First SW, then HW approaches
DLX Floating Point as example■
Measurements suggests R4000 performance FP execution has room for improvement
Arquitectura de Computadoras PipeliningPerformance- 26
Laboratorio deTecnologías de Información
FP Loop: Where are the Hazards?FP Loop: Where are the Hazards?
Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar from F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot
Instruction Instruction
Latency in
producing result
using result clock cyclesFP ALU op
Another FP ALU op
3
FP ALU op Store double
2
Load double FP ALU op
1
Load double Store double
0
Integer op Integer op
0
•
Where are the stalls?
Arquitectura de Computadoras PipeliningPerformance- 27
Laboratorio deTecnologías de Información
FP Loop Showing StallsFP Loop Showing Stalls
9 clocks: Rewrite code to minimize stalls?
Instruction Instruction
Latency in
producing result
using result clock cyclesFP ALU op
Another FP ALU op
3
FP ALU op Store double
2
Load double FP ALU op
1
1 Loop: LD F0,0(R1) ;F0=vector element2 stall3 ADDD F4,F0,F2 ;add scalar in F24 stall5 stall6 SD 0(R1),F4 ;store result7 SUBI R1,R1,8 ;decrement pointer 8B (DW)8 BNEZ R1,Loop ;branch R1!=zero9 stall ;delayed branch slot
Arquitectura de Computadoras PipeliningPerformance- 28
Laboratorio deTecnologías de Información
Revised FP Loop Minimizing StallsRevised FP Loop Minimizing Stalls
6 clocks: Unroll loop 4 times code to make faster?
Instruction Instruction
Latency in
producing result
using result clock cyclesFP ALU op
Another FP ALU op
3
FP ALU op Store double
2
Load double FP ALU op
1
1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 SUBI R1,R1,85 BNEZ R1,Loop ;delayed branch6 SD 8(R1),F4 ;altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Arquitectura de Computadoras PipeliningPerformance- 29
Laboratorio deTecnologías de Información
1 Loop:LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;drop SUBI & BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;drop SUBI & BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;drop SUBI & BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SD -24(R1),F1613 SUBI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP
15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iterationAssumes R1 is multiple of 4
Unroll Loop Four Times (straightforward Unroll Loop Four Times (straightforward way)way)
Rewrite loop to minimize stalls?
1 cycle stall2 cycles stall
Arquitectura de Computadoras PipeliningPerformance- 30
Laboratorio deTecnologías de Información
Unrolled Loop That Minimizes StallsUnrolled Loop That Minimizes Stalls
What assumptions made when moved code?■
OK to move store past SUBI even though changes register
■
OK to move loads before stores: get right data?
■
When is it safe for compiler to do such changes?
1 Loop:LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iterationWhen safe to move instructions?
Arquitectura de Computadoras PipeliningPerformance- 31
Laboratorio deTecnologías de Información
Issues in Pipelined designIssues in Pipelined design
° Pipelining
° Super-pipeline- Issue one instruction per (fast) cycle- ALU takes multiple cycles
° Super-scalar- Issue multiple scalarinstructions per cycle
° VLIW (“EPIC”)- Each instruction specifiesmultiple scalar operations- Compiler determines parallelism
° Vector operations- Each instruction specifiesseries of identical operations
Limitation
Issue rate, FU stalls, FU depth
Clock skew, FU stalls, FU depth
Hazard resolution
Packing
Applicability
IF D Ex M WIF D Ex M W
IF D Ex M WIF D Ex M W
IF D Ex M WIF D Ex M W
IF D Ex M WIF D Ex M W
IF D Ex M WIF D Ex M W
IF D Ex M WIF D Ex M W
IF D Ex M WEx M WEx M WEx M W
IF D Ex M WEx M W
Ex M WEx M W
Arquitectura de Computadoras PipeliningPerformance- 32
Laboratorio deTecnologías de Información
Getting CPI < 1: Issuing Multiple Getting CPI < 1: Issuing Multiple Instructions/CycleInstructions/Cycle
Two main variations: Superscalar and VLIWSuperscalar: varying no. instructions/cycle (1 to 6)
■
Parallelism and dependencies determined/resolved by HW■
IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164, HP 7100
Very Long Instruction Words (VLIW): fixed number of instructions (16) parallelism determined by compiler
■
Pipeline is exposed; compiler must schedule delays to get right result
Explicit Parallel Instruction Computer (EPIC)/ Intel■
128 bit packets containing 3 instructions (can execute sequentially)■
Can link 128 bit packets together to allow more parallelism■
Compiler determines parallelism, HW checks dependencies and fowards/stalls
Arquitectura de Computadoras PipeliningPerformance- 33
Laboratorio deTecnologías de Información
Getting CPI < 1: Issuing Multiple Getting CPI < 1: Issuing Multiple Instructions/CycleInstructions/Cycle
Superscalar DLX: 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair
Type PipeStagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instructions in SS■
instruction in right half can’t use it, nor instructions in next slot
Arquitectura de Computadoras PipeliningPerformance- 34
Laboratorio deTecnologías de Información
Unrolled Loop that Minimizes Stalls for ScalarUnrolled Loop that Minimizes Stalls for Scalar
1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
LD to ADDD: 1 CycleADDD to SD: 2 Cycles
Arquitectura de Computadoras PipeliningPerformance- 35
Laboratorio deTecnologías de Información
Loop Unrolling in SuperscalarLoop Unrolling in SuperscalarInteger instruction FP instruction Clock cycle
Loop: LD F0,0(R1) 1LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12
Unrolled 5 times to avoid delays (+1 due to SS)12 clocks, or 2.4 clocks per iteration
Arquitectura de Computadoras PipeliningPerformance- 36
Laboratorio deTecnologías de Información
Limits of SuperscalarLimits of SuperscalarWhile Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:
■
Exactly 50% FP operations■
No hazards
If more instructions issue at same time, greater difficulty of decode and issue
■
Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
VLIW: tradeoff instruction space for simple decoding■
The long instruction word has room for many operations■
By definition, all the operations the compiler puts in the long instruction word can execute in parallel
■
E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch●
16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide■
Need compiling technique that schedules across several branches
Arquitectura de Computadoras PipeliningPerformance- 37
Laboratorio deTecnologías de Información
Loop Unrolling in VLIWLoop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iterationNeed more registers in VLIW(EPIC => 128int + 128FP)
Arquitectura de Computadoras PipeliningPerformance- 38
Laboratorio deTecnologías de Información
Software PipeliningSoftware PipeliningObservation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
Arquitectura de Computadoras PipeliningPerformance- 39
Laboratorio deTecnologías de InformaciónSoftware PipeliningSoftware Pipelining
sum = 0.0;for( i=1; i<=N; i++ ) { ; sum = sum + a[i]*b[i]load a[i];load b[i];mult a[i]*b[i]add sum[i]
}sum = 0.0;START-UP-BLOCK;for( i=1; i<=N; i++ ) {load a[i]; ;Aiload b[i]; ;Bimult a[i-1]*b[i-1] ;*iadd sum[i-2] ;+i
}FINISH-UPBLOCK;
Arquitectura de Computadoras PipeliningPerformance- 40
Laboratorio deTecnologías de Información
Software Pipelining, cont.Software Pipelining, cont.
START-UP-BLOCK;for( i=1; i<=N; i++ ) {load a[i]; ;Aiload b[i]; ;Bimult a[i-1]*b[i-1] ;*i-1add sum[i-2] ;+i-2
}FINISH-UPBLOCK;
LOOPSTART-UP i=3 ... i=N FINISH-UPA1 A2 A3 Ai ANB1 B2 B3 Bi BN
*1 *2 *i-1 *N-1 *N+1 +i-2 +N-2 +N-1 +N
Arquitectura de Computadoras PipeliningPerformance- 41
Laboratorio deTecnologías de Información
Software Pipelining ExampleSoftware Pipelining ExampleBefore: Unrolled 3 times1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F44 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F87 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOP
After: Software Pipelined1 SD 0(R1),F4 ; Stores M[i]2 ADDD F4,F0,F2 ; Adds to M[i-1]3 LD F0,-16(R1); Loads M[i-2]4 SUBI R1,R1,#85 BNEZ R1,LOOP
• Symbolic Loop Unrolling– Maximize result-use distance – Less code space than unrolling–
Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling
SW Pipeline
Loop Unrolled
over
lapp
ed o
psTime
Time
Arquitectura de Computadoras PipeliningPerformance- 42
Laboratorio deTecnologías de Información
Software Pipelining with Loop Unrolling in Software Pipelining with Loop Unrolling in VLIWVLIW
Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,-48(R1) ST 0(R1),F4 ADDD F4,F0,F2 1LD F6,-56(R1) ST -8(R1),F8 ADDD F8,F6,F2 SUBI R1,R1,#24 2LD F10,-40(R1) ST 8(R1),F12 ADDD F12,F10,F2 BNEZ R1,LOOP 3
Software pipelined across 9 iterations of original loop■
In each iteration of above loop, we:●
Store to m,m-8,m-16 (iterations I-3,I-2,I-1)●
Compute for m-24,m-32,m-40 (iterations I,I+1,I+2)●
Load from m-48,m-56,m-64 (iterations I+3,I+4,I+5)
9 results in 9 cycles, or 1 clock per iterationAverage: 3.3 ops per clock, 66% efficiencyNote: Need less registers for software pipelining
(only using 7 registers here, was using 15)
Arquitectura de Computadoras PipeliningPerformance- 43
Laboratorio deTecnologías de Información
Trace SchedulingTrace SchedulingParallelism across IF branches vs. LOOP branchesTwo steps:
■
Trace Selection●
Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code
■
Trace Compaction●
Squeeze trace into few VLIW instructions●
Need bookkeeping code in case prediction is wrong
Compiler undoes bad guess (discards values in registers)Subtle compiler bugs mean wrong answer vs. pooer performance; no hardware interlocks
Arquitectura de Computadoras PipeliningPerformance- 44
Laboratorio deTecnologías de Información
Trace SchedulingTrace Scheduling
b[i] = “old”a[i] = ...if( a[i] > 0 ) thenb[i] = “new”; /*common case*/
elsex
endifc[i] = ...
Until doneselect most common path -- a traceschedule trace accross basic blocksrepair other paths
Arquitectura de Computadoras PipeliningPerformance- 45
Laboratorio deTecnologías de Información
Trace Scheduling, contTrace Scheduling, cont
trace to be scheduled:
b[i] = “old”a[i] = ...b[i] = “new”;c[i] = ...if( A[i] <= 0 ) go to A
B:
repair code:
A: restore old b[i]xmaybe recalculate c[i]go to B:
b[i] = “old”a[i] = ...if( a[i] > 0 ) then
b[i] = “new”; /*common case*/else
xendifc[i] = ...
Arquitectura de Computadoras PipeliningPerformance- 46
Laboratorio deTecnologías de Información
SummarySummaryHazards limit performance
■
Structural: need more HW resources■
Data: need forwarding, compiler scheduling■
Control: early evaluation & PC, delayed branch, prediction
Data hazards must be handled carefully:■
RAW data hazards handled by forwarding■
WAW and WAR hazards don’t exist in 5-stage pipeline
MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)Exceptions in 5-stage pipeline recorded when they occur, but acted on only at WB (end of MEM) stage
■
Must flush all previous instructions
More performance from deeper pipelines, parallelismInstruction Level Parallelism is used to get CPI < 1.0
■
Multiple instruction executed simultaneously■
Multiple issue instruction■
Dynamic scheduling■
Out-of-order execution■
Speculation