arturo díaz pérez laboratorio de tecnologías de información …adiaz/arqcomp/14... · 2014. 7....

Arquitectura de Computadoras PipeliningPerformance- 1

Laboratorio deTecnologías de Información

Pipelining:Pipelining: Performance and LimitsPerformance and Limits

Arquitectura de ComputadorasArquitectura de ComputadorasArturo DArturo Dííaz Paz Péérezrez

Centro de InvestigaciCentro de Investigacióón y de Estudios Avanzados del IPNn y de Estudios Avanzados del IPN

Laboratorio de TecnologLaboratorio de Tecnologíías de Informacias de Informacióónn

[email protected]@cinvestav.mx



Review: Summary of Pipelining BasicsReview: Summary of Pipelining Basics

Pipelines pass control information down the pipe just as data moves down pipeForwarding/Stalls handled by local controlHazards limit performance

■

Structural: need more HW resources■

Data: need forwarding, compiler scheduling■

Control: early evaluation & PC, delayed branch, prediction

Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latencyWhat about performance?



Is CPI = 1 for our pipeline?Is CPI = 1 for our pipeline?

Remember that CPI is an “Average # cycles/inst

CPI here is 1, since the average throughput is 1 instruction every cycle.What if there are stalls or multi-cycle execution?Usually CPI > 1. How close can we get to 1??

IFetch Dcd Exec Mem WB






Speedup Equation for PipeliningSpeedup Equation for Pipelining

Pipeline SpeedupAverage instruction time unpipelined

Average instruction time pipelined

CPI Clock cycle

CPI Clock cycle

= CPI CPI

Clock cycleClock cycle

unpipelined unpipelined

pipelined pipelined

unpipelined

pipelined

unpipelined

pipelined

=

=××

×

Pipeline Speedup = CPI Pipeline depth

CPI Clock cycleClock cycle

Ideal

pipelined

unpipelined

pipelined

××

CPI = CPI

Pipeline depthIdealunpipelined

CPI = CPI + Pipeline stall cycles per instructionpipelined Ideal



Speedup Equation for Pipelining, cont.Speedup Equation for Pipelining, cont.

Speedup = CPI Pipeline depth

CPI + Pipeline stall cycles Clock cycleClock cycle

Ideal

Ideal

unpipelined

pipelined

××

Speedup = CPI Pipeline depth

CPI + Pipeline stall cycles Ideal

Ideal

×

If

we

ignore Cycle

Time differences:



ExampleExample: Dual: Dual--portport vs. single vs. single portportMachine A: Dual ported memoryMachine B: Single ported memory, 1.05 times faster clock rateCPIideal = 1 for bothLoads are 40% of instructions executed

SpeedupB = Pipeline depth/(1+0.4*1) x (clockunpipe /(clockpipe /1.05))= 0.75 x Pipeline depth

SpeedupA = Pipeline depth/(1+0) x (clockunpipe /clockpipe )= Pipeline depth

SpeedupA / SpeedupB = Pipeline depth/(0.75 x Pipeline depth= 1.33

Machine A is 1.33 times faster



Performance SummaryPerformance Summary

Just overlap tasks, and easy if tasks are independentSpeedup <= Pipeline depth; if ideal CPI is 1, then:

• Hazards limit performance on computers:• structural: need more HW resources• data: need forwarding, compiler scheduling• control: discuss next time

Speedup = Pipeline depth

+ Pipeline stall cycles Clock cycleClock cycle

unpipelined

pipelined1 ×



Recap: Pipeline HazardsRecap: Pipeline Hazards

I-Fet ch DCD MemOpFetch OpFetch Exec Store

IFetch DCD ° ° °StructuralHazard

I-Fet ch DCD OpFetch Jump

IFetch DCD ° ° °

Control Hazard

IF DCD EX Mem WB

IF DCD OF Ex Mem

RAW (read after write) Data Hazard

WAW Data Hazard (write after write)

IF DCD OF Ex RS WAR Data Hazard (write after read)

IF DCD EX Mem WB

IF DCD EX Mem WB



Hazard DetectionHazard Detection

Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline.A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j )

■

Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction.

■

When instruction issues, reserve its result register. ■

When on operation completes, remove its write reservation.

A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j )A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j )



Record of Pending Writes In Pipeline RegistersRecord of Pending Writes In Pipeline Registers

Current operand registersPending writeshazard <=

((rs == rwex)

& regWex

) OR((rs == rwmem)

& regWme

) OR((rs == rwwb) & regWwb

) OR((rt == rwex) & regWex

) OR((rt == rwmem) & regWme

) OR((rt == rwwb

) & regWwb

)

npc

I mem

Regs

B

alu

S

D mem

m

IAU

PC

Regs

A im op rwn

op rwn

op rwn

op rw

rs

rt



Resolve RAW by Resolve RAW by ““forwardingforwarding”” (or bypassing)(or bypassing)

Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe

Increase muxes to add paths from pipeline registers

Data Forwarding = Data Bypassing

npc

I mem

Regs

B

alu

S

D mem

m

IAU

PC

Regs

A im op rwn

op rwn

op rwn

op rw

rs

rtForward

mux



What about memory operations?What about memory operations?

A B

op Rd Ra Rb

op Rd Ra Rb

Rd

to regfile

R

Rd

If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations!What does delaying WB on arithmetic operations cost?

–

cycles ? –

hardware ?

What about data dependence on loads? R1 <-

R4 + R5

R2 <-

Mem[ R2 + I ] R3 <-

R2 + R1

⇒ “Delayed Loads”Can recognize this in decode stage and introduce bubble while stalling fetch stag.Tricky situation:

R1 <-

Mem[ R2 + I ] Mem[R3+34] <-

R1

Handle with bypass in memory stage!

D

Mem

T



Software SchedulingSoftware Scheduling

Slow Code:

lw Rb, blw Rc, cadd Ra, Rb, Rcsw a, Ralw Re, elw Rf, fsub Rd, Re, Rfsw d, Rd

Fast Code:

lw Rb, blw Rc, clw Re, eadd Ra, Rb, Rclw Rf, fsw a, Rasub Rd, Re, Rfsw d, Rd

Try

producing

fast

code

fora = b + c;d = e -

f;

assuming

a, b, c, d, e, and

f in memory



Compiler Avoiding Load Stalls:Compiler Avoiding Load Stalls:

% loads stalling pipeline

0% 20% 40% 60% 80%

tex

spice

gcc

25%

14%

31%

65%

42%

54%

scheduled unscheduled



Pipelined ProcessorPipelined Processor

Separate control at each stageStalls propagate backwards to freeze previous stagesBubbles in pipeline introduced by placing “Noops” into local stage, stall previous stages.

Exe

c

Reg

. Fi

le

Mem

Acce

ss

Dat

aM

em

A

B

S

M

Reg

File

Equa

l

PC

Nex

t PC

IR

Inst

. Mem

Valid

IRex

Dcd

Ctrl

IRm

em

Ex

Ctrl

IRw

b

Mem

Ctrl

WB

Ctrl

D

Stalls

Bubbles



FYI: MIPS R3000 clocking disciplineFYI: MIPS R3000 clocking discipline

2-phase non-overlapping clocksPipeline stage is two (level sensitive) latches

phi1

phi2

phi1 phi1phi2Edge-triggered



MIPS R3000 Instruction PipelineMIPS R3000 Instruction Pipeline

Inst Fetch DecodeReg. Read

ALU / E.A Memory Write Reg

TLB I-Cache RF Operation WB

E.A. TLB D-Cache

TLBI-cache

RFALUALU

TLB

D-Cache

WB

Resource Usage

Write in phase 1, read in phase 2 => eliminates bypass from WB



Recall: Data Hazard on r1Recall: Data Hazard on r1

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

With MIPS R3000 pipeline, no need to forward from WB stage



MIPS R3000 Multicycle OperationsMIPS R3000 Multicycle Operations

Ex: Multiply, Divide, Cache Miss

Use control word of local stage to step through multicycle operation

Stall all stages above multicycle operation in the pipeline

Drain (bubble) stages below it

Alternatively, launch multiply/divide to autonomous unit, only stall pipe if attempt to get result before ready

- This means stall mflo/mfhi in decode stage if multiply/divide still executing

A B

op Rd Ra Rb

mul Rd Ra Rb

Rd

to regfile

R

TRd



Case Study: MIPS R4000 (200 MHz)Case Study: MIPS R4000 (200 MHz)

8 Stage Pipeline:■

IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.

■

IS–second half of access to instruction cache. ■

RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection.

■

EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.

■

DF–data fetch, first half of access to data cache.■

DS–second half of access to data cache.■

TC–tag check, determine whether the data cache access hit.■

WB–write back for loads and register-register operations.

8 Stages: What is impact on Load delay? Branch delay? Why?



Case Study: MIPS Case Study: MIPS R4000R4000

IF ISIF

RFISIF

EXRFISIF

DFEXRFISIF

DSDFEXRFISIF

TCDSDFEXRFISIF

WBTCDSDFEXRFISIF

TWO CycleLoad Latency

IF ISIF

RFISIF

EXRFISIF

DFEXRFISIF

DSDFEXRFISIF

TCDSDFEXRFISIF

WBTCDSDFEXRFISIF

THREE CycleBranch Latency(conditions evaluatedduring EX phase)Delay slot plus two stallsBranch likely cancels delay slot if not taken



MIPS R4000 Floating PointMIPS R4000 Floating Point

FP Adder, FP Multiplier, FP DividerLast step of FP Multiplier/Divider uses FP Adder HW8 kinds of stages in FP units:

Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers



MIPS FP Pipe StagesMIPS FP Pipe Stages

FP Instr 1 2 3 4 5 6 7 8 …Add, Subtract U S+A A+R R+SMultiply U E+M M M M N N+A RDivide U A R D28 … D+A D+R, D+R, D+A, D+R, A, RSquare root U E (A+R)108 … A RNegate U SAbsolute value U SFP compare U A RStages:

M First stage of multiplierN Second stage of multiplierR Rounding stageS Operand shift stageU Unpack FP numbers

A Mantissa ADD stage D Divide pipeline stageE Exception test stage



R4000 PerformanceR4000 PerformanceNot ideal CPI of 1:

■

Load stalls (1 or 2 clock cycles)■

Branch stalls (2 cycles + unfilled slots)■

FP result stalls: RAW data hazard (latency)■

FP structural stalls: Not enough FP hardware (parallelism)

00.5

11.5

22.5

33.5

44.5

eqnt

ott

espr

esso gc

c li

dodu

c

nasa

7

ora

spic

e2g6

su2c

or

tom

catv

Base Load stalls Branch stalls FP result stalls FP structuralstalls



Improving Instruction Level Parallelism (ILP)Improving Instruction Level Parallelism (ILP)

ILP: Overlap execution of unrelated instructionsgcc 17% control transfer

■

5 instructions + 1 branch■

Beyond single block to get more instruction level parallelism

Loop level parallelism one opportunity■

First SW, then HW approaches

DLX Floating Point as example■

Measurements suggests R4000 performance FP execution has room for improvement



FP Loop: Where are the Hazards?FP Loop: Where are the Hazards?

Loop: LD F0,0(R1) ;F0=vector element

ADDD F4,F0,F2 ;add scalar from F2

SD 0(R1),F4 ;store result

SUBI R1,R1,8 ;decrement pointer 8B (DW)

BNEZ R1,Loop ;branch R1!=zero

NOP ;delayed branch slot

Instruction Instruction

Latency in

producing result

using result clock cyclesFP ALU op

Another FP ALU op

3

FP ALU op Store double

2

Load double FP ALU op

1

Load double Store double

0

Integer op Integer op

0

•

Where are the stalls?



FP Loop Showing StallsFP Loop Showing Stalls

9 clocks: Rewrite code to minimize stalls?


Latency in

producing result


Another FP ALU op

3


2


1

1 Loop: LD F0,0(R1) ;F0=vector element2 stall3 ADDD F4,F0,F2 ;add scalar in F24 stall5 stall6 SD 0(R1),F4 ;store result7 SUBI R1,R1,8 ;decrement pointer 8B (DW)8 BNEZ R1,Loop ;branch R1!=zero9 stall ;delayed branch slot



Revised FP Loop Minimizing StallsRevised FP Loop Minimizing Stalls

6 clocks: Unroll loop 4 times code to make faster?


Latency in

producing result


Another FP ALU op

3


2


1

1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 SUBI R1,R1,85 BNEZ R1,Loop ;delayed branch6 SD 8(R1),F4 ;altered when move past SUBI

Swap BNEZ and SD by changing address of SD



1 Loop:LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;drop SUBI & BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;drop SUBI & BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;drop SUBI & BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SD -24(R1),F1613 SUBI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP

15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iterationAssumes R1 is multiple of 4

Unroll Loop Four Times (straightforward Unroll Loop Four Times (straightforward way)way)

Rewrite loop to minimize stalls?

1 cycle stall2 cycles stall



Unrolled Loop That Minimizes StallsUnrolled Loop That Minimizes Stalls

What assumptions made when moved code?■

OK to move store past SUBI even though changes register

■

OK to move loads before stores: get right data?

■

When is it safe for compiler to do such changes?

1 Loop:LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iterationWhen safe to move instructions?



Issues in Pipelined designIssues in Pipelined design

° Pipelining

° Super-pipeline- Issue one instruction per (fast) cycle- ALU takes multiple cycles

° Super-scalar- Issue multiple scalarinstructions per cycle

° VLIW (“EPIC”)- Each instruction specifiesmultiple scalar operations- Compiler determines parallelism

° Vector operations- Each instruction specifiesseries of identical operations

Limitation

Issue rate, FU stalls, FU depth

Clock skew, FU stalls, FU depth

Hazard resolution

Packing

Applicability

IF D Ex M WIF D Ex M W






IF D Ex M WEx M WEx M WEx M W

IF D Ex M WEx M W

Ex M WEx M W



Getting CPI < 1: Issuing Multiple Getting CPI < 1: Issuing Multiple Instructions/CycleInstructions/Cycle

Two main variations: Superscalar and VLIWSuperscalar: varying no. instructions/cycle (1 to 6)

■

Parallelism and dependencies determined/resolved by HW■

IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164, HP 7100

Very Long Instruction Words (VLIW): fixed number of instructions (16) parallelism determined by compiler

■

Pipeline is exposed; compiler must schedule delays to get right result

Explicit Parallel Instruction Computer (EPIC)/ Intel■

128 bit packets containing 3 instructions (can execute sequentially)■

Can link 128 bit packets together to allow more parallelism■

Compiler determines parallelism, HW checks dependencies and fowards/stalls



Getting CPI < 1: Issuing Multiple Getting CPI < 1: Issuing Multiple Instructions/CycleInstructions/Cycle

Superscalar DLX: 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair

Type PipeStagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB

1 cycle load delay expands to 3 instructions in SS■

instruction in right half can’t use it, nor instructions in next slot



Unrolled Loop that Minimizes Stalls for ScalarUnrolled Loop that Minimizes Stalls for Scalar

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

LD to ADDD: 1 CycleADDD to SD: 2 Cycles



Loop Unrolling in SuperscalarLoop Unrolling in SuperscalarInteger instruction FP instruction Clock cycle

Loop: LD F0,0(R1) 1LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12

Unrolled 5 times to avoid delays (+1 due to SS)12 clocks, or 2.4 clocks per iteration



Limits of SuperscalarLimits of SuperscalarWhile Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:

■

Exactly 50% FP operations■

No hazards

If more instructions issue at same time, greater difficulty of decode and issue

■

Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue

VLIW: tradeoff instruction space for simple decoding■

The long instruction word has room for many operations■

By definition, all the operations the compiler puts in the long instruction word can execute in parallel

■

E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch●

16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide■

Need compiling technique that schedules across several branches



Loop Unrolling in VLIWLoop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iterationNeed more registers in VLIW(EPIC => 128int + 128FP)



Software PipeliningSoftware PipeliningObservation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration


Laboratorio deTecnologías de InformaciónSoftware PipeliningSoftware Pipelining

sum = 0.0;for( i=1; i<=N; i++ ) { ; sum = sum + a[i]*b[i]load a[i];load b[i];mult a[i]*b[i]add sum[i]

}sum = 0.0;START-UP-BLOCK;for( i=1; i<=N; i++ ) {load a[i]; ;Aiload b[i]; ;Bimult a[i-1]*b[i-1] ;*iadd sum[i-2] ;+i

}FINISH-UPBLOCK;



Software Pipelining, cont.Software Pipelining, cont.

START-UP-BLOCK;for( i=1; i<=N; i++ ) {load a[i]; ;Aiload b[i]; ;Bimult a[i-1]*b[i-1] ;*i-1add sum[i-2] ;+i-2

}FINISH-UPBLOCK;

LOOPSTART-UP i=3 ... i=N FINISH-UPA1 A2 A3 Ai ANB1 B2 B3 Bi BN

*1 *2 *i-1 *N-1 *N+1 +i-2 +N-2 +N-1 +N



Software Pipelining ExampleSoftware Pipelining ExampleBefore: Unrolled 3 times1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F44 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F87 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOP

After: Software Pipelined1 SD 0(R1),F4 ; Stores M[i]2 ADDD F4,F0,F2 ; Adds to M[i-1]3 LD F0,-16(R1); Loads M[i-2]4 SUBI R1,R1,#85 BNEZ R1,LOOP

• Symbolic Loop Unrolling– Maximize result-use distance – Less code space than unrolling–

Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling

SW Pipeline

Loop Unrolled

over

lapp

ed o

psTime

Time



Software Pipelining with Loop Unrolling in Software Pipelining with Loop Unrolling in VLIWVLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,-48(R1) ST 0(R1),F4 ADDD F4,F0,F2 1LD F6,-56(R1) ST -8(R1),F8 ADDD F8,F6,F2 SUBI R1,R1,#24 2LD F10,-40(R1) ST 8(R1),F12 ADDD F12,F10,F2 BNEZ R1,LOOP 3

Software pipelined across 9 iterations of original loop■

In each iteration of above loop, we:●

Store to m,m-8,m-16 (iterations I-3,I-2,I-1)●

Compute for m-24,m-32,m-40 (iterations I,I+1,I+2)●

Load from m-48,m-56,m-64 (iterations I+3,I+4,I+5)

9 results in 9 cycles, or 1 clock per iterationAverage: 3.3 ops per clock, 66% efficiencyNote: Need less registers for software pipelining

(only using 7 registers here, was using 15)



Trace SchedulingTrace SchedulingParallelism across IF branches vs. LOOP branchesTwo steps:

■

Trace Selection●

Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code

■

Trace Compaction●

Squeeze trace into few VLIW instructions●

Need bookkeeping code in case prediction is wrong

Compiler undoes bad guess (discards values in registers)Subtle compiler bugs mean wrong answer vs. pooer performance; no hardware interlocks



Trace SchedulingTrace Scheduling

b[i] = “old”a[i] = ...if( a[i] > 0 ) thenb[i] = “new”; /*common case*/

elsex

endifc[i] = ...

Until doneselect most common path -- a traceschedule trace accross basic blocksrepair other paths



Trace Scheduling, contTrace Scheduling, cont

trace to be scheduled:

b[i] = “old”a[i] = ...b[i] = “new”;c[i] = ...if( A[i] <= 0 ) go to A

B:

repair code:

A: restore old b[i]xmaybe recalculate c[i]go to B:

b[i] = “old”a[i] = ...if( a[i] > 0 ) then

b[i] = “new”; /*common case*/else

xendifc[i] = ...



SummarySummaryHazards limit performance

■

Structural: need more HW resources■

Data: need forwarding, compiler scheduling■

Control: early evaluation & PC, delayed branch, prediction

Data hazards must be handled carefully:■

RAW data hazards handled by forwarding■

WAW and WAR hazards don’t exist in 5-stage pipeline

MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)Exceptions in 5-stage pipeline recorded when they occur, but acted on only at WB (end of MEM) stage

■

Must flush all previous instructions

More performance from deeper pipelines, parallelismInstruction Level Parallelism is used to get CPI < 1.0

■

Multiple instruction executed simultaneously■

Multiple issue instruction■

Dynamic scheduling■

Out-of-order execution■

Speculation

arturo díaz pérez laboratorio de tecnologías de información …adiaz/arqcomp/14... · 2014. 7....

Documents