吳俊興高雄大學資訊工程學系 october 2004 chapter 3 ilp and its dynamic exploitation –...

吳俊興高雄大學資訊工程學系

October 2004

Chapter 3ILP and Its Dynamic Exploitation –

Branch and Beyond

EEF011 Computer Architecture計算機結構

2

Chapter Overview

3.1 Instruction Level Parallelism: Concepts and Challenges

3.2 Overcoming Data Hazards with Dynamic Scheduling

3.3 Dynamic Scheduling: Examples & the Algorithm3.4 Reducing Branch Costs with Dynamic Hardware

Prediction 3.5 High Performance Instruction Delivery3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of the Limitations of ILP3.9 Limitations on ILP for Realizable Processors3.10 The P6 Microarchitecture

3

3.4 Reducing Branch Costs with Dynamic Hardware Prediction

Significances of branch

1. When issue N instructions per clock cycle, branches will arrive up to n times faster in an n-issue processor

2. Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor

Review: basic schemes – static or software prediction• Predict taken• Predict not taken• Delayed branch

4

Dynamic Hardware Prediction

Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not.

The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions.

Key Concept: A Branch History Table contains information about what a branch did the last time it was executed.

Performance = ƒ(accuracy, cost of misprediction)

Dynamic Branch Prediction

5

Branch Prediction Buffers (Branch History Table)1-bit branch-prediction buffer: a small memory

– indexed by the lower portion of the address of the branch instruction– a bit indicating whether the branch was recently taken or not.– Fetching begins in the predicted direction. If mis-predicted, the bit is inverted

Problem: in a loop, 1-bit BHT will cause two mis-predictions:End of loop case, when it exits instead of looping as beforeFirst time through loop on next time through code, when it predicts exit instead of looping

Bits 13 - 2

0

1023

Prediction

31 0

Address

Bits 2 – 13 define 1024 different possibilities. Based on the address of the branch, its prediction is put into the Branch History table.

6

2-bit Dynamic Branch Prediction

2-bit scheme: change prediction only if get misprediction twice


Basic Branch Prediction:Branch Prediction Buffers

Figure 3.7 (p. 198)

7

Branch History Table

Accuracy

• 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

• 4096 about as good as infinite table, but 4096 is a lot of HW

Mis-prediction:• Wrong guess for that

branch

• Got branch history of wrong branch when index the table

(Figure 3.9)

8

Correlating Branches



What if we have the code:

If ( aa == 2) aa = 0;

If ( bb == 2 ) bb = 0;

If ( aa != bb ) { …

Then the third “if” can be somewhat predicted based on the 1st two “ifs”

Generated MIPS Code:

DSUBUI R3, R1, #2

BNEZ R3, L1

DADD R1, R0, R0

L1:

DSUBUI R3, R2, #2

BNEZ R3, L2

DADD R2, R0, R0

L2:

DSUBU R3, R1, R2

BEQZ R3, L3

This branch is based on theOutcome of the previous 2 branches.

9

Correlating Branches(2-level Predictors)



Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)

– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

• (2,2) predictor: 2-bit global, 2-bit local• (m,n) predictor: uses the behavior of

the last m branches to choose from 2m branch predictors, each of which is an n-bit predictor for a single branch

# of bits = 2m * n * # of pred. entries

Branch address (4 bits)

2-bits per branch local predictors

PredictionPrediction

2-bit global branch history(01 = not taken then taken)

10

If b1 is not taken, then b2 will be not taken

1-bit predictor: Consider d alternates between 2 and 0. All branches are mispredicted

Example: Multiple Consequent Branchesif(d == 0) not taken

d=1;else taken;if(d==1) not takenelse taken

11

(1,1) predictor - 1-bit predictor with 1 bit of correlation: last branch (either taken or not taken) decides which prediction bit will be considered or updated

two prediction bits: prediction if last branch not taken, and prediction if last branch taken

if(d == 0) not takend=1;

else taken;if(d==1) not takenelse taken

12

Tournament Predictors: Adaptively Combining Local and Global Predictors

Use several levels of branch-prediction tables together with an algorithm for choosing among the multiple predictors

Advantage: ability for per-branch basis to select the right predictor for the right branch dynamically

Figure 3.16 The state transition diagram for a tournament predictor has four states corresponding to which predictor to use.

P1/P2 =

0 incorrect

1 correct

2+:0

1:0 0:1

0:2+

13

Figure 3.17

The fraction of predictions coming from the local predictor for a tournament predictor (=local 2-bit predictor + 2-bit local/global predictor) using the SPEC89 benchmarks.

Figure 3.18

The misprediction rate for three different predictors on SPEC89 as the total number of bits is increased

14

Accuracy of Different Schemes



(Figure 3.15, p. 206)

(2,2) predictor with 1K entries often outperforms a 2-bit predictor with an unlimited number of entries

15

3.5 High Performance Instruction Delivery

Predicting well is not enough for multiple-issue pipeline•Expect to deliver a high-bandwidth instruction stream• Ideal no-penalty branch: we need to know the next address to fetch by the end of IF stage

•For the classic five-stage pipeline, a branch-prediction buffer is accessed during the ID cycle

The goal here is to be able to fetch an instruction from the destination of a branch•You need the next address at the same time as you’ve made the prediction.

•This can be tough if you don’t know where that instruction is until the branch has been figured out.

•The solution is a table that remembers the resulting destination addresses of previous branches.

Three concepts: branch-target buffer, integrated instruction fetch unit, and indirect branches by predicting return addresses

16

Branch Target Buffer•Branch Target Buffer (BTB): a branch-prediction cache that stores the predicted address for the next instruction after a branch

– Use address of branch as index to get prediction AND branch address (if taken)– Must check for branch match now, only store predicted-taken branches

•Branch-target address– Branch-prediction buffer: accessed during the ID cycle– Branch-target buffer: accessed during the IF stage


Basic Branch Prediction:Branch Target Buffers

Branch PC Predicted PC

=?

PC

of in

structio

nF

ET

CH

Extra prediction state bits

Yes: instruction is branch and use predicted PC as next PC

No: branch not predicted,proceed normally (NextPC = PC+4)

(Figure 3.19, p. 210)

17

Figure 3.20 Steps for handling BTB

Incorrect prediction:

• 1-clock-cycle delay fetching the wrong instruction

• restart the fetch 1 clock cycle later

Total penalty of 2 clock cycles

18

Example

Example on page 211 (Figure 3.21).

Determine the total branch penalty for a BTB using the above penalties. Assume also the following:

• Prediction accuracy of 90%

• Hit rate in the buffer of 90%

• Assume that 60% of the branches are taken.


Basic Branch Prediction:Branch Target Buffers

Case Instruction Prediction Actual Penaltyin Buffer Branch Cycles

1. Yes Taken Taken 02. Yes Taken Not taken 23. No Taken 24. No Not Taken 0

Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2 + ( 1 - percent buffer hit rate) X Taken branches X 2

Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)

Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles

Case 2

Case 3

19

Integrated Instruction Fetch Units

As a separate autonomous unit that feeds instructions to the rest of the pipeline

– Integrated branch prediction – The branch predictor constantly predicts branches

– Instruction prefetch – The unit autonomously manages the prefetching of instructions, integrating with branch prediction

– Instruction memory access and buffering – Using prefetch to try to hide the cost of crossing cache blocks

Prefetching and trace caches is discussed in Chapter 5


20

Return Address Predictors

Predicting indirect jumps – destination address varies at run time

– indirect procedure calls, procedure returns, select or case statements

Approaches– predicting with a branch-target buffer– stack for return address predictor: pushing a return address

on the stack at a call and popping one off at a return– multi-path fetching to reduce misprediction penalty

• Caching addresses or instructions from multiple paths in the target buffer


21

3.6 More ILP with Multiple Issue

Previous techniques - achieving an ideal CPI of one• Eliminate data and control stalls

Multiple-issue processors – achieving CPI < 1!!• Start more than one instruction in a given cycle• Issue multiple instructions in a clock cycle• Vector Processing: explicit coding of independent loops

as operations on large vectors of numbers• Multimedia instructions being added to many processors

Two basic flavors• superscalar processors and• VLIW (very long instruction word) processors

22

Issuing Multiple Instructions/Cycle

Flavor I:

Superscalar processors issue varying number of instructions per clock, can be either

– statically scheduled (by the compiler, in-order executing) or– dynamically scheduled (by the hardware, out-of-order execution)

Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)

Example: a 4-issue static superscalar processor– issue packet: group of instructions received from the fetch unit

that could potentially issue in one clock cycle– The IF unit in-order examines each instruction in the issue packet– The instruction is not issued if it would cause a structural hazard

or a data hazard (hardware hazard detection)

IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

23

Flavor II: VLIW (Very Long Instruction Word)

VLIW issues a fixed number of instructions formatted either

– as one large instruction or

– as a fixed instruction packet with the parallelism among

instructions indicated by the VLIW

also known as Explicitly Parallel Instruction Computer (EPIC)

Inherently statically scheduled by compilers (see chapter 4)

– Fixed number of instructions (4-16) scheduled by the compiler;

put operators into wide templates

– Intel Architecture-64 (IA-64) 64-bit address

24

Summary - Issuing Multiple Instructions/Cycle

25

Statically Scheduled Superscale MIPS processor

– Fetch 64-bits/clock cycle; INT on left, FP on right Fetch/prefetch multiple instructions but may issue/deliver 0-n instructions

hardware hazard detection– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair

Type Pipe StagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB

• 1 cycle load delay causes delay to 3 instructions in Superscalar

– FP instruction in right half can’t use it, nor instructions in next slot

In our MIPS example, we can handle 2 instructions/cycle:

• Floating Point• Anything Else

26

Dynamic Scheduling with Tomasulo’s algorithm in Superscalar

• How to issue two instructions and keep in-order instruction issue for Tomasulo?

– Assume 1 integer + 1 floating point– Two approaches to removing the constraint of issuing one

integer and one FP instruction in a clock

1) Issuing in half a clock cycle (or 2X clock rate)

2) 1 Tomasulo control for integer, 1 for floating point

• Only loads/stores might cause dependency between integer and FP issue:

– Replace load reservation station with a load queue; operands must be read in the order they are fetched

– Load checks addresses in Store Queue to avoid RAW violation– Store checks addresses in Load Queue to avoid WAR,WAW

27

Example:

Integer ALU 1 cycleload/store 2 cyclesFP add 3 cycles

assume 2 CDBs, 1 integer ALU, 1 FP unit, and perfect branch prediction

28

Integrated ALU

Figure 3.25 Figure 3.26

•A new loop iteration is fetched and issued every 3 clock cycles

•Issue rate: 5/3 = 1.67

•The loop executes in 16 clock cycles

•One CDB is enough

One integer functional unit for both ALU operations and effective address calculations

Executes Stage•L.D and S.D - effective address calculation

•Branches – when the branch condition can be evaluated

29

Separate ALU

Figure 3.27 / Figure 3.28

•The loop executes in 5 clock cycles less (11 versus 16)

•Two CDBs are needed

Separate functional units for effective address calculations and ALU operations

30

Three factors limit the performance of the example pipeline

1. Imbalance between the functional unit structure of the pipeline and the example loop

– impossible to fully use the FP units– we need fewer dependent integer operations per loop

2. Amount of overhead per loop iteration is very high

– DADDIU and BNE: 2 out of 5 instructions3. The control hazard causes 1-cycle penalty on

every loop iteration– We assume any instructions following a branch cannot

start execution until after the branch condition has been evaluated

– Accurate branch prediction is not sufficient

31

3.7 Hardware-based Speculation

• Motivation– Prediction is not sufficient to have high amount of ILP– Overcome control dependence by speculating on the

outcome of branches Execute the program as if our guesses were correct

• Dynamic scheduling vs. speculation– dynamic scheduling: only fetch and issue instructions

as if our branch predictions were always correct– speculation: fetch, issue, and execute such instructions

• Incorrect speculation Undo

Intel Pentium II/III/4, AMD K5/K6/Athlon, PowerPC 603/604/G3/G4, MIPS R10000/R12000, Alpha 21264

32

Key Ideas

• Design– dynamic branch prediction to choose which

instructions to execute– speculation to allow the execution of instructions

before the control dependences are resolved• with the ability to undo the effects of an incorrectly speculated

sequence– dynamic scheduling to deal with the scheduling of

different combinations of basic blocks• Implementation

– allow instructions to execute out of order,– but to force them to commit in order and to prevent any

irrevocable action until an instruction commits

33

Reorder Buffer (ROB)

• Reorder buffer – an additional set of hardware buffers that hold the results of instructions that have finished execution but have not committed

– a source of operands for instructions as the reservation stations

– in the interval between completion of instruction execution and instruction commit

• Similar to the store buffer in Tomasulo’s algorithm– Speculation – the register file is not updated until the instruction commits

– Tomasulo – once an instruction writes its result, any subsequently issued instructions will find the result in the register file

ROB completely replaces store buffers in Tomasulo’s algorithm

34

ROB Components

•Instruction type field– Indicate whether the instruction is

• a branch (and has no destination result),

• a store (has a memory address destination), or

• a register operation (ALU operation or load, which has register destinations)

•Destination field– Supply the register # (for loads and ALU operations) or the memory

address (for stores) where the instruction result should be written

•Value field– Hold the value of the instruction result until the instruction commits

•Ready field– Indicate that the instruction has completed execution, and the value is

ready

35

Speculative Tomasulo’s algorithm

1.Issue (=dispatch)•Get an instruction from queue•Issue the instruction if there is

–an empty RS–an empty slot in the ROBthen indicate they are in use, and ROB # for result also sent to RS

2.Execute•Wait for all operands available

3.Write Result•Write result to CDB and ROB #(value field of the ROB for Store)•Mark RS as available

4.Commit – Three cases•Normal commit

–Occur when• an inst reaches the head of the ROB• its result is present in the buffer

–Update the register with the result and free the ROB entry

•Store–Like Normal Commit, but update memory instead

•Mispredicted branch–ROB is flushed and restart execution at correct successor of the branch

36

Example ofSpeculative Tomasulo’s

Add: 2 cyclesMultiply: 10 cyclesDivide: 40 cycles

Instruction status: Exec Write

Instruction j k Issue Comp ResultLD F6 34+ R2 1 3 4LD F2 45+ R3 2 4 5MULTD F0 F2 F4 3 15SUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Cycle 15 of original Alg.When MUL.D is ready to commit

With speculation, SUB.D and ADD.D will not commit until MUL.D commits, although the results are available and can be used

37

Hazards Through Memory - Load/Store RAW HazardQuestion: Given a load that follows a store in program order, are the two related?

–(Alternatively: is there a RAW hazard between the store and the load)?E.g.: st 0(R2),R5 ld R6,0(R3)

Can we go ahead and start the load early (RAW)? –Store address could be delayed for a long time by some calculation that

leads to R2 (divide?). –We might want to issue/begin execution of both operations in same cycle

Answer is that we are not allowed to start load until we know that address 0(R2) 0(R3)

–Not allowing a load to initiate the 2nd step if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load

–maintaining the program order for the computation of an effective address of a load with respect to all earlier stores

How about WAR/WAW hazards through memory?–Stores commit in order, so no worry

38

Multiple Issue

Separate functional units for address calculation, ALU operations, and branch condition

With speculation

No speculationL.D must wait until the branch outcome is determined

L.D following the BNE can start execution early with speculation

39

Extended Physical Registers

Speculative Tomasulo’s algorithm with ROB– architecturally visible registers (R0, …, R31 and F0, …, F31)– values reside in the visible register set and RS, and temporarily in the ROB

Alternative to ROB: a larger physical set of registers and register renaming

– An extended set of physical registers is used to hold both architecturally visible registers and temporary values

• The extended registers replace functions of ROB and RS

• A physical register does not become the architectural register until the instr commits

– During instruction issue, renaming process maps names of architectural registers to physical registers with renaming table, allocating a new unused register for the destination

• WAW and WAR hazards are avoided by renaming the destination register

40

Register Renaming versus Reorder Buffers•Advantage: simplifies instruction commit

mark register as no longer speculative, free register with old value. Require only two simple actions:

1.Record that the mapping between an architectural register # and physical register # is no longer speculative

2.free up any physical register being used to hold the value of the architectural register

•Advantage: simplifies instruction issueAll results are in the extended registers, so need not examine both the ROB

and the register file•Disadvantage: deallocating registers is complex

Before freeing up a physical register, we must know that– It no longer corresponds to an architectural register, and

•Rewriting an architectural register causes the renaming table to point elsewhere– no further uses of the physical register are outstanding (not as a source)

•Examining source register specifiers of all instructions in functional unit queues

20-80 extra registers: Alpha, PowerPC, MIPS, Pentium,…– Size limits no. instructions in execution (used until commit)

41

3.8 Studies of the Limitations of ILP

•Conflicting studies of amount of improvement available– Benchmarks (vectorized FP Fortran vs. integer C programs)– Hardware sophistication– Compiler sophistication

•Studies of ILP limitations– How much ILP is available using existing mechanisms with

increasing HW budgets?– Do we need to invent new HW/SW mechanisms to keep on

processor performance curve?

42

Ideal Hardware ModelInitial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:1. Register renaming–infinite virtual/physical registers

–all WAW & WAR hazards are avoided–unbounded number of instructions can begin execution simultaneously

2. Branch prediction–perfect; no misprediction3. Jump prediction–all jumps perfectly predicted

–machine with perfect speculation–an unbounded buffer of instructions available

4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal

5. Perfect cache –all loads and stores always complete in one cycleA2+A3: eliminate control dependenciesA1+A4: eliminate all but true data dependencies (RAW)

1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle

Studies of ILP

43

Upper Limit to ILP: Ideal Machine(Figure 3.35, page 242)

This is the amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies.

Integer: 18-63

FP: 75-150

44

Limitations on Window Size and Maximum Issue Count•Operand dependence comparisons required to determine whether n issuing instructions have any register dependences among them:

2i = 2*(n-1)n/2 = n2 – n•Window: the set of instructions that is kept in the processor and examined for simultaneous execution

# of result comparisons per cycle = maximum completion rate* window size* # of operands per instruction

•Assume 2K window and 64 issues later

Figure 3.36 Effects of reducing the window size

Figure 3.37 Effect of window size and average issue rate

FP (59-61)> Int (15-41)

45

Effects of Realistic Branch and Jump Prediction

What parallelism do we get when we don’t allow perfect branch prediction, but assume some realistic model?

Possibilities include:1. Perfect - all branches are perfectly predicted (previous

slide)2. Tournament-based branch predictor – use a correlating

2-bit predictor and a non-correlating 2-bit predictor together with a selector. 8K entries for branch and 2K entries for jump

3. Standard 2-bit predictor with 512 2-bit entries4. Static – A static predictor uses the profile history of the

program and predicts that the branch is always taken or not taken

5. None - Parallelism is limited to basic block.Assume 2K window, 64 issues, and tournament-based

predictor in later studies

46

Effects of Branch-Prediction Schemes

Figure 3.40 Branch-prediction accuracy

Figure 3.38 Effect of branch-prediction schemes Figure 3.39 sorted by applications

FP (15-48)> Int (9-12)

47

Effects of Finite Registers for Renaming

• Window size=2K, Max issue=64 instructions, tournament-based branch predictor

• The impact on the integer programs is small primarily because the limitations in window size and branch prediction have limited the ILP substantially

• Assume 256 integer and 256 FP registers available for renaming in later studies

FP (16-45)> Int (10-15)

48

Effects of Imperfect Memory Alias Analysis

Different models for memory alias analysis (memory disambiguation):

1. Perfect (no memory disambiguation)

2. Global/stack perfect (by best compiler-based analysis schemes) and heap references conflict

3. Inspection – Examine the accesses to see if they can be determined not to interfere at compile time

–eg. Mem[R10 + 20] and Mem[R10+100] never conflict (same base register with different offsets)

4. None – All memory references are assumed to conflicts

49

Effects of Imperfect Memory Alias Analysis

• Since there is no heap references in Fortran, there is no difference between perfect and global/stack perfect analysis for Fortran programs

• 2K window, 64 issues

Figure 3.43 Effect of alias analysis 3.44 sorted by applications

50

3.10 The P6 Microarchitecture

– The basis for Pentium Pro, Pentium II and Pentium III– A dynamically scheduled processor that translates each IA-32

instruction to a series of micro-operations (uops) executed by the pipeline

• uops are similar to typical RISC instructions• Up to 3 IA-32 instructions are fetched, decoded, and translated into uops every

clock cycle• If an IA-32 instruction requires more than 4 uops, implemented by a micro-

coded sequence that generates the necessary uops in multiple clock cycles => Max=6

Differ in clock rate, cache architecture, and memory interface. Pentium II added MMX (multimedia extension). Pentium III added SSE (Streaming SIMD Extensions)

51

P6 Microarchitecture Pipeline• uops executed by an out-of-order speculative pipeline using

register renaming and a ROB (similar to Section 3.7)– Up to 3 uops per clock renamed and dispatched to RS, or committed

• 14 super-pipelined stages– 8 stage: in-order instruction fetch, decode, and dispatch

• 512-entry, two-level (correlating) branch predictor• decode and issue stages include 40 extended registers for register renaming

and dispatch to one of 20 RS and to one of 40 entries in the ROB– 3 stages: out-or-order execution in one of 5 separate FU (ALU, FP,

branch, memory address, memory access, 1-32 cycles)– 3 stages: instruction commit

Repeat rate of 2 means that operations can start every other cycle

52

Stalls in Decode Cycle

Figure 3.50 # of instructions decoded per clock (average = 0.87 instructions per cycle)

Figure 3.51 Stall cycles per instruction at decode time (I-cache miss + lack of RS/ROB)

53

Figure 3.52 Number of micro-operations per IA-32 Instruction

• Most instruction will take only one uop

• On average, 1.37 uops per IA-32 instruction

• Other than fpppp, the integer programs typically require more uops

54

Figure 3.53 Number of misses per thousand instructions for L1 and L2 caches

•L1=8KB I+8KB D (hide by speculative)•L2=256KB (cost 5 times) (dominate performance)

55

Figure 3.54 BTB miss frequency (dominate) vs.. mispredict frequency

If BTB misses, a static prediction is used•backward branches are predicted taken (1-cycle penalty if correctly predicted)•forward branches are predicted not taken (no penalty if correctly predicted)

Branch mispredict:•direct penalty: 10-15 cycles•indirect: hard to measure incorrectly speculated instructions

On average about 20% use the simple static predictor rule

56

Instruction Commit

Figure 3.55 the fraction of issued instructions that do not commit. On average, each mispredicted branch issues 20 uops canceled

Figure 3.56 Breakdown in how often 0-3 uops commit in a cycle (average: 55%, 13%, 8%, 23%)

57

Figure 3.57 Actual CPI and Individual CPIs

uop cycles assume that 3 uops are completed every cycle and include the # of uops per instructionAverage CPI is 1.15 for SPECint programs and 2.0 for SPECFP programs

58

AMD Althon

• Similar to P6 microarchitecture (Pentium III), but more resources

• Transistors: PIII 24M vs. Althon 37M• Die Size: 106 mm2 vs. 117 mm2

• Power: 30W vs. 76W• Cache: 16K/16K/256K vs. 64K/64K/256K• Window size: 40 vs. 72 uops• Rename registers: 40 vs. 36 int +36 Fl. Pt.• BTB: 512 x 2 vs. 4096 x 2• Pipeline: 10-12 stages vs. 9-11 stages• Clock rate: 1.0 GHz vs. 1.2 GHz• Memory bandwidth: 1.06 GB/s vs. 2.12 GB/s

59

Pentium 4 – NetBurst Microarchitecture

• Still translate from 80x86 to micro-ops• A much deeper pipeline: 24 (vs. 14)• Use register renaming (potentially up to 128) rather than ROB (vs. 40)

–Window: 40 v. 126• 7 execution units (vs. 5, one more ALU and address computation unit)• P4 has better branch predictor, more FUs• aggressive ALU and data cache (operating half a clock cycle)• 8 times larger BTB (4096 vs.. 512)• New SSE2 instructions allow 2 floating operations per instruction• Instruction Cache holds micro-operations vs. 80x86 instructions

–no decode stages of 80x86 on cache hit–called “trace cache” (TC)

• Faster memory bus: 400 MHz v. 133 MHz• Caches

–Pentium III: L1I 16KB, L1D 16KB, L2 256 KB–Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB–Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock

• Clock rates:–Pentium III 1 GHz v. Pentium IV 1.5 GHz

60

Pentium, Pentium Pro, Pentium 4 Pipeline

• Pentium (P5) = 5 stagesPentium Pro, II, III (P6) = 10 stages (1 cycle ex)Pentium 4 (NetBurst) = 20 stages (no decode)

The Pentium 4

61

Block Diagram of Pentium 4 Microarchitecture

• BTB = Branch Target Buffer (branch predictor)

• I-TLB = Instruction TLB, Trace Cache = Instruction cache

• RF = Register File; AGU = Address Generation Unit

• "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s

The Pentium 4

62

Pentium 4 Die Photo

• 42M Xtors– PIII: 26M

• 217 mm2

– PIII: 106 mm2

• L1 Execution Cache– Buffer 12,000

Micro-Ops• 8KB data cache• 256KB L2$

The Pentium 4

63

Benchmarks: Pentium 4 v. PIII v. Althon

• SPEC base2000– Int, [email protected] GHz: 524, PIII@1GHz: 454, AMD [email protected]:?– FP, [email protected] GHz: 549, PIII@1GHz: 329, AMD [email protected]:304

• WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better)– P4 : 164, PIII : 167, AMD Althon: 180

• Quake 3 Arena: P4 172, Althon 151• SYSmark 2000 composite: P4 209, Althon 221• Office productivity: P4 197, Althon 209• S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to

argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Althon in clock speed."

The Pentium 4

64

Why is the Pentium 4 Slower than the

Pentium 3?

• Instruction count is the same for x86• Clock rates: P4 > Althon > PIII• How can P4 be slower?• Time = Instruction count x CPI x 1/Clock rate• Average Clocks Per Instruction (CPI) of P4 must be worse

than Althon, PIII• Will CPI ever get < 1.0 for real programs?

The Pentium 4

65

Another Approach: Multithreaded Execution for

Servers

• Thread: process with own instructions and data– thread may be a process part of a parallel program of multiple

processes, or it may be an independent program– Each thread has all the state (instructions, data, PC, register

state, and so on) necessary to allow it to execute• Multithreading: multiple threads to share the functional units of

1 processor via overlapping– processor must duplicate independent state of each thread e.g.,

a separate copy of register file and a separate PC– memory shared through the virtual memory mechanisms

• Threads execute overlapped, often interleaved– When a thread is stalled, perhaps for a cache miss, another

thread can be executed, improving throughput

The Pentium 4

66

Summary3.1 Instruction Level Parallelism: Concepts and

Challenges 3.2 Overcoming Data Hazards with Dynamic

Scheduling 3.3 Dynamic Scheduling: Examples & the Algorithm3.4 Reducing Branch Costs with Dynamic Hardware

Prediction 3.5 High Performance Instruction Delivery3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of the Limitations of ILP3.9 Limitations on ILP for Realizable Processors3.10 The P6 Microarchitecture

吳俊興 高雄大學資訊工程學系 october 2004 chapter 3 ilp and its dynamic exploitation –...

Documents

吳俊興高雄大學資訊工程學系 october 2004 chapter 3 ilp and its dynamic exploitation –...