lois orosa and rodolfo azevedo university of campinas [email protected] revisiting load value...

Lois Orosa and Rodolfo AzevedoUniversity of Campinas

[email protected]

Revisiting Load Value Speculation: an Approach to Mainstream Processors

mailto:[email protected]

2

ProjectsProjects

Load Value Speculation Frequent Value Locality On-chip photonics (Jorge González , PhD

candidate)

3

Introduction to Value Speculation (I)

It was proposed in the 90´s Improve ILP by breaking true data dependencies (RAW) Speculation in all the instructions The prediction is written in the output register Predictors indexed by PC (at fetch time) The proposals were very complex in that time (many

changes in the OoO engine) Recently Perais and Seznec revisited the topic [HPCA´13]

[ISCA´14][HPCA´15] Propose simplifications in the implementation Propose new predictors

4

Introduction to Value Speculation (II)

Confidence counters [Perais'13] (per instruction) to increase precision Only speculates when the confidence is high Reduce mispredictions Decrease coverage Increase when prediction is ok, reset when misprediction

Precision If mispenalty is low, the system could tolerate low precision If mispenalty is high, precision should be high (99% or more)

The prediction have to be available before dispatch time Available cycles: from fetch to dispatch The predictor delay is not critical

5

Introduction to Value Speculation (III)

Validation At execution time (OoO changes, small misprediction penalty) At commit time (no OoO changes, higher misprediction penalty)

Recovering from misprediction: Selective reissue: faster, more complex (validation at execution

time) Pipeline squashing: slower, more simple

Two main problems: Register port pressure

New extra ports (extra writes for predictions, extra reads for validations and predictor updates)

Back-to-back predictions Predictors may depend on previous values

6

Contributions

Analysis of the potential of Value Speculation in a narrow processor for different types of instructions

Reducing complexity in narrow-width-issue processors by speculating only in load instructions

AV predictor: two phase value predictor with prediction of addresses

XLStride predictor: multilevel stride predictor

7

Baseline Processor & Benchmarks

Baseline: real narrow-width-issue processor ZSIM simulator:

Westmere OoO x86-64bit , 4-issue, 2-level branch predictor

128-entry ROB, 32-entry load queue, 32-entry store queue L1I & L1D : 32KB 4-way, LRU, 4-cycle latency L2 Cache : 256KB, 8-way, LRU, 12-cycle latency Pipeline squashing, validation at commit

Benchmarks Splash2, Parsec, SPEC2000, SPEC2006

8

Potential of Value Speculation (I) Six categories of instructions Loads are the 25% of all dynamic micro-instructions High latency micro-instructions (more than 5 cycles) are not representative

(included in “Other”)

9

Potential of Value Speculation (III) Oracle predictor (no mispredictions) Value Speculation in each category of instruction Loads have almost the same potential than speculating in all instructions

ALLLOADS

LOADS ALL

NOTLOADS

Loads (25%) have more potential gains than all the other instructions together (75%)

NOTLOADS

10

Advantages of Speculating only in Loads in a narrow processor

Value Speculation in Narrow-issue processors Reduced back-to-back prediction: less on-flight instructions Approach to mainstream processors Reduced misprediction penalty (smaller pipeline)

Speculation in ¼ of the instructions (loads), with almost the same potential gains: Reduced Register port presure Reduced back-to-back prediction Still need confidence counters to increase precision

“mcf” minimun precision: 76,7 % “tonto” minimum precision: 99,6 %

11

State of the Art Predictors

Last Value Predictor (LVP) Stride predictor

{1,2,3,4,5}, Variants: 2D-Stride

FCM VTAGE DVTAGE, DFCM

12

XLStride Predictor It detects strides between consecutive values, and also between alternating values:

Examples: {2,1,1,4,4,3,6,6,7,8} , {4,0,4,9,4,1,4} It can have several levels X histories, each one containing stride information about the last X occurrencesof the instruction. It requires X^2 strides + last value 16 bit strides X predictions: selection by confidence counters We implemented a 2LStride predictor (good relation performance/cost)

Example: 2LStride, 1 bit confidence counter

13

AV Predictor Some benchmarks exhibit patterns in the addresses, not in the values Address table (AT): index by PC, result: predicted address

Implemented with a state-of-the-art predictor Value Table (VT): index by predicted address, result: predicted value

Implemented with a last value predictor VT is also updated in stores Detect patterns in the addresses: results are totally different from

traditional predictors

14

Evaluation

Load speculation 7 State of the art

predictors 2LStride predictor 3 AV predictors Several Hybrid

Predictors Uses the half of the

entries of state of the art predictors [Perais and Seznec, HPCA'13]

15

Results (I)Individual results:Individual results:

Hybrid Results:Hybrid Results:

Best of the single preditors

Always better than the best of the single predictors

16

Results (II)

Multicore experiments with 24 cores

To check the influence of shared memory in the precision

Precision on the value table => No changes in shared memory by remote processors

17

Conclusions

We simulate a real processor (Intel Westmere) to approximate Value Prediction to general purpose processors (narrow-issue processors)

Speculating in Loads has better cost/benefit than speculating in all the instructions (in narrow processors)

We propose the XLStride predictor Detect more complex stride patterns

We propose the AV predictor Complementary to the traditional predictors: ideal for hybrid predictors

Speed-up up to 33% (average 10%) Shared memory in multicore processors barely affects the precision

of predictors

Lois Orosa [email protected]

Thank You!!

mailto:[email protected]

19

Potential of Value Speculation (II)

lois orosa and rodolfo azevedo university of campinas [email protected] revisiting load value...

Documents