lois orosa and rodolfo azevedo university of campinas [email protected] revisiting load value...
TRANSCRIPT
![Page 1: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/1.jpg)
Lois Orosa and Rodolfo AzevedoUniversity of Campinas
Revisiting Load Value Speculation: an Approach to Mainstream Processors
![Page 2: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/2.jpg)
2
ProjectsProjects
Load Value Speculation Frequent Value Locality On-chip photonics (Jorge González , PhD
candidate)
![Page 3: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/3.jpg)
3
Introduction to Value Speculation (I)
It was proposed in the 90´s Improve ILP by breaking true data dependencies (RAW) Speculation in all the instructions The prediction is written in the output register Predictors indexed by PC (at fetch time) The proposals were very complex in that time (many
changes in the OoO engine) Recently Perais and Seznec revisited the topic [HPCA´13]
[ISCA´14][HPCA´15] Propose simplifications in the implementation Propose new predictors
![Page 4: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/4.jpg)
4
Introduction to Value Speculation (II)
Confidence counters [Perais'13] (per instruction) to increase precision Only speculates when the confidence is high Reduce mispredictions Decrease coverage Increase when prediction is ok, reset when misprediction
Precision If mispenalty is low, the system could tolerate low precision If mispenalty is high, precision should be high (99% or more)
The prediction have to be available before dispatch time Available cycles: from fetch to dispatch The predictor delay is not critical
![Page 5: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/5.jpg)
5
Introduction to Value Speculation (III)
Validation At execution time (OoO changes, small misprediction penalty) At commit time (no OoO changes, higher misprediction penalty)
Recovering from misprediction: Selective reissue: faster, more complex (validation at execution
time) Pipeline squashing: slower, more simple
Two main problems: Register port pressure
New extra ports (extra writes for predictions, extra reads for validations and predictor updates)
Back-to-back predictions Predictors may depend on previous values
![Page 6: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/6.jpg)
6
Contributions
Analysis of the potential of Value Speculation in a narrow processor for different types of instructions
Reducing complexity in narrow-width-issue processors by speculating only in load instructions
AV predictor: two phase value predictor with prediction of addresses
XLStride predictor: multilevel stride predictor
![Page 7: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/7.jpg)
7
Baseline Processor & Benchmarks
Baseline: real narrow-width-issue processor ZSIM simulator:
Westmere OoO x86-64bit , 4-issue, 2-level branch predictor
128-entry ROB, 32-entry load queue, 32-entry store queue L1I & L1D : 32KB 4-way, LRU, 4-cycle latency L2 Cache : 256KB, 8-way, LRU, 12-cycle latency Pipeline squashing, validation at commit
Benchmarks Splash2, Parsec, SPEC2000, SPEC2006
![Page 8: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/8.jpg)
8
Potential of Value Speculation (I) Six categories of instructions Loads are the 25% of all dynamic micro-instructions High latency micro-instructions (more than 5 cycles) are not representative
(included in “Other”)
![Page 9: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/9.jpg)
9
Potential of Value Speculation (III) Oracle predictor (no mispredictions) Value Speculation in each category of instruction Loads have almost the same potential than speculating in all instructions
ALLLOADS
LOADS ALL
NOTLOADS
Loads (25%) have more potential gains than all the other instructions together (75%)
NOTLOADS
![Page 10: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/10.jpg)
10
Advantages of Speculating only in Loads in a narrow processor
Value Speculation in Narrow-issue processors Reduced back-to-back prediction: less on-flight instructions Approach to mainstream processors Reduced misprediction penalty (smaller pipeline)
Speculation in ¼ of the instructions (loads), with almost the same potential gains: Reduced Register port presure Reduced back-to-back prediction Still need confidence counters to increase precision
“mcf” minimun precision: 76,7 % “tonto” minimum precision: 99,6 %
![Page 11: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/11.jpg)
11
State of the Art Predictors
Last Value Predictor (LVP) Stride predictor
{1,2,3,4,5}, Variants: 2D-Stride
FCM VTAGE DVTAGE, DFCM
![Page 12: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/12.jpg)
12
XLStride Predictor It detects strides between consecutive values, and also between alternating values:
Examples: {2,1,1,4,4,3,6,6,7,8} , {4,0,4,9,4,1,4} It can have several levels X histories, each one containing stride information about the last X occurrencesof the instruction. It requires X^2 strides + last value 16 bit strides X predictions: selection by confidence counters We implemented a 2LStride predictor (good relation performance/cost)
Example: 2LStride, 1 bit confidence counter
![Page 13: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/13.jpg)
13
AV Predictor Some benchmarks exhibit patterns in the addresses, not in the values Address table (AT): index by PC, result: predicted address
Implemented with a state-of-the-art predictor Value Table (VT): index by predicted address, result: predicted value
Implemented with a last value predictor VT is also updated in stores Detect patterns in the addresses: results are totally different from
traditional predictors
![Page 14: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/14.jpg)
14
Evaluation
Load speculation 7 State of the art
predictors 2LStride predictor 3 AV predictors Several Hybrid
Predictors Uses the half of the
entries of state of the art predictors [Perais and Seznec, HPCA'13]
![Page 15: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/15.jpg)
15
Results (I)Individual results:Individual results:
Hybrid Results:Hybrid Results:
Best of the single preditors
Always better than the best of the single predictors
![Page 16: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/16.jpg)
16
Results (II)
Multicore experiments with 24 cores
To check the influence of shared memory in the precision
Precision on the value table => No changes in shared memory by remote processors
![Page 17: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/17.jpg)
17
Conclusions
We simulate a real processor (Intel Westmere) to approximate Value Prediction to general purpose processors (narrow-issue processors)
Speculating in Loads has better cost/benefit than speculating in all the instructions (in narrow processors)
We propose the XLStride predictor Detect more complex stride patterns
We propose the AV predictor Complementary to the traditional predictors: ideal for hybrid predictors
Speed-up up to 33% (average 10%) Shared memory in multicore processors barely affects the precision
of predictors
![Page 19: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/19.jpg)
19
Potential of Value Speculation (II)
![Page 20: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors](https://reader035.vdocuments.pub/reader035/viewer/2022062407/56649e225503460f94b0e6da/html5/thumbnails/20.jpg)
20