spatio-temporal memory streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · flexus...

20
Computer Architecture Lab at © 2009 Stephen Somogyi Spatio-Temporal Memory Streaming Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki and Babak Falsafi 36 th International Symposium on Computer Architecture June 22, 2009

Upload: others

Post on 31-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

Computer Architecture Lab at

© 2009 Stephen Somogyi

Spatio-Temporal Memory Streaming

Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki and Babak Falsafi

36th International Symposium on Computer Architecture

June 22, 2009

Page 2: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 2

Memory Wall: Performance Bottleneck

Especially for server apps •  Large data footprints •  Pointer-intensive structures

But, no single technique effective for OLTP, DSS & Web

Prefetching can hide long memory latency [ISCA ’05] [ISCA ’06] [Micro ’07]

exec

utio

n

Prefetching Base

compute mem stall

Page 3: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 3

Observation: Temporal, Spatial Predictions Different

Different predictors capture different behaviors •  Temporal: recurring memory access sequences (pointers) •  Spatial: recurring data layouts (structs)

Spatial/temporal disjoint; opportunity to exploit both

Page 4: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 4

How to Combine?

Concept: independent temporal & spatial •  Duplicates prefetches ⇒ inefficient •  Prefetchers interfere ⇒ confuses training

Refinement: chain spatial predictions via temporal •  No sequence within spatial predictions •  Prefetches in wrong order ⇒ not timely

Solution: also learn order within spatial predictions •  Achieves high, consistent coverage & low mispredictions

Page 5: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 5

Contributions

•  Opportunity for spatio-temporal prediction –  70% of read misses are predictable on average

•  Temporal characterization of spatial accesses – Sequences repetitive within & across layouts

•  Spatio-Temporal Memory Streaming – Predicts unified spatio-temporal miss sequence

•  62% of read misses on average – Mean speedup 1.31, ≥ temporal or spatial alone

Page 6: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 6

Outline

•  Introduction

•  Spatial and Temporal Prediction

•  Spatio-Temporal Memory Streaming

•  Results

•  Conclusion

Page 7: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 7

Example: Non-Clustered Index Scan

Sequence spans non-contiguous data pages Similar layouts within data pages

Index Pages

Data Pages

Page 8: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 8

Temporal Memory Streaming (TMS) [Wenisch ’05]

Sequences contain entire addresses  Good for pointer chasing ⇒ breaks dependence chains   Startup costs amortized over long streams   Cannot predict compulsory misses   Large storage required (~2MB / processor)

Records & replays recurring miss sequences •  Code traversals repeat ⇒ data traversals repeat

Q W A B C D E R T A B C D E Y

order repeats!

Page 9: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 9

Spatial Memory Streaming (SMS) [Somogyi ’06]

Exploits repetitive, large-scale data layouts in memory

Mem

ory

•  Pattern: offsets in logical region •  Trigger: first miss, used for lookup

Patterns encoded as bit vectors   PC lookup ⇒ predicts compulsory misses   Efficient storage (~80KB / processor)   Trigger miss per pattern ⇒ lost opportunity   Unordered ⇒ BW spikes, wrong prioritization

pattern repeats!

Page 10: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 10

Hybrid Spatio-Temporal Predictor

Naïve approach –  Record sequence across patterns –  Fetch entire pattern at a time

•  No priority across patterns

Must prioritize prefetches across patterns

But, triggers many patterns simultaneously –  Accesses across patterns are interleaved! –  Bursts of prefetches ⇒ pollution, BW spikes

Page 11: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 11

sequence: (offset)

Deconstructing a Miss Sequence

Investigate temporal & spatial relationships

A+4 A+2 A+0 B+6 D+1 E+2 overall miss sequence base addr + offset

sequence across patterns

A (4) (2) (0)

sequences within patterns base addr sequence: (offset, skip)

B (6)

E (2)

D (1)

A+7 B+5 C+3 D+3 E+0 base addr + offset

Reverse process to reconstruct the miss sequence

A+7 B+5 C+3 D+3 E+0

(4,0) (2,1) (0,1) (6,1) (1,2) (2,0)

A+7,0 B+5,1 C+3,3 D+3,0 E+0,1 base addr + offset, skip

skip = 1 skip = 0 skip = 1 skip = 0

Naïve approach: prefetches pattern A before B Incorrect order!

Page 12: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 12

Spatio-Temporal Memory Streaming (STeMS)

Goal: reconstruct the overall miss sequence –  Using both temporal & spatial predictions ⇒ Prefetch cache blocks in order

Training: observe relative interleavings –  Record skips in temporal seq. and spatial patterns

Prediction: reconstruction buffer for staging –  Spread temporal predictions according to skips –  Trigger spatial lookup for each, insert predicted addresses

Generates simple address seq. for throttled streaming

Page 13: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 13

Outline

•  Introduction

•  Spatial and Temporal Prediction

•  Spatio-Temporal Memory Streaming

•  Results

•  Conclusion

Page 14: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 14

Methodology

Flexus [Wenisch ‘06]

•  Full-system trace and OoO timing simulation •  Leverages SMARTS sampling

Model Parameters   16 4GHz SPARC CPUs   4-wide OoO; 96-entry ROB

  64KB 2-way L1

  8MB 8-way L2, 25-cycle lat.   40ns memory

  25ns per-hop network   TSO w/ speculation

Benchmark Applications •  OLTP: TPC-C

– IBM DB2 & Oracle •  DSS: TPC-H Qry 2,16,17

– IBM DB2 •  Web: SPECweb99

– Apache & Zeus •  Scientific

– em3d, ocean, sparse

Page 15: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 15

Temporal Repetition Within Patterns

Compare access sequence of successive patterns –  Evaluate for small reordering windows

Seq. of accesses within patterns extremely repetitive

Page 16: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 16

Temporal Repetition Across Patterns

Evaluate repetition with compression algorithm [Wenisch ’08]

Similar opportunity for predicting seq. of triggers

–  Past work (TMS): all miss addresses –  STeMS: only trigger misses

Page 17: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 17

STeMS Results

STeMS is effective across commercial workloads Predicts 62% of misses, improves perf. 31%

OLTP/DSS – matches TMS/SMS, Web – beats both

Page 18: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 18

Related Work

•  Stream Chaining [Diaz ‘09] –  Reconstruct overall miss seq. using control flow –  Better compulsory, worse temporal coverage

•  Predictor Virtualization [Burcea ‘08] –  Reduces dedicated on-chip storage for predictors –  Can be applied to history structures in STeMS

•  Epoch-base Correlation Prefetching [Chou ‘07] –  Improves timeliness of temporal predictions –  In contrast, STeMS mainly targets spatial timeliness

Page 19: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 19

Conclusion

•  Spatial/temporal predictions disjoint –  Opportunity to predict up to 70% of read misses

•  Temporal repetition of spatial patterns –  Near-perfect repetition within patterns –  Similar repetition across patterns as all addresses

•  Design for Spatio-Temporal Memory Streaming –  Reconstructs total miss sequence

•  Predicts 62% of read misses, perf. improvement 31% –  Coverage & speedup ≥ temporal or spatial alone

Page 20: Spatio-Temporal Memory Streamingisca09.cs.columbia.edu/pres/07.pdf · 2009-07-29 · Flexus [Wenisch ‘06] • Full-system trace and OoO timing simulation • Leverages SMARTS sampling

© 2009 Stephen Somogyi 20

Questions ?

STeMS Project Spatio-Temporal Memory Streaming www.ece.cmu.edu/~stems

Computer Architecture Laboratory Carnegie Mellon University www.ece.cmu.edu/~calcm