2010ipdps kedzierski

Upload: kamil-kedzierski

Post on 02-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 2010IPDPS Kedzierski

    1/31

    IPDPS, April 2010

    Kamil Kedzierski 1

    [email protected]

    Adapting Cache Partitioning

    Algorithms to Pseudo-LRU

    Replacement Policies

    Kamil Kedzierski1,3, Miquel Moreto1,3,

    Francisco J. Cazorla2,3, Mateo Valero1,3

    1 Technical University of Catalonia

    2

    Spanish National Research Council3 Barcelona Supercomputing Center

  • 7/27/2019 2010IPDPS Kedzierski

    2/31

    IPDPS, April 2010

    Kamil Kedzierski 2

    [email protected]

    Chip Multiprocessors (CMPs)

    CMPs are good representative of the transition from ILP to TLP

    Current CMPs share the Last Level Cache (LLC)

    Pros: Better utilization than a private LLC, which translates into improved performance

    Cons: LLC has been identified as a source of contention between threads

    Cache competition may lead to performance degradation

    Cache Partitioning Algorithms (CPAs) control the interaction between threads

    CPAs can deliver a flexible and easy-to-manage infrastructure to control threads behavior in

    shared caches

    CPAs have become the central element of current QoS frameworks for CMPs

  • 7/27/2019 2010IPDPS Kedzierski

    3/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    We focus on dynamic CPAs

    Execution divided into time intervals

    At interval boundary we select a new cache partition based on the behavior in the previousinterval(s)

    Cache partitioned at the way granularity

    Each thread assigned a number of ways, between 1 and A N

    A associativity

    N number of cores

    Main components of CPAs

    Profiling logic

    Partitioning logic

    Enforcement logic

    Cache Partitioning Algorithms

  • 7/27/2019 2010IPDPS Kedzierski

    4/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Limiting factors to implement CPAs in real processors

    Size of the profiling logic (Auxilary Tag Directory)

    Its size can be similar to the size of the L1 cache

    Received significant attention

    Sampled profiling logic

    No profiling (check all cases and select the best performing one)

    We conclude the problem has been solved

    Replacement scheme

    So far solutions focus on LRU replacement scheme

    LRU has high implementation cost

    High associativity caches use pseudo-LRU schemes

    It has not been shown how current CPAs work with pseudo-LRU Problem not solved

    Motivation

  • 7/27/2019 2010IPDPS Kedzierski

    5/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Outline

    Replacement schemes

    Problem definition for pseudo-LRU schemes

    Profiling for pseudo-LRU

    Results

    Conclusions

  • 7/27/2019 2010IPDPS Kedzierski

    6/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Outline

    Replacement schemes

    LRU: Least Recently Used

    NRU: Not Recently Used (UltraSPARC) (pLRU)

    BT: Binary Tree (IBM) (pLRU)

    Problem definition for pseudo-LRU schemes

    Profiling for pseudo-LRU

    Results

    Conclusions

  • 7/27/2019 2010IPDPS Kedzierski

    7/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Least Recently Used (LRU)

    Hit

    Miss

    B

    C

    D

    ALRU

    MRU

    3

    1

    0

    2

    Each line that is between the MRU line andthe hit line increments its LRU bits

    In the worst case positions of all the lines areupdated

    Hit line is promoted to the MRU position

    Search for value 3 in correspondingreplacement bits

    Promote the line to MRU position and set itsbits to 0

    Increase all the other bits

    B

    C

    D

    ALRU 3

    0

    1

    2

    MRU

    B access

    B

    C

    D

    ALRU

    MRU

    3

    1

    0

    2

    B

    C

    D

    EMRU 0

    2

    1

    3LRU

    E access

    Replacement schemes

    Problem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

  • 7/27/2019 2010IPDPS Kedzierski

    8/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Not Recently Used (NRU)

    Hit

    Miss

    B

    C

    D

    A0

    0

    1

    0

    Set corresponding used bit to 1

    If it causes all used bits to be 1, reset all theother bits

    Start looking for a victim at the position pointedby the replacement pointer

    Search for used bit equal 0

    Set corresponding used bit to 1

    If it causes all used bits to be 1, reset all theother bits

    Rotate the replacement pointer forward one way

    B access

    E access

    B

    C

    D

    A0

    1

    1

    0

    B

    C

    D

    A0

    0

    1

    0replacementpointer

    B

    C

    D

    A0

    1

    1

    0

    Replacement schemes

    Problem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

    replacementpointer

  • 7/27/2019 2010IPDPS Kedzierski

    9/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Binary Tree (BT)

    Hit

    Miss

    Update corresponding bits so thatthey point to MRU position

    Update corresponding bits so thatthey point to MRU position

    B access

    B

    C

    D

    A 0

    1

    0

    p-LRU

    MRU

    B

    C

    D

    A 0

    1

    1

    p-LRU

    MRU

    E access

    B

    C

    D

    A0

    1

    0

    p-LRU

    MRU C

    D

    E1

    1

    1

    MRU

    B

    p-LRU

    3

    1

    0

    2

    MRU

    1

    0

    Replacement schemes

    Problem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

    p-LRU

    0

    1

  • 7/27/2019 2010IPDPS Kedzierski

    10/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    2 4 8 16 32 641

    10

    100

    1000

    LRU

    NRU

    BT

    Associativity

    Replacementbits

    Summary

    position

    LRU

    MRU

    B

    C

    D

    A 1 1

    0 1

    0 0

    1 0

    +

    +

    +

    +

    LRU

    A log2(A)

    B

    C

    D

    A 0

    1

    1

    0

    NRU

    A

    B

    C

    D

    A 0

    1

    0

    A - 1

    BT

    3

    1

    0

    2

    Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

    LRU requires more replacement bits

    LRU requires more information to

    update

    Current processors available on the

    market use pseudo-LRU replacement

    policies

  • 7/27/2019 2010IPDPS Kedzierski

    11/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Outline

    Replacement schemes

    Problem definition for pseudo-LRU schemes Cache Partitioning Algorithms

    Profiling Logic

    Profiling for pseudo-LRU

    Results

    Conclusions

  • 7/27/2019 2010IPDPS Kedzierski

    12/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Cache Partitioning Algorithms

    Shared L2 cacheCore 0 Core 1I $

    D $

    I $

    D $

    Partitioning Logic ProfilingLogic 1

    ProfilingLogic 0

    Enforcement Logic

    Profiling Logic

    Observe each thread behavior in L2 cache

    Partitioning Logic

    Make the decision on how to partition the cache

    We use way partitioning

    Enforcement Logic

    Put the partitions into practice

    Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

  • 7/27/2019 2010IPDPS Kedzierski

    13/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Profiling Logic for LRU

    Shared L2 cacheCore 0 Core 1I $

    D $

    I $

    D $

    Partitioning LogicSDH

    Enforcement Logic

    Auxiliary Tag Directory (ATD)

    Separate copy of the tag directory with the same associativity

    Simulates single-threaded behavior

    On every cache access reports LRU stack position to SDH

    Stack Distance Histogram (SDH)

    Gathers stack positions

    Allows us to derive the miss curve of the thread as a function of the ways assigned to a thread

    ATD SDH ATD

    Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

  • 7/27/2019 2010IPDPS Kedzierski

    14/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Profiling Background for LRU

    Building SDH, ATD content (1 set)

    B

    C

    D

    A

    LRU

    MRU 0

    1

    2

    3

    C access

    A

    B

    D

    C

    LRU

    MRU 0

    1

    2

    3

    D access

    C

    A

    B

    D

    LRU

    MRU 0

    1

    2

    3

    D access

    +1

    r0 r1 r2 r3 r4

    +1

    r0 r1 r2 r3 r4

    +1

    r0 r1 r2 r3 r4

    +1

    r0 r1 r2 r3 r4

    +1 +1

    Building miss curve

    Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

    0 1 2 3 4 ways

    r4

    r3 + r4

    r4

    r2 + r3 + r4

    r1 + 2 + r3 + r4

    r0 + r1 + 2 + r3 + r4

    +1

    misses

    +1

    +2

    +2

    +3

  • 7/27/2019 2010IPDPS Kedzierski

    15/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Profiling in pseudo-LRU?

    Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

    B

    C

    D

    A0

    0

    1

    0

    B access

    B access

    B

    C

    D

    A0

    1

    0

    p-LRU

    MRU

    BT

    NRU

    ... but what is the stack position ?

    B

    C

    D

    ALRU

    MRU

    3

    1

    0

    2

    B access

    LRU

    1

    don't know

    don't know

  • 7/27/2019 2010IPDPS Kedzierski

    16/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Outline

    Replacement schemes

    Problem definition for pseudo-LRU schemes

    Profiling for pseudo-LRU

    NRU scheme

    BT scheme

    Limitations

    Results

    Conclusions

  • 7/27/2019 2010IPDPS Kedzierski

    17/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Profiling in NRU

    Used bits in a 4-way ATD using NRU for three consecutive accesses. The arrowspoint to the line of the last access with the estimated stack distance next to it

    Count number of used bits equal 1 (U)

    If current used bit = 1, stack distance is between 1 and U

    If current used bit = 0, stack distance is between U+1 and A

    ATD for CDD accesses ATD for ABC accesses

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

  • 7/27/2019 2010IPDPS Kedzierski

    18/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Profiling in BT

    Estimated SDH profiling Decoder for ID bits extractionfrom the way number

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

    Li i i

  • 7/27/2019 2010IPDPS Kedzierski

    19/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Limitations

    Two stacks with the same BT bitsaffect profiling accuracy

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

    Over- vs. under-estimation of the positionin the pseudo-LRU stack

    We evaluate three scaling factors:

    1.0 x used bits equal 1

    assume stack distance 4

    0.75 x used bits equal 1

    assume stack distance 3

    0.5 x used bits equal 1

    assume stack distance 2

    B

    C

    D

    A0

    0

    1

    1

    F

    G

    H

    E1

    0

    1

    0

    NRU BT

    O tli

  • 7/27/2019 2010IPDPS Kedzierski

    20/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Outline

    Replacement schemes

    Problem definition for pseudo-LRU schemes

    Profiling for pseudo-LRU

    Results

    Conclusions

    With t C h P titi i

  • 7/27/2019 2010IPDPS Kedzierski

    21/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Without Cache Partitioning

    Performance of LRU, NRU and BT. Analysis for 1, 2, 4 and 8 core CMPs using a16-way 2MB L2 cache with 128 bytes lines

    Is it worth to develop complex, area expensive, power hungry LRU replacement forhigh associativity caches and win 2% - 5% in performance?

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

    C h titi i

  • 7/27/2019 2010IPDPS Kedzierski

    22/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Cache partitioning

    Analysis done for a 16-way 2MB L2 cache with 128 bytes lines

    Counters (any Kways out ofA) vs. Masks (specific Kways out ofA)

    Neglible difference for 1 million cycles sampling interval

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

    Cache partitioning

  • 7/27/2019 2010IPDPS Kedzierski

    23/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Cache partitioning

    Analysis done for a 16-way 2MB L2 cache with 128 bytes lines

    We select 0.75 factor as a winner

    Random-like NRU replacement evicts not least recently used data

    One replacement pointer for all the sets

    Gets significant when the number of cores increases

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

    Cache partitioning

  • 7/27/2019 2010IPDPS Kedzierski

    24/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Cache partitioning

    Analysis done for a 16-way 2MB L2 cache with 128 bytes lines

    Alternating nodes do not evict least recently used line

    Misses not evenly distributed among partition

    Gets significant when the number of cores increases

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

    C

    D

    ABT0

    BT1

    BT2

    B

    50%

    25%

    25%

    Outline

  • 7/27/2019 2010IPDPS Kedzierski

    25/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Outline

    Replacement schemes

    Problem definition for pseudo-LRU schemes

    Profiling for pseudo-LRU

    Results

    Conclusions

    Conclusions

  • 7/27/2019 2010IPDPS Kedzierski

    26/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Conclusions

    We propose a complete partitioning design that targets two pseudo-LRUreplacement policies.

    Not Recently Used, implemented in the L2 cache in the market UltraSPARC T1/T2 processor

    Binary Tree proposed by IBM

    We identify profiling logic as the main source of the so-far lack of CPAimplementations

    The results show a negligible performance degradation with respect to the LRU-based CPA

    For NRU our design loses as much as 0.3%, 3.6% and 7.3% throughput for 2, 4 and 8-coreCMP architectures, respectively

    For BT the proposal degrades throughput by 1.4%, 3.4% and 9.7%, respectively

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

  • 7/27/2019 2010IPDPS Kedzierski

    27/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Thank you

    Q & A

    Kamil Kedzierski1,3, Miquel Moreto1,3,

    Francisco J. Cazorla2,3, Mateo Valero1,3

    1 Technical University of Catalonia

    2 Spanish National Research Council

    3 Barcelona Supercomputing Center

  • 7/27/2019 2010IPDPS Kedzierski

    28/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Backup

    Kamil Kedzierski1,3, Miquel Moreto1,3,

    Francisco J. Cazorla2,3, Mateo Valero1,3

    1 Technical University of Catalonia

    2 Spanish National Research Council

    3 Barcelona Supercomputing Center

    Without Cache Partitioning

  • 7/27/2019 2010IPDPS Kedzierski

    29/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    Without Cache Partitioning

    Performance of LRU, NRU and BT. Analysis for 1, 2, 4 and 8 core CMPs using a16-way 2MB L2 cache with 128 bytes lines.

    Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions

    BT enforcement logic

  • 7/27/2019 2010IPDPS Kedzierski

    30/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    BT enforcement logic

    Enforcement logic for the BT replacement policy

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions

    minMisses improvements

  • 7/27/2019 2010IPDPS Kedzierski

    31/31

    IPDPS, April 2010

    Kamil Kedzierski [email protected]

    minMisses improvements

    Throughput for the LRU, NRU and BT schemes when applying dynamic cachepartitioning in a 2-core CMP. The results are relative to the cases without cachepartitioning.

    M-L architecture vs. non-partitionedLRU-based L2 cache.

    M-0.75N architecture vs. non-partitionedNRU-based L2 cache.

    M-BT architecture vs. non-partitioned BT-based L2 cache.

    Replacement schemesProblem definition for pseudo-LRU schemes

    Profiling for pseudo-LRUResultsConclusions