performance and energy implications of many-core caches for throughput computing

LOGO

Performance and Energy Implica-tions of Many-Core Caches for

Throughput Computing

C. J. Hughes, C. Kim, Y. Chen in Intel LabsIEEE Micro 2010

2013010654 유승요

Throughput Computing

Throughput Computing Computing focuses on maximizing the throughput of workloads

rather than latency Huge number of calculation with parallelism

• Fit to many-core processor To keep many core busy

memory system must feed and facilitate efficient core-to-core communication

By using cache Hides latency of low level memory systems Fast core-to-core communication Sufficient bandwidth

Research Objective

Many-core cache design for throughput computing Considering power and performance

Different from traditional CPU cache More cores intercore communication Latency tolerant minimizing average ac-

cess time may not be best

Throughput Applications


Working set size Model use 256kB L2 cache larger working set benchmarks may be

slowed by L2 cache size Cache miss rate

Means L1 cache miss rate high miss rate means high lower level cache access rate

Prefetch coverage Percentage reduction in L1 misses when stride prefetcher added

high reduction rate means strong streaming pattern


Data sharing characteristics. Percentage of L2 cache lines and L2 cache accesses to data shared by given number of coresSharing degree

number of cores access in lineSpatial domain

Most data is private – 12 out of 15 benchmarks has more than 70 % none shared line with

Frequency domain Larger than percentage of spatial domain

Cache designConstraints

Two level caching• L1 private cache for each core• Last Level Cache(LLC)

Directory based hardware cache coherence• Entry contains tag and directory state information

Tiled design processor

Flexibility – key design criterion Determines a given line can reside in the LLC Affects the distance an LLC request and reply must travel

through the on-die network

Flexible design affects Access latency? more flexible design is better On-die bandwidth usage? more flexible design is better No. of unique lines? less flexible design is better Off-die bandwidth usage? less flexible design is better Effective cache capacities? less flexible design is better

Cache design -Private

Set core’s data to close location Private LLC Less unique cache

line

home tile Mapped by address

hashing function Tag directory in home

tile maintains line

Private design variation

Replication policy To increase unique cache lines Uncontrolled replication Controlled replication No replication

Uncontrolled replication Allow unlimited replication when unique line evicted

• Not in home tile, move it to home tile(migration)

Controlled replication Allow replication, but deprioritize line via replacement policy Use a bit for each line(reuse bit)

It follow cache line when it is evicted or transferred to other LLC If new line is inserted, line with 0 reuse bit is evicted.

Private design variation

No replication Shared line in

home tile Private line in access-

ing core’s tile LLC controller works

as directory controller Track private line with

Roaming Data Pointer(RDP)• When it is shared, re-

moves RDP entry and migrates line to home tile

Shared design

Shared Keep all line

in home tile

Maximize unique line

Increase in average access latency

and on-die traffic

Experimental setup

L1 cache withhardware stride prefetcher

64 switches for ring

Energy components LLC, tag directory, RDP

access On-die data messages On-die coherence

messages Off-die access

Performance and Energy Consumption

(a)Performance of the 5 LLC designs relative to shared

Least flexible design is better Flexible designs intended to minimize ac-

cess latency But in throughput computing, cache miss la-

tency is hided via multithreading or prefetch-ing

Critical path heavily rw shared line Least flexible designs centralized storage

• Home tile respond requires no acknowledgment Flexible designs private caches only

• Not allow muliple cache to cache transfer• Tag directory needs acknowledgment from tile

Performance

flexible design is better Saving on-die traffic Increased off-die traffic small effect

Unique line policy in private design Controlled and uncontrolled design Reduce off-die access

Energy consumption

Energy consumption of the 5 LLC designs relative to sharedP : private, U : uncontrolled replication, C : controlled replication, N : no replication, S : shared, T:tag directory buffer

Designing for performance and energy

Tag Directory Buffer Small FA buffer to hold clean lines Handle read request for clean lines Acts like shared based model Add a bit in each line’s directory entry

• to check concurrent share• Save space and traffic

Tag directory buffer hit rates of different buffer sizes

Alternatives

Sharing Migration Coping read shared lines to the home

LLC tiles Also needs acknowledgment from home

tile

Parallel reads Modify coherence protocol and directory

hardware to allow simultaneous transfer Not increasing data traffic Changing protocol and hardware Still needs cache-to-cache transfer

• Slower than tag directory buffer

Impact of increased read parallelism

(a)performance and (b) energy consumption for designs that attempt to increase read parallelism

Impact of increased read parallelism

Tag Directory Buffer Faster but increase in energy

• Copying data to home tile• Data reply from tag directory has longer path

Alternatives Parallel reads : slower than TDB

& same energy consumption with Con-trolled

Sharing migration : no performance boost & increase in energy

Read throughput isn’t sufficient

Conclusion

Tag Directory Buffer 10% faster than private designs 55% energy saving compare to shared

designs

Next Work More complex hierarchy More fundamental changes in hierarchy

LOGO

www.themegallery.com

www.themegallery.com

performance and energy implications of many-core caches for throughput computing

Documents