ch20. optimization for the memory hierarchy 2006.5.3 이병현
Post on 28-Dec-2015
216 Views
Preview:
TRANSCRIPT
CH20. Optimization for the Memory
Hierarchy
2006.5.3이병현
Outline
Introduction Instruction-Cache Optimization Scalar Replacement of Array Elements Data-Cache Optimization
Introduction
Year
µProc60%/year(2/1.5yr)
DRAM9%/year(2/10 yrs)
1
10
100
1000
198
0 198
1 198
3 198
4 198
5 198
6 198
7 198
8 198
9 199
0 199
1 199
2 199
3 199
4 199
5 199
6 199
7 199
8 199
9 200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Using Hardware Assists: Instruction Prefetching
Hardware provides sequential prefetching of code
Provide software support – some newer 64-bit RISCs(ex:SPARC-V9,Alpah)→ Provide fetching hints to a system’ I-cache and instructio
n-fetch unit→ ex: for SPARC-V9 : iprefetch address
Procedure Sorting
Sort the statically linked routines that make up an object module at link time according to their calling relationships and frequency of use.
Objective→ To place routines near their callers in virtual
memory so as to reduce paging traffic.→ To place frequently used and related routines
so they are less likely to collide with each other in I cache.
Procedure Sorting
Begin with the weighted undirected static call graph.
Select an arc with the highest weight and merge the node.→ Coalesce their corresponding arcs.→ Add the weights of the coalesced arcs to compute the
label for the coalesced arc.
Node that are merged are placed next to each other in final ordering of the procedure.→ Weights of the connections used to determine their
relative order.
Procedure SortingP1
[P2,P4]
P3
P5
P6 P7
P8
50
4055
3
9032
40
20
P1
[P2,P4]
[P3,P6]
P5
P7
P8
50
4055
3
40
52
P1
[P5,[P2,P4]]
[P3,P6]
P7
P8
50
40
3
40
52
P1 [[P3,P6],[P5,[P2,P4]]]
P7
P8
90
3
40
[P1,[[P3,P6],[P5,[P2,P4]]]]
[P7,P8]
3
Resulting Order : P1,P3,P6,P5,P2,P4,P7,P8
P1 P2
P3 P4
P5
P6 P7
P8
50
40 100
50
5
390 32
40
20
Procedure and Block Placement Another approach to I-cache optimization that
combined with the procedure sorting. Modify the system linker to put each routine on
an I-cache block boundary, allowing the later phases of the compilation process to position frequently executed code segments.
If most basic blocks are short, this helps to keep the beginnings of basic blocks away from the ends of cache blocks.
Compiler can be instrumented to collect statistics and profiling feedback could be used.
Intraprocedural Code Positioning
Objective→ Move infrequently executed code out of the
main body of the code→ Straighten the code→ A higher fraction of the instructions fetched
into the I-cache are actually executed.
Intraprocedural Code Positioning Build the procedure’s flowgraph
→ edges annotated with their execution frequency bottom-up search of the flow graph
a. Building chains of basic blocks that should be placed as straight-line code
b. Two chains whose respective tail and head are connected by the edge with the highest execution frequency are merged
c. Select the entry chain and proceed through the other chains according to the weights of their connections
Intraprocedural Code Positioning
entry
B1
B2 B3
B4 B5
B8 B9
B6 B7
exit
30
40 5
45 10
20
14
10 14
10 5 10
10 15
< Example flowgraph >
Intraprocedural Code Positioning
< Resulting Arrangement >
entry B1 B2 B3B4
B5
B8
B9B6
B7
exit
Procedure Splitting
Divides each procedure into a primary and a secondary component→ Primary: frequently executed basic blocks.→ Secondary: rarely executed ones, such as
exception-handling code. Then collects the secondary
components of a series of procedure into a separate secondary section.→ Packing the primary components more
tightly together
Procedure Splitting
P S P S P S P S
P1 P2 P3 P4
P P P P
P1 P2 P3 P4
S S S S
P1 P2 P3 P4
< A group of precedure bodies, each split into primary(p) and secondary(s) >
< Result of collecting each type of component >
Combining Intra- and Interprocedural Method
McFaling’s work- Focus on optimization of entire programs for direct
-mapped I-caches- Work on object module- Depend on rearranging instructions in memory and
segregating some instructions to not be cached at all
Scalar Replacement of Array Elements
Scalar Replacement→ Replacing subscripted variable by scalars→ Making them available for register
allocation→ Find opportunities to reuse array elements
and replaces the reuses with references to scalar temporaries
→ Improve speed → Decrease the need for D-cache optimization
Scalar Replacement of Array Elements
examplesdo i = 1,N do j = 1,N do k = 1,N C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddoenddo
do i = 1,N do j = 1,N ct = C(i,j) do k = 1,N ct = ct + A(i,k)*B(k,j) enddo C(i,j) = ct enddo enddo
reduce by
)(2 23 NN
for i ← 1 to n do b[i+1] ← b[i] + 1.0 a[i] ← 2*b[i] + c[i]endfor
if n>=1 then t0 ← b[1] t1 ← t0 + 1.0 b[2] ← t1 a[1] ← 2 * t0 + c[1]endif
t0 ← t1for i ← 2 to n do t1 ← t0 + 1.0 b[i+1] ← t1 a[i] ← 2 * t0 + c[i] t0 ← t1endfor
reduce 40% memory access
Scalar Replacement of Array Elements Loop interchange may increase opportunities
by making loop carried dependences be carried by the innermost loop
Loop fusion can create opportunities for scalar replacement by bringing together in one loop multiple uses of a single array element.
for i ← 1 to n do for j ← 1 to n do a[i,j] ← b[i] + 0.5 a[i+1,j] ← b[i] – 0.5 endforendfor
for j ← 1 to n do for i ← 1 to n do a[i,j] ← b[i] + 0.5 a[i+1,j] ← b[i] - 0.5 endforendfor
for i ← 1 to n do a[i] ← a[i] + 1.0end forfor j ← 1 to n do b[j] ← a[j] * 0.618end for
for i ← 1 to n do a[i] ← a[i] + 1.0 b[i] ← a[i] * 0.618end for
Data Cache Optimization
Loop Transformations Locality and Tiling Data Prefetching
Loop Transformations Do things like
→ Interchanging two nested loops→ Reversing the order→ Fusing two loop bodies
If chosen properly→ Semantics of the program are preserved and its performanc
e is improved Three general types
→ Unimodular transformation→ Loop fusion and distribution→ tiling
Unimodular loop transformation Define
→ whose effect can be represented by the product of a unimodular matrix with a distance vector
Unimodular matrix→ square matrix with all integral components and wit
h a determinant of 1 or -1 Lexicographically positive
→ it has at least one non zero element and the first nonzero element in it is positive
Unimodular loop transformation Loop interchage
→ reverse the order of two adjacent loops in a loop nest
01
10
for i ← 1 to n do for j ← 1 to n do a[i,j] ← (a[i-1,j] + a[i+1,j])/2.0 endforendfor
for j ← 1 to n do for i ← 1 to n do a[i,j] ← (a[i-1,j] + a[i+1,j])/2.0 endforendfor
1
0
0
1
01
10
• Loop interchange matrix :
• product with the distance vector:
• the result is legal : lexicographically positive
Unimodular loop transformation Loop permutation
generalize loop interchange by allowing more than two loops to be moved at once and by not requiring then to be adjacent
0010
0001
1000
0100
• Example : interchanging the first with the third and the second with fourth
Unimodular loop transformation
Loop reversal → Reverse the order in which a particular
loop’s iterations are performed
100
010
001
• Reversing the direction of middle loop
0
1
0
1
10
01
• illegal transformation : lexicographically negative
Unimodular loop transformation
Loop skewing→ change the shape of a loop’s iteration
spacefor i ← 1 to n do for j ← 1 to n do a[i,j] ← a[i+j] + 1.0 endforendfor
for i ← 1 to n do for j ← i+1 to i+n do a[i,j] ← a[j] + 1.0 endforendfor
11
01
Unimodular loop transformation
Loop fusion→ Take two adjacent loop that have the same
iteration-space traversal and combines their bodies into a single loop
→ Legal as long as the loops have the same bounds and as long as there are no flow, anti-, or output dependence in the fused loop
for i ← 1 to n do a[i] ← a[i] + 1.0end forfor j ← 1 to n do b[j] ← a[j] * 0.618end for
for i ← 1 to n do a[i] ← a[i] + 1.0 b[i] ← a[i] * 0.618end for
Unimodular loop transformation
Loop distribution→ Take a loop that contains multiple
statements and splits it into two loops with the same iteration space
→ Legal if it does not result in breaking any cycles in the dependence graph of the original loop
for i ← 1 to n do a[i] ← a[i] + 1.0end forfor j ← 1 to n do b[j] ← a[j] * 0.618end for
for i ← 1 to n do a[i] ← a[i] + 1.0 b[i] ← a[i] * 0.618end for
Data Prefetching
Software Data Prefetching Hardware Data Prefetching
Sequential Prefetching Prefetching with arbitrary strides
Integrating Hardware and Software Prefetching
Appendix : Data cache prefetching using a global history buffer
Data Prefetching What is it ?
Request for a future data need is initiated Useful execution continues during access Data moves from slow/far memory to fast/near cache Data ready in cache when needed (load/store)
When can it be used ? Future data needs are (somewhat) predictable
How is it implemented ? in hardware: history based prediction of future access in software: compiler inserted prefetch instructions
Data Prefetching
a) no prefetching
b) perfect prefetching
c) degraded prefetching
Software Data Prefetching Most contemporary micro processors support some form of fetch
instruction which can be used to implement prefetching Fetch instructions
Added by the programmer or by the compiler Can often be done effectively by the programmer Common characteristics
→ Nonblocking memory operation→ Require a lockup-free cache
loops with large array calculations provide excellent prefetching opportunities Common in scientific codes Exhibit poor cache utilization Predictable array referencing pattern
Software Data Prefetching
for (i=0; i<N; i++)
ip=ip+a[i]*b[i];
• example code for loop-based prefetching• assume a four-word cache block
• cause cache miss every fourth iteration
• simple prefetchingfor (i=0; i<N; i++){
fetch( &a[i+1]);
fetch( &b[i+1]);
ip=ip+a[i]*b[i];
}
→ several problems neet not prefetch every iteration
unnecessary and degrade performance
Software Data Prefetching• prefetching with loop unrolling
for (i=0; i<N; i+=4){
fetch( &a[i+4]);
fetch( &b[i+4]);
ip=ip+a[i]*b[i];
ip=ip+a[i+1]*b[i+1];
ip=ip+a[i+1]*b[i+2];
ip=ip+a[i+1]*b[i+3];
}
→ unroll loop by a factor of the number of words to be prefetched per cache block
→ still have improvements cache miss occurs during the first iteration
unnecessary prefetches will occur in the last iteration of the unrolled loop
Software Data Prefetching• software pipelining
→ assumption : prefetching one iteration ahead of the data’s actual use is sufficient to hide the latency of main memory access
fetch( &ip);
fetch( &a[0]);
fetch( &b[0];
for (i=0; i<N-4; i+=4){
fetch( &a[i+4]);
fetch( &b[i+4]);
ip=ip+a[i]*b[i];
ip=ip+a[i+1]*b[i+1];
ip=ip+a[i+1]*b[i+2];
ip=ip+a[i+1]*b[i+3];
}
for ( ; i<N; i++)
ip=ip+a[i]*b[i];
fetch( &ip);
fot(i=0; i<12; i+=4){
fetch( &a[i]);
fetch( &b[i];
}
for (i=0; i<N-12; i+=4){
fetch( &a[i+12]);
fetch( &b[i+12]);
ip=ip+a[i]*b[i];
ip=ip+a[i+1]*b[i+1];
ip=ip+a[i+1]*b[i+2];
ip=ip+a[i+1]*b[i+3];
}
for ( ; i<N; i++)
ip=ip+a[i]*b[i];
generalize for loops contain small computational bodies
prolog-prefetching only
main loop – prefetching and computation
epilog – computation only
→ initiate prefeches : δ =[ l / s]
l : average memory latency
s : estimated cycle time of the shortest possible execution path
l=100 cycles
d=45 cycles
Software Data Prefetching The loop transformations are fairly mechanical with s
ome refinements. performance penalty must be considered
add processor overhead- require extra excution cycles- source address must be calculated and stored
increase register pressure – additional spill code significant code expansion unable to detect when a prefetched block has been premat
urely evicted and needs to be refetched
Hardware Data Prefetching
Add prefetching capability to a system without the need for programmer or compiler intervention
No changes to existing executables Instruction overhead is completely eliminated take advantage of run-time information to ma
ke prefetching more effective
Sequential prefetching By grouping consecutive memory words into single
units, caches exploit the principle of spatial locality to implicitly prefetch data that is likely to be referenced in the near future
two consideration ensuing cache polution : As the cache block size increase
s, so does the amount of potentially useful data displaced from the cache to make room for the new block
Increasing the cache block size increases the likelihood of two processors sharing data from the same block, hence false sharing is more likely to arise
sequential prefetching can take advantage of spatial locality without introducing these problems
Sequential prefetching One block lookahead(OBL) implementation
initiate a prefetch for block b+1 when block b is accessed differ from simply doubling the block size differ depending on what type of access to block b initiates the p
refetch of b+1→ prefetch-on-miss : simply initiates a prefetch for block b+1 whene
ver an access for block b results in a cache miss.If b+1 is already cached, no memory access is initiated
→ tagged prefetch : associates a tag bit with every memory block. A tag bit is used to detect when a block is demanded-fetched or a prefetched block is referenced for the first time. In either of these cases, the next sequential block is fetched
Sequential prefetching
prefetch-on-miss
tagged prefetch
• tagged prefetch is more effective than prefetch-on-miss
• why? In prefetch-on-miss strictly sequential access pattern will result in a cache miss for every other cache block. In tagged prefetch just one cache miss occurs.
• One shortcoming
→ may not be initiated far enough in advance of the actual use to avoid a processor stall
→ to solve this, increase the number of blocks prefetched after a demand fetch
Sequential prefetching sequential prefetching with degree of prefetching(K)
prefetching K>1 subsequent blocks aids the memory system in staying ahead of rapid processor requests.
additional traffic and cache pollution are generated by sequential prefetching during program phases that show little spatial locality
→ adaptive sequential prefetching(Dahlgren et al,[1993])→ using FIFO stream buffer(jouppi [1990])
When K=2
Sequential prefetching
Properties No changees to existing executables Implemented with relatively simple hardward Compared to software prefetching, sequential hard
ware prefetching performs poorly when non-sequential memory access pattern
scalar references or array accesses with large strides can result in unnecessary prefetches
Prefetching with arbitrary strides employ special logic to monitor the processor’s addre
ss referencing pattern to detect constant stride array reference comparing successive addresses used by load or store
assume a memory instruction mi, references address a1,a2,a3 during 3 successive loop iteration
Prefetching for mi will be initiated if (a2-a1)=Δ≠0 The first prefetch address A3=a2+ Δ prefetching until An ≠ Δ
RPT (reference prediction table)
Prefetching with arbitrary strides RPT
hold the reference histories for only recently used memory instructions
table entries contain the address of the memory instruction
the previous address accessed by this instruction
a stride value for those entries that have established a stride
state field that records the entry’s current state
Prefetching with arbitrary strides
Prefetching with arbitrary strides The RPT improves upon sequential policies by correc
tly handling strided array references But previous example has limits the prefetch distanc
e to one loop iteration Prefetch address with distance
effective address + (stride*distance) the RPT entries are maintained under direction of the PC prefetches are initiated separately by a pseudo program cou
nter – lookahead program counter (LA-PC) distance = the difference between the PC and LA-PC
Integrating HW and SW prefetching software prefetching
compile time analysis to schedule fetch instructions within the user program
hardware prefetching prefetching opportunities at run-time without any c
ompiler or instruction set support Integrating these approaches
Gornish and Veidenbaum[1994] Zhang and Torrellas [1995] Chen [1995]
Integrating HW and SW prefetching Gornish and Veidenbaum[1994]
A variation on tagged hardware prefetching in which the degree of prefetching for a particular reference stream is calculated at compile time and passed on to the prefetch hardware
Zhang and Torrellas [1995] enable prefetching for irregular data structure the compiler initializes the tags in memory, the actual prefetching
is handled by hardware Chen [1995]
programmable prefetch engine extension to the RPT but tag, address, and stride information are supplied by the progr
am
Prefetching Using a Global History Buffer
prefetches from main memory to lowest level cache
conventional table-based prefetching stride prefetching correlation prefetching
Markov prefetching Distance prefetching Prefetch
algorithm
History table
Prefetch
address
Prefetch
Key
conventional table-based prefetching
stride prefetching Using a table to store stride-related history inform
ation for individual load instruction following a cache miss, if the algorithm detects a
constant stride pattern, trigger prefetches for addresses a+s,a+2s,…,a+ds
conventional table-based prefetching Markov prefetching
using a history table to record consecutive miss addresses when a cache miss occurs
→ the miss address indexes the correlation table→ the member of the table entry’s address list are prefetched, with the most rec
ent miss address first distance prefetching
Generalization of Markov’s using address delta ; the distance between two consecutive miss address more compact : One delta correlation can represent many miss address corr
elation Problems
a. table data can become stale and consequently reduce prefetch accuracyb. tables suffer from conflicts when multiple prefetch keys hash to the same ta
ble entryc. table hold a fixed, usually small amount of history per entry
Prefetching Using a Global History Buffer Index table
Prefetch algorithms access the index table with a key.
The key can be a load instruction’s PC, cache miss address, a hashed combination of two.
Entries contain pointers into the GHB
Global history buffer(GHB) N-entry FIFO table (implemented
as a circular buffer) Hold the n most recent L2 miss a
ddress Each entry stores a global miss a
ddress and a link pointer
Global History Buffer
miss addresses
Index Table
FI
Prefetch key
FO
PrefetchAlgorithm
Prefetchaddress
Prefetching Using a Global History Buffer
GHB example(use markov)a. When a L2 cache miss occurs, the mi
ss address indexes the index tableb. If hit, the index table entry will point to
the most recent occurrence of the same miss address in the GHB
c. This GHB entry is at the head of the linked list of other entries with the same miss address
d. The next FIFO ordered entry is the miss address that immediately followed the current miss address in the past
e. These next miss addresses are prefetch candidates.
In example : B, C
Global History Buffermiss address pointerpointer
Index Table
BC
A
A
head pointerC
C
A
C
=> Current => Prefetches
Key
DBCGlobal Miss
Address
D D
AB
Prefetching Using a Global History Buffer Improvement
1. GHB FIFO naturally gives table space priority to the most recent history, thus eliminating the stale-data problem
2. The index table and the GHB are sized separately3. A designer can use the ordered global history to create mo
st-sophisticated prefetching methods than conventional table-based prefetching
Drawback Collecting prefetch information requires multiple table acce
sses→ However linked list walk is short compared with L2 miss lat
ency
Conclusion for the data prefetching Prefetching schems are diverse 3 basic questions to help categorize a particular ap
proach1. when are prefetches initiated2. where are prefetched data placed3. what is prefetched
The majority of prefetching schemes concentrate on numerical, array-based applications.
despite the many application and system constraints, data prefetching has demonstrated the ability to reduce over all program execution time both in simulation studies and in real systems
References
Steven P. V. and DAVID J.L. 2000. DATA prefetch Mechanisms. In ACM computing Surveys, Vol.32,No.2 Kyle J.Nesbit and James E. smith 2005. Prefetching Using a Global History Buffer. In IEEE micro January/February 2005 (Vol. 2
5, No. 1)
top related