global load instruction aggregation based on code motion
TRANSCRIPT
![Page 1: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/1.jpg)
Global Load Instruction Aggregation
Based on Code Motion
The 2012 International Symposium on Parallel
Architectures, Algorithms and Programming.
December18, 2012
![Page 2: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/2.jpg)
Outline
Background
Previous works
Motivations
Partial Redundancy Elimination(PRE)
Lazy code motion(LCM)
Global Load Instruction Aggregation(GLIA)
Experiment results
Conclusion
![Page 3: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/3.jpg)
Processor
Main
memory
Speed:Speed:
Background
![Page 4: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/4.jpg)
Cache
memory
ProcessorImportant
Main
memory
Background
![Page 5: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/5.jpg)
1. Prefetch instructions
2. Transform loop structures.
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]
for(i=0;i<10;i++)
for(j=0;j<10;j++)
... = a[i][j]
before after
Previous works
![Page 6: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/6.jpg)
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
![Page 7: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/7.jpg)
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
![Page 8: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/8.jpg)
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
![Page 9: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/9.jpg)
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
![Page 10: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/10.jpg)
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
![Page 11: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/11.jpg)
1. Prefetch instructions
2. Transform loop structures.
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]
for(i=0;i<10;i++)
for(j=0;j<10;j++)
... = a[i][j]
before after
Previous works
![Page 12: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/12.jpg)
1. Local technique
ex. target: initial load instruction, loop only.
2. It is necessary to change the structure.
Problems
![Page 13: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/13.jpg)
・・・
・・・
Main memory
Cache memory
main(){
x = a[i]
}
a[i]
a[i+1]
How we can apply cache optimization to any program globally?
![Page 14: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/14.jpg)
main(){
x = a[i]
}
・・・
・・・
Main memory
a[i]
a[i+1]
How we can apply cache optimization to any program globally?
Cache memory
a[i]
a[i+1]
![Page 15: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/15.jpg)
Cache memory
・・・
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
a[i]
a[i+1]
a[i]
a[i+1]b[i]
b[i+1]
Main memory
How we can apply cache optimization to any program globally?
![Page 16: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/16.jpg)
Cache memory
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
b[i]
b[i+1]
・・・
a[i]
a[i+1]
b[i]
b[i+1]
Main memory
How we can apply cache optimization to any program globally?
![Page 17: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/17.jpg)
Cache memory
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
b[i]
b[i+1]
・・・
a[i]
a[i+1]
b[i]
b[i+1]
Main memory
Cache miss
How we can apply cache optimization to any program globally?
![Page 18: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/18.jpg)
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
b[i]
b[i+1]
We can remove this cache miss by
changing the order of accesses
・・・
a[i]
a[i+1]
b[i]
b[i+1]Cache miss
How we can apply cache optimization to any program globally?
![Page 19: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/19.jpg)
x = a[i]
z = b[i]
Expel from
cache memory
w = a[i+j]
y = x+1
Code motion
![Page 20: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/20.jpg)
x = a[i]
z = b[i]
w = a[i+j]
y = x+1
Code motion
![Page 21: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/21.jpg)
x = a[i]
z = b[i]
Live range
of wy = x+1
w = a[i+j]
Code motion
![Page 22: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/22.jpg)
x = a[i]
z = b[i]
w
y = x+1
x
w = a[i+j]
Code motion
![Page 23: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/23.jpg)
x = a[i]
z = b[i]
Spilly = x+1
w = a[i+j]
Code motion
![Page 24: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/24.jpg)
x = a[i]
t = Load(j)
z = b[i]
w = a[i+t]
y = x+1
Change the
access order
Code motion
![Page 25: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/25.jpg)
x = a[i]
z = b[i]
w = a[i+j]
y = x+1
Code motion
![Page 26: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/26.jpg)
x = a[i]
z = b[i]
w = a[i+j]
y = x+1
Delayed
Code motion
![Page 27: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/27.jpg)
We use Partial Redundancy Elimination(PRE)
One of the code optimization
Eliminates redundant expressions
Implementation
![Page 28: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/28.jpg)
x = a[i]
y = a[i]
t = a[i]x = tt = a[i]
y = t
PRE
![Page 29: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/29.jpg)
LCM determines two insertion node
-- Earliest and Latest
Knoop,J.,etc.:Lazy Code Motion, Proc. Programming Language Design and Implementation, ACM, pp.224-234, 1992.
x = a[i]
y = a[i]
LCM
• Earliest(n) denotes that node n is
the closest to the start node of the
nodes which can be inserted
• Latest(n) denotes that node n is
the closest to nodes which contain
same load instruction.
![Page 30: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/30.jpg)
x = a[i]
y = a[i]
LCM
![Page 31: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/31.jpg)
x = a[i]
y = a[i]
t = a[i]
LCM
![Page 32: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/32.jpg)
x = a[i]
y = a[i]
t = a[i]
LCM
![Page 33: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/33.jpg)
x = a[i]
y = a[i]
t = a[i]
Delayed
LCM
![Page 34: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/34.jpg)
x = a[i]y = a[i]
t = a[i]
Delayed
LCM
![Page 35: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/35.jpg)
x = ty = t
t = a[i]
LCM
![Page 36: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/36.jpg)
Purpose
1. Decrease the cache miss.
2. Suppress register spills.
Extension
1. Move not redundant load instructions.
2. Delayed considering the order of
memory access.
Global Load Instruction Aggregation(GLIA)
![Page 37: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/37.jpg)
x = a[i]
w = a[i+1]
y = b[i]
GLIA
![Page 38: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/38.jpg)
x = a[i]t = a[i+1]
w = a[i+1]
y = b[i]
GLIA
![Page 39: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/39.jpg)
x = a[i]
w = a[i+1]
y = b[i]t = a[i+1]
GLIA
![Page 40: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/40.jpg)
x = a[i]
w = a[i+1]
y = b[i]t = a[i+1]
GLIA
![Page 41: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/41.jpg)
x = a[i]
w = t
y = b[i]t = a[i+1]
GLIA
![Page 42: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/42.jpg)
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
![Page 43: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/43.jpg)
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
![Page 44: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/44.jpg)
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
![Page 45: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/45.jpg)
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
![Page 46: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/46.jpg)
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
![Page 47: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/47.jpg)
= a[i]
= a[i+1]= b[i]
= a[i+1]
Application to the entire program
![Page 48: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/48.jpg)
Implementation our technique in COINS compiler as LIR
converter.
Benchmark
SPEC2000
Measurement
1. Execution efficiency
2. The number of cache misses
Experiment
![Page 49: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/49.jpg)
Environment
SPARC64-V 2GHz, Solaris 10
Optimization
BASE:applies Dead Code Elimination(DCE)
GLIADCE:applies GLIA and DCE.
Experiment(1/2) | Execution efficiency
![Page 50: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/50.jpg)
Improvement of art has been about 10.5%
Experiment(1/2) | Execution efficiency
![Page 51: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/51.jpg)
= a[i]
= a[j]
= b[i]
The decrease reason 1: speculative code motion
![Page 52: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/52.jpg)
= a[i]= a[j]= b[i]
The decrease reason 1: speculative code motion
![Page 53: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/53.jpg)
The number of spills
The decrease reason 2: register spill
![Page 54: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/54.jpg)
System parameter of x86 machine
Intel corei5-2320 3.00GHz
Floating register :8
Integer register :8
L1D cache memory:32KB
L2 cache memory :256KB
L3 cache memory :6144KB
Experiment(2/2) | Cache misses
![Page 55: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/55.jpg)
Improvement of twolf has been about 10.6%
Experiment(2/2) | Level 2 cache misses
![Page 56: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/56.jpg)
Improvement of art has been about 93.7%
Experiment(2/2) | Level 3 cache misses
![Page 57: Global Load Instruction Aggregation Based on Code Motion](https://reader030.vdocuments.pub/reader030/viewer/2022020307/55ac201a1a28ab0a448b4823/html5/thumbnails/57.jpg)
We proposed a new cache optimization.
1. GLIA can be applied to any programs
2. GLIA improves cache efficiency
3. GLIA considers register spill
Thank you for your attention.
Conclusion