introduction to high performance programming...introduction to high performance programming...
TRANSCRIPT
Introduction to High Performance Programming
名古屋大学情報基盤中心 教授 片桐孝洋
Takahiro Katagiri, Professor, Information Technology Center, Nagoya University
Introduction to Parallel Programming for Multicore/Manycore Clusters
1
國家理論中心數學組「高效能計算 」短期課程
Agenda1. Hierarchical Cache Memories2. Instruction Pipeline3. Loop Unrolling4. Continual Accesses for Arrays 5. Blocking6. Other Techniques to Establish Speedups
Introduction to Parallel Programming for Multicore/Manycore Clusters
2
Hierarchical Cache Memories
Very High Speed Memories are Small Amount.
Introduction to Parallel Programming for Multicore/Manycore Clusters
3
Memory Hierarchies on Recent Computers
Introduction to Parallel Programming for Multicore/Manycore Clusters
4
High Speed
LargeAmount
O(1 Nano Seconds)
O(10 Nano Seconds)
O(100 Nano Seconds)
O(10 Miri Seconds)
Bytes
K Bytes ~ M Bytes
M Bytes~G Bytes
G Bytes~T Bytes
Registers
Caches
Main Memory
Hard Disk
Cost from main memory to registers isO(100) times to access cost on registers.
More Intuitive Explanations…
Introduction to Parallel Programming for Multicore/Manycore Clusters
5
A Register
A Cache
Main Memory
To establish high performance programming,we should modify programs to access very small spaces of its access area.(This calls “space locality.”)
An Example of Organization of Cache
Introduction to Parallel Programming for Multicore/Manycore Clusters
6
Main Memory
Cache Memory
Registers
arithmetic unit
Requirementof Operation
ComputationResult
Data Supply Data Store
CPU
8 9 10 11 12 13 14
0 1 2 3 4 6 7
Main Memory
Bank(Block)
Set
10 6
0 2 14
Cache Memory
LowerUpper
Physical Address
Inside Blocks
10 6
0 2 14
CacheDirectories
Cache Organization
Data Supply Data Store
A Memory Organization of the Fujitsu FX10 (A K-Computer Type Supercomputer)
Introduction to Parallel Programming for Multicore/Manycore Clusters
7
Registers
Level 1 Cache(32K Bytes/core)
Level 2 Cache(12M Bytes/16 cores)
Main Memory(32G Bytes / Node)
HighSpeed
LargeAmount
●Data
●Data
●Data
Introduction to Parallel Programming for Multicore/Manycore Clusters
8
High speed access is established, if data is on the L1 cache.
A Memory Organization of the Fujitsu FX10 (A K-Computer Type Supercomputer)
Introduction to Parallel Programming for Multicore/Manycore Clusters
8
Registers
Level 1 Cache(32K Bytes/core)
Level 2 Cache(12M Bytes/16 cores)
Main Memory(32G Bytes / Node)
HighSpeed
LargeAmount
●Data
●Data
An Example of Memory Organization for Each Node on the FX10
Introduction to Parallel Programming for Multicore/Manycore Clusters
9
This is a hierarchical memory organization.
Mein Memory
L1 L1
Core0
Core1
L1 L1
Core2
Core3
L2
L1 L1
Core12
Core13
L1 L1
Core14
Core15
…
An Example of Memory Organization for Whole System on the FX10
Introduction to Parallel Programming for Multicore/Manycore Clusters
10
Memory is hierarchical.
…
TOFU Network(5 GBytes / s.×Both Way)
…
Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
…Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
Mein Memory
L1
L1
Core0
Core1
L1
L1
Core2
Core3
L2
L1
L1
Core12
Core13
L1
L1
Core14
Core15
…
Introduction to Parallel Programming for Multicore/Manycore Clusters
11
Node Organization ofthe FX10
Memory Memory Memory
Inside CPU organizationCore
#1Core
#2Core
#3Core
#0
Only One Socket
Core#13
Core#14
Core#15
Core#12…
L2 (Shared with 16 cores, 12MB)
L1 L1 L1 L1 L1 L1 L1 L1: L1 Data Cache 32KB
85GB/s.=(8Byte×1333MHz
×8 channels)
DDR3 DIMMMemory
4GB ×2 channels4GB ×2 channels
4GB ×2 channels4GB ×2 channels
Total memory amount:8GB×4=32GB
20GB/s.
TOFUNetwork
ICC
Specification of CPU for the FX10 (SPARC64IXfx)Items Value
ArchitectureName
HPC-ACE (Extended SPARC-V9 Instruction Set)
Frequency 1.848GHz
L1 Cache 32 Kbytes (Instruction and Data are separated.)
L2 Cache 12 Mbytes
SoftwareControlledCache
Sector Cache
InstructionIssue
2 Integer Operation Units4 Floating point Multiplication-and-add Units (FMA)
SIMDInstruction
2 FMAs are working per one instruction.2 Floating point Operations (add and Multiplication) can beexecuted.
Registers Number of registers for floating point computations:256
Others Instruction for sin and cos. Instruction for conditional executions. Instruction for division and sqrt.12
13Reading:240GB/s.Writing:240GB/s. = Total:480GB/s.
Node Organization of the FX100
Core#17
Core#18
Core#19
Core#16
Core#29
Core#30
Core#31
Core#28…
L2 (Shared with 17 cores, 12MB)
L1 L1 L1 L1 L1 L1 L1 L1
: L1 DataCache64KB
HMC16GB Total Memory Amount per node:32GB
Memory
Socket0 (CMG(Core Memory Group))
Core#1
Core#2
Core#3
Core#0
Core#13
Core#14
Core#15
Core#12
…L1 L1 L1 L1 L1 L1 L1 L1: L1 Data
Cache64KB
HMC16GB
TOFU2Network
Assist.Core
L1
Assist.Core
L1
2 Sockets, NUMA(Non Uniform Memory Access)
L2 (Shared with 17 cores, 12MB)
Socket1 (CMC)
…
…
…
…
Memory
ICC
Architectural Comparison between the FX10 and the FX100
Introduction to Parallel Programming for Multicore/Manycore Clusters
14
source:https://www.ssken.gr.jp/MAINSITE/event/2015/20151028-sci/lecture-04/SSKEN_sci2015_miyoshi_presentation.pdf
FX10 FX100Computation Capacity/ Node
Double / Single236 GFLOPS
Single :1.011 TFLOPSDouble:2.022 TFLOPS
Number of Cores 16 32
Assistant Cores None 2
SIMD Length 128 256
SIMD Operation Floating Point Operations,Continuous Load/Store
In addition to the right, Integer Operations, Stridden and Indirect Load/Store
L1D Cache/core 32KB, 2 Ways 64KB, 4 Ways
L2 Cache/node 12MB 24MB
Memory Bandwidth 85GB/s. 480GB/s. (HMC)
Specification of the FX100 (Nagoya u.)CPU(SPARC64XIfx)
Items ValueArchitecture Name HPC-ACE2 (Extended SPARC-V9 Instruction Set)Frequency 2.2 GHzL1 Cache 64 Kbytes (Instruction and Data are separated.)L2 Cache 24 MbytesSoftwareControlled Cache
Sector Cache
Instruction Issue 2 Integer Operation Units.8 Floating point Multiplication-and-add Units(FMA).
SIMD Instruction 2 FMAs are working per one instruction.4 Floating point Operations (add andMultiplication) can be executed.
Registers Number of registers for floating pointcomputations:256
Others
Introduction to Parallel Programming for Multicore/Manycore Clusters
15
Instruction Pipeline
An assembly line work of computations.
Introduction to Parallel Programming for Multicore/Manycore Clusters
16
Assembly line work For example, assembling cars. One worker owns one procedure.(5 workers)
Let us 2 months for the above procedures.(Each process needs 0.4 month) After 2 months, the first car is made. After 4 months, the second car are made. Eff.: 2 months par car
Introduction to Parallel Programming for Multicore/Manycore Clusters
17
Making Car BodyPut on Front
and rear windows
Interior Finishing
Exterior Finishing
Confirming Functions
車体作成
フロント・バックガラスをつける
内装 外装機能確
認
車体作成
フロント・バックガラスをつける
内装 外装機能確
認
車体作成
フロント・バックガラスをつける
内装 外装機能確認 Time
FirstSecondThird
Each worker is working for 0.4 month, then they take a rest for 1.6 months.
-> too bad efficiency.
Assembly Line Work We can take 5 working places. Waiting for previous process, then car is moved
to next process as soon as possible. A Belt Conveyor System.
Introduction to Parallel Programming for Multicore/Manycore Clusters
18
0.4 Month 0.4 Month 0.4 Month 0.4 Month 0.4 Month
Making Car BodyPut on Front
and rear windows
Interior Finishing
Exterior Finishing
Confirming Functions
Assembly Line Work In this method, we can make: the first car in 2 months. the second car in 2.4 months. the third car in 2.8 months. the forth car in 3.2 months. the firth car in 3.4 months the sixth car in 3.8 months. Efficiency: 0.63 month per car.
Introduction to Parallel Programming for Multicore/Manycore Clusters
19
車体作成
フロント・バックガラスをつける
内装 外装機能確
認
車体作成
フロント・バックガラスをつける
内装 外装機能確
認
車体作成
フロント・バックガラスをつける
内装 外装機能確
認
Time車体作成
フロント・バックガラスをつける
内装 外装機能確
認
車体作成
フロント・バックガラスをつける
内装 外装機能確
認
FirstSecondThirdForthFifth
Each worker keeps working for 0.4 month when time is enough large. -> Very high efficiency.
This process calls<Pipeline Process> .
Pipeline Process on Computers1. Hardware Pipelining Performing it by computer hardware. Typical processes:
1. Computation process with pipelining.2. Data sending (operand code, data) from memory with pipelining.
2. Software Pipelining Performing it by writing programs. Typical processes:
1. Pipelining by compiler.(Instruction preload, data prefetch, data post-store. )
2. Pipelining by program modification by hand. (Data preload, loop unrolling)
Introduction to Parallel Programming for Multicore/Manycore Clusters
20
In Case of Arithmetic Unit (AU) Ex) Process for AU (Note) Process for real AU is different from follows.)
Ex) Matrix-vector multiplication:for (j=0; j<n; j++)
for (i=0; i<n; i++) {y[j] += A[j][i] * x[i] ;
}
If we do not use pipelining, the process is like as follow.
Introduction to Parallel Programming for Multicore/Manycore Clusters
21
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Time
Performing AU.
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Fetch data A from memory
Fetch data B from memory
In Case of Arithmetic Unit (AU) This is using only one unit time during 4 units time in AU;
hence poor efficiency > Computation efficiency: ¼=25%. If we establish pipelining as follows, computation is made
in every time unit when enough time is passed.> Computation efficiency: 100%
Introduction to Parallel Programming for Multicore/Manycore Clusters
22 Time…
The enough time is enough large loop length. Large loop length of N makes no delay pipelining, then it makes high efficiency. > If N is small, efficiency goes poor.
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Fetch data A from memory
Fetch data B from memory
Issue Computation Store Result
Summary of Computation Pipelining A Concept to establish full usage of AU.
> To do high performance Computing. Time of data fetch from main memory is very slow. Making
scheduling of load instruction by pipelining, the fetch time can be hidden. > AU is working AU in every time unit.
In real execution, it is not easy because of the following issues. 1. Delay by organization of computer architectures, such as restriction of
number of registers, data supply limitation for from memory to CPU, and from CPU to memory, etc. *CPU of the FX100 is based on Sparc64.
2. Process for loops, such as initialization and addition for loop induction variables ( I , j ), evaluation of loop finalizing.
3. Computation of memory address to access data of arrays.4. Whether compiler generates pipelined codes or not.
Note for Actual CPUs In actual CPUs, they have independent pipelines for:
1. Addition and Substitution, and2. Multiplication.
Moreover, there are several instructions for simultaneous pipelining operations.
In the Intel Pentium4, number of pipelining depths is 31! Time to full working of pipelining is very large. If many branches, prediction misses for instructions are
happened, efficiency goes poor.
Recent multi-core and many core CPUs require small depth of pipelining because of low frequency. Xeon Phi is 7 for depth of pipelining.
Introduction to Parallel Programming for Multicore/Manycore Clusters
24
Hardware Information for the FX10 8 computations per clock can be performed. 2 multiplications and adds per FMA.(4 Floating Point Operations)
2 FMAs is working per clock. 4 Floating Point Operations ×2FMAs = 8 Floating Operations
per clock. It has1.848 GHz clock per core, hence: Theoretical peak performance is:
1.848 GHz * 8 times = 14.784 GFLOPS / core 1 Node (16 cores)
14.784 * 16 cores = 236.5 GFLOPS / node
Number of registers for floating point operations 256 registers / core
Introduction to Parallel Programming for Multicore/Manycore Clusters
25
Loop UnrollingCompilers do not always perform this automatically.
Introduction to Parallel Programming for Multicore/Manycore Clusters
26
What is “Loop Unrolling”?
It is a tuning technique in compilers by rewriting codes to enhance: 1. Register allocations for data;2. Pipelining;
Changing loop stride to m, not for 1. <Depth m unrolling>
Introduction to Parallel Programming for Multicore/Manycore Clusters
27
Loop Unrolling In technical term for compiler, it is
defined by unrolling the most inner loop. Narrow sense definition of loop unrolling.
Computation scientists defines unrolling multiple loops. Broad sense definition for loop unrolling. This is defined as a loop restructuring by
compiler optimizations.
Introduction to Parallel Programming for Multicore/Manycore Clusters
28
An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language) k-loop Depth 2 Unrolling, where is n is dividable by 2.
for (i=0; i<n; i++) for (j=0; j<n; j++)
for (k=0; k<n; k+=2) C[i][j] += A[i][k] *B[k][j] + A[i][k+1]*B[k+1][j];
Number of times for loop counter checking in k-loopwill be half.
Introduction to Parallel Programming for Multicore/Manycore Clusters
29
j-loop Depth 2 Unrolling, where is n is dividable by 2.
for (i=0; i<n; i++) for (j=0; j<n; j+=2)
for (k=0; k<n; k++) {C[i][j ] += A[i][k] *B[k][j ];C[i][j+1] += A[i][k] *B[k][j+1];
}
By allocating A[i][k] into a register, it can be accessed with high speed.
Introduction to Parallel Programming for Multicore/Manycore Clusters
30
An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)
i-loop Depth 2 Unrolling, where is n is dividable by 2.
for (i=0; i<n; i+=2) for (j=0; j<n; j++)
for (k=0; k<n; k++) {C[i ][j] += A[i ][k] *B[k][j];C[i+1][j] += A[i+1][k] *B[k][j];
}
By allocating B[i][k] into a register, it can be accessed with high speed.
Introduction to Parallel Programming for Multicore/Manycore Clusters
31
An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)
i-loop and j-loop Depth 2 Unrolling, where is n is dividable by 2.for (i=0; i<n; i+=2)
for (j=0; j<n; j+=2) for (k=0; k<n; k++) {
C[i ][j ] += A[i ][k] *B[k][j ];C[i ][j+1] += A[i ][k] *B[k][j+1]; C[i+1][j ] += A[i+1][k] *B[k][j ]; C[i+1][j+1] += A[i+1][k] *B[k][j +1];
} By allocating A[i][j], A[i+1][k],B[k][j],B[k][j+1] into registers,
it can be accessed with high speed.Introduction to Parallel Programming for
Multicore/Manycore Clusters32
An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)
To do easy understanding for compilers, the following code is better in some situations.for (i=0; i<n; i+=2)
for (j=0; j<n; j+=2) {
dc00 = C[i ][j ]; dc01 = C[i ][j+1]; dc10 = C[i+1][j ]; dc11 = C[i+1][j+1] ; for (k=0; k<n; k++) {
da0= A[i ][k] ; da1= A[i+1][k] ;db0= B[k][j ]; db1= B[k][j+1]; dc00 += da0 *db0; dc01 += da0 *db1; dc10 += da1 *db0; dc11 += da1 *db1;
}C[i ][j ] = dc00; C[i ][j+1] = dc01; C[i+1][j ] = dc10; C[i+1][j+1] = dc11;
}
Introduction to Parallel Programming for Multicore/Manycore Clusters
33
An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)
k-loop Depth 2 Unrolling, where is n is dividable by 2.
do i=1, ndo j=1, n
do k=1, n, 2C(i, j) = C(i, j) +A(i, k) *B(k, j) + A(i, k+1)*B(k+1, j)
enddoenddo
enddoNumber of times for loop counter checking in k-loop will
be half.
Introduction to Parallel Programming for Multicore/Manycore Clusters
34
An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)
An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)
j-loop Depth 2 Unrolling, where is n is dividable by 2.
do i=1, ndo j=1, n, 2
do k=1, nC(i, j ) = C(i, j ) +A(i, k) * B(k, j )C(i, j+1) = C(i, j+1) +A(i, k) * B(k, j+1)
enddoenddo
enddo By allocating A(i , k) into a register, it can be accessed with high
speed. Introduction to Parallel Programming for
Multicore/Manycore Clusters35
An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)
i-loop Depth 2 Unrolling, where is n is dividable by 2.
do i=1, n, 2do j=1, n
do k=1, nC(i , j) = C(i , j) +A(i , k) * B(k , j)C(i+1, j) = C(i+1, j) +A(i+1, k) * B(k , j)
enddoenddo
enddo By allocating B(i, k) into a register, it can be accessed with high
speed.Introduction to Parallel Programming for
Multicore/Manycore Clusters36
An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)
i-loop and j-loop Depth 2 Unrolling, where is n is dividable by 2.do i=1, n, 2
do j=1, n, 2do k=1, nC(i , j ) = C(i , j ) +A(i , k) *B(k, j )C(i , j+1) = C(i , j+1) +A(i , k) *B(k, j+1) C(i+1, j ) = C(i+1, j ) +A(i+1, k) *B(k, j ) C(i+1, j+1) =C(i+1, j+1) +A(i+1, k) *B(k, j +1)
enddo; enddo; enddo; By allocating A(i, j), A(i+1, k),B(k, j),B(k, j+1) into registers,
it can be accessed with high speed.Introduction to Parallel Programming for
Multicore/Manycore Clusters37
An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language) To do easy understanding for compilers, the following
code is better in some situations.do i=1, n, 2
do j=1, n, 2
dc00 = C(i ,j ); dc01 = C(i ,j+1) dc10 = C(i+1,j ); dc11 = C(i+1,j+1) do k=1, n
da0= A(i ,k); da1= A(i+1, k)db0= B(k ,j ); db1= B(k, j+1) dc00 = dc00+da0 *db0; dc01 = dc01+da0 *db1; dc10 = dc10+da1 *db0; dc11 = dc11+da1 *db1;
enddoC(i , j ) = dc00; C(i , j+1) = dc01C(i+1, j ) = dc10; C(i+1, j+1) = dc11
enddo; enddo
Introduction to Parallel Programming for Multicore/Manycore Clusters
38
Continual Accesses for Arrays
Stride access makes poor performance.
Introduction to Parallel Programming for Multicore/Manycore Clusters
39
Storage Formats for ArrayC LanguageA[i][j]
Introduction to Parallel Programming for Multicore/Manycore Clusters
40
1
5
9
13
2
6
10
14
3
7
11
15
4
8
12
16
Direction of Continual Access
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Direction ofContinual Access
Fortran LanguageA(i, j)
i
j
i
j
Cache and Cache Line Data mapping methods between main memory and cache. From main memory to cache:
Direct Mapping: Direct mapping by unit (memory bank). Set Associative: Indirect mapping with hash function.
From cache to main memory: Store Through:Writing content to main memory as soon as data is
written in cache. Store In:Writing content to main memory when data on cache line is
replaced.
Introduction to Parallel Programming for Multicore/Manycore Clusters
41…
Main Memory
MemoryBank
Line 0Line 1Line 2Line 3Line 4Line 5
Cache
A MappingFunction
CacheLines
…
Cache Line Conflict We consider direct mapping, which is directly mapping from
address on main memory to cache. The mapping stride is 4 in this example.
Data on main memory maps to same cache line with stridden 4.
In this example, opposite access is made to direction of continuous access. >When we use C language, i-direction is accessed.
Introduction to Parallel Programming for Multicore/Manycore Clusters
42
Main Memory
Line 0Line 1Line 2Line 3
CacheCacheLines
1 2 3 45 6 7 89 10 11 1213 14 15 16
…
ContinuousAccess
Access Direction
1 2 3 45 6 7 89 10 11 1213 14 15 16
…
Cache line Conflict1. In this situation, data 5 is accessed as soon as data 1 is
stored in cache line 0. Hence data 1 in cache line 0should be gone out immediately.
2. As same as 1, data 9 is accessed as soon as data 5 is stored in cache line 0. Hence data 5 in cache line 0 should be gone out immediately.
Introduction to Parallel Programming for Multicore/Manycore Clusters
43
Main Memory
Line 0Line 1Line 2Line 3
CacheCacheLines
ContinuousAccess
159
To registers
Access Direction
Cache line Conflict The 1 and 2 are continuous happened in this example. Data transfer line is full due to data movement from
main memory to cache. Hence it is impossible to additional data movement.
This is as same as sequential access from main memory. It is same situation for no cache. Data cannot reach to arithmetic unit when it needs,
then computation is stopped. Hence performance goes down.
This phenomena is called by <Cache Thrashing> or <Cache line Conflict>.
Introduction to Parallel Programming for Multicore/Manycore Clusters
44
Memory Interleaving In case of physical direct access for array. When data is accesses, it is highly happened by near bank
accesses from current access bank. With using this, there is a function to move data on the near banks to cache lines.
When data on line 0 is accessed, it can perform parallel data transfer from a near bank to line 1. This calls <Memory Interleaving>.
From viewpoint of arithmetic unit (AU), time of data access can be shorten.
Time for idol time of AU can be shorten. >This makes high efficiency.
Introduction to Parallel Programming for Multicore/Manycore Clusters
45
Make continual access loop for data storage in your program!
Condition of Cache Line Conflict Memory bank allocation to cache lines is usually
done with power of 2. For example, 32, 64, and 128. If performance goes down with factor of 1/2~1/3,
sometimes 1/10, in dedicated size of array, such as 1024, cache line conflict may be happened. Actual program behaves very complex, hence it is
difficult to find condition of cache line conflict. But, Allocate array with size of power of 2 should be avoided.
Introduction to Parallel Programming for Multicore/Manycore Clusters
46
How to avoid cache line conflict? To avoid cache line conflict, we can apply the follows:
1. Patting:Allocate additional contents of array to avoid size of power of 2. Then use a part of the allocated array.
There is a compile option for patting.
2. Data Compression:Allocate a new array to use it for computation, and move data to the newly allocated array.
3. Computation Prediction:A routine to predict cache line conflict is included to program. When allocate array, the routine is called and checked the cache line conflict.
Introduction to Parallel Programming for Multicore/Manycore Clusters
47
Blocking
Data reuse for small access area
Introduction to Parallel Programming for Multicore/Manycore Clusters
48
Access Localization by Blocking There is a size for cache. Even in performing continuous access, data goes out
from cache when access area exceeds cache size. If data goes out from cache in frequently, it means
same situation from main memory to cache. In this situation, no advantage for high speed cache is obtained.
Hence the followings is needed to do high performance1. Store full data to cache lines;2. Reuse data on the cache lines in many times;
Introduction to Parallel Programming for Multicore/Manycore Clusters
49
An Example of Cache miss hits by Blocking
Matrix-matrix Multiplications Matrix Size:8×8 double A[8][8];
Number of Cache Lines: 4 4 elements of array can be on a cache line. Cache Line Amount:4×8 Bytes (double)=32 Bytes
Row wise is continual access for array. (C language) Cache Replacement Algorithm::
Least Recently Used (LRU) Introduction to Parallel Programming for
Multicore/Manycore Clusters50
Relationship between Organizations ofArray and Cache Lines In this situation, relationship between array and cache lines is
as follow. We do not consider cache line conflict.
Introduction to Parallel Programming for Multicore/Manycore Clusters
51
In C Language: Arrays: A[i][j]、B[i][j]、C[i][j]
i
j
Storage Direction
1
Organization ofCache Lines
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
234
Elements of 1×4 in arrays are on each cache line. Which line is used is determined
by <Access pattern of the arrays>and <Cache Replacement Algorithm>.
Matrix-matrix Multiplications (Non Blocking)
Introduction to Parallel Programming for Multicore/Manycore Clusters
52
=
C A B
*
Cache LineNumber of cache linesIs 4. Cache line replacementAlgorithm is LRU.
Cache Miss ①
Line 1Line 2Line 3Line 4
Cache Miss ②
Cache Miss ③
Cache Miss ④
Cache Miss ⑤
LRU: Data on line that does not access recently goes out.
Matrix-matrix Multiplications (Non Blocking)
Introduction to Parallel Programming for Multicore/Manycore Clusters
53
=
C A B
*
キャッシュライン
Cache Miss ⑥
Cache Miss ⑦
Cache Miss ⑧
Cache Miss ⑨
Cache Miss ⑩
Cache Miss 11
Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.
Line 1Line 2Line 3Line 4
Matrix-matrix Multiplications (Non Blocking)
Introduction to Parallel Programming for Multicore/Manycore Clusters
54
=
C A B
*
キャッシュライン
Cache Miss Cache Miss Cache Miss
Cache Miss Cache Miss
Cache MissCache Miss
Cache Miss Cache Miss
Cache Miss
Cache Miss
Number of Cache Miss hit is 22during 2 elements of C.
Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.
Line 1Line 2Line 3Line 4
Matrix-matrix Multiplications(Blocking: 2 elements)
Introduction to Parallel Programming for Multicore/Manycore Clusters
55
=
C A B
*
Cache Line
Cache Miss Cache Miss Cache Miss
Cache Miss Cache Miss
Cache Miss
Computing with thisblocking unit
1 2
1
①
①
②
②
Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.
Line 1Line 2Line 3Line 4
Matrix-matrix Multiplications(Blocking: 2 elements)
Introduction to Parallel Programming for Multicore/Manycore Clusters
56
=
C A B
*
Cache Line
Cache Miss Cache Miss
Cache Miss
Cache Miss
Cache Miss
Cache Miss
Computing with thisblocking unit
1
1
③ ④
③
④
2
Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.
Line 1Line 2Line 3Line 4
Number of Cache Miss hit is 10during 2 elements of C.
Matrix-Matrix Multiplication Code(C Language): Without blockingCode Example:
for (i=0; i<n; i++) for (j=0; j<n; j++)
for (k=0; k<n; k++) C[i][j] += A[i][k] *B[k][j];
Introduction to Parallel Programming for Multicore/Manycore Clusters
57
C A Bi
j
i
k
k
j
Matrix-Matrix Multiplication Code(C Language) : With blocking When n is not dividable for the blocking size (ibl=16), it
make the following six-nested loop:
Introduction to Parallel Programming for Multicore/Manycore Clusters
58
ibl = 16;for ( ib=0; ib<n; ib+=ibl ) {
for ( jb=0; jb<n; jb+=ibl ) {for ( kb=0; kb<n; kb+=ibl ) {for ( i=ib; i<ib+ibl; i++ ) {for ( j=jb; j<jb+ibl; j++ ) {for ( k=kb; k<kb+ibl; k++ ) {C[i][j] += A[i][k] * B[k][j];
} } } } } }
Matrix-Matrix Multiplication Code(Fortran Language) : With blocking When n is not dividable for the blocking size (ibl=16), it
make the following six-nested loop:
Introduction to Parallel Programming for Multicore/Manycore Clusters
59
ibl = 16do ib=1, n, ibl
do jb=1, n, ibldo kb=1, n, ibl
do i=ib, ib+ibl-1do j=jb, jb+ibl-1
do k=kb, kb+ibl-1C(i, j) = C(i, j) + A(i, k) * B(k, j)
enddo; enddo; enddo; enddo; enddo; enddo;
Data Access Pattern in Case of Cache Blocking
Introduction to Parallel Programming for Multicore/Manycore Clusters
60
C A B
= ×
ibl
ibl
ibl
ibl
ibl
ibl
Perform matrix-matrixMultiplication with small matrix ibl×ibl
Data Access Pattern in Case of Cache Blocking
Introduction to Parallel Programming for Multicore/Manycore Clusters
61
C A B
= ×
ibl
ibl
ibl
ibl ibl
ibl
Perform matrix-matrixMultiplication with small matrix ibl×ibl
Unrolling for Blocked Matrix-matrix Multiplication (C Language) Unrolling for 6-nested loop to blocked matrix-matrix
multiplication (6-nested loop) can be adapted. i-loop and j-loop with depth 2 unrolling is as follows, where
the blocking depth ibl can be divided by 2.
Introduction to Parallel Programming for Multicore/Manycore Clusters
62
ibl = 16;for (ib=0; ib<n; ib+=ibl) {for (jb=0; jb<n; jb+=ibl) {
for (kb=0; kb<n; kb+=ibl) {for (i=ib; i<ib+ibl; i+=2) {for (j=jb; j<jb+ibl; j+=2) {
for (k=kb; k<kb+ibl; k++) {C[i ][j ] += A[i ][k] * B[k][j ];C[i+1][j ] += A[i+1][k] * B[k][j ];C[i ][j+1] += A[i ][k] * B[k][j+1];C[i+1][j+1] += A[i+1][k] * B[k][j+1];
} } } } } }
Unrolling for Blocked Matrix-matrix Multiplication (Fortran Language)
Unrolling for 6-nested loop to blocked matrix-matrix multiplication (6-nested loop) can be adapted.
i-loop and j-loop with depth 2 unrolling is as follows, where the blocking depth ibl can be divided by 2.
Introduction to Parallel Programming for Multicore/Manycore Clusters
63
ibl = 16do ib=1, n, ibl
do jb=1, n, ibldo kb=1, n, ibldo i=ib, ib+ibl, 2
do j=jb, jb+ibl, 2do k=kb, kb+ibl
C(i , j ) = C(i , j ) + A(i , k) * B(k, j )C(i+1, j ) = C(i+1, j ) + A(i+1, k) * B(k, j )C(i , j+1) = C(i , j+1) + A(i , k) * B(k, j+1)C(i+1, j+1) = C(i+1, j+1) + A(i+1, k) * B(k, j+1)
enddo; enddo; enddo; enddo; enddo; enddo;
Other Techniques to Establish Speedups
Introduction to Parallel Programming for Multicore/Manycore Clusters
64
Delete Common Statements (1) There is redundant computation in the following program.
d = a + b + c;f = d + a + b;
Some compiler optimizes this, but the following is better.
temp = a + b;d = temp + c;f = d + temp;
Introduction to Parallel Programming for Multicore/Manycore Clusters
65
Delete Common Statements (2) It is better to avoid redundant computation for access of array.
for (i=0; i<n; i++) {xold[i] = x[i];x[i] = x[i] + y[i];
} The following is the one.
for (i=0; i<n; i++) {dtemp = x[i];xold[i] = dtemp;x[i] = dtemp + y[i];
}
Introduction to Parallel Programming for Multicore/Manycore Clusters
66
Movement of Codes Division takes much time. Do not write it inside loop if possible.
for (i=0; i<n; i++) {a[i] = a[i] / sqrt(dnorm);
} In the case of above, multiplication should be adapted.
dtemp = 1.0 / sqrt(dnorm);for (i=0; i<n; i++) {
a[i] = a[i] *dtemp;}
Introduction to Parallel Programming for Multicore/Manycore Clusters
67
IF Sentences Inside Loop Do not write if sentences as much as possible.
for (i=0; i<n; i++) {for (j=0; j<n; j++) {
if ( i != j ) A[i][j] = B[i][j];else A[i][j] = 1.0;
} }
In this case, the following is better.for (i=0; i<n; i++) {
for (j=0; j<n; j++) {A[i][j] = B[i][j];
} }for (i=0; i<n; i++) A[i][i] = 1.0;
Introduction to Parallel Programming for Multicore/Manycore Clusters
68
Strengthen Software Pipelining
Introduction to Parallel Programming for Multicore/Manycore Clusters
69
for (i=0; i<n; i+=2) {dtmpb0 = b[i];dtmpc0 = c[i];dtmpa0 = dtmpb0 + dtmpc0;a[i] = dtmpa0;dtmpb1 = b[i+1];dtmpc1 = c[i+1];dtmpa1 = dtmpb1 + dtmpc1;a[i+1] = dtmpa1;
}
for (i=0; i<n; i+=2) { dtmpb0 = b[i];dtmpb1 = b[i+1];dtmpc0 = c[i];dtmpc1 = c[i+1];dtmpa0 = dtmpb0 + dtmpc0;dtmpa1 = dtmpb1 + dtmpc1;a[i] = dtmpa0;a[i+1] = dtmpa1;
}
Original Code(Depth 2 unrolling)
Code for Strengthen Softwarepipelining (Depth 2 unrolling)
Distance between definition-use is short.> Noting can be adaptedin viewpoint of software.
Distance between definition-use is large.> Causing many opportunities for software pipelining.