parallel and distributed computing on low latency clusters
DESCRIPTION
Slides from the thesis defence in Chicago by Vittorio Giovara.TRANSCRIPT
![Page 1: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/1.jpg)
Parallel and Distributed Computing on Low
Latecy Clusters
Vittorio GiovaraM. S. Electrical Engineering and Computer Science
University of Illinois at ChicagoMay 2009
![Page 2: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/2.jpg)
Contents• Motivation
• Strategy
• Technologies
• OpenMP
• MPI
• Infinband
• Application
• Compiler Optimizations
• OpenMP and MPI over Infinband
• Results
• Conclusions
![Page 3: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/3.jpg)
Motivation
![Page 4: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/4.jpg)
Motivation
• Scaling trend has to stop for CMOS technology:✓ Direct-tunneling limit in SiO2 ~3 nm
✓ Distance between Si atoms ~0.3 nm
✓ Variabilty
• Foundamental reason: rising fab cost
![Page 5: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/5.jpg)
Motivation
• Easy to build multiple core processor
• Requires human action to modify and adapt concurrent software
• New classification for computer architectures
![Page 6: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/6.jpg)
data pool
inst
ruct
ion
pool
CPU
CPU
MISD
data pool
CPU
inst
ruct
ion
pool
SISDdata pool
inst
ruct
ion
pool
CPU CPU
SIMD
data pool
inst
ruct
ion
pool
CPU CPU
CPUCPU
MIMD
Classification
![Page 7: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/7.jpg)
easier to parallelize
abstraction level
algorithm
loop level
process management
![Page 8: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/8.jpg)
data dependencybranching overhead
control flowalgorithm
loop level
process management
recursionmemory
managementprofiling
SMP MultiprogrammingMultithreading and Scheduling
Levels
![Page 9: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/9.jpg)
Backfire
• Difficutly to fully exploit the parallelism offered
• Automatic tools required to adapt software to parallelism
• Compiler support for manual or semi-automatic enhancement
![Page 10: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/10.jpg)
Applications
• OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
![Page 11: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/11.jpg)
Specific Problem and Background
• Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering)
• Computationally intensive (even days of CPU); speedup required
• Previous works still not fully encompassing the problem (no Infiniband or OpenMP+MPI solutions)
![Page 12: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/12.jpg)
Strategy
![Page 13: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/13.jpg)
Strategy
• Install a Linux Kernel with ad-hoc configuration for scientific computation
• Compile a OpenMP enable GCC (supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with proper drivers in kernel and user space
• Select a MPI implementation library
![Page 14: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/14.jpg)
Strategy
• Verify Infiniband network through some MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI directives in the code
• Run test cases
![Page 15: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/15.jpg)
OpenMP
• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
![Page 16: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/16.jpg)
OpenMP - example
![Page 17: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/17.jpg)
Parallel Task 1
Parallel Task 2 Parallel Task 4
Parallel Task 3
OpenMP - example
![Page 18: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/18.jpg)
Master Thread
Parallel Task 1 Parallel Task 2
Parallel Task 3
Parallel Task 4
Thread B
Thread A
Join
![Page 19: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/19.jpg)
OpenMP Sceduler
• Which scheduler available for hardware?
- Static
- Dynamic
- Guided
![Page 20: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/20.jpg)
OpenMP Scheduler
0
10000
20000
30000
40000
50000
60000
70000
80000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OpenMP Static Scheduler Chart
mic
rose
cond
s
number of threads
chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
![Page 21: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/21.jpg)
OpenMP Scheduler
0
14625
29250
43875
58500
73125
87750
102375
117000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OpenMP Dynamic Scheduler Chart
mic
rose
cond
s
number of threads
chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
![Page 22: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/22.jpg)
OpenMP Scheduler
0
10000
20000
30000
40000
50000
60000
70000
80000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OpenMP Guided Scheduler Chart
mic
rose
cond
s
number of threads
chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
![Page 23: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/23.jpg)
OpenMP Scheduler
![Page 24: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/24.jpg)
OpenMP Scheduler
static scheduler dynamic scheduler guided scheduler
![Page 25: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/25.jpg)
MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
- OpenMPI
- MVAPICH
![Page 26: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/26.jpg)
Infiniband
• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
![Page 27: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/27.jpg)
1,0 µs
10,0 µs
100,0 µs
1000,0 µs
10000,0 µs
100000,0 µs
1000000,0 µs
10000000,0 µs
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128
kB
256
kB
512
kB1
MB
2 M
B4
MB
8 M
B
16 M
B
32 M
B
64 M
B
128
MB
256
MB
512
MB1
GB2
GB4
GB8
GB
16 G
B
OpenMPI Mvapich2
MPI over Infiniband
![Page 28: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/28.jpg)
MPI over Infiniband
1,00 µs
10,00 µs
100,00 µs
1000,00 µs
10000,00 µs
100000,00 µs
1000000,00 µs
10000000,00 µs
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128
kB
256
kB
512
kB1
MB
2 M
B4
MB
8 M
B
OpenMPI Mvapich2
![Page 29: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/29.jpg)
Optimizations
• Active at compile time
• Available only after porting the software to standard FORTRAN
• Consistent documentation available
• Unexpected positive results
![Page 30: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/30.jpg)
Optimizations
•-march = native
•-O3
•-ffast-math
•-Wl,-O1
![Page 31: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/31.jpg)
Target Software
![Page 32: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/32.jpg)
Target Software
• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of mathematical models
![Page 33: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/33.jpg)
parallel loop
OpenMP Threadsdistributed loop
Host 1 Host 2OpenMP Threads OpenMP Threads
MPI
sequential loop
standard programming
model
Implementation Scheme
![Page 34: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/34.jpg)
Implementation Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization obiects required
➡ send() and recv() mechanism
➡ critical regions using OpenMP directives
➡ functions merging
➡ matrix conversion
![Page 35: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/35.jpg)
Results
![Page 36: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/36.jpg)
ResultsOMP MPI OPT seconds
* * * 133* * - 400* - * 186* - - 487- * * 200- * - 792- - * 246- - - 1062
Total Speed Increase: 87.52%
![Page 37: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/37.jpg)
Actual ResultsOMP MPI seconds
* * 59* - 129- * 174- - 249
Function Namecalc_intmuduacalc_hdmg_tetcalc_muduacampo_effettivo
Normal OpenMP MPI OpenMP+MPI24.5 s 4.7 s 14.4 s 2.8 s16.9 s 3.0 s 10.8 s 1.7 s12.1 s 1.9 s 7.0 s 1.1 s17.7 s 4.5 s 9.9 s 2.3 s
![Page 38: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/38.jpg)
Actual Results
Total Raw Speed Increment: 76%
• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x
![Page 39: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/39.jpg)
Conclusions
![Page 40: Parallel and Distributed Computing on Low Latency Clusters](https://reader035.vdocuments.pub/reader035/viewer/2022081414/54b6c9004a7959e5268b47ab/html5/thumbnails/40.jpg)
Conclusions and Future Works
• Computational time has been significantly decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
• Continue inserting OpenMP and MPI directives• Perform algorithm optimizations• Increase cluster size