ima ppttutorial
TRANSCRIPT
-
7/31/2019 IMA PPtTutorial
1/104
Lawrence Livermore National Laboratory
Ulrike Meier Yang
LLNL-PRES-463131
Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551
This work performed under the auspices of the U.S. Department of Energy byLawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
A Parallel Computing Tutorial
-
7/31/2019 IMA PPtTutorial
2/104
2Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
References
Introduction to Parallel Computing, Blaise Barneyhttps://computing.llnl.gov/tutorials/parallel_comp
A Parallel Multigrid Tutorial, Jim Jones
https://computing.llnl.gov/casc/linear_solvers/present.html
Introduction to Parallel Computing, Design and Analysis
of Algorithms, Vipin Kumar, Ananth Grama, Anshul
Gupta, George Karypis
-
7/31/2019 IMA PPtTutorial
3/104
3Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Overview
Basics of Parallel Computing (see Barney)- Concepts and Terminology- Computer Architectures- Programming Models
- Designing Parallel Programs
Parallel algorithms and their implementation- basic kernels- Krylov methods
- multigrid
-
7/31/2019 IMA PPtTutorial
4/104
4Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Serial computation
Traditionally software has been written for serial computation:- run on a single computer- instructions are run one after another- only one instruction executed at a time
-
7/31/2019 IMA PPtTutorial
5/104
5Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Von Neumann Architecture
John von Neumann first authored thegeneral requirements for an electroniccomputer in 1945
Data and instructions are stored in memory
Control unit fetches instructions/data from
memory, decodes the instructions and thensequentiallycoordinates operations toaccomplish the programmed task.
Arithmetic Unit performs basic arithmeticoperations
Input/Output is the interface to the humanoperator
-
7/31/2019 IMA PPtTutorial
6/104
6Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Parallel Computing
Simultaneous use of multiple compute sources to solve a singleproblem
-
7/31/2019 IMA PPtTutorial
7/1047Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Why use parallel computing?
Save time and/or money Enables the solution of larger problems
Provide concurrency
Use of non-local resources
Limits to serial computing
-
7/31/2019 IMA PPtTutorial
8/1048Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Concepts and Terminology
Flynns Taxonomy (1966)
S I S D
Single Instruction, Single Data
S I M D
Single Instruction, Multiple Data
M I S D
Multiple Instruction, Single Data
M I M D
Multiple Instruction, Multiple Data
-
7/31/2019 IMA PPtTutorial
9/1049
Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
SISD
Serial computer Deterministic execution
Examples: older generation main frames, work stations, PCs
-
7/31/2019 IMA PPtTutorial
10/10410
Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
SIMD
A type of parallel computer All processing units execute the same instruction at any given clock cycle
Each processing unit can operate on a different data element
Two varieties: Processor Arrays and Vector Pipelines
Most modern computers, particularly those with graphics processor units
(GPUs) employ SIMD instructions and execution units.
-
7/31/2019 IMA PPtTutorial
11/10411
Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
MISD
A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent
instruction streams.
Few actual examples : Carnegie-Mellon C.mmp computer (1971).
-
7/31/2019 IMA PPtTutorial
12/10412
Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
MIMD
Currently, most common type of parallel computer Every processor may be executing a different instruction stream
Every processor may be working with a different data stream
Execution can be synchronous or asynchronous, deterministic or non-deterministic
Examples: most current supercomputers, networked parallel computerclusters and "grids", multi-processor SMP computers, multi-core PCs.
Note: many MIMDarchitectures alsoinclude SIMD executionsub-components
-
7/31/2019 IMA PPtTutorial
13/10413
Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Parallel Computer Architectures
Shared memory:
all processors can access the
same memory
Uniform memory access (UMA):
- identical processors
- equal access and access
times to memory
-
7/31/2019 IMA PPtTutorial
14/10414
Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Non-uniform memory access (NUMA)
Not all processors have equal access to all memories
Memory access across link is slower
Advantages:- user-friendly programming perspective to memory
- fast and uniform data sharing due to the proximity of memory to CPUs
Disadvantages:- lack of scalability between memory and CPUs.- Programmer responsible to ensure "correct" access of global memory- Expense
-
7/31/2019 IMA PPtTutorial
15/104
15Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Distributed memory
Distributed memory systems require a communication network to
connect inter-processor memory.
Advantages:- Memory is scalable with number of processors.
- No memory interference or overhead for trying to keep cache coherency.- Cost effective
Disadvantages:- programmer responsible for data communication between processors.- difficult to map existing data structures to this memory organization.
-
7/31/2019 IMA PPtTutorial
16/104
16Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Hybrid Distributed-Shared Meory
Generally used for the currently largest and fastestcomputers
Has a mixture of previously mentioned advantages anddisadvantages
-
7/31/2019 IMA PPtTutorial
17/104
17Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Parallel programming models
Shared memory Threads
Message Passing
Data Parallel
Hybrid
All of these can be implemented on any architecture.
-
7/31/2019 IMA PPtTutorial
18/104
18Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Shared memory
tasks share a common address space, which they read and writeasynchronously.
Various mechanisms such as locks / semaphores may be used tocontrol access to the shared memory.
Advantage: no need to explicitly communicate of data between
tasks -> simplified programming Disadvantages:
Need to take care when managing memory, avoidsynchronization conflicts
Harder to control data locality
-
7/31/2019 IMA PPtTutorial
19/104
19Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Threads
A thread can be considered as asubroutine in the main program
Threads communicate with eachother through the global memory
commonly associated with shared
memory architectures andoperating systems
Posix Threads or pthreads
OpenMP
-
7/31/2019 IMA PPtTutorial
20/104
20Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Message Passing
A set of tasks that use theirown local memory duringcomputation.
Data exchange throughsending and receiving
messages. Data transfer usually requires
cooperative operations to beperformed by each process.For example, a send operation
must have a matching receive operation.
MPI (released in 1994)
MPI-2 (released in 1996)
-
7/31/2019 IMA PPtTutorial
21/104
21Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Data Parallel
The data parallel model demonstrates the following characteristics:
Most of the parallel workperforms operations on a dataset, organized into a commonstructure, such as an array
A set of tasks works collectively
on the same data structure,with each task working on adifferent partition
Tasks perform the same operationon their partition
On shared memory architectures, all tasks may have access to thedata structure through global memory. On distributed memoryarchitectures the data structure is split up and resides as "chunks"in the local memory of each task.
-
7/31/2019 IMA PPtTutorial
22/104
22Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Other Models
Hybrid
- combines various models, e.g. MPI/OpenMP
Single Program Multiple Data (SPMD)- A single program is executed by all tasks simultaneously
Multiple Program Multiple Data (MPMD)- An MPMD application has multiple executables. Each task canexecute the same or different program as other tasks.
-
7/31/2019 IMA PPtTutorial
23/104
23Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Designing Parallel Programs
Examine problem:- Can the problem be parallelized?- Are there data dependencies?- where is most of the work done?- identify bottlenecks (e.g. I/O)
Partitioning- How should the data be decomposed?
-
7/31/2019 IMA PPtTutorial
24/104
24Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Various partitionings
-
7/31/2019 IMA PPtTutorial
25/104
25Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
How should the algorithm be distributed?
-
7/31/2019 IMA PPtTutorial
26/104
26Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Communications
Types of communication:
- point-to-point
- collective
-
7/31/2019 IMA PPtTutorial
27/104
27Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Synchronization types
Barrier
Each task performs its work until it reaches the barrier. It thenstops, or "blocks".
When the last task reaches the barrier, all tasks aresynchronized.
Lock / semaphore
The first task to acquire the lock "sets" it. This task can thensafely (serially) access the protected data or code.
Other tasks can attempt to acquire the lock but must wait until
the task that owns the lock releases it. Can be blocking or non-blocking
-
7/31/2019 IMA PPtTutorial
28/104
28Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Load balancing
Keep alltasks busy allof the time. Minimize idle time. The slowest task will determine the overall performance.
-
7/31/2019 IMA PPtTutorial
29/104
29Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Granularity
In parallel computing, granularity is a qualitativemeasure of the ratio of computation tocommunication.
Fine-grain Parallelism:- Low computation to communication ratio- Facilitates load balancing- Implies high communication overhead and lessopportunity for performance enhancement
Coarse-grain Parallelism:- High computation to communication ratio- Implies more opportunity for performanceincrease- Harder to load balance efficiently
-
7/31/2019 IMA PPtTutorial
30/104
30Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Parallel Computing Performance Metrics
Let T(n,p) be the time to solve a problem of size nusing p processors
- Speedup: S(n,p) = T(n,1)/T(n,p)
- Efficiency: E(n,p) = S(n,p)/p
- Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)
An algorithm is scalable if there exists C>0 withSE(n,p) C for all p
-
7/31/2019 IMA PPtTutorial
31/104
31Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Strong Scalability
increasing the number of processorsfor a fixed problem size
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Speedup
no of procs
perfectspeedup
method
0
1
2
34
5
6
1 2 3 4 5 6 7 8
seconds
no of procs
Run times
0.6
0.7
0.80.9
1
1.1
1.2
1 2 3 4 5 6 7 8
Effic
iency
no of procs
-
7/31/2019 IMA PPtTutorial
32/104
32Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Weak scalability
increasing the number of processors using a fixed problem size
per processor
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 3 4 5 6 7 8
seconds
no of procs
Run times
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8
ScaledEfficiency
No of procs
-
7/31/2019 IMA PPtTutorial
33/104
33Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Amdahls Law
Maximal Speedup = 1/(1-P), P parallel portion of code
-
7/31/2019 IMA PPtTutorial
34/104
34Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Amdahls Law
Speedup = 1/((P/N)+S),N no. of processors, S serial portion of code
Speedup
N P=0.5 P=0.9 P=0.9910
1001000
10000100000
1.821.981.991.991.99
5.269.179.919.919.99
9.1750.2590.9999.0299.90
50%75%90%95%
-
7/31/2019 IMA PPtTutorial
35/104
35Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Performance Model for distributed memory
Time for n floating point operations:tcomp = fn
1/f Mflops
Time for communicating n words:tcomm = + n
latency, 1/ bandwidth
Total time: tcomp + tcomm
-
7/31/2019 IMA PPtTutorial
36/104
36Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Parallel Algorithms and Implementations
basic kernels Krylov methods
multigrid
-
7/31/2019 IMA PPtTutorial
37/104
37Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Vector Operations
Adding two vectorsx = y+z
Axpysy = y+x
Vector products
Proc 0
Proc 1
Proc 2
z0 = x0 + y0
z1 = x1 + y1
z2 = x2 + y2
Proc 0
Proc 1
Proc 2
s0 = x0 * y0
s1 = x1 * y1
s2 = x2 * y2
s = s0 + s1 + s2
s = x T y
-
7/31/2019 IMA PPtTutorial
38/104
38Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Dense n x n Matrix Vector Multiplication
Matrix size n x n , p processors
=
T = 2(n2/p)f + (p-1)(+n/p)
Proc 0
Proc 1
Proc 2
Proc 3
-
7/31/2019 IMA PPtTutorial
39/104
39Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Sparse Matrix Vector Multiplication
Matrix of size n2 x n2 , p2 processors
Consider graph of matrix
Partitioning influences run timeT1 10(n
2/p2)f + 2(+n) vs. T2 10(n2/p2)f + 4(+n/p)
-
7/31/2019 IMA PPtTutorial
40/104
40Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Effect of Partitioning on Performance
1
3
5
7
9
11
13
15
1 3 5 7 9 11 13 15
Spee
dup
No of procs
Perfect Speedup
MPI opt
MPI-noopt
-
7/31/2019 IMA PPtTutorial
41/104
41Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Coordinate Format (COO)
real Data(nnz), integer Row(nnz), integer Col(nnz)
Data
Row
Col
Matrix-Vector multiplication:
for (k=0; k
-
7/31/2019 IMA PPtTutorial
42/104
42Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Matrix-Vector multiplication:
for (k=0; k
-
7/31/2019 IMA PPtTutorial
43/104
43Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Sparse Matrix Vector Multiply with CSR Data Structure
Compressed sparse row (CSR)real Data(nnz), integer Rowptr(n+1), Col(nnz)
Data
Rowptr
Col
Matrix-Vector multiplication:
for (i=0; i
-
7/31/2019 IMA PPtTutorial
44/104
44Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Sparse Matrix Vector Multiply with CSC Data Structure
Compressed sparse column (CSC)real Data(nnz), integer Colptr(n+1), Row(nnz)
Data
Colptr
Row
Matrix-Vector multiplication:
for (i=0; i
-
7/31/2019 IMA PPtTutorial
45/104
45Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
CSR
CSC
Parallel Matvec, CSR vs. CSC
y0 = A0 x0
y1 = A1 x1
y2
= A2
x2
A0 x0
y = y0 + y1 + y2 A0 A2 x1
x2
-
7/31/2019 IMA PPtTutorial
46/104
46Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Sparse Matrix Vector Multiply with Diagonal Storage
Diagonal storagereal Data(n,m), integer offset(m)
Data offset
Matrix-Vector multiplication:
for (i=0; i
-
7/31/2019 IMA PPtTutorial
47/104
47Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Other Data Structures
Block data structures (BCSR, BCSC)block size bxb, n divisible by breal Data(nnb*b*b), integer Ptr(n/b+1), integer Index(nnb)
Based on space filling curves(e.g. Hilbert curves, Z-order)
integer IJ(2*nnz)real Data(nnz)
Matrix-Vector multiplication:for (i=0; i
-
7/31/2019 IMA PPtTutorial
48/104
48Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Data Structures for Distributed Memory
Concept for MATAIJ (PETSc) and ParCSR (Hypre)
x x xx x x x x x
x x x x xx x x x x xx x x x
x x x x x x xx x x x x xx x x x x
x x x x xx x x x x x
x x x x x x
Proc 0
Proc 1
Proc 2
-
7/31/2019 IMA PPtTutorial
49/104
49Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Data Structures for Distributed Memory
Concept for MATAIJ (PETSc) and ParCSR (Hypre)
x x xx x x x x x
x x x x xx x x x x xx x x x
x x x x x x xx x x x x xx x x x x
x x x x xx x x x x x
x x x x x x
Proc 0
Proc 1
Proc 2
-
7/31/2019 IMA PPtTutorial
50/104
50Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Data Structures for Distributed Memory
Concept for MATAIJ (PETSc) and ParCSR (Hypre)
Each processor : Local CSR Matrix,Nonlocal CSR MatrixMapping local to global indices
x x xx x x x x x
x x x x xx x x x x xx x x x
x x x x x x xx x x x x xx x x x x
x x x x xx x x x x x
x x x x x x
0 1 2 3 4 5 6 7 8 9 10
Proc 0
Proc 1
Proc 2
0 2 3 8 10
-
7/31/2019 IMA PPtTutorial
51/104
51Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Communication Information
Neighbor hood information (No. of neighbors)
Amount of data to be sent and/or received
Global partitioning for p processors
not good on very large numbers of processors!
r0 r1 r2 . rp
-
7/31/2019 IMA PPtTutorial
52/104
52Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Assumed partition (AP) algorithm
Answering global distribution questions previously required O(P)storage & computations
On BG/L, O(P) storage may not be possible
New algorithm requires
O(1) storage O(log P) computations
Enables scaling to 100K+processors
AP idea has general applicability beyond hypre
Data owned by processor 3,in 4s assumed partition
Actual partition1 2 43
1 N
Assumed partition ( p = (iP)/N )
1 2 3 4
Actual partition info is sent to the assumedpartition processors distributed directory
Assumed partition (AP) algorithm is more challenging
-
7/31/2019 IMA PPtTutorial
53/104
53Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
p ( ) g g g
for structured AMR grids
AMR can produce grids with gaps
Our AP function accounts for thesegaps for scalability
Demonstrated on 32K procs of BG/L
Simple, nave AP function leavesprocessors withempty partitions
-
7/31/2019 IMA PPtTutorial
54/104
54Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Effect of AP on one ofhypres multigrid solvers
0
1
2
3
4
5
6
7
0 20000 40000 60000
seconds
No of procs
Run times on up to 64K procs on BG/P
global
assumed
-
7/31/2019 IMA PPtTutorial
55/104
55Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Machine specification Hera - LLNL
Multi-core/ multi-socketcluster
864 diskless nodesinterconnected
by Quad Data rate (QDR)Infiniband
AMD opteron quad core(2.3 GHz)
Fat tree network
-
7/31/2019 IMA PPtTutorial
56/104
56Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Individual 512 KB L2 cache for each core
2 MB L3 cache shared by 4 cores
4 sockets per node,16 cores sharing the 32 GBmain memory
NUMA memory access
Multicore cluster details (Hera):
MxV Performance 2-16 cores OpenMP vs. MPI
-
7/31/2019 IMA PPtTutorial
57/104
57Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
p
7pt stencil
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 200000 400000
Mflops
matrix size
OpenMP
1
2
4
8
12
16
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 200000 400000
Mflops
matrix size
MPI
-
7/31/2019 IMA PPtTutorial
58/104
58Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
MxV Speedup for various matrix sizes, 7-pt stencil
0
5
10
15
20
25
30
1 3 5 7 9 11 13 15
Speedup
no of tasks
MPI
0
5
10
15
20
25
30
1 3 5 7 9 11 13 15
Speedup
no of threads
OpenMP
512
8000
21952
32768
64000
110592
512000
perfect
-
7/31/2019 IMA PPtTutorial
59/104
59Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Iterative linear solvers
Krylov solvers :- Conjugate gradient- GMRES- BiCGSTAB
Generally consist of basic linear kernels
-
7/31/2019 IMA PPtTutorial
60/104
60Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Conjugate Gradient for solving Ax=b
k = 0, x0
= 0, r0
= b, 0
= r0
Tr0
;
while (i > 20)
Solve Mzk = rk;
k = rkT zk;
if (k=0) p1 = z0 else pk+1 = zk + k pk / k-1 ;
k = k+1 ;
k = k-1 / pkT Apk ;
xk = xk-1 + k pk ;
rk = rk-1 - k Apk ;
k = rkT
rk ;end
x = xk ;
-
7/31/2019 IMA PPtTutorial
61/104
61Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Weak scalability diagonally scaled CG
0
2
4
6
8
10
12
0 20 40 60 80 100 120
seconds
no of procs
0
100
200
300
400
500
600
0 20 40 60 80 100 120
no of procs
Number of iterationsTotal times 100 iterations
-
7/31/2019 IMA PPtTutorial
62/104
62Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Achieve algorithmic scalability through multigrid
smooth
Finest Grid
First Coarse Grid
restrict interpolate
-
7/31/2019 IMA PPtTutorial
63/104
63Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Multigrid parallelized by domain partitioning
Partition fine grid
Interpolation fully parallel
Smoothing
-
7/31/2019 IMA PPtTutorial
64/104
64Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Select coarse grids,
Define interpolation,
Define restriction and coarse-grid operators
Algebraic Multigrid
Setup Phase
Solve Phase:
mm)m( fuARelax mm)m( fuARelax
1m)m(m ePeeInterpolat
m)m(mm uAfrCompute mmm
euuCorrect
1m1m)1m( reA
Solve
m)m(1m rRrRestrict
,...2,1m,P )m(
)m()m()m()1(mT)m()m( PARA)P(R
H b l bili f l i id V l ?
-
7/31/2019 IMA PPtTutorial
65/104
65Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
How about scalability of multigrid V cycle?
Time for relaxation on finest grid for nxn points per procT0 4 + 4n + 5n
2 f
Time for relaxation on level kTk 4 + 4(n/2
k) + 5(n/2k)2/4
Total time for V_cycle:TV 2 k (4 + 4(n/2
k) + 5(n/2k)2/4)
8(1+1+1+ )
+ 8(1 + 1/2 + 1/4 +) n + 10 (1 + 1/4 + 1/16 + ) n2 f
8 L + 16 n + 40/3 n2 fL log2(pn) number of levels
lim p>>0 TV(n,1)/ TV(pn,p) = O(1/log2(p))
H b t l bilit f lti id W l ?
-
7/31/2019 IMA PPtTutorial
66/104
66Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
How about scalability of multigrid W cycle?
Total time for W cycle:
TW 2 k 2k (4 + 4(n/2k) + 5(n/2k)2/4)
8(1+2+4+ )
+ 8(1 + 1 + 1 +) n
+ 10 (1 + 1/2 + 1/4 + ) n2 f
8 (2L+1 - 1) + 16 L n + 20 n2 f
8 (2pn - 1) + 16 log2(pn) n + 20 n2 f
L log2(pn)
S th
http://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gif -
7/31/2019 IMA PPtTutorial
67/104
67Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Smoothers
Smoothers significantly affect multigrid convergence and runtime
solving: Ax=b, A SPD
smoothing: xn+1 = xn + M-1(b-Axn)
smoothers:
Jacobi, M=D, easy to parallelize
Gauss-Seidel, M=D+L, highly sequential!!
smooth
1-n0,...,i,)j(x)j,i(A)i(b)i,i(A
)i(xn
ij,jkk
1
0
1
1
1-n0,...,i,)j(x)j,i(A)j(x)j,i(A)i(b)i,i(A
)i(xn
ijk
i
jkk
1
1
1
1
0
1
P ll l S th
-
7/31/2019 IMA PPtTutorial
68/104
68Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Parallel Smoothers
Red-Black Gauss-Seidel
Multi-color Gauss-Seidel
H b id th
-
7/31/2019 IMA PPtTutorial
69/104
69Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Hybrid smoothers
pM
.
.
.
M
M
1
kk
k
k LD
M1
Hybrid SOR:
)LD(D)LD(MT
kkkkkk
1
Hybrid Symm. Gau-Seidel:
kkk
LDMHybrid Gau-Seidel:
Can be used for distributed as well as shared parallel computingNeeds to be used with care
Might require smoothing parameter M = M
Corresponding block schemes,
Schwarz solvers, ILU, etc.
H i th th th d d?
-
7/31/2019 IMA PPtTutorial
70/104
70Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
How is the smoother threaded?
A1
A =A2
Ap
Ap =
t1t2
t3t4
Hybrid: GS within eachthread, Jacobi on threadboundaries
the smoother is threadedsuch that each threadworks on a contiguoussubset of the rows
matrices are distributedacross P processors incontiguous blocks of rows
FEM li ti 2D S
-
7/31/2019 IMA PPtTutorial
71/104
71Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
FEM application: 2D Square
AMG determines thread subdomains by splitting nodes (not elements)
the applications numbering of the elements/nodes becomes relevant!
(i.e., threaded domains dependent on colors)
partitions grid element-wise into nice subdomains for each processor
16 procs
(colors indicate element numbering)
W t t O MP lti hi
-
7/31/2019 IMA PPtTutorial
72/104
72Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
We want to use OpenMP on multicore machines...
hybrid-GS is not guaranteed to converge, but in practice it iseffective for diffusion problems when subdomains are well-defined
Some Options
1. applications needs to order unknowns inprocessor subdomain nicely
may be more difficult due to refinement
may impose non-trivial burden onapplication
2. AMG needs to reorder unknowns (rows) non-trivial, costly
3. smoothers that dont depend on parallelism
C i l ith
-
7/31/2019 IMA PPtTutorial
73/104
73Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Coarsening algorithms
Original coarsening algorithm
Parallel coarsening algorithms
Choosing the Coarse Grid (Classical AMG)
-
7/31/2019 IMA PPtTutorial
74/104
74Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Choosing the Coarse Grid (Classical AMG)
Add measures(Number of stronginfluences of a point)
Choosing the Coarse Grid (Classical AMG)
-
7/31/2019 IMA PPtTutorial
75/104
75Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Choosing the Coarse Grid (Classical AMG)
Pick a point withmaximal measure andmake it a C-point
Choosing the Coarse Grid (Classical AMG)
-
7/31/2019 IMA PPtTutorial
76/104
76Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Choosing the Coarse Grid (Classical AMG)
Pick a point withmaximal measure andmake it a C-point
Make its neighborsF-points
Choosing the Coarse Grid (Classical AMG)
-
7/31/2019 IMA PPtTutorial
77/104
77Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Choosing the Coarse Grid (Classical AMG)
Pick a point withmaximal measure andmake it a C-point
Make its neighborsF-points
Increase measures ofneighbors of new F-points by number ofnew F-neighbors
Choosing the Coarse Grid (Classical AMG)
-
7/31/2019 IMA PPtTutorial
78/104
78Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Choosing the Coarse Grid (Classical AMG)
Pick a point withmaximal measure andmake it a C-point
Make its neighborsF-points
Increase measures ofneighbors of new F-points by number ofnew F-neighbors
Repeat until finished
Choosing the Coarse Grid (Classical AMG)
-
7/31/2019 IMA PPtTutorial
79/104
79Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Choosing the Coarse Grid (Classical AMG)
Pick a point withmaximal measure andmake it a C-point
Make its neighborsF-points
Increase measures ofneighbors of new F-points by number ofnew F-neighbors
Repeat until finished
Choosing the Coarse Grid (Classical AMG)
-
7/31/2019 IMA PPtTutorial
80/104
80Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Choosing the Coarse Grid (Classical AMG)
Pick a point withmaximal measure andmake it a C-point
Make its neighborsF-points
Increase measures ofneighbors of new F-points by number ofnew F-neighbors
Repeat until finished
Parallel Coarsening Algorithms
-
7/31/2019 IMA PPtTutorial
81/104
81Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory
Parallel Coarsening Algorithms
Perform sequential algorithm on each processor,then deal with processor boundaries
PMIS: start
-
7/31/2019 IMA PPtTutorial
82/104
82Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: start
select C-pt withmaximal measure
select neighbors asF-pts
3 5 5 5 5 5 3
5 8 8 8 8 8 5
5 8 8 8 8 8 5
5 8 8 8 8 8 5
5 8 8 8 8 8 5
5 8 8 8 8 8 5
3 5 5 5 5 5 3
PMIS: add random numbers
-
7/31/2019 IMA PPtTutorial
83/104
83Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: add random numbers
select C-pt withmaximal measure
select neighbors asF-pts
3.7 5.3 5.0 5.9 5.4 5.3 3.4
5.2 8.0 8.5 8.2 8.6 8.2 5.1
5.9 8.1 8.8 8.9 8.4 8.9 5.9
5.7 8.6 8.3 8.8 8.3 8.1 5.0
5.3 8.7 8.3 8.4 8.3 8.8 5.9
5.0 8.8 8.5 8.6 8.7 8.9 5.3
3.2 5.6 5.8 5.6 5.9 5.9 3.0
PMIS: exchange neighbor information
-
7/31/2019 IMA PPtTutorial
84/104
84Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: exchange neighbor information
select C-pt withmaximal measure
select neighbors asF-pts
3.7 5.3 5.0 5.9 5.4 5.3 3.4
5.2 8.0 8.5 8.2 8.6 8.2 5.1
5.9 8.1 8.8 8.9 8.4 8.9 5.9
5.7 8.6 8.3 8.8 8.3 8.1 5.0
5.3 8.7 8.3 8.4 8.3 8.8 5.9
5.0 8.8 8.5 8.6 8.7 8.9 5.3
3.2 5.6 5.8 5.6 5.9 5.9 3.0
PMIS: select
-
7/31/2019 IMA PPtTutorial
85/104
85Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: select
8.8
8.9 8.9
8.9
3.7 5.3 5.0 5.9 5.4 5.3 3.4
5.2 8.0 8.5 8.2 8.6 8.2 5.1
5.9 8.1 8.8 8.4 5.9
5.7 8.6 8.3 8.8 8.3 8.1 5.0
5.3 8.7 8.3 8.4 8.3 8.8 5.9
5.0 8.5 8.6 8.7 5.3
3.2 5.6 5.8 5.6 5.9 5.9 3.0
select C-pts withmaximal measurelocally
make neighbors F-pts
PMIS: update 1
-
7/31/2019 IMA PPtTutorial
86/104
86Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: update 1
5.4 5.3 3.4
select C-pts withmaximal measurelocally
make neighbors F-pts(requires neighbor
info)
3.7 5.3 5.0 5.9
5.2 8.0
5.9 8.1
5.7 8.6
8.4
8.6
5.6
PMIS: select
-
7/31/2019 IMA PPtTutorial
87/104
87Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: select
8.6
5.9
8.6
5.4 5.3 3.43.7 5.3 5.0
5.2 8.0
5.9 8.1
5.7
8.4
5.6
select C-pts withmaximal measurelocally
make neighbors F-pts
PMIS: update 2
-
7/31/2019 IMA PPtTutorial
88/104
88Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: update 2
5.3 3.4
select C-pts withmaximal measurelocally
make neighbors F-pts
3.7 5.3
5.2 8.0
PMIS: final grid
-
7/31/2019 IMA PPtTutorial
89/104
89Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
PMIS: final grid
select C-pts withmaximal measurelocally
make neighbor F-pts
remove neighboredges
Interpolation
-
7/31/2019 IMA PPtTutorial
90/104
90Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Interpolation
Nearest neighbor interpolation straight forward toparallelize
Harder to parallelize long range interpolation
i
j
lk
Communication patterns 128 procs Level 0
-
7/31/2019 IMA PPtTutorial
91/104
91Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Communication patterns 128 procs Level 0
Setup Solve
Communication patterns 128 procs Setupdistance 1 vs distance 2 interpolation
-
7/31/2019 IMA PPtTutorial
92/104
92Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
distance 1 vs. distance 2 interpolation
Distance 1 Level 4 Distance 2 Level 4
Communication patterns 128 procs Solvedistance 1 vs distance 2 interpolation
-
7/31/2019 IMA PPtTutorial
93/104
93Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
distance 1 vs. distance 2 interpolation
Distance 1 Level 4 Distance 2 Level 4
AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,
100x100x100 grid points per node (16 procs per node)
-
7/31/2019 IMA PPtTutorial
94/104
94Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
100x100x100 grid points per node (16 procs per node)
No. of cores
secon
ds
Solve timesSetup times
No. of cores
0
2
4
6
8
10
12
14
16
18
20
0 1000 2000 3000
Hera-1Hera-2BGL-1BGL-2
0
2
4
6
8
10
12
14
16
18
20
0 1000 2000 3000
Distance 1 vs distance 2 interpolation on differentcomputer architectures
-
7/31/2019 IMA PPtTutorial
95/104
95Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
computer architectures
AMG-GMRES(10), 7pt 3D Laplace problem on a unit cube, using
50x50x25 points per processor, (100x100x100 per node on Hera)
0
5
10
15
20
25
30
0 1000 2000 3000
seconds
no of cores
Hera-1
Hera-2
BGL-1
BGL-2
AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,
100x100x100 grid points per node (16 procs per node)
-
7/31/2019 IMA PPtTutorial
96/104
96Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
100x100x100 grid points per node (16 procs per node)
No. of cores
secon
ds
Solve cycle timesSetup times
No. of cores
0
5
10
15
20
25
30
35
0 5000 10000
MPI 16x1
H 8x2
H 4x4
H 2x8
OMP 1x16
MCSUP 1x16
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 5000 10000
AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,
100x100x100 grid points per node (16 procs per node)
-
7/31/2019 IMA PPtTutorial
97/104
97Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
0
5
10
15
20
25
30
35
40
45
50
0 2000 4000 6000 8000 10000 12000
MPI 16x1
H 8x2
H 4x4H 2x8
OMP 1x16
MCSUP 1x16
100x100x100 grid points per node (16 procs per node)
No. of cores
Seconds
Blue Gene/P Solution IntrepidArgonne National Laboratory
-
7/31/2019 IMA PPtTutorial
98/104
98Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Argonne National Laboratory
40 racks with 1024compute nodes each
Quad-core 850 MHzPowerPC 450 Processor
163,840 cores
2GB main memory pernode shared by all cores
3D torus network- isolated dedicated partitions
AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit
cube, 50x50x25 grid points per core (4 cores per node)
http://upload.wikimedia.org/wikipedia/commons/d/d3/IBM_Blue_Gene_P_supercomputer.jpg -
7/31/2019 IMA PPtTutorial
99/104
99Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
cube, 50x50x25 grid points per core (4 cores per node)
0
1
2
3
4
5
6
0 50000 100000
Second
s
No of cores
0
0.05
0.1
0.15
0.2
0.25
0.3
0 50000 100000
Se
conds
No of cores
smp
dual
vn
vn-omp
H1x4
H2x2
MPI
H4x1
Setup times Cycle times
AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit
cube, 50x50x25 grid points per core (4 cores per node)
-
7/31/2019 IMA PPtTutorial
100/104
100Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
cube, 50x50x25 grid points per core (4 cores per node)
0
2
4
6
8
10
12
14
16
0 50000 100000
Sec
onds
No of procs
H1x4
H2x2
MPI
H4x1
H1x4 H2x2 MPI
128 17 17 18
1024 20 20 228192 25 25 25
27648 29 27 28
65536 30 33 29
128000 37 36 36
No. Iterations
Cray XT-5 Jaguar-PF Oak Ridge National Laboratory
-
7/31/2019 IMA PPtTutorial
101/104
101Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
Cray XT 5 Jaguar PF Oak Ridge National Laboratory
18,688 nodes in 200 cabinets
Two sockets with AMD OpteronHex-Core processor per node
16 GB main memory per node
Network: 3D torus/ mesh withwrap-around links (torus-like)in two dimensions and withoutsuch links in the remaining dimension
Compute Node Linux, limited set of services, but reduced noise ratio
PGIs C compilers (v.904) with and without OpenMP,OpenMP settings: -mp=numa (preemptive thread pinning, localizedmemory allocations)
AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit
cube, 50x50x30 grid points per core (12 cores per node)
-
7/31/2019 IMA PPtTutorial
102/104
102Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
, g p p ( p )
Setup times Cycle times
0
1
2
3
4
5
6
7
8
9
10
0 50000 100000 150000 200000
Second
s
No. of cores
MPI
H12x1
H1x12
H4x3
H2x6
0
0.05
0.1
0.15
0.2
0.25
0 50000 100000 150000 200000
Seconds
No. of cores
AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit
cube, 50x50x30 grid points per core (12 cores per node)
-
7/31/2019 IMA PPtTutorial
103/104
103Option:UCRL# Option:Additional Information
Lawrence Livermore National Laboratory
, g p p ( p )
0
2
4
6
8
10
12
14
16
0 50000 100000 150000 200000
seconds
no of cores
MPI
H1x12
H4x3
H2x6
-
7/31/2019 IMA PPtTutorial
104/104
Thank You!
This work performed under the auspices of the U.S. Department of
Energy by Lawrence Livermore National Laboratory under
Contract DE-AC52-07NA27344.