Download - Parallel Textbook Multigrid Efficiency
Parallel Textbook Multigrid Efficiency - Uli Rüde
Parallel Textbook Multigrid Efficiency
Lehrstuhl für Simulation FAU Erlangen-Nürnberg
U. Rüde (FAU Erlangen, [email protected]) joint work with B. Gmeiner, M. Huber, H. Stengel, H. Köstler (FAU)
C. Waluga, M. Huber, L. John, B. Wohlmuth (TUM) M. Mohr, S. Bauer, J. Weismüller, P. Bunge (LMU)
1
TERRA NEO
TERRA
FOURTEENTH COPPER MOUNTAIN CONFERENCEON
ITERATIVE METHODS
March 20 - March 25, 2016
Büro
für G
esta
ltung
Wan
gler
& A
bele
04.
Apr
il 20
11
Parallel Textbook Multigrid Efficiency - Uli Rüde
Two questions:What is the minimal cost of solving a PDE? (such as Poisson’s or Stokes’ equation in 3D)
asymptotic results of the form
Cost ≦ C np (Flops)
with unspecified constant C areinadequate to predict the real performance
Flops as metric become questionable
How do we quantify true cost? (i.e. resources needed)
Number of flops? Memory consumption? Memory bandwidth? (aggregated?) Communication bandwidth? Communication latency? Power consumption?
2
TERRA NEO
TERRA
Büro
für G
esta
ltung
Wan
gler
& A
bele
04.
Apr
il 20
11
Exascale Computing Ulrich Rüde
Two Multi-PetaFlops SupercomputersJUQUEEN SuperMUC
Blue Gene/Q architecture 458,752 PowerPC A2 cores 16 cores (1.6 GHz) per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak 448 TByte memory TOP 500: #11
Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #23
SuperMuc: 3 PFlops
Parallel Textbook Multigrid Efficiency - Uli Rüde
How big a PDE problem can we solve?400 TByte main memory = 4*1014 Bytes = 5 vectors each with 1013 elements8 Byte = double precision even with a sparse matrix format, storing a matrix of dimension 1013 is not possible on Juqueen
matrix-free implementation necessary Which algorithm?
multigrid asymptotically optimal complexity: Cost = C*N C „moderate“
does it parallelize well? overhead?
4
TERRA NEO
TERRA
equal, not „≦“
TERRA NEO
TERRA
5
Energy
computer generation gigascale:109 FLOPS
terascale1012 FLOPS
petascale1015 FLOPS
exascale1018 FLOPS
desired problem size DoF=N 106 109 1012 1015
energy estimate (kWh) 1 NJoule × N2
all-to-all communication
0.278 Wh 10 min of LED light
278 kWh 2 weeks
blow drying hair
278 GWh 1 month electricity
for Denver
278 PWh 100 years
world electricity production
TerraNeo prototype (kWh) 0.13 Wh 0.03 kWh 27 kWh ?
Only optimal algorithms qualify
At extreme scale: optimal complexity is a must!
Parallel Textbook Multigrid Efficiency - Uli Rüde 6
Hierarchical Hybrid Grids (HHG)
Parallelize „plain vanilla“ multigrid for tetrahedral finite elements
partition domain parallelize all operations on all grids use clever data structures matrix free implementation
Do not worry (so much) about coarse grids idle processors? short messages? sequential dependency in grid hierarchy?
Elliptic problems always require global communication. This cannot be accomplished by
local relaxation or Krylov space acceleration or domain decomposition without coarse grid
Bey‘s Tetrahedral Refinement
B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, 2005.
TERRA NEO
TERRA
Parallel Textbook Multigrid Efficiency - Uli Rüde
HHG: A modern architecture for FE computations
Geometrical Hierarchy: Volume, Face, Edge, Vertex
7
TERRA NEO
TERRA
Copying to update ghost pointsVertex Edge Interior
0 1 2 3 4
5 6 7 8
9 10 11 12
0 1 2 3 4
5 6 7 8
9 10 11
12 13
14
MPI message to updateghost points
linearization & memory representation
Parallel Textbook Multigrid Efficiency - Uli Rüde 8
TERRA NEO
TERRA
Parallel Textbook Multigrid Efficiency - Uli Rüde
Towards a holistic performance engineering methodology: Parallel Textbook Multigrid Efficiency
9
TERRA NEO
TERRA
Brandt, A. (1998). Barriers to achieving textbook multigrid efficiency (TME) in CFD. Thomas, J. L., Diskin, B., & Brandt, A. (2003). Textbook Multigrid Efficiency for Fluid Simulations*. Annual review of fluid mechanics, 35(1), 317-340. Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015
Parallel Textbook Multigrid Efficiency - Uli Rüde
Textbook Multigrid Efficiency (TME)„Textbook multigrid efficiency means solving a discrete PDE problem with a computational effort that is only a small (less than 10) multiple of the operation count associated with the discretized equations itself.“ [Brandt, 98]
work unit (WU) = single elementary relaxation classical algorithmic TME-factor:
ops for solution/ ops for work unit new parallel TME-factor to quantify
algorithmic efficiency combined with implementation scalability
10
TERRA NEO
TERRA
Parallel Textbook Multigrid Efficiency - Uli Rüde
Textbook Multigrid Efficiency (TME)
Full multigrid method (FMG): linear tetrahedral elements One V(2,2)-cycle with SOR smoother on each new level reduces the algebraic error to 70% of discretization error. Textbook efficiency for 3D Poisson problem: 6.5 WU
11
TERRA NEO
TERRA
10 Björn Gmeiner
For simplicity, the computational domain in all our computations is chosen as the unitcube ⌦ = (0,1)3 if not mentioned otherwise. For scaling and performance studies on morecomplex domains, we refer to our work [18], where spherical shells and complex networksof channels were considered using the same version of the HHG software framework. Forsuch geometries the performance was observed to deteriorate by some ten percent com-pared to the cube domain.
Our first experiment investigates the performance of different multigrid settings, vary-ing the number of smoothing steps, type of cycle, and using over-relaxation in the smoother.We first consider the scalar case (2.1) with a constant coefficient, i.e., k = 1, and an exactsolution given as
u(x) = sin(2x1) sin(4x2) sin(16x3). (CC)
The boundary data is chosen according to the exact solution and the source term f on theright hand side is constructed by evaluating the strong form of the operator on the left handside of (2.1). In Figure 4, we display errors and residuals as functions of computational
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+000 5 10 15 20 25 30 35 40 45 50
Erro
r (sc
aled
euc
)
Work units
V(1,1)-cycle V(3,3)-cycle V(3,3)-cycle (SOR) W(3,3)-cycle (SOR)
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+000 5 10 15 20 25 30 35 40 45 50
Resi
dual
(euc
)
Work units
V(1,1)-cycle V(3,3)-cycle V(3,3)-cycle (SOR) W(3,3)-cycle (SOR)
Figure 4: Error (left) and residual (right) for (CC) on a resolution of 8.6 · 109grid points.
cost measured in work units. We note that in the pre-asymptotic stage, the W(3,3) cyclewith an over-relaxation factor of ! = 1.3 produces the most efficient error reduction, butinterestingly this is not the most efficient method when considering the convergence ofthe residual. In this respect, the V(3,3) cycles with a mesh independent over-relaxationfactor of again ! = 1.3 in (2.3) are found to perform best. All methods require about30� 50 operator evaluations (WU) to converge to the truncation error of approximately10�6. Clearly, we have not yet reached the 10 WU required for TME, but since we solvesystems as large as 8.6 · 109 unknowns, we believe that this is already in an acceptablerange.
To achieve full TME, we next consider a FMG algorithm. To evaluate different variantsof FMG, Table 1 lists the ratio
�l = ku� ulk/ku� ulk, (3.1)
between the total error obtained by the FMG ku� ulk and the discretization error ku�ulk,where ul denotes the result obtained with FMG. For the first row of Table 1 there is only one
Residual reduction for multigrid V- and W-cycles
vs. work units (WU)
Yavneh, I. (1996). On red-black SOR smoothing in multigrid. SIAM Journal on Scientific Computing, 17(1), 180-192. Gmeiner, B., Huber, M., John, L., Rüde, U., & Wohlmuth, B. (2015). A quantitative performance analysis for Stokes solvers at the extreme scale. submitted. arXiv preprint arXiv:1511.02134.
Parallel Textbook Multigrid Efficiency - Uli Rüde
extended TME paradigm for parallel performance analysis
Analyse cost of an elementary relaxation micro-kernel benchmarks for smoother
• optimize node level performance • justify/remove any performance degradation
aggregate performance for whole parallel system defines a WU
Measure parallel solver performance Compute TME factor as overall measure of efficiency
• analyse discrepancies • identify possible improvements
Optimize TME factor
12
TERRA NEO
TERRA
Uµsm
T (N,U)
EParTME(N, U) =T (N, U)
TWU(N, U)= T (N,U)
Uµsm
N
TWU(N, U) =N
Uµsm
Parallel Textbook Multigrid Efficiency - Uli Rüde
Parallel TME# of elementary relaxation steps on single core/sec
# cores
aggregate peak relaxation performance
13
TERRA NEO
TERRA
µsm
U
time to solution for N unknowns on U cores
idealized time for a work unit
Parallel textbook efficiency factor
combines algorithmic and implementation efficiency.
Parallel Textbook Multigrid Efficiency - Uli Rüde
TME Efficiency Analysis: RB-GS Smoother
14
TERRA NEO
TERRA
for (int i=1; i < (tsize-j-k-1); i=i+2) { u[mp_mr+i] = c[0] * ( -c[1] *u[mp_mr+i+1] - c[2] *u[mp_tr+i-1] - c[3] *u[mp_tr+i] - c[4] *u[tp_br+i] - c[5] *u[tp_br+i+1] - c[6] *u[tp_mr+i-1] - c[7] *u[tp_mr+i] - c[8] *u[bp_mr+i] - c[9] *u[bp_mr+i+1] - c[10]*u[bp_tr+i-1] - c[11]*u[bp_tr+i] - c[12]*u[mp_br+i] - c[13]*u[mp_br+i+1] - c[14]*u[mp_mr+i-1] + f[mp_mr+i] );
This loop should be executed on single SuperMuc core at 720 M updates/sec (in theory - peak performance) = 176 M updates/sec (in practice - memory access bottleneck; RB-ordering prohibits vector loads)
Thus whole SuperMuc should perform = 147456*176M ≈ 26T updates/sec
ĺ tp_br
ĺ mp_br
tp_mr ĺ
mp_tr ĺ
mp_mr ĺ
bp_tr ĺ
bp_mr ĺ
µsm
Uµsm
Parallel Textbook Multigrid Efficiency - Uli Rüde
Execution-Cache-Memory Model (ECM)
ECM model for the 15-point stencil on SNB core.
Arrow indicates a 64 Byte cache line transfer.
Run-times represent 8 elementary updates.
15
TERRA NEO
TERRA
Hager et al. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience, 2013.
16 Björn Gmeiner
���������
�
�� ���� � ������
�� ���� �� � ���� ��
��
��
������
�� ����
��� ����
� ������
��� ������
����
� ������
� ������
��� ������
��!�������"#�$��%�&�'����������
�
��
��
������
()� ������
*
*
�( ������
Figure 5: ECM model for the 15-point stencil with variable coefficients (left) and constant coefficients
(right) on SNB core. An arrow indicates a 64 Byte cache line transfer. Run-times represent 8 elementary
updates.
Since the performance predictions obtained as above by the roofline model are un-satisfactory, we must proceed to develop a more accurate analysis. To this end, a morecareful inspection of the code reveals the following: The vectorization report by the Intelcompiler confirms that the innermost loop of the stencil code is optimally executed. How-ever, the roofline model in its simplest setting does not account for a possible non-optimalinstruction code scheduling. Also the run-time contributions of transfers inside the cachehierarchy are not taken into account. To include these effects, we employ the Execution-Cache-Memory (ECM) model of [22, 33], and we refer to [18] for a thorough descriptionof the modeling process for this kernel.
An overview of the input parameters to the model is displayed in Figure 5. Differentassumptions which operations can be overlapped lead to a corridor of performance pre-dictions of 159� 189MLups/s for a single core. The obtained measurements are withinthis range, indicating that the micro architecture is capable of overlapping at least someof the operations (cf. Figure 6). We also note that the (CC) kernel cannot saturate themaximal memory bandwidth of 42 GB/s with 8 cores. A comparison of the different run-time contributions shows that the single-core performance is in fact dominated by codeexecution and that — in contrast to the initial analysis — instruction throughput and notmemory bandwidth is the critical resource. The poor predictive results that were obtainedwith the simple model are mainly caused by an over-optimistic assumption for the peakperformance value that was based on FLOPS throughput alone. For the (CC) smoother,additional instruction overhead has to be considered. We remark that this is caused by thered-black pattern of updates that requires a high number of logistic machine instructionsbefore vectorized operations can be used. In such cases, a thorough understanding of theperformance depends on a careful code analysis. As demonstrated, this leads to a realisticprediction interval for the single-core performance.
We can additionally conclude that code optimization strategies that try to speed up theexecution of a WU by reducing the memory bandwidth requirements, in particular cache
Variable coefficients Constant coefficients
Parallel Textbook Multigrid Efficiency - Uli Rüde
TME and Parallel TME results
16
TERRA NEO
TERRA
20 Björn Gmeiner
Setting/Measure ETME ESerTME ENodeTME EParTME1 EParTME2
Grid points - 2 · 106 3 · 107 9 · 109 2 · 1011
Processor cores U - 1 16 4 096 16 384(CC) - FMG(2,2) 6.5 15 22 26 22(VC) - FMG(2,2) 6.5 11 13 15 13(SF) - FMG(2,1) 31 64 100 118 -
Table 4: TME factors for different problem sizes and discretizations.
Table 4 lists the efficiencies for the (CC), (VC), as well as the (SF) case. In the lattercase, we employ the FMG-4CG-1V(2,1) pressure correction cycle. All cycles are chosensuch that they are as cheap as possible, but still keep the ratio �l ⇡ 2. We observe a largergap between the ETME and the new EParTME for (CC) and (SF) compared to the previouscases. We also observe the effect that traversing the coarser grids of the multigrid hierarchyis less efficient than the processing the the large grids on the finest level. This is particularlytrue because of communication overhead and the deteriorating surface to volume ratio oncoarser grids, but also because of loop startup overhead. When there are less operations,the loop startup time is less well amortized.
The case (VC) exhibits a better relative performance, since time is here spent in themore complex smoothing and residual computations. However, for the quality of an im-plementation it is crucial to show that the WU is implemented as time efficient as possiblefor a given problem. We remark that a good parallel efficiency by itself has limited valuewhen the serial performance of the algorithm is slow. In this sense, multigrid may ex-hibit a worse parallel efficiency than other solvers while it is superior in terms of absoluteexecution time for a given problem and a given computer architecture.
ESerTME and ENodeTME are measured using only one core (U = 1) and one node (U =16), respectively. We observe that the transition from one core to one node causes a similaroverhead as the transition from one node to 256 nodes (EParTME1). For the (CC) case,roughly a factor of two lies between the TME and the serial implementation, and a furtherfactor of two between the serial and the parallel execution. To interpret these results,recall that TME and ParTME would only coincide if there were no inter-grid transfers,if the loops on coarser grids were executed as efficiently as on the finest mesh, if therewere no communication overhead, and if no additional copy operations, or other programoverhead occurred.
For the last setup EParTME2, we increase the number of grid points per core to themaximal capacity of the main memory. This also provides a better (lower) surface tovolume ratio and consequently decreases the EParTME even for more compute cores.
The HHG multigrid implementation is already designed to minimize these overheadsso that we can solve very large finite element systems. The largest Stokes system presentedin Table 4 is solved on 4096 cores and has 3.6⇥1010 degrees of freedom. Note that thoughthis compares favorably with some of the largest FE computations to date, it is alreadyachieved on a quite small partition of the SuperMUC cluster that has in total 155,656cores. For scaling results to even larger computations with more than 1012 degrees offreedom we refer to [18,19].
Three model problems: scalar constant (CC) & scalar variable (VC) coefficients Stokes solved via Schur complement (SF)
Full multigrid with #iterations such that asymptotic optimality maintained TME = 6.5 (algorithmically for scalar cases) ParTME around 20 for scalar PDE, and ≥100 for Stokes
Parallel Textbook Multigrid Efficiency - Uli Rüde
Application to Earth Mantle Convection Models
17
TERRA NEO
TERRA
Ghattas O., et. al.: An Extreme-Scale Implicit Solver for Complex PDEs: Highly Heterogeneous Flow in Earth's Mantle. Gordon Bell Prize 2015.
Weismüller, J., Gmeiner, B., Ghelichkhan, S., Huber, M., John, L., Wohlmuth, B., UR, Bunge, H. P. (2015). Fast asthenosphere motion in high-resolution global mantle flow models. Geophysical Research Letters, 42(18), 7429-7435.
Parallel Textbook Multigrid Efficiency - Uli Rüde
HHG Solver for Stokes System Motivated by Earth Mantle convection problem
18
TERRA NEO
TERRA
Scale up to ~1012 nodes/ DOFs ⇒ resolve the whole Earth Mantle globally with 1km resolution
Solution of the Stokes equations
Boussinesq model for mantle convection problems
derived from the equations for balance of forces, conservation ofmass and energy:
�r · (2⌘✏(u)) +rp = ⇢(T )g,
r · u = 0,
@T
@t+ u ·rT �r · (rT ) = �.
u velocityp dynamic pressureT temperature⌫ viscosity of the material✏(u) = 1
2
(ru+ (ru)T ) strain rate tensor⇢ density, �, g thermal conductivity,
heat sources, gravity vector
Gmeiner, Waluga, Stengel, Wohlmuth, UR: Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems, SISC, 2015.
Parallel Textbook Multigrid Efficiency - Uli Rüde
Coupled Flow-Transport Problem
19
TERRA NEO
TERRA
• 6.5×109 DoF • 10000 time steps • run time 7 days • Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU
Lehrstuhl fürNumerische Mathematik 1 Mantle convection model
A simplified model
For the sake of presentation, let us simplify the model by neglecting compressibility,nonlinearities and inhomogeneities in the material.
In dimensionless form, we obtain:
��u+rp = �RaT br
divu = 0
@tT + u ·rT = Pe�1�T
Notation:
u velocity
T temperature
p pressure
br radial unit vector
Dimensionless numbers:
• Rayleigh number (Ra): determines the presence and strenght of convection
• Peclet number (Pe): ratio of convective to di↵usive transport
The mantle convection problem is defined by the model equations in the interior of thespherical shell as well as suitable initial/boundary conditions.
JJ J I II 0 1 2 3 4
Parallel Textbook Multigrid Efficiency - Uli Rüde
Comparison of Stokes SolversThree solvers:
Schur complement CG (pressure correction) with MG MINRES with MG preconditioner for Vector-Laplace MG for saddle point problem with Uzawa type smoother
Time to solution without coarse grid solver
residual reduction by 10-8
with coarse grid solver the difference shrinks
20
TERRA NEO
TERRA
10 B. GMEINER, M. HUBER, L. JOHN, U. RUDE, B. WOHLMUTH
L DoFs SCG MINRES Uzawa
4 1.4 · 104 1.30 1.08 1.785 1.2 · 105 1.40 1.52 1.816 1.0 · 106 1.62 2.00 1.957 8.2 · 106 1.67 2.08 2.108 6.6 · 107 1.56 2.17 2.189 5.3 · 108 – – 2.25
Table 4. Factor of the time to solution for Laplace and D operator.T:factor_times
evaluations in case of the Uzawa method stay exactly the same. This behaviour is also reflected bythe time to solution, c.f. Tab. 3.
For a comparison of the Laplace and D operator we also include Tab. 4, for the serial testcases. Here, we present the ratios of the times to solution for the Laplace and D operator. Whilethis factor is approximately 1.5 for Schur complement CG it is around 2 for the MINRES andUzawa-type multigrid method. The reason therefore can be explained as follows ... ??Lorenz:
Grund
angeben
oder Tab.
weglassen?
Lorenz:
Grund
angeben
oder Tab.
weglassen?
sec:par_perfrom
5.3. Parallel performance. In the following we present numerical results for the parallel case,illustrating excellent scalability. Beside the comparison of the di↵erent solvers, focus is also laid onthe di↵erent formulations for the Stokes equations. Scaling results are present for up to 786 432threads and 1013 degrees of freedom. We consider 32 threads per node and ?? tetrahedrons of theinitial mesh per thread. Let us also mention that the current computational level is obtained bya 7 times refined initial mesh. Until not further specified, the settings for the solvers remain asdescribed above.
SCG MINRES UzawaNodes Threads DoFs iter time iter time iter time
1 24 6.6 · 107 28 (3) 137.26 s 65 (3) 116.35 s 7 (1) 65.71 s6 192 5.3 · 108 26 (5) 135.85 s 64 (7) 133.12 s 7 (3) 80.42 s48 1 536 4.3 · 109 26 (15) 139.64 s 62 (15) 133.29 s 7 (5) 85.28 s384 12 288 3.4 · 1010 26 (30) 142.92 s 62 (35) 138.26 s 7 (10) 93.19 s
3 072 98 304 2.7 · 1011 26 (60) 154.73 s 62 (71) 156.86 s 7 (20) 114.36 s24 576 786 432 2.2 · 1012 28 (120) 187.42 s 64 (144) 184.22 s 8 (40) 177.69 s
Table 5. Weak scaling results for the Laplace operator.T:itertime_lap_par
SCG MINRES UzawaNodes Threads DoFs iter time iter time iter time
1 24 6.6 · 107 28 136.30 s 65 115.05 s 7 58.80 s6 192 5.3 · 108 26 134.26 s 64 130.56 s 7 64.40 s
48 1 536 4.3 · 109 26 135.06 s 62 128.34 s 7 65.03 s384 12 288 3.4 · 1010 26 135.41 s 62 128.34 s 7 64.96 s
3 072 98 304 2.7 · 1011 26 139.55 s 62 133.30 s 7 66.08 s24 576 786 432 2.2 · 1012 28 154.06 s 64 139.52 s 8 78.24 s
Table 6. Weak scaling results for the Laplace operator without coarse grid time.T:itertime_lap_withoutcoarse_par
In Tab. 5 we present weak scaling results for the individual solvers in the case of the Laplaceoperator. Additionally, we also present the coarse gird iteration numbers in brackets, which in caseof the Uzawa solver represent the iterations of the CG, used as a preconditioner, cf. Sec. 4.2. Weobserve excellent scalability of all three solvers, where the Schur complement CG and MINRES have
10 S. Bauer et al.
SCG UMG
nop
(A)
nop
(B)
nop
(C)
nop
(M)
83.0%
12.9%2.2%1.9%
23.3%24.8%
4.1%
Fig. 1.2 Operator counts for the individual blocks for the SCG and UMG method in percentswhere 100% corresponds to the total number of operator counts for the SCG.
requires only half of the total number of operator evaluations in comparison to theSCG algorithm. This is also reflected within the time-to-solution of the UMG.
1.4.2 Scalability
In the following, we present numerical examples, illustrating the robustness andscalability of the Uzawa-type multigrid solver. For simplicity we consider the ho-mogenous Dirichlet boundary value problem. As a computational domain, we con-sider the spherical shell
W = {x 2 R3 : 0.55 < kxk2 < 1}.
We present weak scaling results, where the initial mesh T�2 consists of 240 tetra-hedra for the case of 5 nodes and 80 threads on 80 compute cores, cf. Tab. 1.3. Thismesh is constructed using an icosahedron to provide an initial approximation of thesphere, see e.g. [19, 41], where we subdivide each resulting spherical prismatoidinto three tetrahedra. The same approach is used for all other experiments belowthat employ a spherical shell. We now consider 16 compute cores per node (onethread per core) and assign 3 tetrahedra of the initial mesh per compute core. Thecoarse grid T0 is a twice refined initial grid, where in one uniform refinement stepeach tetrahedron is decomposed into 8 new ones. Thus the number of degrees offreedoms grows approximately from 103 to 106 when we perfrom a weak scalingexperiment starting on the coarse grid T0.
The right-hand side f = 0 and a constant viscosity n = 1 are considered. Dueto the constant viscosity it is also possible to use the simplification of the Stokesequations using Laplace-operators. In the following numeral experiments, we shallcompare this formulation with the generalized form that employs the e-operator. For
Parallel Textbook Multigrid Efficiency - Uli Rüde
Exploring the Limits …
21
TERRA NEO
TERRA
typically appear in simulations for molecules, quantum mechanics, or geophysics. The initial mesh
T�2
consists of 240 tetrahedrons for the case of 5 nodes and 80 threads. The number of degrees of
freedoms on the coarse grid T0
grows from 9.0 · 103 to 4.1 · 107 by the weak scaling. We consider
the Stokes system with the Laplace-operator formulation. The relative accuracies for coarse grid
solver (PMINRES and CG algorithm) are set to 10�3 and 10�4, respectively. All other parameters
for the solver remain as previously described.
nodes threads DoFs iter time time w.c.g. time c.g. in %
5 80 2.7 · 109 10 685.88 678.77 1.04
40 640 2.1 · 1010 10 703.69 686.24 2.48
320 5 120 1.2 · 1011 10 741.86 709.88 4.31
2 560 40 960 1.7 · 1012 9 720.24 671.63 6.75
20 480 327 680 1.1 · 1013 9 776.09 681.91 12.14
Table 10: Weak scaling results with and without coarse grid for the spherical shell geometry.
Numerical results with up to 1013 degrees of freedom are presented in Tab. 10, where we observe
robustness with respect to the problem size and excellent scalability. Beside the time-to-solution
(time) we also present the time excluding the time necessary for the coarse grid (time w.c.g.) and
the total amount in % that is needed to solve the coarse grid. For this particular setup, this
fraction does not exceed 12%. Due to 8 refinement levels, instead of 7 previously, and the reduction
of threads per node from 32 to 16, longer computation times (time-to-solution) are expected,
compared to the results in Sec. 4.3. In order to evaluate the performance, we compute the factor
t nc n�1, where t denotes the time-to-solution (including the coarse grid), nc the number of used
threads, and n the degrees of freedom. This factor is a measure for the compute time per degree of
freedom, weighted with the number of threads, under the assumption of perfect scalability. For
1.1 · 1013 DoFs, this factor takes the value of approx. 2.3 · 10�5 and for the case of 2.2 · 1012 DoFs
on the unit cube (Tab. 5) approx. 6.0 · 10�5, which is of the same order. Thus, in both scaling
experiments the time-to-solution for one DoF is comparable. The reason why the ratio is even
smaller for the extreme case of 1.1 · 1013 DoFs is the deeper multilevel hierarchy. Recall also that
the computational domain is di↵erent in both cases.
The computation of 1013 degrees of freedom is close to the limits that are given by the shared
memory of each node. By (8), we obtain a theoretical total memory consumption of 274.22 TB,
and on one node of 14.72 GB. Though 16 GB of shared memory per node is available, we employ
one further optimization step and do not allocate the right-hand side on the finest grid level. The
right-hand side vector is replaced by an assembly on-the-fly, i.e., the right-hand side values are
evaluated and integrated locally when needed. By applying this on-the-fly assembly, the theoretical
18
Multigrid with Uzawa Smoother Optimized for Minimal Memory Consumption
1013 Unknowns correspond to 80 TByte for the solution vector Juqueen has 450 TByte Memory matrix free implementation essential
Gmeiner B., Huber M, John L, UR, Wohlmuth, B: A quantitative performance analysis for Stokes solvers at the extreme scale, arXiv:1511.02134, submitted.
Parallel Textbook Multigrid Efficiency - Uli Rüde
Conclusions and Outlook
Resilience with multigrid Multigrid scales to Peta and beyond HHG: lean and mean implementation, excellent time to sol. Reaching 1013 DoF
22
TERRA NEO
TERRA