scaling algebraic multigrid to over 287k processors

Title

Scaling Algebraic Multigrid to over 287K processors

Markus [email protected]

Interdisziplinares Zentrum fur wissenschaftliches RechnenUniversitat Heidelberg

Copper Mountain Conference on Multigrid MethodsMarch 28, 2011

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 1 / 19

Outline

Outline

1 Motivation

2 IBM Blue Gene/P

3 Algebraic Multigrid

4 Parallelization Approach

5 Scalability


Motivation

Motivation

Figure: Cut through the ground beneath an acre

• Highly discontinuous permeability of the ground.• 3D Simulations with high resolution.• Efficient and robust parallel iterative solvers.

−∇ · (K (x)∇u) = f in Ω

u = gD on ΓD ⊂ ∂Ω

(K (x)∇u) · n = gN on ΓN ⊂ ∂Ω .

(1)M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 3 / 19

IBM Blue Gene/P

IBM Blue Gene/PJUGENE in Julich

• 72 racks - 73728 nodes (294912 cores)

• Overall peak performance: 1 Petaflops

• Main Memory 144 TB

• Compute Card/Processor• PowerPC 450, 32-bit, 850 MHz, 4 Cores• Main Memory: 2 GB

• Top500 list: ranked 9 (Europe 2)


IBM Blue Gene/P

Challenges on IBM Blue Gene/P

Challenging Facts

• Only 500MB main memory/core in VM-mode!

• 32-bit architecture, but 144 TB memory!

• Utilize many cores!

• Idle cores for coarse problems!

Crucial Points to be Adressed• Be memory efficient and maybe consider less convergent methods.

• Be able to adress 1012 degrees of freedom.

• Speedup matrix-vector-products as much as possible!


Algebraic Multigrid

Algebraic Multigrid Methods

Interpolation AMG

• Split unknowns into C- (coarse grid unknowns) and F-nodes (otherunknowns).

• Construct operator-dependent interpolation operators.

• Optimal number of iterations for model problems.

• Non-optimal operator complexity.

• Convergence behavior not optimal for discontinuous coefficients.

Aggregation AMG

• Builds aggregates of strongly connected unknowns.

• Each aggregate is represent as a coarse level unknown.

• (Tentative) piecewise constant prolongators.

• For smoothed aggregation the prolongation operator is improved viasmoothing.

• Non-smoothed version is more easily parallelized.M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 6 / 19

Algebraic Multigrid

Improved Algebraic Multigrid Methods

Improved Variants

• Interpolation-based AMG with aggressive coarsening (less fill-in).

• Adaptive AMG (better convergence, but more fill-in)

• AMG based on compatible relaxation (are there efficientimplementations?).

• K-cycle Aggregation-AMG (good operator complexity, but more workon levels).


Algebraic Multigrid

Our AMG Method

• Non-smoothed AMG

• Heuristic and greedy aggregation algorithm.

• Fast setup time

• Very memory efficient.

• Fast and scalable V-cycle.

• Only used as a preconditioner to Krylov methods (BiCGSTAB in theexamples


Algebraic Multigrid

Sequential Aggregation

• strong connection (j , i):a−ij a

−ji

aiiajj> αγi , with γi = maxj 6=i

a−ij a−ji

aiiajj,

a−ij = max0,−aij• Prescribe size of the aggregates (e.g. min 8, max 12 for 3D).

• Minimize fill-in.

• Make aggregates round in the isotropic case.


Parallelization Approach


Goals• Reuse efficient sequential data structures and algorithms for

aggregation and matrix-vector-products.

• Agglomerate data onto fewer processes on coarser levels.

• Smoother is hybrid Gauss-Seidel.

Approach

• Separate decomposition and communication information from datastructures.

• Use simple and portable index identification for data items.

• Data structures need to be augmented to contain ghost items.



Index Sets

Index Set

• Distributed overlapping index set I =⋃P−1

0 Ip

• Process p manages mapping Ip −→ [0, np).

• Might only store information about the mapping for shared indices.



Index Sets

Index Set


0 Ip



Global Index

• Identifies a position (index) globally.

• Arbitrary and not consecutive (to support adaptivity).

• Persistent.

• On JUGENE this is not an int to get rid off the 32 bit limit!



Index Sets

Index Set


0 Ip



Local Index• Addresses a position in the local container.

• Convertible to an integral type.

• Consecutive index starting from 0.

• Non-persistent.

• Provides an attribute to identity ghost region.



Remote Information and Communication

Remote Index Information• For each process q the process p knows all common global indices

together with the attribute on q.

Communication• Target and source partition of the index is chosen using attribute

flags, e.g from ghost to owner and ghost.

• If there is remote index information of q available on p, then p sendall the data in one message.

• Communication takes place collectively at the same time.



Parallel Matrix Representation

• Let Ii is a nonoverlapping decomposition of our index set I .

• Ii is the augmented index set such set for all k ∈ Ii with|akj |+ |ajk | 6= 0 also k ∈ Ii holds.

• Then the locally stored matrix looks like

Ii

Ii

Aii ∗

0 I

• Therefore Av can be computed locally for the entries associated withIi if v is known for Ii

• A communication step ensures consistent ghost values.

• Matrix can be used for hybrid preconditioners.



Parallel Setup Phase

• Every process builds aggregates in his owner region.

• One communication with next neighbors to update aggregateinformation in ghost region.

• Coarse level index sets are a subset of the fine level.

• Remote index information can be deduced locally.

• Galerkin product can be calculated locally.



Data Agglomeration on Coarse Levels

coarsen target

number of nonidle processor n1

num

ber

of

ver

tice

s per

pro

cess

or 0

1

L

L−1

(L−2)’

L−2

L−3

L−4

L−4’

L−5

L−6

L−6’

L−7

• Repartition the data ontofewer processes.

• Use METIS on the graph ofthe communication pattern.(ParMETIS cannot handlethe full machine!)


Scalability

Weak Scalability Results

1 1

1

1 .01.01

.01

1000

.01

• Test problem from Griebel et. al.

• ∇K · ∇u = 0

• Dirichlet boundary condition

• 323 unknowns per process (87E09 unknowns for biggest problem)


Scalability

Weak Scalability Results

0

20

40

60

80

100

120

140

1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06

Com

puta

tion T

ime [s]

# Processors [-]

build time solve time total time


Scalability

Weak Scalability Results II

• Richards equation.

• I.e. water flow in partially saturated media.

• One large random heterogeneity field.

• 64 ∗ 64 ∗ 128 unknowns per process.

• 1.25E11 unknowns on the full JUGENE.

• One time step in simulation.

• Model and discretization due to Olaf Ippisch

procs T(s)1 1133.44

294849 3543.34


Scalability

Summary

• Optimal convergence is not everything for parallel iterative solvers.

• Current architectures demand low memory footprints to tackle bigproblems.

• Still, we were able to achieve reasonable scalability for large problemswith jumping coefficients.

• The software is open software and part of the ISTL module of theDistributed and Unified Numerics Environment .


http://www.dune-project.org

scaling algebraic multigrid to over 287k processors

Technology