scaling algebraic multigrid to over 287k processors

23
Title Scaling Algebraic Multigrid to over 287K processors Markus Blatt [email protected] Interdisziplin¨ ares Zentrum f¨ ur wissenschaftliches Rechnen Universit¨ at Heidelberg Copper Mountain Conference on Multigrid Methods March 28, 2011 M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 1 / 19

Upload: markus-blatt

Post on 11-Jun-2015

437 views

Category:

Technology


3 download

DESCRIPTION

We describe our parallel algebraic multigid method which scaled well to over 287K processors on the IBM Blue Gene/P.

TRANSCRIPT

Page 1: Scaling algebraic multigrid to over 287K processors

Title

Scaling Algebraic Multigrid to over 287K processors

Markus [email protected]

Interdisziplinares Zentrum fur wissenschaftliches RechnenUniversitat Heidelberg

Copper Mountain Conference on Multigrid MethodsMarch 28, 2011

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 1 / 19

Page 2: Scaling algebraic multigrid to over 287K processors

Outline

Outline

1 Motivation

2 IBM Blue Gene/P

3 Algebraic Multigrid

4 Parallelization Approach

5 Scalability

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 2 / 19

Page 3: Scaling algebraic multigrid to over 287K processors

Motivation

Motivation

Figure: Cut through the ground beneath an acre

• Highly discontinuous permeability of the ground.• 3D Simulations with high resolution.• Efficient and robust parallel iterative solvers.

−∇ · (K (x)∇u) = f in Ω

u = gD on ΓD ⊂ ∂Ω

(K (x)∇u) · n = gN on ΓN ⊂ ∂Ω .

(1)M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 3 / 19

Page 4: Scaling algebraic multigrid to over 287K processors

IBM Blue Gene/P

IBM Blue Gene/PJUGENE in Julich

• 72 racks - 73728 nodes (294912 cores)

• Overall peak performance: 1 Petaflops

• Main Memory 144 TB

• Compute Card/Processor• PowerPC 450, 32-bit, 850 MHz, 4 Cores• Main Memory: 2 GB

• Top500 list: ranked 9 (Europe 2)

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 4 / 19

Page 5: Scaling algebraic multigrid to over 287K processors

IBM Blue Gene/P

Challenges on IBM Blue Gene/P

Challenging Facts

• Only 500MB main memory/core in VM-mode!

• 32-bit architecture, but 144 TB memory!

• Utilize many cores!

• Idle cores for coarse problems!

Crucial Points to be Adressed• Be memory efficient and maybe consider less convergent methods.

• Be able to adress 1012 degrees of freedom.

• Speedup matrix-vector-products as much as possible!

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 5 / 19

Page 6: Scaling algebraic multigrid to over 287K processors

IBM Blue Gene/P

Challenges on IBM Blue Gene/P

Challenging Facts

• Only 500MB main memory/core in VM-mode!

• 32-bit architecture, but 144 TB memory!

• Utilize many cores!

• Idle cores for coarse problems!

Crucial Points to be Adressed• Be memory efficient and maybe consider less convergent methods.

• Be able to adress 1012 degrees of freedom.

• Speedup matrix-vector-products as much as possible!

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 5 / 19

Page 7: Scaling algebraic multigrid to over 287K processors

Algebraic Multigrid

Algebraic Multigrid Methods

Interpolation AMG

• Split unknowns into C- (coarse grid unknowns) and F-nodes (otherunknowns).

• Construct operator-dependent interpolation operators.

• Optimal number of iterations for model problems.

• Non-optimal operator complexity.

• Convergence behavior not optimal for discontinuous coefficients.

Aggregation AMG

• Builds aggregates of strongly connected unknowns.

• Each aggregate is represent as a coarse level unknown.

• (Tentative) piecewise constant prolongators.

• For smoothed aggregation the prolongation operator is improved viasmoothing.

• Non-smoothed version is more easily parallelized.M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 6 / 19

Page 8: Scaling algebraic multigrid to over 287K processors

Algebraic Multigrid

Improved Algebraic Multigrid Methods

Improved Variants

• Interpolation-based AMG with aggressive coarsening (less fill-in).

• Adaptive AMG (better convergence, but more fill-in)

• AMG based on compatible relaxation (are there efficientimplementations?).

• K-cycle Aggregation-AMG (good operator complexity, but more workon levels).

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 7 / 19

Page 9: Scaling algebraic multigrid to over 287K processors

Algebraic Multigrid

Our AMG Method

• Non-smoothed AMG

• Heuristic and greedy aggregation algorithm.

• Fast setup time

• Very memory efficient.

• Fast and scalable V-cycle.

• Only used as a preconditioner to Krylov methods (BiCGSTAB in theexamples

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 8 / 19

Page 10: Scaling algebraic multigrid to over 287K processors

Algebraic Multigrid

Sequential Aggregation

• strong connection (j , i):a−ij a

−ji

aiiajj> αγi , with γi = maxj 6=i

a−ij a−ji

aiiajj,

a−ij = max0,−aij• Prescribe size of the aggregates (e.g. min 8, max 12 for 3D).

• Minimize fill-in.

• Make aggregates round in the isotropic case.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 9 / 19

Page 11: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Parallelization Approach

Goals• Reuse efficient sequential data structures and algorithms for

aggregation and matrix-vector-products.

• Agglomerate data onto fewer processes on coarser levels.

• Smoother is hybrid Gauss-Seidel.

Approach

• Separate decomposition and communication information from datastructures.

• Use simple and portable index identification for data items.

• Data structures need to be augmented to contain ghost items.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 10 / 19

Page 12: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Parallelization Approach

Goals• Reuse efficient sequential data structures and algorithms for

aggregation and matrix-vector-products.

• Agglomerate data onto fewer processes on coarser levels.

• Smoother is hybrid Gauss-Seidel.

Approach

• Separate decomposition and communication information from datastructures.

• Use simple and portable index identification for data items.

• Data structures need to be augmented to contain ghost items.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 10 / 19

Page 13: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Index Sets

Index Set

• Distributed overlapping index set I =⋃P−1

0 Ip

• Process p manages mapping Ip −→ [0, np).

• Might only store information about the mapping for shared indices.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 11 / 19

Page 14: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Index Sets

Index Set

• Distributed overlapping index set I =⋃P−1

0 Ip

• Process p manages mapping Ip −→ [0, np).

• Might only store information about the mapping for shared indices.

Global Index

• Identifies a position (index) globally.

• Arbitrary and not consecutive (to support adaptivity).

• Persistent.

• On JUGENE this is not an int to get rid off the 32 bit limit!

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 11 / 19

Page 15: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Index Sets

Index Set

• Distributed overlapping index set I =⋃P−1

0 Ip

• Process p manages mapping Ip −→ [0, np).

• Might only store information about the mapping for shared indices.

Local Index• Addresses a position in the local container.

• Convertible to an integral type.

• Consecutive index starting from 0.

• Non-persistent.

• Provides an attribute to identity ghost region.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 11 / 19

Page 16: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Remote Information and Communication

Remote Index Information• For each process q the process p knows all common global indices

together with the attribute on q.

Communication• Target and source partition of the index is chosen using attribute

flags, e.g from ghost to owner and ghost.

• If there is remote index information of q available on p, then p sendall the data in one message.

• Communication takes place collectively at the same time.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 12 / 19

Page 17: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Parallel Matrix Representation

• Let Ii is a nonoverlapping decomposition of our index set I .

• Ii is the augmented index set such set for all k ∈ Ii with|akj |+ |ajk | 6= 0 also k ∈ Ii holds.

• Then the locally stored matrix looks like

Ii

Ii

Aii ∗

0 I

• Therefore Av can be computed locally for the entries associated withIi if v is known for Ii

• A communication step ensures consistent ghost values.

• Matrix can be used for hybrid preconditioners.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 13 / 19

Page 18: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Parallel Setup Phase

• Every process builds aggregates in his owner region.

• One communication with next neighbors to update aggregateinformation in ghost region.

• Coarse level index sets are a subset of the fine level.

• Remote index information can be deduced locally.

• Galerkin product can be calculated locally.

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 14 / 19

Page 19: Scaling algebraic multigrid to over 287K processors

Parallelization Approach

Data Agglomeration on Coarse Levels

coarsen target

number of nonidle processor n1

num

ber

of

ver

tice

s per

pro

cess

or 0

1

L

L−1

(L−2)’

L−2

L−3

L−4

L−4’

L−5

L−6

L−6’

L−7

• Repartition the data ontofewer processes.

• Use METIS on the graph ofthe communication pattern.(ParMETIS cannot handlethe full machine!)

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 15 / 19

Page 20: Scaling algebraic multigrid to over 287K processors

Scalability

Weak Scalability Results

1 1

1

1 .01.01

.01

1000

.01

• Test problem from Griebel et. al.

• ∇K · ∇u = 0

• Dirichlet boundary condition

• 323 unknowns per process (87E09 unknowns for biggest problem)

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 16 / 19

Page 21: Scaling algebraic multigrid to over 287K processors

Scalability

Weak Scalability Results

0

20

40

60

80

100

120

140

1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06

Com

puta

tion T

ime [s]

# Processors [-]

build time solve time total time

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 17 / 19

Page 22: Scaling algebraic multigrid to over 287K processors

Scalability

Weak Scalability Results II

• Richards equation.

• I.e. water flow in partially saturated media.

• One large random heterogeneity field.

• 64 ∗ 64 ∗ 128 unknowns per process.

• 1.25E11 unknowns on the full JUGENE.

• One time step in simulation.

• Model and discretization due to Olaf Ippisch

procs T(s)1 1133.44

294849 3543.34

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 18 / 19

Page 23: Scaling algebraic multigrid to over 287K processors

Scalability

Summary

• Optimal convergence is not everything for parallel iterative solvers.

• Current architectures demand low memory footprints to tackle bigproblems.

• Still, we were able to achieve reasonable scalability for large problemswith jumping coefficients.

• The software is open software and part of the ISTL module of theDistributed and Unified Numerics Environment .

M. Blatt (IWR, Heidelberg) AMG 4 287K CMMG, March 27, 2009 19 / 19