a patch-based partitioner for structured adaptive mesh...

IT 08 007

Examensarbete 30 hpFebruari 2008

A patch-based partitioner for Structured Adaptive Mesh Refinement Implementation and evaluation

Abbas Vakili

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress: Box 536 751 21 Uppsala

Telefon:018 – 471 30 03

Telefax: 018 – 471 30 00

Hemsida:http://www.teknat.uu.se/student

Abstract

A patch-based partitioner for Structured AdaptiveMesh Refinement

Abbas Vakili

To increase the speed of computer simulations we solve partial differential equations(PDEs) using structured adaptive mesh refinement (SAMR). During the execution ofan SAMR-application, finer grids are superimposed dynamically on coarser gridswhere a more accurate solution is needed in the computation area. To furtherdecrease the computation time, we use parallel computers and divide thecomputational work between the processors. This gives rise to challenging loadbalancing problem. The arithmetic workload, the communication volume and datamigration should simultanously be kept as low as possible.

In this work we implement a load balancing algorithm that specifically targets thearithmetic workload. Efforts are also made to restrict the amount of communicationusing space filling curves. The algorithm is evaluated using a number of SAMRapplications. The load imbalance is very low for all applications. The communicationvolumes are generally large but not prohibitively large. Overall, the results are verysatisfactory compared to other algorithms. The implemented algorithm cansignificantly reduce the run-time for parallel SAMR applications.

Tryckt av: Reprocentralen ITCIT 08 007Examinator: Anders JanssonÄmnesgranskare: Jarmo RantakokkoHandledare: Henrik Johansson

Contents

1 Introduction 2

2 Background 32.1 Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . 32.1.2 Partial Differential Equations (PDE) . . . . . . . . . . . . . . . . . 3

2.2 The Finite Difference Method (FDM) . . . . . . . . . . . . . . . . . . . . . 42.3 Structured Adaptive Mesh Refinement . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Grid Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Integration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.3 Grid Generation and Clustering . . . . . . . . . . . . . . . . . . . 9

3 Load balancing algorithms 103.1 Patch-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 The domain-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 The hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Ordering grids to increase locality . . . . . . . . . . . . . . . . . . . . . . 11

4 Implementation of the partitioner 13

5 Applications 175.1 Ramp — Mach reflection at a wedge . . . . . . . . . . . . . . . . . . . . . 185.2 ShockTurb — Planar Richtmyer-Meshkow instability . . . . . . . . . . . 185.3 ConvShock — Converging/diverging Richtmyer-Meshkov instability . . 195.4 Spheres — Cylinders in hypersonic flow . . . . . . . . . . . . . . . . . . . 20

6 Performance metrics 206.1 Number of blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 Load imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.3 Volume of communications . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Performance results 237.1 Number of blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.2 Load imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.3 Intra-level communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.4 Inter-level communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.5 Total communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.6 Impact of SFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.7 Scalability of the implemented patch-based partitioner . . . . . . . . . . 30

8 Summary and conclusion 33

1

1 Introduction

A computer simulation is a process of imitating real phenomenons that can be hardto describe analytically. Computer simulations have become a useful tool to modellof many systems in astro physics [5, 11], chemistry [14] and biology [1]. An con-crete example is car construction to calculate the robustness and safety of the car incollision situations [6].

Many phenomenas can be approximated by computer simulations. The phenom-enas are often described mathematically by partial differential equations (PDE). APDE is a differential equation; a relation involving an unknown function of severalindependent variables and its partial derivatives with respect to those variables.

Most methods used for solving PDEs employ a discrete computational grid andthe solution is then computed on this grid. The grid can either be structured, whereit is described with Cartesian coordinates, or unstructured, where it is usually builtwith triangels or tetrahedrons. The resolution of the grid has great importance forthe accuracy of the solution. A higher resolution generally yields a more accuratesolution but it is computationally more expensive. Many PDEs are, at a given time,only in need of high accuracy on limited parts of the grid. In this case, it is a wasteof computational resources to compute a solution with a high and uniform accuracyon the whole grid. Instead the grid can be refined at the parts that are in need of agreater accuracy. At these parts, overlapping grids with higher resolution are added.The remaining pieces of the grid are left with a lower resolution. The acronym forthis method is SAMR - Structured Adaptive Mesh Refinment [3]. SAMR can underfavorable conditions result in a runtime reduction in vicinity of a factor 1000 [15].

However, even with SAMR, a single computer is often not capable to compute thesolution of a PDE within a realistic time interval. To decrease the simulation timefurther, we divide the computations between a number of processors. We thus paral-lelize the computations. The parallel computational time for each time step is deter-mined by the processor with the largest amount of work. A badly balanced workloadresults in poor utilization of the computational resources and a longer execuationtime. We also need to consider that the solution can depend on data that resides onother processors. In such cases, the involved processors must communicate partialsolutions with each other before the computations can proceed. High communicationvolumes also result in longer execution times.

To divide the workload between the processors, three common types of algorithmsare used:

• Domain-based algorithms try to preserve the locality of the different partitionsto minimize the communication between processors [12, 16, 21]. Overlappinggrids are always assigned to the same processor. These algorithms often resultin low communication but a high load imbalance between the processors.

• Patch-based algorithms divide the workload residing on the same level of refine-

2

ment between processors without considering the workload on other refinementlevels [10,23]. These algorithms can result in low load imbalance but high com-munication.

• Hybrid algorithms combine ideas from both patch-based and domain-based al-gorithms to balance the impact of load imbalance and communication [18].

In this thesis we implement a patch-based partitioner that aim to acheive thelowest possible load imbalance, while we try to limit the amount of communication.

2 Background

2.1 Differential Equations

A differential equation is an equation where the derivatives of a function appear asvariables. They express relationships involving rates of change. The solution of adifferential equation is a function whose derivatives satisfy the equation.

The mathematical theory of differential equations has developed together withthe sciences from where the equations originate and where the results find applica-tion. Diverse scientific problem often give rise to similar differential equations. Insuch cases, the mathematical theory can unify distinct scientific fields.

The order of a differential equation is the same as the highest derivative that itcontains. For instance, a first order differential equation contains only first orderderivatives. In most differential equations there are room, but also time derivativescan exist. Differential equations are divided into two types, based on the number ofvariables they depend on. These two groups are called Ordinary Differential Equa-tions (ODE) and Partial Differential Equations (PDE).

2.1.1 Ordinary Differential Equations

An ordinary differential equation contains functions of only one independent vari-able, and one or more of its derivatives with respect to that independent variable.A simple example is Newton’s second law of motion, which leads to the differentialequation

md2xdt2

= f(x)

for the motion of a particle of mass m. Here, the force f only depends upon theposition of the particle x. Thus, the unknown variable x appears on both sides of thedifferential equation, as is indicated in the notation f(x).

2.1.2 Partial Differential Equations (PDE)

A partial differential equation (PDE) is a relation involving an unknown functionof several independent variables and their partial derivatives. Partial differential

3

equations are used to formulate and solve complex problems such as the propagationof sound or heat, electromagnetics, electrodynamics, fluid flow, and elasticity. Com-pletely distinct physical problems may have similar mathematical formulations..

An example partial differential equation with two independent variables is∂u∂x + ∂u

∂y = f(x, y)

A solution of a partial differential equation is generally not unique. Additionalconditions must generally be specified on the boundary of the region where the solu-tion is defined. Also, inital values might be needed.

2.2 The Finite Difference Method (FDM)

Solving PDEs analytically is generally impossible. They are therefore solved approx-imately or numerically using computers. The numerical solution of partial differ-ential equations involves a discrete domain where approximations of the PDEs arecomputed. One standard method is to use a cartesian grid and estimate the values ofthe unknowns at the grid points. The resolution of the grid influences the local errorand hence the accuracy of the solution. The resolution also influences the numberof computations to be made. The properties of the grids are thus important for thebehaviour of a given method. To solve PDEs numerically, the finite difference method

Figure 1: A structured grid, described with Cartesian coordinates. Note that ∆x and∆y can have different size.

(FDM) is often used. Each term in the partial differential equation is replaced by adifference operator. A difference operator approximates a certain derivative at a gridpoint. The operator generally includes both the values of the function at the current

4

grid points and at neighbouring grid points. By substituting the difference operatorsinto the PDE, an approximation of the solution can be obtained. For time-dependentPDEs, the FDM can also be used in the time dimension.A forward difference operator, denoted by D+, is an expression of the form

D+f(x) = f(x+h)−f(x)h

where h is the step size. This operator approximates the first derivative.

A backward difference operator, denoted by D−, is an espression of the form

D−f(x) = f(x)−f(x−h)h

where h is again the step size. This operator also approximates the first derivative.

A second order derivative can be approximated by the following operator:

D+D−f(x) = f(x+h)−2f(x)+f(x−h)h2

For the approximate solution to converge to the correct solution of the PDE, twocondition must be fulfilled:

• Consistency, the local truncation error should go to zero as the stepsizes goto zero. This means that the discrete problem approximates the continous prob-lem.

• Stability, which means that the approximated solution should remain bounded.

When solving a PDE using FDM and a parallel computers, the processors need toexchange data at every processor boundary. A FDM-operator always involves neigh-boring grid points. If those neighboring points resides on different processors, griddata is exchanged between the processors. Figure 2 illustrates this situation.

As an example of FDM, consider the heat equaton

ut = cuxx, 0 ≤ x ≤ 1

u(0, x) = f(x), u(t, 0) = α, u(t, 1) = β

We let ukj denote the approximate solution at xj = j∆x and tk = k∆t, where ∆x, ∆t is

the step size, j is the index in the x-dimension, k is the index on the time axis and nis the number of grid points. If we replace ut by a forward difference in time and uxx

by a centered difference in space, we get the scheme

5

Figure 2: Two grid patches are assigned to processors P0 respectively P1. To computethe value of the point (i, j) using the D+ operator, P0 needs the data at point (i, j +1).Thus, P0 must communicate with P1 to receive the values contained at the point(i, j + 1).

uk+1j −uk

j

∆t = cuk

j+1−2ukj +uk

j−1

(∆x)2, j = 1, ..., n

or uk+1j = uk

j + c ∆t(∆x)2

(ukj+1 − 2uk

j + ukj−1), j = 1, ..., n (1)

The boundary conditions give us uk0 = α and uk

n+1 = β, and the initial conditionsprovide the starting values uj

0 = f(xj) for all j, so that we can march the numericalsolution forward in time using the difference scheme.

2.3 Structured Adaptive Mesh Refinement

For many problems, a grid with uniform resolution generally gives satisfactory re-sults. However, there are classes of problems where the solution is more difficultto estimate in some regions due to e.g. discontinuities, steep gradients, and shocks.One could use a uniform grid that have a spacing fine enough so that the local errorsin these regions are acceptable. This approach is computationally very costly and itis difficult for most problems to predict a grid resolution that will give good resultswhile keeping the amount of computations low.

A method to decrease the computational time is structured adaptive mesh refine-ments (SAMR) [3]. In the AMR method, we start with a base coarse grid. As thesolution proceeds, we identify regions requiring more resolution. We superimposefiner subgrids in these regions. Finer and finer subgrids are added recursively un-til either a given maximum number of refinement level is reached or the error hasdropped below the desired level. Thus, in adaptive mesh refinement, grid resolution

6

is only fixed for the base grid and is determined locally for the subgrids according tothe properties of the problem. Adaptive mesh refinement can under favorable condi-tions result in a runtime reduction in the vicinity of a factor 1000 [15].

Using parallel computers, AMR give rise to a challenging load balancing problem.The level of refinement can increase or decrease at any given time. The area occupiedby the refinements can also grow larger or smaller. In short, the grids are constantlychanging. This has the consequence that a partitioning of the grid that yields a goodload balance at a certain time step almost certainly results in a bad load balance lateron.

2.3.1 Grid Description

The basis for SAMR is a large uniform and structured grid that encompasses thewhole computational domain. The resolution of the grid conforms to the lowest ac-ceptable accuracy of the solution. Every refined grid is uniform and rectangular inshape. A refined grid can overlap several coarser, lower level grids, but it must alwaysbe inside the domain of these grids.

In the adaptive mesh refinement technique, we start with a coarse grid. As thesolution proceeds we identify the regions requiring more resolution by some parame-ter characterizing the solution, say the local truncation error. We superimpose finersub grids only on these regions. Finer and finer subgrids are added recursively untileither a given maximum level of refinement is reached or the local truncation errorhas dropped below the desired level. Thus, in an adaptive mesh refinement computa-tion grid resolution is fixed for the base grid only and is determined locally for the subgrids according to the requirements of the problem. See Figure 3 for an illustrationof the grids.

2.3.2 Integration Algorithm

The SAMR integration algorithm consists of two main components:

• The actual time integration step which is performed with a finite differencescheme (FDM).

• Error estimation.

Time integration

Each grid can be integrated in time individually, with the exception of the determina-tion of boundary condition. A consequence of this is that the grid must be integratedin order, the computation must always proceed from coarser to gradually finer grids.By using the same ratio of the time step to space step on all grid level, this processis greatly simplified. Most of the finer grids will be contained in the interior of the

7

(a) Illustration of the basic concept of SAMR, sub-grids lying above have higher resolution than thoselying under.

(b) Illustration of a real problem, a shock-wave hitting a wedge. Each rectangle is apatch. Smaller rectangles have generallya higher resolution. Note how the refinedpatches follow the area in need of high ac-curacy.

Figure 3: Refinement levels on a computational grid

computational domain. This means that they can not use the boundary conditionssupplied by the PDE. Thus the information must be exchanged with grids on thenext lower level in order to obtain the boundary conditions. Also, when a finer grid isintegrated, the result should be projected down to the coarser grids before the com-putations are resumed on the coarser levels, as this gives the most accurate solutionsince a finer grid uses a smaller time step. Taking both of these factors into account,the order of integration is described in Algorithm 1.

The order of integration can be illustrated with the following example. Assumethat we have two levels of refined grids and want to advance the solution from time tto t+k. We begin by taking a step of size k on the base grid(level 0). We then provideboundary data from the base grid to the grids on the first level of the refinement. Thegrids on level one are then integrated once, with a time step of size k/2. Next, theappropriate data of level 1 is used as boundary data for refinement level 2 and twotime steps, with the size k/4 is taken on level 2. We have now reached time t+k onthe base grid and time t + k/2 on the two finer grids. The solution on level 2 is nowprojected down to level 1 and the solution is then advanced to time t+ k on this level.We then turn to level 2 again and advance the solution to time t+k with two steps of

8

Algorithm 1 Integration algorithmBegin advance(l , k):take one step of size k on level lif l=L then

returnelse

interpolate from level l to l+1for i = 1 to 2 : do

advance(l+1, k/2)end forproject from level l+1 to l

end ifEnd advance

size k/4. The solution on level 2 is finally passed down to level 1, and the new solutionon this level in turn is passed down the base grid. If we did not update the coarserlevels, as described above, the solution might become both dispersed and dissipated.As a consequence, the apparent need for refinement could gradually vanish, leadingto a grid with a lack of refinement. Also, the update operations prevents bad valueson the coarse approximation of the boundary conditions.

Error Estimation

Error estimation is performed every several time step and grid is then adjusted ac-cording to the resulting estimation. A common method is to use Richardson extrap-olation. In order to increase the interval between the error estimation, each grid ismade larger than needed. Because of this, a moving phenomenon does not necessarymean that we must re-grid each time step.

2.3.3 Grid Generation and Clustering

The generation of the refined grids is an important part of the SAMR algorithm.The generated grids must cover all of the features in the solution that need a higherresolution. At the same time, the resulting grids should be as small as possible inorder to minimize unnecessary computations. Also, the algorithm must be fast asit is going to be invoked many times during the execution of an SAMR application.The quality of a grid is often measured by its efficiency - the fraction of a grid thatconsists of flagged points (points in need of higher accuracy). A high efficiency meansthat the grid only contains a small amount of unflagged points. When the efficiencyis low, we perform a lot of unnecessary computations since a larger number of gridpoints will not be in need of a higher accuracy. When we got a very high efficiency,we will normally get a large number of small grids. This is also bad since all gridneighbors must exchange boundary data before each integration step. The more grids

9

we have - the more communication is needed. The optimal efficiency is dependedon the application but it is often in the range of 75-90%. Also, if the solution hasa moving phenomena, a very high efficiency will lead to more frequent re-griddingoperations, as there is a higher probability of the phenomena moving outside thegrid.

3 Load balancing algorithms

Solving large PDEs numerically on a computer with a single processor can take toolong to be feasible. Instead, we divide the computation between a number of com-puters, even when SAMR is used. Most SAMR existing SAMR frameworks supportsparallel execution [7, 13, 23]. Efficient use of parallel SAMR typically requires thatthe dynamic grid hierarchy is repeatedly partitioned and distributed over the partic-ipating processors. Several performance issues arise during the process of partition-ing of the workload. As information flows in the grid hierarchy, processors need toexchange data. Intra-level communication appears as grid patches are split betweenprocessors and data are exchanged along the borders. Inter-level communication canoccur for overlaid patches when the solution is projected down to the coarser leveland a finer patches lacks boundary data. Both types of communications can result inlonger execution times.

A syncronization delay may occur when a processor is busy computing, while hold-ing data needed by other processors. Until the processor has finished its computa-tions, other processors might be unable to proceed as they lack data. Syncronizationdelay can be severe – the time spent waiting for data can be of the same magnitude asthe actual computational time [19]. The number of delays often grows as the numberof processors is increased.

To get optimal performance, the partitioner needs to simultanously minimize datamigration, load imbalance, communication volumes and syncronization delay. Typ-ically, this is impossible. Instead, the partitioner needs to trade-off the metrics inaccordance with the characteristics of the application and computer. Ultimately, par-tition quality is determined by the resulting application execution time.

The common algorithms for partitioning SAMR hierarchi can be categorized aspatch-based, domain-based or hybrid.

3.1 Patch-based approach

For patch-based partitioners [10,23], distribution decisions are independently madefor each newly created grid. The most straightforward approach is to divide eachpatch or refinement level into p blocks, where p is the number of processors, anddistribute one block to each processor. The partitioner can consider either patch-by-patch or level-by-level. The patch-based approached is illustrated in Figure 4.

10

In theory, the patch-based approach results in perfect load balance. In practice,some load imbalance can be expected due to sub-optimal patch aspect ratios, integerdivisions and constraints on the patch size. Partitioning can be performed incre-mentally, as only patches created or altered since the pervious time step need to beconsidered for repartitioning. However, patch-based algorithms often results in highcommunication volumes and communication bottlenecks. The communication vol-ume is generally increased when a patch is subdivided into blocks to create a lowerload imbalance.

3.2 The domain-based approach

For domain-based algorithms, only the base grid is partitioned [12, 16, 21]. Initially,the workload of the refined patches are projected down onto the base grid, reducingthe problem to partitioning a single grid with heterogenous workload. Thus, thedomain is partitioned along with all contained grids on all refinement levels. Theminimum block size determined by the size of the computational stencil on the basegrid. As the base grid stencil corresponds to many grid points on highly refinedpatches, the workload of a minimum sized block can be large. The domain-basedapproached is illustrated in Figure 4.

As overlaid grid blocks resides on the same processor for domain-based algorithm,inter-level communication is eliminated. A complete re-partition might be necessarywhen the grid hierarchi is modified. Because the imbalance is often high for deepgrid hierarchi it is hard to divide patches from many levels at once. A problem withdomain-based algorithms is ”bad cuts”: many and small block with bad aspect ratios.These blocks occur when patches are cut in bad places, assigning only a tiny fractionof a patch to one processor while the majority of it resides on another processor.

Domain-based algorithms often result in low communication volumes and highload imbalance.

3.3 The hybrid approach

For some applications states, neither the patch- or the domain-based approaches cancreate good-performing partitions. For these states, a combination of the patch- anddomain-based algorithms can be used. This type of algorithm is called hybrid [18].Hybrid algorithms tries to inherit the good properties of both the domain-based andthe patch-based techniques at the same time as they aim to avoid their main draw-backs. We do not describe the details of any hybrid algorithm in this thesis.

3.4 Ordering grids to increase locality

The solutions of a PDE on a particular part of a grid usually depends on the solutionon other grid parts. Processors computing on different parts of a grid thus have to

11

(a) In patch-based partitioning algorithm a grid hierarchy is divided between the participatingprocessors on each level of refinement separately.

(b) In domain-based partitioning algorithm a grid hierarchy is divided between the participatingprocessors on the all levels of refinement at the same time.

Figure 4: Dividing a grid hierarchy between four processors in patch-based partition-ing algorithm (a) and domain-based partitioning algorithm (b).

communicate partial solutions between each other (see Figure 2). This communica-tion takes time and increase the execution time. Processors can also stall while theywait for data. Normally, depending grid parts lie close to each other. To decrease theamount of communication, patches that are close to each other should be assigned tothe same processor. Thus, locality of the subgrids both on the same level of refine-ment and on different refinement levels has to be preserved. To increase locality, thegrids should be ordered prior to be splitted and assigned to the processors. A commonmethod to order subgrids is to use Space Filling Curves (SFC) [4].

12

The German mathematician David Hilbert (1862-1943) was the first to give thealgebraically described space-filling curves a geometric interpretation [4]. He dividedthe unit square into four sub squares. This splitting is recursively repeated for theemerging sub spaces, leading to the curves in Figure 5. An important property ofSFCs is called full inclusion: whenever we further split a sub interval, the mappingto the corresponding finer sub squares stays in the former interval. SFCs can be used

Figure 5: Hillbert curves, first and second order. The numbers in the Figure is theSFC-index.

to order the patches in a grid hierarchy. They are computationally efficient, localitypreserving recursive mappings from d-dimensional space to 1-dimensional space i.e.,Rd → R1, such that each point in Rd is mapped to a unique point or index in R1. Twoproperties of space filling curves make them particularly suitable for partitioningSAMR grid hierarchies. Digital causality implies that the points close together in d-dimensional space will be mapped to points close together in 1-dimensional space, i.e.the mapping preserves the locality. Self-similarity implies that, as a d-dimensionalregion is refined, the refined sub-regions can be recursively filled by curves havingthe same structure as the curves filling the original region, but possibly a differentorientation. Figure 6 illustrates this property for a 2-dimensional region. In the caseof SAMR grid hierarchies, SFC mapping is exploited to maintain locality on the samerefinement level for the patch-based approach and on both the same and differentrefinement levels for the domain-based approach.

4 Implementation of the partitioner

In this section, we describe the implementation of our patch-based partitioning al-gorithm. The partitioner divides a SAMR grid hierarchy on which the solution of aPDE is computed. The partitioner should be invoked by the SAMR framework whenrepartitioning is needed. The current grid hierarchy is given to the partitioner as

13

(a) level 1 (b) level 2 (c) level 3

Figure 6: An illustration of the Self-similarity of Space Filling Curves

input and it is assumed to be two dimensional (the partitioner is easily expanded toalso work for grids in 3D). The output is a partitioned grid hierarchy were each re-finement level is divided into p parts, where p is the number of processors. Each partis assigned to a specific processor.

The partitioner initally receives the grid hierarchy from a SAMR framework.Each grid patch in the hierarchy is transformed to an object of type BBoxInfo. BBox-Info is a custom made data structure that stores all properties of a grid patch. Table1 presents a specification of the BBoxInfo class. The use of BBoxInfo makes the par-titioner independent of any data structure used by the SAMR-framework.

The first time the partitioner is invoked during a simulation, it reads the size ofthe base grid hierarchy. To order the patches, the partitioner then creates an emptysquare grid that is used to assign SFC-indices to the patches.

The size of this SFC-grid is determined by computing the first power of two thatis equal or larger than the longest size of the base grid, up to a maximum size of512× 512 grid points. For example, if the largest side is 240 grid points, the SFC-gridwill be of size 256 (28) grid points. Each cell in the SFC-grid is then indexed accordingto Hilbert space filling curve.

For each grid patch, the partitioner maps its coordinates onto the SFC-grid andfinds a corresponding cell. It then assigns the SFC-index in this mapped cell to theSFC-index field in corresponding BBoxInfo instance. The SFC-grid is created onceand it is reused at every subsequent repartition. Next, the partitioner orders theBBoxInfo objects by their SFC-indices. For the ordering, we use the quicksort algo-rithm.

We limited the resolution of the SFC-grid to 512×512 because the time complexityfor creating a SFC-grid grows exponentially. The resolution 512 × 512 was shown toproduce good results for all our test applications.

It is possible that two grid patches that lie close to each other are assigned thesame SFC-index when the SFC-grid is smaller than the base grid. In practice, thishas no negative effect on the locality as the patches will be positioned next to each

14

BBoxInfoType Name Description

int nx, ny Number of grid point on the x and y dimension

int lbx, lby Lower bound coordinates

int ubx, uby Upper bound coordinates

int stepx, stepy Number of steps between two grid points in the xand y dimension, computed in the coordinate system

of the highest refinement level

int level Refinement level

int index SFC-index

Table 1: Instance variables in class BBoxInfo. All variables except the last one (level)are used to transform a grid patch into a BBoxInfo object. index is used to order anarray of BBoxInfo objects.

other when they are ordered (see Section 7).Next, the partitioner starts to assign grid patches to the processors. If the optimal

processor workload is exceeded when we assign a grid patch to a processor, the parti-tioner divides the offending grid patch into two parts. The patch is cut perpendicularto its longest side. One of new parts has the size that the current processor needsto reach at least the optimal processor workload. This part is assiged to the currentprocessor and the other part is saved for the next processor.

If dividing a grid patch results in a patch with a size smaller than a specifiedthreshold, the whole patch is assigned to the current processor. The threshold isdetermined by the size of the FDM-operators. By restricting the minimum patchsize, we avoid the creation of many small patches that potentially can cause a lotof unneccessary communication. If the current processor is the last processor, thepartitioner assigns the whole grid patch (and any remaining grid patches) to it.

Due to the integer number divisions, the actual workload assigned to a proces-sor can be slightly higher or lower than the optimal processor workload. Throughexperiments, it was discovered that it the assigned workload usually is lower thanthe optimal processor workload. If many processors are slightly underloaded, excesswork will begin to be accumulated. Since the last processor is assigned all remain-ing work, this excess workload can cause a serious overload on the last processor.Because the total computional time is determined by the processor with the longest

15

computational time, this is a serious issue. As a remedy, we try to scatter this excessworkload between all processors, rather than letting it be accumulated at the lastprocessor. To achieve this, we aim to make the processor workload around 5% largerthan the optimal processor workload. The factor 1.05 was found through experimentsto consistently produce good results. Algorithm 3 illustrates the mechamism of divid-ing the grid hierarchy between the processors.

Algorithm 2 Read grid patchesRead the size of base grid in the input grid hierarchyCreate SFC-gridwhile patches left in hierarchy do

read next patchcreate BBoxInfo objectstore object in arrayif no more patches on this level then

compute the optimal procWorkLoad on this levelsort the BBoxInfo array based on SFC

end ifend whilegoto algorithm Assign grids

Algorithm 3 Assign gridsfor i = 0 to maxLevel do

while patches left on this level do// Add patch to procif procWorkLoad+workLoadPatch <= 1.05*optProcWorkLoad OR lastProcessor then

assign current patch to processorprocWorkload += workLoadPatch// current processor is fullif procWorkLoad >= optProcWorkLoad AND !lastProcessor then

procWorkload = 0;processorID++;

end if// Need to divide this patch

elsegoto Algorithm Divide

end ifend while

end for

16

Algorithm 4 DivideDetermine direction of cutif Xsize > Ysize then

direction of cut = ycompute x-coord for cut

elsedirection of cut = xcompute y-coord for cut

end ifif resulting patches < minSize then

procWorkload = 0;processorID++;goto algorithm Assign grids

elseassign the first (the “left”) part to the current processorreplace the original grid by the second (the “right”) partprocWorkload = 0;processorID++;goto algorithm Assign grids

end if

5 Applications

To evaluate the performance of the implemented partitioner, we use four applicationsfrom the Virtual Test Facility (VTF) [8, 20]. VTF, developed at the California Insti-tute of Technology, is a software environment for coupling solvers for compressiblecomputational fluid dynamics (CFD) with solvers for computational solid dynamics(CSD) [8]. The purpose of VTF is to simulate highly coupled fluid-structure inter-action problems. The selected applications are restricted to the CFD domain of VTF,as the CSD solver is implemented with unstructured grids and the finite elementmethod.

For the CFD problems in VTF, the framework AMROC [2, 9] is used. AMROCis an object-oriented SAMR framework that conforms to the algorithm formulatedby Berger and Colella [3]. AMROC is based on DAGH (Distributed Adaptive GridHierarchies), a data storage package for parallel grid hierarchies [13]. In AMROC,separate grid levels are allowed to use different degrees of refinement.

Below we present the applications used. A more comprehensive description canbe found in the AMROC wiki [2]. The term ”refinement factor” describes the degreeof refinement with respect to the next lower refinement level. As an example, assumea 2D application with two levels of refinement and the refinement factors (2,4). Theresolution on the first refinement level is twice as high as on the base grid. The secondrefinement level has a resolution four times higher than the first refinement level.Thus, for the example above, the maximum patch resolution is 64 times higher than

17

for the base grid ((2× 2)× (4× 4)). Because we generally also refine in time, severaliterations are performed on a refined patch during one time step on the coarsest level.Using our example and assuming equal refinement in both space and time, we willperform two iterations for each patch on the first refinement level and eight iterationsfor the patches on the second level. Thus, for each computation on a grid point on thebase grid we will perform 64 × 2 × 4 computations on the highest refinement level.The refinement factors are set by the user before the application is executed. Themetric workload measures the aggregate number of grid points calculated on duringa time step on the coarsest level. Because of the refinement in time, a refined gridhas a larger workload than the number of grid points.

5.1 Ramp — Mach reflection at a wedge

Ramp simulates the reflection of a planar Mach 10 shock wave striking a 30 degreewedge. A complicated shock reflection occurs when the shock wave hits the slopingwall.

Both the workload and the number of grid patches grow almost linearly duringexecution, due to the growing reflection pattern behind the shock wave. Both the in-cident and the reflected shockwave have a sharp and unscattered refinement pattern.

The initial grid size is 480x120 grid points and the application uses three levelsof refinement with refinement factors 2,2,4. The maximum number of grid points is722,924. A density plot for time t=0.2 is shown in Figure 7.

Figure 7: Density plot for Ramp at t=0.2.

5.2 ShockTurb — Planar Richtmyer-Meshkow instability

ShockTurb treats the interaction of two contacting gases with densities. When thegases are subject to a shock wave, the interface between them becomes unstable and

18

the result is called a Richtmyer-Meshkov instability. The Richtmyer-Meshkov insta-bility finds applications in stellar evolution and supernova collapse, pressure waveinteraction with flame fronts, supersonic and hypersonic combustion and in intertialconfinement fusion.

In the simulation, an incident Mach 10 shock wave causes vortices along a si-nusoidally perturbed gas interface (five symmetric pertubations). The geometry isrectangular and closed, except at the left-most end. The gases are air and SF6 (sul-fur and fluoride). The simulation is motivated by physical experiments [22].

The initial grid size is 240x120 grid points and and the application uses threelevels of refinement with a constant refinement factor of two. The maximum numberof grid points is 787,076. A density plot for time t=0.5 is shown in Figure 8.

Figure 8: Density plot for ShockTurb at t=0.5

5.3 ConvShock — Converging/diverging Richtmyer-Meshkov insta-bility

ConvShock simulates a Richtmyer-Meshkov instability in a spherical setting. Thegaseus interface is spherical and sinusoidal in shape. The interface is dis- turbed bya Mach 5 spherical and converging shock wave. The shockwave is reflected at theorigin and drives a Richtmyer-Meshkov instability with reshock from the apex.

The initial grid size is 200x200 grid points and the application uses four levels ofrefinement with refinement factors 2, 2, 4, 2. The maximum number of grid points is695,244. A density plot for time t=0.5 is shown in Figure 9.

19

Figure 9: Density plot for ConvShock at t = 0.6.

5.4 Spheres — Cylinders in hypersonic flow

In the Spheres application, a constant Mach 10 flow passes over two spheres placedinside the computational domain. The flow results in steady bow shocks over thecylinders. This is a realistic flow problem with complex boundaries.

The initial grid size is 200x160 grid points and the application uses three levels ofrefinement with a constant refinement factor of two. The maximum number of gridpoints is 689,688. A density plot for time t=3 is shown in Figure 10.

6 Performance metrics

In this section we define all of the metrics that we use to evaluate the performance ofthe partitioner.

20

Figure 10: Density plot for Spheres at t=3.

Applica- Initial Levels of Refineme- Max Total wo-tion grid size refinement nt factors grid size rkloadRamp 480x240 3 {2,2,4} 722,924 1.78 ∗ 109

ShockTurb 240x120 3 {2,2,2} 787,076 1.21 ∗ 109

ConvShock 200x200 4 {2,2,4,2} 695,244 1.42 ∗ 109

Spheres 200x160 3 {2,2,2} 689,688 0.60 ∗ 109

Table 2: Application data. The maximum grid size denotes the number of grid pointsat the time step when the grid was at its largest. The total workload also considersrefinement in time.

6.1 Number of blocks

To achieve low load imbalances, grid patches are often divided into smaller blocks.Splitting patches into many parts generally results in larger faces between the blocks,inducing more communications. Having many blocks also result in other types ofoverhead, e.g. larger start-up costs, more ”book- keeping” and often higher cachemiss rates. For the number of blocks metric, we compute the maximum number ofblocks assigned to a processor.

6.2 Load imbalance

Arithmetic load imbalance is a common metric for judging the quality of a partition.We define load imbalance as follows:

21

0 50 100 150 2000

500

1000

1500

Num

ber

of p

atch

es

0 50 100 150 2000

5

10

15x 106Application data, ConvShock

Time step

Wor

kloa

d

Workload

Number of patches

(a) ConvShock

0 50 100 150 200 250 3000

100

200

300

400

500

Num

ber

of p

atch

es

0 50 100 150 200 250 3000

2

4

6

8

10x 106Application data, Ramp

Time step

Wor

kloa

d

Workload

Number of patches

(b) Ramp

0 50 100 150 200 250 300 3500

200

400

Num

ber

of p

atch

es

0 100 200 3000

5

10x 106

Time Step

Wor

kloa

dApplication data, ShockTurb

Number of patches

Workload

(c) ShockTurb

0 50 100 150 200 250 300 3500

1000

2000

Num

ber

of p

atch

es

0 100 200 3000

2

4x 106Application data, Spheres

Time step

Wor

kloa

d

Number of patches

Workload

(d) Spheres

Figure 11: Workload and the number of grid patches for the four test applications.

Load imbalance (%) = 100 ∗ Max{processor workload}Average workload − 100

Since we generally also refine in time, the workload is larger than the total gridsize. We use the workload of the most loaded processor, as all processors must finishtheir computations before the solution can be advanced to the next time step.

6.3 Volume of communications

We divide the computational workload of solving the involved PDE in a grid hierar-chy between the participating processors. When computing the value of a grid point,the result is often dependent on the values of other grid points that reside on other

22

processors. Therefore, processors have to communicate the value of these grid pointsbefore they can compute the value of the current grid points. The communication be-tween processors can be time consuming and can slow down the execution, especiallyif the interconnect is slow. The communication metric is divided into three groups:

• Intra-level communicationExchange of data residing on the same level of refinement.

• Inter-level communicationsExchange of data residing on different levels of refinement.

• Total communicationsThe sum of the intra- and inter-level communications.

7 Performance results

We have evaluated the implemented patch-based partitioner using four real-worldSAMR-application. For each SAMR-application (see Section 5), we use a trace filethat contains all information about the resulting grid hierarchies. The trace files werepartitioned by both the patch-based and a domain-based partitioner from the SAMR-framework AMROC [2, 9]. The resulting partitions were used as input to a SAMRsimulator [17]. The SAMR simulator mimics the execution of the common Berger-Colella SAMR algorithm [3]. For each time step, the simulator calculates metricslike number of blocks, arithmetical load imbalance and communication volumes. Themetrics (see Section 6) are independent of computer characteristics. All results wereobtained using 16 processor configurations.

To evaluate the scaling characteristics of the patch-based algorithm, we also presentresults from representative 32 processor configurations.

7.1 Number of blocks

It is important to limit the number of blocks because a smaller number of blocks po-tentially results in smaller communication volumes. When a lower number of blocksare assigned to each processor, it is less likely that a processor will need data fromgrid points assigned to other processors. The results are presented in Figure 12.

The patch-based partitioning algorithm produced partitions with a smaller num-ber of blocks than the domain-based partitioning algorithm for the ConvShock andRamp applications. For the ShockTurb and Spheres applications, the number ofblocks are comparable for both partitioning algorithms. We expected that the patch-based partitioner would produce more blocks than the domain-based partitioner be-cause the patch-based partitioner subdivides the grid hierarchy separately on eachlevel of refinement. The domain-based partitioner divides the grid hierarchy on all re-finement levels at the same time using a smaller number of cuts. This can potentially

23

result in a smaller number of blocks. However, since the patch-based partitioner isimplemented to only divide the patches that actually cause the load on a processor tooverflow, the number of blocks are kept low.

0 50 100 150 2000

50

100

150

200

250

Time step

Num

ber

of b

lock

s

Number of blocks, ConvShockPatch basedDomain based

(a) ConvShock

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

80

90

100

Time step

Num

ber

of b

lock

s

Number of blocks, RampPatch basedDomain based

(b) Ramp

0 50 100 150 200 250 300 3500

20

40

60

80

100

120

140

160

Time step

Num

ber

of b

lock

s

Number of blocks, ShockTurbPatch basedDomain based

(c) ShockTurb

50 100 150 200 250 300 3500

50

100

150

Time step

Num

ber

of b

lock

s

Number of blocks, SpheresPatch basedDomain based

(d) Spheres

Figure 12: Number of blocks, p=16, patch-based vs domain-based

7.2 Load imbalance

For all applications, the patch-based partitioning algorithm consistently producedmuch a lower and more stable load imbalance compared to domain-based algorithm(see Figure 13). The load imbalance produced by the patch-based algorithm is atmost five procent, while the domain-based partitioning algorithm resulted in a muchhigher, more varing and a highly unpredictable amount of load imbalance. The re-sults show that we can safely assume that the patch-based partitioner will not pro-

24

duce a larger load imbalance than five procent. As we described in Section 4 we allowa load imblance of five procent to avoid that excess workload are accumulated on thelast processor. This mechanisms seems to work flawlessly.

The domain-based partitioning algorithm produced high load imbalances becauseof restrictions on where to place the cuts that divide the hierarchy. The cuts can onlybe placed at grid points on the base grid. Thus, large amounts of load imbalance canarise because the larger workloads on higher level of refinements can not be evenlydivided.

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

Time step

Load

imba

lanc

e (%

)

Load imbalance, ConvShockPatch basedDomain based

(a) ConvShock

0 50 100 150 200 250 300 3500

10

20

30

40

50

60

70

80

90

100

Time step

Load

imba

lanc

e (%

)

Load imbalance, RampPatch basedDomain based

(b) Ramp

0 50 100 150 200 250 300 3500

5

10

15

20

25

30

35

40

45

50

Time step

Load

imba

lanc

e (%

)

Load imbalance, ShockTurbPatch basedDomain based

(c) ShockTurb

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

Time step

Load

imba

lanc

e (%

)

Load imbalance, SpheresPatch basedDomain based

(d) Spheres

Figure 13: Load imbalance, p=16, patch-based vs domain-based. Note that the y-axisis cut to allow for easier comparisons.

25

7.3 Intra-level communication

For the intra-level communication, the patch-based partitioning algorithm gives bet-ter results for ShockTurb (see Figure 14). The patch-based and domain-based parti-tioning algorithms give comparable results for Spheres, while the domain-based par-titioning algorithm result in lower communication for ConvShock and Ramp. Thus,the patch-based partitioning algorithm generally is comparable to the domain-basedpartitioning algorithm with regard to the intra-level communication.

Looking at the number of blocks, we expected that the patch-based partitioning al-gorithm would perform best for the ConvShock and the Ramp applications. For thesetwo applications, the patch-based algorithm produced a significantly lower numberof blocks than the domain-based algorithm. However, the best performance by thepatch-based algorithm was instead achievied for the Spheres and ShockTurb appli-cations, were both algorithms produced a similar number of blocks. From these re-sults, we draw the conclusion that the behavior of the intra-level communication isdependent on many other properties than the number of blocks.

We expected the domain-based partitioning algorithm to produce a substanciallylower volumes of intra-level communication than the patch-based partitioning algo-rithm. For the patch-based algorithm, the patches on each level of refinement are or-dered by mapping them onto a small SFC-grid (see Section 4). Becuase of the limitedsize of the SFC-grid, it is possible that different patches are mapped to the same loca-tion on the SFC-grid, and thus assigned to the same SFC-index. The data structureused by the domain-based partitioning algorithm allows for each patch to be assigneda unique SFC-index, resulting in a more accurate mapping. Despite that the domain-based partitioning algorithm in theory has better locality preserving properties, thetwo algorithms resulted in similar amounts of intralevel communication.

7.4 Inter-level communication

We only present inter-level communication for the patch-based partitioning algo-rithm in Figure 15. For the domain-based partitioning algorithm, all overlappingareas of a grid hierarchy are assigned to the same processor. Hence, there is nointer-level communication between the processors when the domain-based partition-ing algorithm is used.

The volume of inter-level communication is large for the patch-based algorithm,substancially larger than for the intra-level communication. For the intra-level com-munication, only data at processor boundaries need to be communicated. In theinter-level communication case, large areas of the grid is often sent between the pro-cessors. Thus, the actual number of performed communications might be similar, butthe communication volume is several times larger for the inter-level communicationcompared to the intra-level communication. The amount of inter-level communica-tion follows the general behavior of the application workload. ConvShock and Ramp

26

0 50 100 150 2000

1

2

3

4

5

6

7

8x 10

4

Time step

Num

ber

of g

rid p

oint

s

Intra level communications, ConvShockPatch basedDomain based

(a) ConvShock

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5x 10

4

Time ste p

Num

ber

of g

rid p

oint

s

Intra level communications, RampPatch basedDomain based

(b) Ramp

0 50 100 150 200 250 3000

2000

4000

6000

8000

10000

12000

14000

16000

Time step

Num

ber

of g

rid p

oint

s

Intra level communications, ShockTurbPatch basedDomain based

(c) ShockTurb

0 50 100 150 200 250 300 3500

2000

4000

6000

8000

10000

12000

14000

Time step

Num

ber

of g

rid p

oint

s

Intra level communications, SpheresPatch basedDomain based

(d) Spheres

Figure 14: Intra-level communication, p=16, patch-based vs domain-based

have larger amounts of communication because of the larger workloads that resultfrom their higher degrees of refinement.

To reduce inter-level communication, overlapping areas of the grid hierarchy shouldbe assigned to the same processor. This is a difficult task that requires advanced al-gorithms and large amounts of post-processing after the initial assignment of blocksto the processors. To our knowledge, this issue has not been seriously addressed inthe context of patch-based partitioners and it is also outside of the scope of this thesis.

7.5 Total communication

Total communication is the sum of the intra- and the inter-level communication. Forthe patch-based partitioning algorithm, total communication is similar to volume of

27

0 50 100 150 2000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

5

Time step

Num

ber

of g

rid p

oint

s

Inter level communications, ConvShock

(a) ConvShock

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3x 10

5

Time step

Num

ber

of g

rid p

oint

s

Inter level communications, Ramp

(b) Ramp

0 50 100 150 200 250 300 3500

0.5

1

1.5

2

2.5x 10

5

Time step

Num

ber

of g

rid p

oint

s

Inter level communications, ShockTurb

(c) ShockTurb

0 50 100 150 200 250 300 3500

2

4

6

8

10

12

14

16

18x 10

4

Time step

Num

ber

of g

rid p

oint

s

Inter level communications, Spheres

(d) Spheres

Figure 15: Inter-level communication, patch-based vs domain-based

inter-level communication since the amount of intra-level communication is muchlower than the inter-level communication (see Figure 16). The domain-based algo-rithm results in a very low amount of total commmunication since the algorithmdoes not have in any inter-level communication. Thus, to reduce the amount of com-munication for the patch-based algorithm, the effort should be directed to decreasethe inter-level communication.

7.6 Impact of SFC

To increase the locality and hence reduce the communication volumes, we order thegrid patches on each level according to a space-filling curve. The details of this pro-cess is described in Section 3.4. In this section we examine the effects this SFC-

28

0 50 100 150 2000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

5

Time step

Num

ber

of g

rid p

oint

s

Total communication, ConvShockPatch basedDomain based

(a) ConvShock

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3x 10

5

Time step

Num

ber

of g

rid p

oint

s

Total communication, RampPatch basedDomain based

(b) Ramp

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5x 10

5

Time step

Num

ber

of g

rid p

oint

s

Total communication, ShockTurbPatch basedDomain based

(c) ShockTurb

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

5

Time step

Num

ber

of g

rid p

oint

s

Total communication, SpheresPatch basedDomain based

(d) Spheres

Figure 16: Total communication, patch-based vs domain-based

mapping on the Ramp application.Using SFC to Increase the locality significantly reduces the intra-level communi-

cation volumes, with up to 75% (see Figure 7.6). The SFC-ordering assigns blocksthat lie close to each other to the same processor. We also hoped that the amountof inter-level communication would be reduced since the SFC order the blocks in asimilar pattern for each refinement level. This can potentially result in overlappingpatches being assigned to the same processor. However, the reduction in inter-levelcommunication is very small and can not be regarded as significant. Turning to loadimbalance and the number of blocks, we see that the use of SFC has minor impact onthese metrics.

Our implementation of the SFCs is computational efficient and it only adds avery small overhead to the partitioning time. Since high communication volumes

29

generally is the most performance inhibiting factor for patch-based partitioners, thebetter locality will help reduce the computational times.

7.7 Scalability of the implemented patch-based partitioner

In this section we present the results from 32 processor configurations to examinethe scalability characteristics of the patch-based partitioner. We limit ourselves tothe Ramp application, since the other applications have a similar behavior. For eachmetric we also present the corresponding result for 16 processors. The results arepresented in Figure 18.

The amount of load imbalance is limited to 5% for the both processor configura-tions. The partitioner can thus generally be assumed to produce a load imbalancelower than five procents.

The number of blocks grows when the number of processors is increased. Thiswas expected because since we subdivide the patches that cause a processor to over-load. The more processor we got, the more patches will be subdivided. However, theincrease in the number of blocks is less then the increase in the number of processors.

The intra-level communication is marginally affected by the increase from 16 to32 processors, eventhough the number of blocks is almost doubled. This means thatthe implemented SFC-mechanism scales very well with regard to intra-level commu-nication.

For inter-level communication, we observe a growth in the communication vol-ume. The increase is proportionally less than the increase in the number of proces-sors. Since the inter-level communication often is the most performance inhibitingfactor, this is a good result. We believe that the SFC-ordering slightly decreases thecommunication volumes.

Because of the good properties of the intra-level communication, the total commu-nication does not grow as much as one could expect. However, since the inter-levelcommunication is much larger than the intra-level communication, the amount ofcommunication is still high.

Summarizing the scalabilty properties, we see that main advantage of the par-titioner, the load imbalance, is kept at the same low level for both processor config-urations. The amount of communication is increased, but less than the increase inthe number of processors. We believe that the patch-based algorithms can continueto scale well over a wide range of configurations. However, it is inevitable that theamount communications eventually will saturate the interconnect when the numberof processors grows too large. However, we believe that this point occurs after highload imbalances make the use of a domain-based partitioner infeasible.

30

0 50 100 150 200 250 3000

10

20

30

40

50

60

Time step

Num

ber

of b

lock

s

Number of blocks, RampWith SFCWithout SFC

(a) Ramp

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3x 10

5

Time step

Num

ber

of g

rid p

oint

s

Inter level communication, RampWith SFCWithout SFC

(b) Ramp

0 50 100 150 200 250 3000

1

2

3

4

5

6x 10

4

Time step

Num

ber

of g

rid p

oint

s

Intra level communication, RampWith SFCWithout SFC

(c) Ramp

0 50 100 150 200 250 3000

1

2

3

4

5

6

7

Time step

Load

imba

lanc

e (%

)

Load imbalance, RampWith SFCWithout SFC

(d) Ramp

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5x 10

5

Time step

Num

ber

of g

rid p

oint

s

Total communication, RampWith SFCWithout SFC

(e) Ramp

Figure 17: Ramp, p=16, with SFC vs without SFC

31

0 50 100 150 200 250 3000

10

20

30

40

50

60

Time step

Num

ber

of b

lock

s

Number of blocks, Rampp=16p=32

(a) Ramp p=16 vs p=32

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3x 10

5

Time step

Num

ber

of p

oint

s

Inter level communication, Rampp=16p=32

(b) Ramp p=16 vs p=32

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5x 10

4

Time step

Num

ber

of p

oint

s

Intra level communication, Rampp=16p=32

(c) Ramp p=16 vs p=32

0 50 100 150 200 250 3000

1

2

3

4

5

6

7

8

9

10

Time step

Load

imba

lanc

e (%

)

Load imbalance, Rampp=16p=32

(d) Ramp p=16 vs p=32

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3x 10

5

Time step

Num

ber

of p

oint

s

Total communication, Rampp=16p=32

(e) Ramp p=16 vs p=32

Figure 18: Ramp, p=16 vs p=32

32

8 Summary and conclusion

To maintain good performance for parallel SAMR applications, the resulting dynamicgrid hierarchies are repeatedly partitioned and distributed over the participatingprocessors.

In this thesis, we implemented a patch-based partitioning algorithm for SAMRapplications. We specifically targeted the load imbalance while using space fillingcurves to limit the amount of communication.

The algorithm was evaluated using four real-world SAMR-application. The re-sults are compared to results from a common domain-based algorithm. From theevalutation we make the following conclusions:

• The patch-based algorithm produce a very low and stable load imbalance com-pared to the domain-based algorithm.

• The partitioner produce a comparable numbers of blocks with regard to thedomain-based partitioner.

• For intra-level communication, the patch-based partitioning algorithm producecomparable communication volumes with regard to the domain-based partition-ing algorithm.

• For the patch-based algorithm, the volume of inter-level communication is sub-stancially larger than the amount of intra-level communication. The domain-based algorithm do not produce any inter-level communication since all over-lapping areas are assigned to the same processor.

• The total communication volume are much larger for the patch-based partition-ing algorithm compared to domain-based algorithm. This was expected sincethe domain-based algorithm only produce intra-level communication.

Experiments with different numbers of processors show that the implementedpatch-based algorithm scales well when the number of processors are increased. Forlarge numbers of processors, communication might become a limiting factor — de-spite the good scaling properties.

Finally, using the implemented patch-based algorithm for computational inten-sive SAMR applications, the overall run time can be substancially reduced comparedto domain-based algorithms.

33

References

[1] Rebecca C. Wade Adrian H. Elcock, Razif R. Gabdoulline and J. AndrewMcCammon. Computer simulation of protein-protein association kinetics:acetylcholinesterase-fasciculin. Department of Chemistry and Biochemistry, De-partment of Pharmacology, University of California at San Diego, La Jolla, CA92093-0365, USA, European Molecular Biology Laboratory, Meyerhofstrasse 169117, Heidelberg, Germany Received 5 February 1999; revised 2 June 1999; ac-cepted 2 June 1999. ; Available online 2 May 2002., 291:149–162, 6 September1999.

[2] AMROC - Blockstructured adaptive mesh refinement in object-oriented C++.http://amroc.sourceforge.net/index.htm, Oct. 2006.

[3] M.J. Berger and P. Colella. Local adaptive mesh refinement for shock hydrody-namics. Journal of Computational Physics, 82:64–84, May 1989.

[4] Greg Breinholt and Christoph Schierz. Algorithm 781: Generating hilbert’sspace filling curve by recursion. Swiss Federal Institute of Technology, 24:184–189, June 1998.

[5] Greg L. Bryan. Fluids in the universe: Adaptive mesh refinement in cosmology.Computing in Science and Engineering, pages 46–53, Mar-Apr 1999.

[6] Niklas Bylund. Aimulation driven product delevopement applied to car bodydesign. Doctoral thesis, Lund university, 1:172, Aug. 2004.

[7] Chombo - Infrastructure for adaptive mesh refinement.http://seesar.lbl.gov/ANAG/chombo/, Dec. 2006.

[8] R. Deiterding, R. Radovitzky, L. Noels S. Mauch, J.C. Cummings, and D.I. Me-iron. A virtual test facility for the efficient simulation of solid material responseunder strong shock and detonation wave loading. To appear in Engineering withComputers, 2006.

[9] Ralf Deiterding. Detonation simulation with the AMROC framework. InForschung und wissenschaftliches Rechnen: Beitrage zum Heinz-Billing-Preis2003, pages 63–77. Gesellschaft fur Wiss. Datenverarbeitung, 2004.

[10] Zhiling Lan, Valerie E. Taylor, and Greg Bryan. Dynamic load balancing ofSAMR applications on distributed systems. In Proceedings of 30th InternationalConference on Parallel Processing, 2001.

[11] M. Norman and G. Bryan. Cosmological adaptive mesh refinement. NumericalAstrophysics, 1999.

34

[12] Manish Parashar and James C. Browne. On partitioning dynamic adaptive gridhierarchies. In Proceedings of the 29th Annual Hawaii International Conferenceon System Sciences, 1996.

[13] Manish Parashar, James C. Browne, Carter Edwards, and Kenneth Klimkowski.A common data management infrastructure for adaptive algorithms for PDEsolutions. In Proceedings of Supercomputing, 1997.

[14] A. Petersson, H. Karlsson, and S. Holmgren. Predissociation of the Ar-12 vander Waals molecule, a 3D study performed using parallel computers. Technicalreport, Department of Quantum Chemistry, Uppsala University, Sweden, 2001.

[15] James J. Quirk. A parallel adaptive grid algoritm for computational shock hy-drodynamics. Applied Numerical Mathematics, 20:427–453, 1996.

[16] Jarmo Rantakokko. Data Partitioning Methods and Parallel Block-OrientedPDE Solvers. PhD thesis, Uppsala University, 1998.

[17] Mausumi Shee, Samip Bhavsar, and Manish Parashar. Characterizing the per-formance of dynamic distribution and load-balancing techniques for adaptivegrid hierarchies. In Proceedings IASTED International conference of paralleland distributed computing and systems, 1999.

[18] Johan Steensland. Efficent Partitioning of Dynamic Structured Grid Hierar-chies. PhD thesis, Department of Scientific Computing, Information Technology,Uppsala University, Oct. 2002.

[19] Johan Steensland. Irregular buffer-zone partitioning reducing synchronizationcost in SAMR. International Journal of Computational Science and Engineering(IJCSE), 2006. Special issue, to appear.

[20] The Virtual Test Facility. http://www.cacr.caltech.edu/asc/wiki, Oct. 2006.

[21] M. Thune. Partitioning strategies for composite grids. Parallel Algorithms andApplications, 11:325–348, 1997.

[22] M. Vetter and R. Stuartevant. Experiments on the Richtmyer-Meshkov insta-bility on a air/SF6 interface. Shock Waves, 4(5):247–252, 1995.

[23] Andrew M. Wissink, Richard D. Hornung, Scott R. Kohn, Steve S. Smith, andNoah Elliott. Large scale parallel structured AMR calculations using the SAM-RAI framework. In Proceedings of Supercomputing, 2001.

35

a patch-based partitioner for structured adaptive mesh...

Documents