parallel algorithms research computing unc - chapel hill instructor: mark reed email:...

Parallel AlgorithmsResearch Computing

UNC - Chapel Hill

Instructor: Mark Reed

Email: [email protected]

its.unc.edu 2

OverviewOverview

Parallel Algorithms

Parallel Random Numbers

Application Scaling

MPI Bandwidth

its.unc.edu 3

Domain Decompositon

Domain Decompositon

Partition data across processors

Most widely used

“Owner” computes

credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 4

Dense Matrix MultiplyDense Matrix Multiply

Data sharing for MM with different partitioning

Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C.

credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 5

Dense Matrix MultiplyDense Matrix Multiply

its.unc.edu 6

Parallel SumParallel Sum

Sum for Nprocs=8

Complete after log(Nprocs) stepscredit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 7

Master/Workers Model

Master/Workers Model

Often embarrassingly parallel

Master: • decomposes the problem into small tasks

• distributes to workers

• gathers partial results to produce the final result

Workers:• work

• pass results back to master

• request more work (optional)

Mapping/Load Balancing• Static

• Dynamic

Master

worker

worker worker

worker

its.unc.edu 8

Master/Workers Load Balance

Master/Workers Load Balance

Iterations may have different and unpredictable run times• Systematic variance• Algorithmic variance

Goal is to balance load balance and overheadSome Schemes

Block decomposition, static chunking Round Robin decomposition Self scheduling

• assign one iteration at a time Guided dynamic self-scheduling

• Assign 1/P of the remaining iterations (P = # procs)

its.unc.edu 9

Functional ParallelismFunctional Parallelism

map tasks onto sets of processors

further decompose each function over data domain

credit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 10

Recursive BisectionRecursive Bisection

Orthogonal Recursive Bisection (ORB)

•good for decomposing irregular grids with mostly local communication

•partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions.

•cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.

its.unc.edu 11

ORB Example – Groundwater modeling

at UNC-Ch

ORB Example – Groundwater modeling

at UNC-Ch

Geometry of the homogeneous sphere-packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively.

“A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller

Two-dimensional examples of the non-uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.

its.unc.edu 12



Example: Parallel Monte Carlo

Additional Requirements:

•usable for arbitrary (large) number of processors

•psuedo-random across processors – streams uncorrelated

•generated independently for efficiency

Rule of thumb

•max usable sample size is at most the square root of the period

its.unc.edu 13



Scalable Parallel Random Number Generators Library (SPRNG)

• free and source available

• collects 5 RNG’s together in one package

• http://sprng.cs.fsu.edu

its.unc.edu 14

QCD ApplicationQCD Application

MILC

• (MIMD Lattice Computation)

quarks and gluons formulated on a space-time lattice

mostly asynchronous PTP communication• MPI_Send_init, MPI_Start, MPI_Startall

• MPI_Recv_init, MPI_Wait, MPI_Waitall

its.unc.edu 15

MILC – Strong ScalingMILC – Strong Scaling

its.unc.edu 16

MILC – Strong ScalingMILC – Strong Scaling

its.unc.edu 17

UNC Capability Computing - Topsail

UNC Capability Computing - Topsail

Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors.• 4M L2 cache per socket

• 2.66 GHz processors

• 4160 processors

12 GB memory/node

Shared Disk : 39TB IBRIX Parallel File System

Interconnect: Infiniband

64 bit OS cluster photos:

Scott Sawyer, Dell

its.unc.edu 18

MPI PTP on baobab

MPI PTP on baobab

Need large messages to achieve high rates

Latency cost dominates small messages

MPI_Send crossover from buffered to synchronous

These are instructional only• not a benchmark

its.unc.edu 19

MPI PTP on TopsailMPI PTP on Topsail

Infiniband (IB) interconnnect

Note higher bandwidth

lower latency

Two modes of standard send

its.unc.edu 20

Community Atmosphere Model (CAM)

Community Atmosphere Model (CAM)

global atmosphere model for weather and climate research communities (from NCAR)

atmospheric component of Community Climate System Model (CCSM)

hybrid MPI/OpenMP

• run here with MPI only

running Eulerian dynamical core with spectral truncation of 31 or 42

T31: 48x96x26 (lat x lon x nlev)

T42: 64x128x26

spectral dynamical cores domain decomposed over just latitude

its.unc.edu 21

CAM PerformanceCAM Performance

T31

T42

parallel algorithms research computing unc - chapel hill instructor: mark reed email:...

Documents