parallel algorithms research computing unc - chapel hill instructor: mark reed email:...

Parallel AlgorithmsResearch Computing

UNC - Chapel Hill

Instructor: Mark Reed

Email: markreed@unc.edu

its.unc.edu 2

OverviewOverview

Parallel Algorithms

Parallel Random Numbers

Application Scaling

MPI Bandwidth

its.unc.edu 3

Domain Decompositon

Partition data across processors

Most widely used

“Owner” computes

credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 4

Dense Matrix MultiplyDense Matrix Multiply

Data sharing for MM with different partitioning

Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C.

credit: George Karypis – Principles of Parallel Algorithm Design

its.unc.edu 5

Dense Matrix MultiplyDense Matrix Multiply

its.unc.edu 6

Parallel SumParallel Sum

Sum for Nprocs=8

Complete after log(Nprocs) stepscredit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 7

Master/Workers Model

Often embarrassingly parallel

Master: • decomposes the problem into small tasks

• distributes to workers

• gathers partial results to produce the final result

Workers:• work

• pass results back to master

• request more work (optional)

Mapping/Load Balancing• Static

• Dynamic

Master

worker

worker worker

worker

its.unc.edu 8

Master/Workers Load Balance

Iterations may have different and unpredictable run times• Systematic variance• Algorithmic variance

Goal is to balance load balance and overheadSome Schemes

Block decomposition, static chunking Round Robin decomposition Self scheduling

• assign one iteration at a time Guided dynamic self-scheduling

• Assign 1/P of the remaining iterations (P = # procs)

its.unc.edu 9

Functional ParallelismFunctional Parallelism

map tasks onto sets of processors

further decompose each function over data domain

credit: Designing and Building Parallel Programs – Ian Foster

its.unc.edu 10

Recursive BisectionRecursive Bisection

Orthogonal Recursive Bisection (ORB)

•good for decomposing irregular grids with mostly local communication

•partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions.

•cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.

its.unc.edu 11

ORB Example – Groundwater modeling

at UNC-Ch

ORB Example – Groundwater modeling

at UNC-Ch

Geometry of the homogeneous sphere-packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively.

“A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller

Two-dimensional examples of the non-uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.

its.unc.edu 12

Example: Parallel Monte Carlo

Additional Requirements:

•usable for arbitrary (large) number of processors

•psuedo-random across processors – streams uncorrelated

•generated independently for efficiency

Rule of thumb

•max usable sample size is at most the square root of the period

its.unc.edu 13

Scalable Parallel Random Number Generators Library (SPRNG)

• free and source available

• collects 5 RNG’s together in one package

• http://sprng.cs.fsu.edu

its.unc.edu 14

QCD ApplicationQCD Application

• (MIMD Lattice Computation)

quarks and gluons formulated on a space-time lattice

mostly asynchronous PTP communication• MPI_Send_init, MPI_Start, MPI_Startall

• MPI_Recv_init, MPI_Wait, MPI_Waitall

its.unc.edu 15

MILC – Strong ScalingMILC – Strong Scaling

its.unc.edu 16

MILC – Strong ScalingMILC – Strong Scaling

its.unc.edu 17

UNC Capability Computing - Topsail

Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors.• 4M L2 cache per socket

• 2.66 GHz processors

• 4160 processors

12 GB memory/node

Shared Disk : 39TB IBRIX Parallel File System

Interconnect: Infiniband

64 bit OS cluster photos:

Scott Sawyer, Dell

its.unc.edu 18

MPI PTP on baobab

Need large messages to achieve high rates

Latency cost dominates small messages

MPI_Send crossover from buffered to synchronous

These are instructional only• not a benchmark

its.unc.edu 19

MPI PTP on TopsailMPI PTP on Topsail

Infiniband (IB) interconnnect

Note higher bandwidth

lower latency

Two modes of standard send

its.unc.edu 20

Community Atmosphere Model (CAM)

global atmosphere model for weather and climate research communities (from NCAR)

atmospheric component of Community Climate System Model (CCSM)

hybrid MPI/OpenMP

• run here with MPI only

running Eulerian dynamical core with spectral truncation of 31 or 42

T31: 48x96x26 (lat x lon x nlev)

T42: 64x128x26

spectral dynamical cores domain decomposed over just latitude

its.unc.edu 21

CAM PerformanceCAM Performance

parallel algorithms research computing unc - chapel hill instructor: mark reed email:...

Documents

st loup chapel

reed it 29

feierlichkeiten und veranstaltungsorte chapel of the... ·...

biograf reed gehry

reed it 30

graham reed - centum

skogskyrkogården funerary chapel

reed connection

the royal chapel of madrid

reed john - mexique_insurge

ch03 - reed & baxter

reed meunch

Язык параллельного...

workshop reed referro_29022012

valleaceron chapel

calvary chapel west

university chapel campus map seabury memorial …university...

vmrs reed presentation

de la gracia - chapel library

cuprins - emanuel calvary chapel