parallel algorithms research computing unc - chapel hill instructor: mark reed email:...

21
Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: [email protected]

Post on 20-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

Parallel AlgorithmsResearch Computing

UNC - Chapel Hill

Instructor: Mark Reed

Email: [email protected]

Page 2: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 2

OverviewOverview

Parallel Algorithms

Parallel Random Numbers

Application Scaling

MPI Bandwidth

Page 3: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 3

Domain Decompositon

Domain Decompositon

Partition data across processors

Most widely used

“Owner” computes

credit: George Karypis – Principles of Parallel Algorithm Design

Page 4: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 4

Dense Matrix MultiplyDense Matrix Multiply

Data sharing for MM with different partitioning

Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C.

credit: George Karypis – Principles of Parallel Algorithm Design

Page 5: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 5

Dense Matrix MultiplyDense Matrix Multiply

Page 6: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 6

Parallel SumParallel Sum

Sum for Nprocs=8

Complete after log(Nprocs) stepscredit: Designing and Building Parallel Programs – Ian Foster

Page 7: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 7

Master/Workers Model

Master/Workers Model

Often embarrassingly parallel

Master: • decomposes the problem into small tasks

• distributes to workers

• gathers partial results to produce the final result

Workers:• work

• pass results back to master

• request more work (optional)

Mapping/Load Balancing• Static

• Dynamic

Master

worker

worker worker

worker

Page 8: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 8

Master/Workers Load Balance

Master/Workers Load Balance

Iterations may have different and unpredictable run times• Systematic variance• Algorithmic variance

Goal is to balance load balance and overheadSome Schemes

Block decomposition, static chunking Round Robin decomposition Self scheduling

• assign one iteration at a time Guided dynamic self-scheduling

• Assign 1/P of the remaining iterations (P = # procs)

Page 9: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 9

Functional ParallelismFunctional Parallelism

map tasks onto sets of processors

further decompose each function over data domain

credit: Designing and Building Parallel Programs – Ian Foster

Page 10: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 10

Recursive BisectionRecursive Bisection

Orthogonal Recursive Bisection (ORB)

•good for decomposing irregular grids with mostly local communication

•partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions.

•cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.

Page 11: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 11

ORB Example – Groundwater modeling

at UNC-Ch

ORB Example – Groundwater modeling

at UNC-Ch

Geometry of the homogeneous sphere-packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively.

“A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller

Two-dimensional examples of the non-uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.

Page 12: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 12

Parallel Random Numbers

Parallel Random Numbers

Example: Parallel Monte Carlo

Additional Requirements:

•usable for arbitrary (large) number of processors

•psuedo-random across processors – streams uncorrelated

•generated independently for efficiency

Rule of thumb

•max usable sample size is at most the square root of the period

Page 13: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 13

Parallel Random Numbers

Parallel Random Numbers

Scalable Parallel Random Number Generators Library (SPRNG)

• free and source available

• collects 5 RNG’s together in one package

• http://sprng.cs.fsu.edu

Page 14: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 14

QCD ApplicationQCD Application

MILC

• (MIMD Lattice Computation)

quarks and gluons formulated on a space-time lattice

mostly asynchronous PTP communication• MPI_Send_init, MPI_Start, MPI_Startall

• MPI_Recv_init, MPI_Wait, MPI_Waitall

Page 15: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 15

MILC – Strong ScalingMILC – Strong Scaling

Page 16: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 16

MILC – Strong ScalingMILC – Strong Scaling

Page 17: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 17

UNC Capability Computing - Topsail

UNC Capability Computing - Topsail

Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors.• 4M L2 cache per socket

• 2.66 GHz processors

• 4160 processors

12 GB memory/node

Shared Disk : 39TB IBRIX Parallel File System

Interconnect: Infiniband

64 bit OS cluster photos:

Scott Sawyer, Dell

Page 18: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 18

MPI PTP on baobab

MPI PTP on baobab

Need large messages to achieve high rates

Latency cost dominates small messages

MPI_Send crossover from buffered to synchronous

These are instructional only• not a benchmark

Page 19: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 19

MPI PTP on TopsailMPI PTP on Topsail

Infiniband (IB) interconnnect

Note higher bandwidth

lower latency

Two modes of standard send

Page 20: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 20

Community Atmosphere Model (CAM)

Community Atmosphere Model (CAM)

global atmosphere model for weather and climate research communities (from NCAR)

atmospheric component of Community Climate System Model (CCSM)

hybrid MPI/OpenMP

• run here with MPI only

running Eulerian dynamical core with spectral truncation of 31 or 42

T31: 48x96x26 (lat x lon x nlev)

T42: 64x128x26

spectral dynamical cores domain decomposed over just latitude

Page 21: Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

its.unc.edu 21

CAM PerformanceCAM Performance

T31

T42