parallel algorithms research computing unc - chapel hill instructor: mark reed email:...
Post on 20-Dec-2015
218 views
TRANSCRIPT
Parallel AlgorithmsResearch Computing
UNC - Chapel Hill
Instructor: Mark Reed
Email: [email protected]
its.unc.edu 2
OverviewOverview
Parallel Algorithms
Parallel Random Numbers
Application Scaling
MPI Bandwidth
its.unc.edu 3
Domain Decompositon
Domain Decompositon
Partition data across processors
Most widely used
“Owner” computes
credit: George Karypis – Principles of Parallel Algorithm Design
its.unc.edu 4
Dense Matrix MultiplyDense Matrix Multiply
Data sharing for MM with different partitioning
Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C.
credit: George Karypis – Principles of Parallel Algorithm Design
its.unc.edu 5
Dense Matrix MultiplyDense Matrix Multiply
its.unc.edu 6
Parallel SumParallel Sum
Sum for Nprocs=8
Complete after log(Nprocs) stepscredit: Designing and Building Parallel Programs – Ian Foster
its.unc.edu 7
Master/Workers Model
Master/Workers Model
Often embarrassingly parallel
Master: • decomposes the problem into small tasks
• distributes to workers
• gathers partial results to produce the final result
Workers:• work
• pass results back to master
• request more work (optional)
Mapping/Load Balancing• Static
• Dynamic
Master
worker
worker worker
worker
its.unc.edu 8
Master/Workers Load Balance
Master/Workers Load Balance
Iterations may have different and unpredictable run times• Systematic variance• Algorithmic variance
Goal is to balance load balance and overheadSome Schemes
Block decomposition, static chunking Round Robin decomposition Self scheduling
• assign one iteration at a time Guided dynamic self-scheduling
• Assign 1/P of the remaining iterations (P = # procs)
its.unc.edu 9
Functional ParallelismFunctional Parallelism
map tasks onto sets of processors
further decompose each function over data domain
credit: Designing and Building Parallel Programs – Ian Foster
its.unc.edu 10
Recursive BisectionRecursive Bisection
Orthogonal Recursive Bisection (ORB)
•good for decomposing irregular grids with mostly local communication
•partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions.
•cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.
its.unc.edu 11
ORB Example – Groundwater modeling
at UNC-Ch
ORB Example – Groundwater modeling
at UNC-Ch
Geometry of the homogeneous sphere-packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively.
“A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller
Two-dimensional examples of the non-uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition.
its.unc.edu 12
Parallel Random Numbers
Parallel Random Numbers
Example: Parallel Monte Carlo
Additional Requirements:
•usable for arbitrary (large) number of processors
•psuedo-random across processors – streams uncorrelated
•generated independently for efficiency
Rule of thumb
•max usable sample size is at most the square root of the period
its.unc.edu 13
Parallel Random Numbers
Parallel Random Numbers
Scalable Parallel Random Number Generators Library (SPRNG)
• free and source available
• collects 5 RNG’s together in one package
• http://sprng.cs.fsu.edu
its.unc.edu 14
QCD ApplicationQCD Application
MILC
• (MIMD Lattice Computation)
quarks and gluons formulated on a space-time lattice
mostly asynchronous PTP communication• MPI_Send_init, MPI_Start, MPI_Startall
• MPI_Recv_init, MPI_Wait, MPI_Waitall
its.unc.edu 15
MILC – Strong ScalingMILC – Strong Scaling
its.unc.edu 16
MILC – Strong ScalingMILC – Strong Scaling
its.unc.edu 17
UNC Capability Computing - Topsail
UNC Capability Computing - Topsail
Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors.• 4M L2 cache per socket
• 2.66 GHz processors
• 4160 processors
12 GB memory/node
Shared Disk : 39TB IBRIX Parallel File System
Interconnect: Infiniband
64 bit OS cluster photos:
Scott Sawyer, Dell
its.unc.edu 18
MPI PTP on baobab
MPI PTP on baobab
Need large messages to achieve high rates
Latency cost dominates small messages
MPI_Send crossover from buffered to synchronous
These are instructional only• not a benchmark
its.unc.edu 19
MPI PTP on TopsailMPI PTP on Topsail
Infiniband (IB) interconnnect
Note higher bandwidth
lower latency
Two modes of standard send
its.unc.edu 20
Community Atmosphere Model (CAM)
Community Atmosphere Model (CAM)
global atmosphere model for weather and climate research communities (from NCAR)
atmospheric component of Community Climate System Model (CCSM)
hybrid MPI/OpenMP
• run here with MPI only
running Eulerian dynamical core with spectral truncation of 31 or 42
T31: 48x96x26 (lat x lon x nlev)
T42: 64x128x26
spectral dynamical cores domain decomposed over just latitude
its.unc.edu 21
CAM PerformanceCAM Performance
T31
T42