![Page 1: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/1.jpg)
Advanced Algorithmics (4AP)Parallel algorithms
Jaak ViloJaak Vilo
2009 Spring
1AlgorithmicsJaak Vilo
![Page 2: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/2.jpg)
http://www.top500.org/
![Page 3: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/3.jpg)
Slides based on materials fromSlides based on materials from
• Joe Davey
• Matt Maul
• Ashok SrinivasanAshok Srinivasan
• Edward Chrzanowski
• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
• Dr. Amitava Datta and Prof. Dr. Thomas Ottmann
• http://en.wikipedia.org/wiki/Parallel_computing
• … and many others
![Page 4: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/4.jpg)
textbookstextbooks
• Joseph JaJa : An Introduction to ParallelJoseph JaJa : An Introduction to Parallel Algorithms, Addison‐Wesley, ISBN 0‐201‐54856 9 199254856‐9, 1992.
![Page 5: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/5.jpg)
TÜ HPCTÜ HPC
• 42 nodes, 2x4‐core = 336 core
• 32GB RAM / node32GB RAM / node
• Infiniband fast interconnect
• http://www hpc ut ee/http://www.hpc.ut.ee/
• Job scheduling (Torque)
• MPI• MPI
![Page 6: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/6.jpg)
![Page 7: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/7.jpg)
Single (SDR) Double (DDR) Quad (QDR)Single (SDR) Double (DDR) Quad (QDR)
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s
LatencyLatencyThe single data rate switch chips have a latency of 200 nanoseconds, and DDR switch chips have a latency of 140 nanoseconds.The end-to-end latency range is from 1 07 microseconds MPI latency to 1 29 microsecondslatency range is from 1.07 microseconds MPI latency to 1.29 microseconds MPI latency to 2.6 microseconds.[
Compare:The speed of a normal ethernet ping/pong (request and response) is roughly 350us (microseconds) or about .35 milliseconds or .00035 seconds.
![Page 8: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/8.jpg)
EXCS krokodillEXCS ‐ krokodill
• 32‐core
• 256GB RAM256GB RAM
![Page 9: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/9.jpg)
Drivers for parallel computingDrivers for parallel computing
• Multi‐core processors (2, 4, … , 64 … )
• Computer clusters (and NUMA)Computer clusters (and NUMA)
• Specialised computers (e.g. vector processors, i l ll l )massively parallel, …)
• Distributed computing: GRID, cloud, …p g , ,
• Need to create computer systems from h ( d k ) t b tcheaper (and weaker) components, but many …
• One of the major challenges of modern IT
![Page 10: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/10.jpg)
What is a Parallel Algorithm?What is a Parallel Algorithm?
i d d fi d l hild i h• Imagine you needed to find a lost child in the woods.
• Even in a small area searching by yourselfEven in a small area, searching by yourself would be very time consuming
• Now if you gathered some friends and family to help you, you could cover the woods in much faster manner…
![Page 11: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/11.jpg)
![Page 12: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/12.jpg)
Parallel complexityParallel complexity
• Schedule on p processors
• schedule depth – max /longest path in schedule/schedule depth max /longest path in schedule/
• N – set of nodes on DAG
• time ti – assigned for each DAG node
• Tp(n) = min{ max iN ti }
• Length of the longest (critical) path!Length of the longest (critical) path!
![Page 13: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/13.jpg)
Bit level parallelismBit‐level parallelism
• Adding of 64‐bit numbers on 8‐bitAdding of 64 bit numbers on 8 bit architecture…
• Sum, carry over, sum, carry over , …, y , , y ,
![Page 14: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/14.jpg)
Boolean circuitsBoolean circuits
• (Micro)processors(Micro)processors
• Basic building blocks,
• bit‐level operations
• combine into word level operations• combine into word‐level operations
![Page 15: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/15.jpg)
![Page 16: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/16.jpg)
8-bit adder
![Page 17: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/17.jpg)
8 bit adder8‐bit adder
![Page 18: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/18.jpg)
Circuit complexityCircuit complexity
• Longest path (= time)Longest path ( time)
• Nr or elements (= size)
![Page 19: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/19.jpg)
• Most of the available algorithms to compute Pi, on the other hand, can not be peasily split up into parallel portions. They require the results from a preceding step torequire the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serialproblems are called inherently serial problems.
![Page 20: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/20.jpg)
Instruction‐level parallelism
A canonical five-stage pipeline in a RISCmachine (IF = Instruction Fetch, ID = I t ti D d EX E t MEMInstruction Decode, EX = Execute, MEM = Memory access, WB = Register write back)
![Page 21: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/21.jpg)
Instructions can be d t th l ifgrouped together only if
there is no data dependency between themthem.
A five-stage pipelined superscalar processor, capable of issuing two instructions per cycle. It can have two instructions in each stage of the pipeline, for a total of up to 10 instructions (shown in green) being simultaneously executed.
![Page 22: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/22.jpg)
• HPC: 2.5GHz, 4‐op‐parallel
![Page 23: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/23.jpg)
B k dBackground• Terminology• Terminology
– Time complexity
– Speedup
– Efficiencyy
– Scalability
C i ti t d l• Communication cost model
![Page 24: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/24.jpg)
Time complexityTime complexity
• Parallel computationA group of processors work together to solve a– A group of processors work together to solve a problem
Ti i d f h i i h– Time required for the computation is the period from when the first processor starts
ki il h h lworking until when the last processor stops
Sequential Parallel - bad Parallel - ideal Parallel - realistic
![Page 25: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/25.jpg)
Other terminologyOther terminology
d / N t ti• Speedup: S = T1 / TP• Efficiency: E = S / P
Notation•P = Number of processorsy
• Work: W = P TP• Scalability
•T1 = Time on one processor
•TP = Time on P processors
• Scalability– How does TP decrease as we increase P to solve the same problem?the same problem?
– How should the problem size increase with P, to keep E constant?keep E constant?
![Page 26: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/26.jpg)
• Sometimes a speedup of more than N when using N processors is observed in parallel g p pcomputing, which is called super linear speedup Super linear speedup rarely happensspeedup. Super linear speedup rarely happens and often confuses beginners, who believe the theoretical maximum speedup should be Ntheoretical maximum speedup should be N when N processors are used.
![Page 27: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/27.jpg)
• One possible reason for a super linear speedup is the cache effectp p
S li d l h• Super linear speedups can also occur when performing backtracking in parallel: One thread can prune a branch of the exhaustive search that another thread would have takensearch that another thread would have taken otherwise.
![Page 28: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/28.jpg)
Amdahl lawis used to find the maximum expected improvement to an overall system when only part of the system is improved
![Page 29: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/29.jpg)
• e.g. 5% non parallelizable => speedup can not be more than 20x !
• http://en.wikipedia.org/wiki/Amdahl%27s lawp // p g/ / _
![Page 30: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/30.jpg)
Sequential programmeSequential programme
![Page 31: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/31.jpg)
Communication cost modelCommunication cost model
P d i d i f l k d• Processes spend some time doing useful work, and some time communicatingM d l i i• Model communication cost as– TC = ts + L tbL i– L = message size
– Independent of location of processesAny process can communicate with any other process– Any process can communicate with any other process
– A process can simultaneously send and receive one messagemessage
![Page 32: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/32.jpg)
• Simple model for analyzing algorithm performance:ptcomm = communication time
= t t t + (#words * td t )= tstartup + (#words tdata)
= latency + ( #words * 1/(words/second) )
= + w*
latency and bandwidth
![Page 33: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/33.jpg)
I/O modelI/O model
W ill i I/O i f h• We will ignore I/O issues, for the most part• We will assume that input and output are distributed
h i f h iacross the processors in a manner of our choosing• Example: Sorting
– Input: x1, x2, ..., xn• Initially, xi is on processor i
Output x x x– Output xp1, xp2, ..., xpn• xpi on processor i• xpi < xpi+1pi pi+1
![Page 34: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/34.jpg)
Important pointsImportant points
• Efficiency– Increases with increase in problem size
D i h i i b f– Decreases with increase in number of processors
• Aggregation of tasks to increase granularity– Reduces communication overheadReduces communication overhead
• Data distribution– 2‐dimensional may be more scalable than 1‐dimensional– Has an effect on load balance too
• General techniquesDi id d– Divide and conquer
– Pipelining
![Page 35: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/35.jpg)
Parallel ArchitecturesParallel Architectures
• Single Instruction Stream, Multiple Data Stream (SIMD)( )– One global control unit connected to each processorprocessor
• Multiple Instruction Stream, Multiple D S (MIMD)Data Stream (MIMD)– Each processor has a local control unit
![Page 36: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/36.jpg)
Architecture (continued)Architecture (continued)
h d dd• Shared‐Address‐Space– Each processor has access to main memory– Processors may be given a small private memory for local variables
• Message‐Passing– Each processor is given its own block of memoryEach processor is given its own block of memory– Processors communicate by passing messages directly instead of modifying memory locationsdirectly instead of modifying memory locations
![Page 37: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/37.jpg)
MPP Massively ParallelMPP – Massively Parallel
• Each node is an independent system having its own:– Physical memory
Address space– Address space
– Local disk and network connections
– Operating system
![Page 38: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/38.jpg)
SMP Symmetric multiprocessingSMP ‐ Symmetric multiprocessing• Shared Memory• Shared Memory
– All processes share the same address space– Easy to program; also easy to program poorly
Performance is hardware dependent; limited memory bandwidth can create– Performance is hardware dependent; limited memory bandwidth can create contention for memory
• MIMD (multiple instruction multiple data)– Each parallel computing unit has an instruction threadEach parallel computing unit has an instruction thread– Each processor has local memory– Processors share data by message passing– Synchronization must be explicitly programmed into a codey p y p g
• NUMA (non‐uniform memory access)– Distributed memory in hardware, shared memory in software, with hardware
assistance (for performance)– Better scalability than SMP, but not as good as a full distributed memory
architecture
![Page 39: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/39.jpg)
MPPMPP
• Short for massively parallel processing, a type of computing that uses many separate CPUs running in parallel to execute a single program MPP is similar to symmetric processing (SMP)single program. MPP is similar to symmetric processing (SMP), with the main difference being that in SMP systems all the CPUs share the same memory, whereas in MPP systems, each y, y ,CPU has its own memory. MPP systems are therefore more difficult to program because the application must be divided
h h ll hin such a way that all the executing segments can communicate with each other. On the other hand, MPP don't suffer from the bottleneck problems inherent in SMP systemssuffer from the bottleneck problems inherent in SMP systems when all the CPUs attempt to access the same memory at once.
![Page 40: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/40.jpg)
Interconnection NetworksInterconnection Networks
St ti• Static– Each processor is hard‐wired to every other processorprocessor
Completely Connected Star-Connected Bounded-Degree
(D 4)• Dynamic– Processors are connected to a series of
(Degree 4)
switches
![Page 41: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/41.jpg)
http://en.wikipedia.org/wiki/Hypercube64
128-way fat tree
![Page 42: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/42.jpg)
Why Do Parallel ComputingWhy Do Parallel Computing
Ti R d th t d ti f li ti• Time: Reduce the turnaround time of applications• Performance: Parallel computing is the only way to extend performance toward the TFLOP realmextend performance toward the TFLOP realm
• Cost/Performance: Traditional vector computers become too expensive as one pushes thebecome too expensive as one pushes the performance barrier
• Memory: Applications often require memory that goes beyond that addressable by a single processor
![Page 43: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/43.jpg)
ContCont…
• Whole classes of important algorithms are ideal for parallel execution. Most algorithms can benefit from parallel processing such as Laplace equation, Monte Carlo, FFT (signal processing), image processing
• Life itself is a set of concurrent processes– Scientists use modelling so why not model systems in a way closer to nature
![Page 44: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/44.jpg)
Some MisconceptionsSome Misconceptions
• Requires new parallel languages?– No. Uses standard languages and compilers ( Fortran, C, C++, Java, Occam)
• However, there are some specific parallel languages such as Qlisp, Mul‐T and others – check out:Mul T and others check out:
http://ceu.fi.udc.es/SAL/C/1/index.shtml
• Requires new code?• Requires new code?– No. Most existing code can be used. Many production installations use the same code base for serial and parallelinstallations use the same code base for serial and parallel code.
![Page 45: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/45.jpg)
ContCont…
i f i ll l i ?• Requires confusing parallel extensions?– No. They are not that bad. Depends on how complex you want to make it. From nothing at all (letting the compiler do the parallelism) to i t lli h lfinstalling semaphores yourself
• Parallel computing is difficult:– No. Just different and subtle. Can be akin to assembler language programming
![Page 46: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/46.jpg)
Parallel Computing ArchitecturesParallel Computing Architectures Flynn’s Taxonomyy y
Single MultipleData Stream Data Stream
SingleInstruction
SISDuniprocessors
SIMDProcessor arraysInstruction
StreamMultiple MISD MIMDMultipleInstruction
MISDSystolic arrays
MIMDMultiprocessors
multicomputersStream
multicomputers
![Page 47: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/47.jpg)
Parallel Computing ArchitecturesParallel Computing Architectures Memory Modely
Shared Address Space Individual Address Spacep
Centralized memory SMP (Symmetric Multiprocessor)
N/A
Di t ib t d NUMA (N U if MPP (M i l P ll lDistributed memory NUMA (Non-Uniform Memory Access)
MPP (Massively Parallel Processors)
![Page 48: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/48.jpg)
NUMA architecture
![Page 49: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/49.jpg)
The PRAMModelThe PRAM Model
• Parallel Random Access Machine– Theoretical model for parallel machinesp
– p processors with uniform access to a large memory bankmemory bank
– MIMD
( if ) l– UMA (uniform memory access) – Equal memory access time for any processor to any address
![Page 50: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/50.jpg)
Memory ProtocolsMemory Protocols
• Exclusive‐Read Exclusive‐Write
• Exclusive‐Read Concurrent‐WriteExclusive Read Concurrent Write
• Concurrent‐Read Exclusive‐Write
• Concurrent‐Read Concurrent‐Write
• If concurrent write is allowed we must decide which value to accept
![Page 51: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/51.jpg)
Example: Finding the Largest key in an array
• In order to find the largest key in an array of size n, at least n‐1 comparisons must be done. p
A ll l i f hi l i h ill ill• A parallel version of this algorithm will still perform the same amount of compares, but by doing them in parallel it will finish sooner.
![Page 52: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/52.jpg)
Example: Finding the Largest key in an array
h i f d h• Assume that n is a power of 2 and we have n / 2 processors executing the algorithm in parallel.
• Each Processor reads two array elements into ylocal variables called first and second
• It then writes the larger value into the first ofIt then writes the larger value into the first of the array slots that it has to read. T k l t f th l t k t b• Takes lg n steps for the largest key to be placed in S[1]
![Page 53: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/53.jpg)
Example: Finding the Largest key in an array
3 12 7 6 8 11 19 13
P1 P2 P3 P4
Read312 76 811 1913
First Second First Second First Second First Second
Read
Write
12 12 7 6 11 11 19 13
![Page 54: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/54.jpg)
Example: Finding the Largest key in an array
12 12 7 6 11 11 19 13
P1 P3Read
7 12 19 11
12 12 7 6 19 11 19 13
Write
12 12 7 6 19 11 19 13
![Page 55: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/55.jpg)
Example: Finding the Largest key in an array
12 12 7 6 19 11 19 13
P1Read
1219
Write
19 12 7 6 19 11 19 13
![Page 56: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/56.jpg)
Example: Merge SortExample: Merge Sort
8 1 4 5 2 7 3 6
81 4 5 2 7 3 6
P1 P2 P3 P4
P1 P2
81 4 5 2 73 6
P1
81 4 52 73 6
![Page 57: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/57.jpg)
Merge Sort AnalysisMerge Sort Analysis
• Number of compares– 1 + 3 + … + (2i‐1) + … + (n‐1)( ) ( )
– ∑i=1..lg(n) 2i‐1 = 2n‐2‐lgn = Θ(n)
• We have improved from nlg(n) to n simply by l i th ld l ith t ll lapplying the old algorithm to parallel
computing, by altering the algorithm we can further improve merge sort to (lg n)2
![Page 58: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/58.jpg)
Parallel Design and Dynamic Programming
• Often in a dynamic programming algorithm a given row or diagonal can be computed g g psimultaneously
• This makes many dynamic programming• This makes many dynamic programming algorithms amenable for parallel architectures
![Page 59: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/59.jpg)
Current Applications that Utilize Parallel Architectures
• Computer Graphics Processing
• Video EncodingVideo Encoding
• Accurate weather forecasting
• Scientific computing modellingScientific computing, modelling
• …
![Page 60: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/60.jpg)
![Page 61: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/61.jpg)
![Page 62: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/62.jpg)
Parallel addition features
• If n >> P– Each processor adds n/P distinct numbers– Perform parallel reduction on P numbers– TP ~ n/P + (1 + ts+ tb) log P– Optimal P obtained by differentiating wrt P
• Popt ~ n/(1 + ts+ tb)• If communication cost is high then fewer processors• If communication cost is high, then fewer processors
ought to be used – E = [1 + (1+ ts+ tb) P log P/n]-1
• As problem size increases, efficiency increases• As number of processors increases, efficiency
ddecreases
![Page 63: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/63.jpg)
Some common collective operations
A A
A
A
B
A, B, C, D
A
A
C
DA
Broadcast
D
Gather
AA, B, C, D
B
A
B
A, B, C, D
A, B, C, D
C
D
C
D
A, B, C, D
A, B, C, D
Scatter All Gather
, , ,
![Page 64: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/64.jpg)
Broadcast
x1x8x7
x3
x2
x4
x1
x1 x2
x4x3
x8x7
x3
x5 x6
x4x1
x1 x4
x2
x3 x2
x
x6
x
x5
• T ~ (t + Lt ) log P
x2x1
• T ~ (ts+ Ltb) log P– L: Length of data
![Page 65: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/65.jpg)
Gather/Scatter
x184L
Note: i=0log P–1 2i
= (2 log P – 1)/(2–1) = P-1
x58x142L 2L
~ P
x8x4
x34
x2 x6
x78x12
x1 x7
x56
x3 x5
LL L L
G th D t t d th t
x8x4x2 x6x1 x7x3 x5
• Gather: Data move towards the root• Scatter: Review question• T ~ ts log P + PLtb
![Page 66: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/66.jpg)
All gather
x8x7
x4x3
x2
x6
x1
x5
L
• Equivalent to each processor broadcasting to all the processors
![Page 67: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/67.jpg)
All gather
x78 x78
x34 x34
2L
x12
x56
x12
x56
L
2L
![Page 68: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/68.jpg)
All gather
x58 x58
x14 x14
2L
x14
x58
x14
x58
4L
2L
L
![Page 69: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/69.jpg)
All gather
x18 x18
x18 x18
2L
x18
x18
x18
x18
4L
2L
L
• Tn ~ ts log P + PLtb
![Page 70: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/70.jpg)
Matrix-vector multiplicationMatrix-vector multiplication
• c = A b– Often performed repeatedly
• bi = A bi-1i i 1
– We need same data distribution for c and b• One dimensional decompositionOne dimensional decomposition
– Example: row-wise block striped for A• b and c replicated• b and c replicated
– Each process computes its components of c independentlyc independently
– Then all-gather the components of c
![Page 71: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/71.jpg)
1-D matrix-vector multiplication1-D matrix-vector multiplication
c: Replicated A: Row-wise b: Replicated
• Each process computes its components of cindependently– Time = (n2/P)
• Then all-gather the components of c– Time = ts log P + tb n
• Note: P < n
![Page 72: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/72.jpg)
2-D matrix-vector multiplication2-D matrix-vector multiplicationA00 A01 A02 A03
A A A A
B0
B
C0
C A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B1
B2
B3
C1
C2
C3
• Processes Pi0 sends Bi to P0iTime: t + t n/P0 5
A30 A31 A32 A33 B3C3
– Time: ts + tbn/P0.5
• Processes P0j broadcast Bj to all Pij– Time = ts log P0.5 + tb n log P0.5 / P0.5
• Processes Pij compute Cij = AijBj– Time = (n2/P)
• Processes Pij reduce Cij on to Pi0, 0 < i < P0.5Processes Pij reduce Cij on to Pi0, 0 i P– Time = ts log P0.5 + tb n log P0.5 / P0.5
• Total time = (n2/P + ts log P + tb n log P / P0.5 )P < n2– P < n2
– More scalable than 1-dimensional decomposition
![Page 73: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/73.jpg)
![Page 74: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/74.jpg)
• 31.2 Strassen's algorithm for matrix multiplicationp
• This section presents Strassen's remarkable recursive algorithm for multiplying n nrecursive algorithm for multiplying n n matrices that runs in (nlg 7) = O(n2.81) time. For ff l l h f fsufficiently large n, therefore, it outperforms
the naive (n3) matrix‐multiplication algorithm MATRIX‐MULTIPLY from Section 26.1.
![Page 75: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/75.jpg)
C = ABC = AB
• r = ae + bf
• s = ag + bhs = ag + bh
• t = ce + df
• u = cg + dh
• T(n) = 8T(n/2) + Θ(n2) = O( n3)• T(n) = 8T(n/2) + Θ(n ). = O( n3)
![Page 76: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/76.jpg)
![Page 77: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/77.jpg)
Matrix operations are parallelizableMatrix operations are parallelizable
• Problems that are expressed in forms ofProblems that are expressed in forms of matrix operations are often easy to automatically paralleliseautomatically parallelise
• Fortran, etc – programming languages can achieve that at no extra effortachieve that at no extra effort
![Page 78: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/78.jpg)
The Process A running executable of a (compiled and linked) program
written in a standard sequential language (i.e. F77 or C) with lib ll t i l t th ilibrary calls to implement the message passing
A process executes on a processor All processes are assigned to processors in a one-to-one mapping
(simplest model of parallel programming)(simplest model of parallel programming) Other processes may execute on other processors
A process communicates and synchronizes with other processes via messagesvia messages
A process is uniquely identified by: The node on which it is running Its process id (PID) Its process id (PID)
A process does not migrate from node to node (though it is possible for it to migrate from one processor to another within a SMP node).SMP node).
![Page 79: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/79.jpg)
Processors vs. Nodes Once upon a time…
When distributed-memory systems were first developed, each computing y y p , p gelement was referred to as a node
A node consisted of a processor, memory, (maybe I/O), and some mechanism by which it attached itself to the interprocessor communication facility (switch mesh torus hypercube etc )communication facility (switch, mesh, torus, hypercube, etc.)
The terms processor and node were used interchangeably But lately, nodes have grown fatter…
Multi-processor nodes are common and getting larger Multi processor nodes are common and getting larger Nodes and processors are no longer the same thing as far as parallel
computing is concerned Old habits die hard…
It is better to ignore the underlying hardware and refer to the elements of a parallel program as processes or (more formally) as MPI tasksprocesses or (more formally) as MPI tasks
![Page 80: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/80.jpg)
Solving Problems in ParallelIt is true that the hardware defines the parallel
computer However it is the software thatcomputer. However, it is the software that makes it usable.
Parallel programmers have the same concern asParallel programmers have the same concern as any other programmer:
- Algorithm design,- Efficiency- Debugging ease- Code reuse, and- Lifecycle.
![Page 81: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/81.jpg)
Cont.However, they are also concerned with:- Concurrency and communication
Need for speed (nee high- Need for speed (nee high performance), and
- Plethora and diversity of architecture
![Page 82: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/82.jpg)
Choose Wisely How do I select the right parallel
computing model/language/libraries to use when I am writing a program?use when I am writing a program?
How do I make it efficient? How do I save time and reuse existing
code?
![Page 83: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/83.jpg)
Fosters Four step Process forFosters Four step Process for Designing Parallel Algorithmsg g g1. Partitioning – process of dividing the computation
and the data into many small pieces –and the data into many small pieces –decomposition
2. Communication – local and global (called overhead) g ( )minimizing parallel overhead is an important goal and the following check list should help the communication structure of the algorithmcommunication structure of the algorithm
1. The communication operations are balanced among tasks2. Each task communicates with only a small number of
hbneighbours3. Tasks can perform their communications concurrently4. Tasks can perform their computations concurrently4. Tasks can perform their computations concurrently
![Page 84: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/84.jpg)
Cont…3. Agglomeration is the process of
grouping tasks into larger tasks in order to improve the performance ororder to improve the performance or simplify programming. Often in using MPI this is one task per processorMPI this is one task per processor.
4. Mapping is the process of assigning tasks to processors with the goal to maximize processor utilizationp
![Page 85: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/85.jpg)
Solving Problems in Parallel Decomposition determines:
Data structures Communication topology Communication topology Communication protocols
b l k d l h Must be looked at early in the process of application developmentpp p
Standard approaches
![Page 86: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/86.jpg)
Decomposition methods Perfectly parallel Domain
Control Control Object-orientedj Hybrid/layered (multiple uses of the
above)above)
![Page 87: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/87.jpg)
For the program Choose a decomposition
Perfectly parallel domain control etc Perfectly parallel, domain, control etc. Map the decomposition to the processors
Ignore topology of the system interconnect Ignore topology of the system interconnect Use natural topology of the problem
Define the inter-process communication e e t e te p ocess co u cat oprotocol Specify the different types of messages which
need to be sentneed to be sent See if standard libraries efficiently support the
proposed message patternsp p g p
![Page 88: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/88.jpg)
Perfectly parallel Applications that require little or no
inter-processor communication when running in parallelrunning in parallel
Easiest type of problem to decompose Results in nearly perfect speed-up
![Page 89: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/89.jpg)
Domain decomposition In simulation and modelling this is the most common
solutionsolution The solution space (which often corresponds to the real
space) is divided up among the processors. Each processor solves its own little piecep
Finite-difference methods and finite-element methods lend themselves well to this approach
The method of solution often leads naturally to a set of ysimultaneous equations that can be solved by parallel matrix solvers
Sometimes the solution involves some kind of transformation of variables (i e Fourier Transform) Heretransformation of variables (i.e. Fourier Transform). Here the domain is some kind of phase space. The solution and the various transformations involved can be parallized
![Page 90: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/90.jpg)
Cont… Example: the client-server model
The server is an object that has data associated with it (i e The server is an object that has data associated with it (i.e. a database) and a set of procedures that it performs (i.e. searches for requested data within the database)
The client is an object that has data associated with it (i.e. a j (subset of data that it has requested from the database) and a set of procedures it performs (i.e. some application that massages the data).Th d li t tl diff t The server and client can run concurrently on different processors: an object-oriented decomposition of a parallel application
In the real-world this can be large scale when many clients In the real-world, this can be large scale when many clients (workstations running applications) access a large central data base – kind of like a distributed supercomputer
![Page 91: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/91.jpg)
Output Data Decomposition: Example
Consider the problem of multiplying two n x n matrices A and B to yield matrix C. The output matrix C can be partitioned into four tasks as follows:
Task 1:
Task 2:
Task 3:
Task 4:
![Page 92: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/92.jpg)
Output Data Decomposition: Example
A partitioning of output data does not result in a unique decomposition into tasks. For example, for the same problem as in previus foil, with identical output data distribution, we can derive the following two (other) decompositions:
Decomposition I Decomposition IIDecomposition I Decomposition II
Task 1: C1,1 = A1,1 B1,1
Task 2: C1 1 = C1 1 + A1 2 B2 1
Task 1: C1,1 = A1,1 B1,1
Task 2: C1 1 = C1 1 + A1 2 B2 1Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,1 B1,2
Task 4: C = C + A B
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,2 B2,2
Task 4: C = C + A BTask 4: C1,2 = C1,2 + A1,2 B2,2
Task 5: C2,1 = A2,1 B1,1
T k 6 C C + A B
Task 4: C1,2 = C1,2 + A1,1 B1,2
Task 5: C2,1 = A2,2 B2,1
T k 6 C C + A BTask 6: C2,1 = C2,1 + A2,2 B2,1
Task 7: C2,2 = A2,1 B1,2
Task 6: C2,1 = C2,1 + A2,1 B1,1
Task 7: C2,2 = A2,1 B1,2
Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2
![Page 93: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/93.jpg)
Control decomposition If you cannot find a good domain to decompose,
your problem might lend itself to controlyour problem might lend itself to control decomposition Good for:
Unpredictable workloads Problems with no convenient static structures
One set of control decomposition is functional decompositionp p Problem is viewed as a set of operations. It is among
operations where parallelization is done Many examples in industrial engineering ( i.e. modelling an y p g g ( g
assembly line, a chemical plant, etc.) Many examples in data processing where a series of operations
is performed on a continuous stream of data
![Page 94: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/94.jpg)
Cont… Control is distributed, usually with some distribution of data
structures Some processes may be dedicated to achieve better load
balance Examples
Image processing: given a series of raw images, perform a series of transformation that yield a final enhanced image. Solve this in a functional decomposition (each process represents a different function in the problem) using data pipeliningfunction in the problem) using data pipelining
Game playing: games feature an irregular search space. One possible move may lead to a rich set of possible subsequent moves to search.
Need an approach where work can be dynamically assigned to improve Need an approach where work can be dynamically assigned to improve load balancing
May need to assign multiple processes to work on a particularly promising lead
![Page 95: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/95.jpg)
Cont… Any problem that involve search (or computations)
whose scope cannot be determined a priori arewhose scope cannot be determined a priori, are candiates for control decomposition Calculations involving multiple levels of recursion (i.e.
i l i h i l d li ifi i lgenetic algorithms, simulated annealing, artificial intelligence)
Discrete phenomena in an otherwise regular medium (i.e. modelling localized storms within a weather model)
Design-rule checking in micro-electronic circuits Simulation of complex systems Simulation of complex systems Game playing, music composing, etc..
![Page 96: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/96.jpg)
Object-oriented decomposition Object-oriented decomposition is really a combination of functional and
domain decompositionR th th thi ki b t di idi d t f ti lit l k t th Rather than thinking about a dividing data or functionality, we look at the objects in the problem
The object can be decomposed as a set of data structures plus the procedures that act on those data structuresh l f b d ll l d b d b The goal of object-oriented parallel programming is distributed objects
Although conceptually clear, in practice it can be difficult to achieve good load balancing among the objects without a great deal of fine tuningtuning Works best for fine-grained problems and in environments where having
functionally ready at-the-call is more important than worrying about under-worked processors (i.e. battlefield simulation)
Message passing is still explicit (no standard C++ compiler automatically Message passing is still explicit (no standard C++ compiler automatically parallelizes over objects).
![Page 97: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/97.jpg)
Partitioning the Graph of Lake Superior
Random PartitioningRandom Partitioning
Partitioning for minimum edge-cut.g g
![Page 98: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/98.jpg)
Decomposition summary A good decomposition strategy is
Key to potential application performancey p pp p Key to programmability of the solution
There are many different ways of thinking about decomposition Decomposition models (domain, control, object-oriented, etc.) p ( , , j , )
provide standard templates for thinking about the decomposition of a problem
Decomposition should be natural to the problem rather than natural to the computer architecturenatural to the computer architecture
Communication does no useful work; keep it to a minimum Always wise to see if a library solution already exists for your
problem Don’t be afraid to use multiple decompositions in a problem if it
seems to fit
![Page 99: AdvancedAlgorithmics(4AP) Parallel algorithms - ut](https://reader034.vdocuments.pub/reader034/viewer/2022042702/6265d301bc727c13ed298069/html5/thumbnails/99.jpg)
Summary of Software Compilers
OpenMP
Moderate O(4-10) parallelism Not concerned with portability Platform has a parallelizing compiler
OpenMP
MPI
Moderate O(10) parallelism Good quality implementation exists on the platform Not scalable
Scalability and portability are important
PVM
Scalability and portability are important Needs some type of message passing platform A substantive coding effort is required
All MPI conditions plus fault tolerance Still provides better functionality in some settings
High Performance Fortran (HPF)
P-Threads
Still provides better functionality in some settings
Like OpenMP but new language constructs provide a data-parallel implicit programming model
Not recommended
High level libraries
Difficult to correct and maintain programs Not scalable to large number of processors
POOMA and HPC++ Library is available and it addresses a specific problem