Download - 并行算法概述

Transcript
Page 1: 并行算法概述

并行算法概述

Page 2: 并行算法概述

2

Content

•Parallel Computing Model•Basic Techniques to Parallel Algorithm

Page 3: 并行算法概述

3

Von Neumann ModelM E M ORY

CON TR OL UN IT

M AR M D R

IR

P R OCE S S IN G UN IT

ALU TE M P

P C

OUTP UTM on ito rP rin terLE DD is k

IN P UTK eyboardM ous eS c anne rD is k

Page 4: 并行算法概述

4

Instruction Processing

Decode instructionDecode instruction

Evaluate addressEvaluate address

Fetch operands from memoryFetch operands from memory

Execute operationExecute operation

Store resultStore result

Fetch instruction from memoryFetch instruction from memory

Page 5: 并行算法概述

5

并行计算模型Parallel Computing Model

• 计算模型• 桥接软件和硬件• 为算法设计提供抽象体系结构• Ex) PRAM, BSP, LogP

Page 6: 并行算法概述

6

并行程序设计模型Parallel Programming Model

• 程序员使用什么来编码 ?• 确定通信( communication )和同步

( synchronization )• 暴露给程序员的通信原语( Communication

primitives )实现编程模型 • Ex) Uniprocessor, Multiprogramming, Data

parallel, message-passing, shared-address-space

Page 7: 并行算法概述

7

Aspects of Parallel Processing

Algorithm developer Application developer

Interconnection NetworkInterconnection Network

Memory

P P P P

Memory

P P P P

Memory

P P P P

Memory

P P P P

MultiprocessorsMultiprocessorsMultiprocessorsMultiprocessors

Parallel computing model Parallel programming model

System programmer

Architecture designer

3 4

2

1

Middleware

Page 8: 并行算法概述

8

Parallel Computing Models – 并行随机存取机( Parallel Randon Access Machine )

特性:

• Processors Pi (i (0 i p-1 )• 每一处理器配有局部内存

• 一全局共享内存 • 所有处理器都可以访问

Page 9: 并行算法概述

9

Illustration of PRAM

P1 P2 P3 Pp

Shared Memory

CLK

P processors connected to a single shared memory

Each processor has

a unique index.

Single program executed in MIMD mode

Page 10: 并行算法概述

10

Parallel Randon Access Machine

操作类型: • 同步

• 处理器执行时会加锁F 每一步,处理器或者工作或者待机F 适用于 SIMD 和 MIMD 体系结构

• 异步• 处理器有局部时钟,用于同步处理器 F 适用于 MIMD architecture

Page 11: 并行算法概述

11

Problems with PRAM

• 是对现实世界并行系统的一种简化描述• 未考虑多种开销

• 延迟,带宽,远程内存访问,内存访问冲突,同步开销 , etc

• 在 PRAM 上理论分析性能分析好的算法,实际性能可能差

Page 12: 并行算法概述

12

Parallel Randon Access Machine

Read / Write 冲突 • EREW : Exclusive - Read, Exclusive -Write

• 对一变量吴并发操作 ( read or write)

• CREW : Concurrent – Read, Exclusive – Write • 允许并发读同一变量• 互斥写

• ERCW : Exclusive Read – Concurrent Write

• CRCW : Concurrent – Read, Concurrent – Write

Page 13: 并行算法概述

15

Parallel Randon Access Machine

基本 Input/Output 操作

• 全局内存• global read (X, x)• global write (Y, y)

• 局部内存• read (X, x)• write (Y, y)

Page 14: 并行算法概述

16

Example: Sum on the PRAM model

对有 n = 2k 个数的数组 A 求和A PRAM machine with n processor

计算 S = A(1) + A(2) + …. + A(n)

构建二叉树计算和

Page 15: 并行算法概述

17

Example: Sum on the PRAM model

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

P1

P2

P3

P4

P5

P6

P7

P8

B(1) B(2) B(3) B(4)

P1

P2

P3

P4

B(1) B(2)

P1

P2

B(1)

S=B(1)

P1

P1

Level >1, Pi computeB(i) = B(2i-1) + B(2i)

Level 1, Pi B(i) = A(i)

Page 16: 并行算法概述

18

Example: Sum on the PRAM modelAlgorithm processor Pi ( i=0,1, …n-1)

Input A : array of n = 2k elements in global memory

OutputS : S= A(1) + A(2) + …. . A(n)

Local variables Pin :

i : processor Pi identity

Begin1. global read ( A(i), a)2. global write (a, B(i))3. for h = 1 to log n do

if ( i ≤ n / 2h ) then begin global read (B(2i-1), x)

global read (b(2i), y) z = x +y global write (z,B(i)) end

4. if i = 1 then global write(z,S)End

Page 17: 并行算法概述

19

其它分布式模型• Distributed Memory Model

• 无全局内存• 每一处理器有局部内存

• Postal Model• 当访问非局部内存时,处理器发送请求• 处理器不会停止,它会继续工作直到数据到达

Page 18: 并行算法概述

20

Network Models

• 关注通信网络拓扑的影响• 早期并行计算关注点

• 分布式内存模型• 远程内存访问的代价与拓扑和访问模式相关• 提供有效的

• 数据映射• 通信路由

Page 19: 并行算法概述

21

LogP• 受并行计算机设计的影响• 分布式内存多处理器模型• 处理器通信通过点对点的消息通信实现• 目标是分析并行计算机的性能瓶颈• 制定通信网络的性能特点

• 为数据放置提供帮助• 显示了平衡通信的重要性

Page 20: 并行算法概述

22

Model Parameters

• Latency (L)• 从源到目的端发送消息的延迟

• Hop (跳) count and Hop delay

• Communication Overhead (o)• 处理器在发送或接收一条消息时的时间开销

• Communication bandwidth (g)• 消息之间的最小时间间隔

• Processor count (P)• 处理器个数

Page 21: 并行算法概述

23

LogP Model

sender

receiver

o

L

g

o

t

Page 22: 并行算法概述

24

Bulk Synchronous Parallel

• Bulk Synchronous Parallel(BSP)• P 个配有局部内存的处理器• 路由器• 周期性全局同步• 考虑因素

• 带宽限制• 延迟• 同步开销

• 未考虑因素 • 通信开销• 处理器拓扑

Page 23: 并行算法概述

25

BSP Computer

• 分布式内存体系结构• 3 种部件

• 节点• 处理器• 局部内存

• 路由器 (Communication Network)• 点对点( Point-to-point ) , 消息传递( message

passing )或者共享变量 (shared variable)

• 路障• 全部或部分

Page 24: 并行算法概述

26

Illustration of BSP

Communication Network (g)

P M P M P M

Node (w) Node Node

Barrier (l)

w parameter 每一超步( superstep )最大计算时间 计算最多消耗 w个时钟周期 .

g parameter 当所有处理器都参与通信时,发送一消息单元所需要的时钟周期 # ,即

网络带宽 h:每一超步最大接收和发送消息的数量 通信操作需要 gh 时钟周期

l parameter 路障( Barrier )同步需要 l 时钟周期

Page 25: 并行算法概述

27

BSP Program

• 每一 BSP 计算由 S 个超步构成• 一超步包括一系列步骤和一个路障• Superstep

• 任何远程内存访问需要路障 – 松散同步

Page 26: 并行算法概述

28

BSP Program

Superstep 1

Superstep 2Barrier

P1 P2 P3 P4

Computation

Communication

Page 27: 并行算法概述

Example: Pregel

Pregel is a framework developed by Google:SIGMOD 2010High scalabilityFault-tolerance• 灵活实现图算法

Page 28: 并行算法概述

Bulk Synchronous Parallel Model

30

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Bar

rier

Bar

rier

Data

Data

Data

Data

Data

Data

Data

Bar

rier

Page 29: 并行算法概述

31

Graph

Page 30: 并行算法概述

Entities and Supersteps

计算由顶点、边和一系列迭代(即超步)构成 每一顶点赋有值。 每一边包含与源点、边值和目的顶点

每一超步: 用户定义的函数 F 处理每一顶点 V F 在超步 S – 1 读发送给 V 的消息,发送消

息给其它顶点。这些消息将在 S + 1 超步收到 F 更改顶点 V 和出边的状态 F 可以改变图的拓扑

32

Page 31: 并行算法概述

Algorithm Termination

根据各顶点投票决定算法是否终止superstep 0 ,每一顶点活跃所有活跃顶点参与任意给定超步中的计算当顶点投票终止时,顶点进入非活跃状态如果顶点收到外部消息,顶点可以进入活跃状态

当所有节点都同时变为非活跃状态时,程序终止

33

Active Inactive

Vote to Halt

Message Received

Vertex State Machine

Page 32: 并行算法概述

The Pregel API in C++ A Pregel program is written by subclassing the vertex class:

34

template <typename VertexValue,typename EdgeValue,typename MessageValue>

class Vertex {public:

virtual void Compute(MessageIterator* msgs) = 0;

const string& vertex_id() const;int64 superstep() const;const VertexValue& GetValue();VertexValue* MutableValue();OutEdgeIterator GetOutEdgeIterator();

void SendMessageTo(const string& dest_vertex,const MessageValue& message);

void VoteToHalt();};

Override the compute function to

define the computation at each superstep

To pass messages to other vertices

To define the types for vertices, edges and messages

To get the value of the current vertex

To modify the value of the vertex

Page 33: 并行算法概述

Pregel Code for Finding the Max Value

Class MaxFindVertex: public Vertex<double, void, double> {

public:virtual void Compute(MessageIterator* msgs) {

int currMax = GetValue();SendMessageToAllNeighbors(currMax);for ( ; !msgs->Done(); msgs->Next()) {

if (msgs->Value() > currMax)currMax = msgs->Value();

}if (currMax > GetValue())

*MutableValue() = currMax;else VoteToHalt();

}};

35

Page 34: 并行算法概述

Finding the Max Value in a Graph36

3 6 2 1

3 6 2 16 2 66

6 6 2 66 6

6 6 6 66

节点内数值是节点值蓝色箭头是消息

蓝色节点投票终止

6

Page 35: 并行算法概述

37

Model Survey Summary

• No single model is acceptable!• Between models, subset of characteristics are focused in majority of models• Computational Parallelism• Communication Latency• Communication Overhead• Communication Bandwidth• Execution Synchronization• Memory Hierarchy• Network Topology

Page 36: 并行算法概述

38

Computational Parallelism

• Number of physical processors• Static versus dynamic parallelism• Should number of processors be fixed?

• Fault-recovery networks allow for node failure• Many parallel systems allow incremental upgrades by increasing node count

Page 37: 并行算法概述

39

Latency

• Fixed message length or variable message length?

• Network topology?• Communication Overhead?• Contention based latency?• Memory hierarchy?

Page 38: 并行算法概述

40

Bandwidth

• Limited resource• With low latency

• Tendency for bandwidth abuse by flooding network

Page 39: 并行算法概述

41

Synchronization

• Ability to solve a wide class of problems require asynchronous parallelism

• Synchronization achieved via message passing

• Synchronization as a communication cost

Page 40: 并行算法概述

42

Unified Model?

• Difficult• Parallel machines are complicated• Still evolving• Different users from diverse disciplines

• Requires a common set of characteristics derived from needs of different users

• Again need for balance between descriptivity and prescriptivity

Page 41: 并行算法概述

43

Content

• Parallel Computing Model• Basic Techniques of Parallel Algorithm

• Concepts • Decomposition• Task• Mapping• Algorithm Model

Page 42: 并行算法概述

44

分解、任务及依赖图

• 设计并行算法的第一步是将问题分解成可并发执行的任务

• 分解可用任务依赖图 (task dependency graph) 表示。图中节点代表任务,边代表任务依赖

Page 43: 并行算法概述

45

Example: Multiplying a Dense Matrix with a Vector

计算输出向量 y 的每一元素可独立进行。因此,矩阵与向量之积可分解为 n 个任务

Page 44: 并行算法概述

46

Example: Database Query Processing

在如下数据库上执行查询 :

MODEL = ``CIVIC'' AND YEAR = 2001 AND(COLOR = ``GREEN'' OR COLOR = ``WHITE) ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000

Page 45: 并行算法概述

47

Example: Database Query Processing执行查询可分成任务。每一任务可看作产生满足某一条件的中间结果

边表示一个任务的输出是另一个任务的输入

Page 46: 并行算法概述

48

Example: Database Query Processing 同一问题可采用其它方式分解。不同的分解可能存在重

大的性能差异

Page 47: 并行算法概述

49

任务粒度• 分解的任务数量越多,粒度越小。否则粒度越大

Page 48: 并行算法概述

50

并行度 Degree of Concurrency • 能并行执行的任务数称为一分解的 degree of concurrency • maximum degree of concurrency • average degree of concurrency

• 当任务粒度小时,并行度大。

Page 49: 并行算法概述

51

任务交互图 Task Interaction Graphs

• 任务之间通常需要交换数据 •表达任务之间交换关系的图称为 task interaction graph.

• task interaction graphs 表达数据依赖; task dependency graphs表达 control dependencies.

Page 50: 并行算法概述

52

Task Interaction Graphs: An Example

稀疏矩阵 A乘以向量 b.

计算结果向量的每一元素可视之为独立任务 由于内存优化,可以将 b 根据任务划分,可以发现任务交互图和矩阵 A 的图一样

Page 51: 并行算法概述

53

进程和映射 Processes and Mapping

• 任务的数量超过处理单元的数量,因此必须将任务映射到进程

•恰当的任务映射对并行算法的性能非常重要 • 映射由任务依赖图和任务交互图决定 • 任务依赖图确保任务在任何时间点均匀分布到所有进程 (minimum idling and optimal load balance).

• 任务交互图用于确保进程与其它进程之间的交互最少 (minimum communication).

Page 52: 并行算法概述

54

Processes and Mapping: Example

将数据库查询任务映射到进程 . 根据同一层没有依赖关系,同一层任务可分配给不同进程

Page 53: 并行算法概述

55

分解技术 Decomposition Techniques

•递归分解( recursive decomposition ) • 数据分解( data decomposition ) • 探索分解( exploratory decomposition ) • 猜测分解( speculative decomposition )

Page 54: 并行算法概述

56

Recursive Decomposition

• 适合可用分治法解决的问题 . • 给定问题首先分解为一系列子问题 •这些子问题进一步递归分解,直到所需要

的任务粒度

Page 55: 并行算法概述

57

Recursive Decomposition: Example 经典的例子是快速排序

In this example, once the list has been partitioned around the pivot, each sub-list can be processed concurrently (i.e., each sub-list represents an independent subtask). This can be repeated recursively.

Page 56: 并行算法概述

58

Recursive Decomposition: Example

We first start with a simple serial loop for computing the minimum entry in a given list:

1. procedure SERIAL_MIN (A, n)2. begin3. min = A[0];4. for i := 1 to n − 1 do5. if (A[i] < min) min := A[i];6. endfor;7. return min;8. end SERIAL_MIN

Page 57: 并行算法概述

59

Recursive Decomposition: ExampleWe can rewrite the loop as follows:

1. procedure RECURSIVE_MIN (A, n) 2. begin 3. if ( n = 1 ) then 4. min := A [0] ; 5. else 6. lmin := RECURSIVE_MIN ( A, n/2 ); 7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 ); 8. if (lmin < rmin) then 9. min := lmin; 10. else 11. min := rmin; 12. endelse; 13. endelse; 14. return min; 15. end RECURSIVE_MIN

Page 58: 并行算法概述

60

Recursive Decomposition: Example

以上代码可用如下求最小数例子说明 .

求 {4, 9, 1, 7, 8, 11, 2, 12} 的最小数 . 任务依赖图如下 :

Page 59: 并行算法概述

61

Data Decomposition •划分数据,将数据分配给不同任务 •输入数据划分•中间数据划分•输出划分

• 输出数据的每一元素可以独立计算出

Page 60: 并行算法概述

62

Output Data Decomposition: Example

n x n 矩阵 A 和 B 相乘得到矩阵 C. 输出矩阵 C的计算 可以分为如下四个任务 :

Task 1:

Task 2:

Task 3:

Task 4:

Page 61: 并行算法概述

63

Output Data Decomposition: Example 以前面的矩阵相乘例子为例,还可以派生如下两

种划分 :

Decomposition I Decomposition II

Task 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,1 B1,2

Task 4: C1,2 = C1,2 + A1,2 B2,2

Task 5: C2,1 = A2,1 B1,1

Task 6: C2,1 = C2,1 + A2,2 B2,1

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Task 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,2 B2,2

Task 4: C1,2 = C1,2 + A1,1 B1,2

Task 5: C2,1 = A2,2 B2,1

Task 6: C2,1 = C2,1 + A2,1 B1,1

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Page 62: 并行算法概述

64

Input Data Partitioning

•如果输出事先未知,这时可以考虑输入划分• 每一任务处理一部分输入数据,形成局部结果。合并局部结果形成最终结果

Page 63: 并行算法概述

65

Input Data Partitioning: Example

统计事务数量的例子可采用输入数据划分。

Page 64: 并行算法概述

66

Partitioning Input and Output Data 也可以将输入划分和输出划分相结合以便得到更高的并行度 . 对于统计事务的例子,事务集 (input) 和事务统计数量 (output) 可同时划分如下 :

Page 65: 并行算法概述

67

Intermediate Data Partitioning

• 计算通常可视为一系列从输入到输出的变换 .

• 因此,可考虑将中间结果进行分解

Page 66: 并行算法概述

68

Intermediate Data Partitioning: Example Let us revisit the example of dense matrix multiplication.

Page 67: 并行算法概述

69

Intermediate Data Partitioning: Example A decomposition of intermediate data structure leads to the following decomposition into 8 + 4 tasks:

Stage I

Stage II

Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1

Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2

Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1

Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2

Task 09: C1,1 = D1,1,1 + D2,1,1

Task 10: C1,2 = D1,1,2 + D2,1,2

Task 11: C2,1 = D1,2,1 + D2,2,1

Task 12: C2,,2 = D1,2,2 + D2,2,2

Page 68: 并行算法概述

70

Intermediate Data Partitioning: Example

The task dependency graph for the decomposition (shown in previous foil) into 12 tasks is as follows:

Page 69: 并行算法概述

71

Exploratory Decomposition

• 在许多场合,随着执行的逐步推进而进行划分 . •这些应用通常涉及搜索解答的状态空间• 适合应用包括:组合优化,定理证明,游戏, etc.

Page 70: 并行算法概述

72

Exploratory Decomposition: Example

15 puzzle (a tile puzzle).

Page 71: 并行算法概述

73

Exploratory Decomposition: Example 产生当前状态的后继状态,将搜索每一状态视为一独立

任务

Page 72: 并行算法概述

74

Speculative Decomposition

• 在某些应用,任务之间依赖事先未知 •两种方法 :

• 保守方法( conservative approaches ):当确认没有依赖时,可以识别独立任务 ,

• 乐观方法( optimistic approaches )即使可能是错误的,仍然调度任务

• 保守方法可能产生较少的并发;乐观方法可能需要回滚

Page 73: 并行算法概述

75

Speculative Decomposition: Example

模拟网络的例子(例如生产线和计算机网络) . 任务是模拟不同输入和节点参数(如延迟)下网络的行为

Page 74: 并行算法概述

76

Hybrid Decompositions

在 quicksort, 递归分解限制了并发。这时可用数据分解和递归分解

离散事件模拟( discrete event simulation )可用数据分解和猜测分解

对于找最小数,可用数据分解和递归分解

Page 75: 并行算法概述

77

任务特性 • 任务特征影响并行算法的选择及其性能

• 任务生成 • 任务粒度 • 与任务相关的数据规模

Page 76: 并行算法概述

78

Task Generation

•静态任务生成 • 例如:矩阵运算,图算法,图像处理应用以及其它结构

化问题 . • 任务分解通常用数据分解和递归分解 .

•动态任务生成 • 一个例子是 15谜 – 每一 15谜棋局由前一棋局产生 . • 应用通常用探索和猜测法分解 .

Page 77: 并行算法概述

79

Task Sizes

• 任务粒度可以是统一,也可以是非一致 • 例如:组合优化问题里很难估计状态空间的大小

Page 78: 并行算法概述

80

Size of Data Associated with Tasks

• The size of data associated with a task may be small or large when viewed in the context of the size of the task.

• A small context of a task implies that an algorithm can easily communicate this task to other processes dynamically (e.g., the 15 puzzle).

• A large context ties the task to a process, or alternately, an algorithm may attempt to reconstruct the context at another processes as opposed to communicating the context of the task.

Page 79: 并行算法概述

81

Characteristics of Task Interactions • Tasks may communicate with each other in various ways.

The associated dichotomy is: • Static interactions:

• The tasks and their interactions are known a-priori. These are relatively simpler to code into programs.

• Dynamic interactions: • The timing or interacting tasks cannot be determined a-priori. These

interactions are harder to code, especially, as we shall see, using message passing APIs.

Page 80: 并行算法概述

82

Characteristics of Task Interactions

• Regular interactions: • There is a definite pattern (in the graph sense) to the

interactions. These patterns can be exploited for efficient implementation.

• Irregular interactions: • Interactions lack well-defined topologies.

Page 81: 并行算法概述

83

Characteristics of Task Interactions: Example A simple example of a regular static interaction pattern is in image dithering. The underlying communication pattern is a structured (2-D mesh) one as shown here:

Page 82: 并行算法概述

84

Characteristics of Task Interactions: Example

The multiplication of a sparse matrix with a vector is a good example of a static irregular interaction pattern. Here is an example of a sparse matrix and its associated interaction pattern.

Page 83: 并行算法概述

85

Characteristics of Task Interactions

• Interactions may be read-only or read-write. • In read-only interactions, tasks just read data items associated with other tasks.

• In read-write interactions tasks read, as well as modify data items associated with other tasks.

Page 84: 并行算法概述

86

Mapping

• Mapping Techniques for Load Balancing • Static and Dynamic Mapping

• Methods for Minimizing Interaction Overheads • Maximizing Data Locality • Minimizing Contention and Hot-Spots • Overlapping Communication and Computations • Replication vs. Communication • Group Communications vs. Point-to-Point Communication

• Parallel Algorithm Design Models • Data-Parallel, Work-Pool, Task Graph, Master-Slave, Pipeline,

and Hybrid Models

Page 85: 并行算法概述

87

Mapping Techniques

• Mappings must minimize overheads. • Primary overheads are communication and idling. • Minimizing these overheads often represents

contradicting objectives. • Assigning all work to one processor trivially

minimizes communication at the expense of significant idling.

Page 86: 并行算法概述

88

Mapping Techniques for Minimum Idling

Mapping must simultaneously minimize idling and load balance. Merely balancing load does not minimize idling.

Page 87: 并行算法概述

89

Mapping Techniques for Minimum Idling

Mapping techniques can be static or dynamic. • Static Mapping

• Tasks are mapped to processes a-priori• For this to work, we must have a good estimate of the

size of each task. Even in these cases, the problem may be NP complete.

• Dynamic Mapping• Tasks are mapped to processes at runtime• This may be because the tasks are generated at

runtime, or that their sizes are not known.

Page 88: 并行算法概述

90

Schemes for Static Mapping

•Mappings based on data partitioning •Mappings based on task graph partitioning

•Hybrid mappings

Page 89: 并行算法概述

91

Mappings Based on Data Partitioning The simplest data decomposition schemes for dense

matrices are 1-D block distribution schemes.

Page 90: 并行算法概述

92

Block Array Distribution Schemes

Block distribution schemes can be generalized to higher dimensions as well.

Page 91: 并行算法概述

93

Block Array Distribution Schemes: Examples

• For multiplying two dense matrices A and B, we can partition the output matrix C using a block decomposition.

• For load balance, we give each task the same number of elements of C. (Note that each element of C corresponds to a single dot product.)

Page 92: 并行算法概述

94

Cyclic and Block Cyclic Distributions

• If the amount of computation associated with data items varies, a block decomposition may lead to significant load imbalances.

• A simple example of this is in LU decomposition (or Gaussian Elimination) of dense matrices.

Page 93: 并行算法概述

95

LU Factorization of a Dense Matrix

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

A decomposition of LU factorization into 14 tasks - notice the significant load imbalance.

Page 94: 并行算法概述

96

Block Cyclic Distributions

Variation of the block distribution scheme that can be used to alleviate the load-imbalance and idling problems.

Partition an array into many more blocks than the number of available processes.

Blocks are assigned to processes in a round-robin manner so that each process gets several non-adjacent blocks.

Page 95: 并行算法概述

99

Mappings Based on Task Partitioning

• Partitioning a given task-dependency graph across processes.

• Determining an optimal mapping for a general task-dependency graph is an NP-complete problem.

Page 96: 并行算法概述

100

Task Partitioning: Mapping a Binary Tree Dependency Graph

Example illustrates the dependency graph of one view of quick-sort and how it can be assigned to processes in a cube.

Page 97: 并行算法概述

101

Hierarchical Mappings

• Sometimes a single mapping technique is inadequate.

• For example, the task mapping of the binary tree (quicksort) cannot use a large number of processors.

• For this reason, task mapping can be used at the top level and data partitioning within each level.

Page 98: 并行算法概述

102

An example of task partitioning at top level with data partitioning at the lower level.

Page 99: 并行算法概述

103

Schemes for Dynamic Mapping

• Dynamic mapping is sometimes also referred to as dynamic load balancing, since load balancing is the primary motivation for dynamic mapping.

• Dynamic mapping schemes can be centralized or distributed.

Page 100: 并行算法概述

104

Centralized Dynamic Mapping

• Processes are designated as masters or slaves. • When a process runs out of work, it requests the master

for more work. • When the number of processes increases, the master

may become the bottleneck. • To alleviate this, a process may pick up a number of

tasks (a chunk) at one time. This is called Chunk scheduling. • Selecting large chunk sizes may lead to significant load

imbalances as well. • A number of schemes have been used to gradually decrease

chunk size as the computation progresses.

Page 101: 并行算法概述

105

Distributed Dynamic Mapping

• Each process can send or receive work from other processes.

• This alleviates the bottleneck in centralized schemes. • There are four critical questions:

• how are sensing and receiving processes paired together• who initiates work transfer• how much work is transferred• when is a transfer triggered?

• Answers to these questions are generally application specific.

Page 102: 并行算法概述

106

Minimizing Interaction Overheads

• Maximize data locality • Where possible, reuse intermediate data. Restructure

computation so that data can be reused in smaller time windows.

• Minimize volume of data exchange• There is a cost associated with each word that is

communicated • Minimize frequency of interactions

• There is a startup cost associated with each interaction

• Minimize contention and hot-spots• Use decentralized techniques, replicate data where

necessary.

Page 103: 并行算法概述

107

Minimizing Interaction Overheads (cont.)

• Overlapping computations with interactions• Use non-blocking communications, multithreading,

and prefetching to hide latencies.

• Replicating data or computations. • Using group communications instead of point-to-point primitives.

• Overlap interactions with other interactions.

Page 104: 并行算法概述

108

Parallel Algorithm Models 算法模型主要涉及选择划分方法以及映射技术以减少任务之间的交互 .

• Data Parallel Model • 任务静态映射到进程,每一任务在不同数据上执行相似

操作 .

• Task Graph Model• 根据任务依赖图,任务之间的交互用于增强局部性或减少交互开销

Page 105: 并行算法概述

109

Parallel Algorithm Models (cont.)

• Master-Slave Model• 一个或多个进程产生任务,并静态或动态分配给工作进程

• Pipeline / Producer-Consumer Model• 数据流通过一系列进程,每一进程在数据上执行任务 .

• Hybrid Models• 由多种模型水平、垂直或顺序组合来解决应用问题


Top Related