introduction 因為需要 large amount of computation. why parallel?
TRANSCRIPT
Introduction
因為需要 large amount of computation.
Why parallel?
Example
1015/107 10≒ 8 秒108/105 10≒ 3 天≒ 3 年
用在外科手術時,每秒約需 1015 個計算3D 立體影像
一般的電腦每秒約可算 107 個計算
其他需要大量計算的領域Aircraft TestingNew Drug DevelopmentOil ExplorationModeling Fusion ReactorsEconomic PlanningCryptanalysisManaging Large DatabasesAstronomyBiomedical Analysis…
電腦的速度
光速 : 3×108 m/s
Components 太近 : 會互相干擾
每 3 、 4 年增快一倍
Unfortunately
It is evident that this trend will soon come to an end.
A simple law of physics
Parallelism
人 (CPU) 海戰術
目前許多電腦中都不只一顆 CPU
Models of Computation
Flynn's classification:SISD(Single Instruction, Single Data)MISD(Multiple Instruction, Single Data)SIMD(Single Instruction, Multiple Data)MIMD(Multiple Instruction, Multiple Data)
SISD Computers
Sequential(serial) Computation
Example
summation of n numbers:
Sum = 0For I = 1 to n Sum = Sum + INext
need n-1 additionsO(n) time
MISD Computers
Example
test whether a positive integer is prime or not.n is prime: no divisors except 1 & n.n is composite: otherwiseEach processor tests a number between 1 and n.
Each processor needs O(1) operation. O(1) time
SIMD Computers
PRAM(Parallel Random Access Machine)
Share-Memory(SM) SIMD computers
How does a processor i pass a number to another processor j?
O(1) time
Write Read
1. EREW (Exclusive Read, Exclusive Write)2. CREW (Concurrent Read, Exclusive Write)3. ERCW (Exclusive Read, Concurrent Write)4. CRCW (Concurrent Read, Concurrent
Write)
How to resolve conflicts in CRCW?
(b) the values to be written in all processorsare equal, otherwise access is denied to all processor.
(c) sum the values to be written.
(a) the smallest-numbered processor is allowed to write, and access is denied to all other processor.
Simulating multiple accesses on an EREW computer:
(i) N multiple accesses (read) : broadcasting
procedure:1. P1→P22. P1,P2, → P3,P43. P1,P2,P3,P4 → P5,P6,P7,P8 : :
O(log N) steps
1 2 3 4 5 6 7 8
5
D
P1 P2 P3 P4 P5 P6 P7 P8
A5
5 5
5
5 5
5 5
5 5 5 5
5 5 5 5
(ii) m out of N multiple access(read), m N≦ Each memory location is associated with a binary
tree.
[1] [2] [4] [7]
[1] [4] [7]
[1] [7]
d
P1 P2 P4P7
[1] [2] [4] [7][1] [2] [4]
[1] [4]
[2]
[1]
[7]
d
P1 P2 P4P7
d[1]
[1]d
[7]d
[4]
[7]
d d
d dd d
Dividing a shared memory into blocks
Interconnection networks SIMD computers
The memory is distributed among the N processors.
Every pair of processors can communicate with each other if they are connected by a line(link).
Instantaneous communication between severalpairs of processors is allowed.
(1) fully connected networks
There are 2
)1( NN links.
(2) linear array(1-dimensional array) 1-D array
Pi-1 Pi Pi+1 , 2≦i≦ N–1
(3) two-dimensional array (2-D mesh-connected)
P(j,k) has links connecting to P(j+1,k), P(j-1,k), P(j,k+1) and P(j,k-1).
(4) tree connection (tree machines)
The sons of Pi: P2i and P2i+1
The parent of Pi: Pi/2
(5) shuffle-exchange connection
shuffle links:Pi Pj
j=2i if 0 i 12N
j=2i-(N-1) if 2N
exchange links:Pi Pi+1 if i is even
i N-1
(6) cube connection (n-cube,hypercube)
3-cube:
4-cube:
An n-cube connected network consists of 2n node. (N=2n)
binary representation of node i:in-1 in-2 ... ij ... i1 i0
connects toin-1 in-2 ... ... i1 i0
Each processor has n links.
ji
adding 8 numbers:
This concept can be implemented on a tree machine.
This concept can also be implemented on other models.
time: O(log n) , n:# of inputs
m sets, each of n numbers:pipelining: log n+(m-1) steps.
MIMD Computers
MIMD computers sharing a common memory:multiprocessors, tightly coupled machines
MIMD computers with an interconnection network:multicomputers, loosely coupled machines, distributed system
Analyzing Algorithms
● running time● speedup ● cost● efficiency
以工程為例 :
● 到完工所花的時間● 和一個人所花的時間比較● 人年 ( 或人月 )● 一人之人年 /n 人之人年
Running Time
從 the first processor begins computing 到 the last processor ends computing
的時間
Running Time
Sum = 0For I = 1 to n Sum = Sum + INext
1 次N+1 次N 次N+1 次
共 3N+3 次 = O(N)
g(n)=O(f(n)): at most f(n),
,, 0nc
0nnfor cf(n),g(n) g(n) = 3n + 3 f(n) = n
3n+3 5≦ n 當 n 2≧ 時
c n0
Upper Bound
g(n)=O(f(n)): at most f(n),
,, 0nc
0nnfor cf(n),g(n)
g(n)cf(n)
n0
g(n) = 3n + 3 f(n) = n
3n+3 ≧ 2 n 當 n 1≧ 時
c n0
g(n)= Ω(f(n)): at least f(n),
,, 0nc0nnfor ),()( ncfng
Lower Bound
g(n)= Ω(f(n)): at least f(n),
,, 0nc
0nnfor ),()( ncfng
g(n)
cf(n)
n0
adding 8 numbers:
time: O(log n)
The time complexity of heapsort is O(n log n).
The lower bound of sorting is Ω(n log n).
(why?)
Thus, heapsort is optimal.
The upper bound of sorting is O(n log n).
For matrix multiplication, the lower bound is Ω(n2) and the upper bound is O(n2.5). (There exists such an algorithm.)
speedup =
algo. parallelfor time
algo. sequentialfastest for the time
即 : 一個 CPU 所花的時間 / 平行所花的時間
Adding n number on a tree machine
Time: O(log n)Processor: n-1
speedup=O(n/log n)
Bitonic sorting
speedup=O(nlogn/log2 n) =O(n/log n)
Time: O(log2 n)Processor: O(n)
cost = parallel time * # of processor = t(n) * p(n)
s(n): time for the fastest sequential algo.
If s(n)=t(n)*p(n), the algorithm is cost optimal (achieves optimal speedup).
adding n numbersprocessor: O(n/log n)}cost optimaltime: O(log n)Each processor first adds logn numbers sequentially.
efficiency =
algo. parallel oft cos
algo. sequentialfastest for the time
e.g. bitonic sorting efficiency:
n log
1