cse-700 parallel programming introduction
DESCRIPTION
CSE-700 Parallel Programming Introduction. 박성우. POSTECH Sep 6, 2007. Common Features?. ... runs faster on. Multi-core CPUs. IBM Power4, dual-core, 2000 Intel reaches thermal wall , 2004 ) no more free lunch ! Intel Xeon, quad-core, 2006 - PowerPoint PPT PresentationTRANSCRIPT
CSE-700 Parallel Programming
Introduction
POSTECH
Sep 6, 2007
박성우
2
Common Features?
3
... runs faster on
4
Multi-core CPUs• IBM Power4, dual-core, 2000• Intel reaches thermal wall, 2004 ) no more free lunch!• Intel Xeon, quad-core, 2006• Sony PlayStation 3 Cell, eight cores enabled, 2006
• Intel, 80-cores, 2011 (prototype finished)
source: Herb Sutter - "Software and the concurrency revolution"
5
Parallel Programming Models• Posix threads (API)• OpenMP (API) • HPF (High Performance Fortran)• Cray's Chapel• Nesl• Sun's Fortress• IBM's X10• ...• and a lot more.
6
Parallelism• Data parallelism
– ability to apply a function in parallel to each element of a collection of data
• Thread parallelism– ability to run multiple threads concurrently– Each thread uses its own local state.
• Shared memory parallelism
Data ParallelismThread Parallelism
Shared Memory Parallelism
8
Data Parallelism = Data Separation
a1 a2 ... an an+1 an+2 ... an+m an+m+1 ... an+m+l
hardware thread #1
hardware thread #2
hardware thread #3
9
Data Parallelism in Hardware
• GeForce 8800– 128 stream processors @ 1.3Ghz, 500+GFlops
10
Data Parallelism in Programming Languages
• Fortress– parallelism is the default.
for i à 1:m, j à 1:n do // 1:n is a generator
a[i, j] := b[i] c[j]
end
• Nesl (1990's)– supports nested data parallelism
• the function being applied itself can be parallel.
{sum(a) : a in [[2, 3], [8, 3, 9], [7]]};
11
Data Parallel Haskell (DAMP '07)
• Haskell + nested data parallelism– flattening (vectorization)
• transforms a nested parallel program such that it manipulates only flat arrays.
– fusion• eliminate many intermediate arrays
• Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements
Data ParallelismThread Parallelism
Shared Memory Parallelism
13
Thread Parallelism
hardware thread #1
hardware thread #2
local state local state
message
message
synchronous communication
14
Pure Functional Threads• Purely functional threads can run concurrently.
– Effect-free computations can be executed in parallel with any other effect-free computations.
• Example: collision-detection
A
A'
B
B'
15
Manticore (DAMP '07)
• Three layers– sequential base language
• functional language drawn from SML• no mutable references and arrays!
– data-parallel programming • Implicit:
– the compiler and runtime system manage thread creation.
• E.g.) parallel arrays of parallel arrays
[: 2 * n | n in nums where n > 0 :]fun mapP f xs = [: f x | x in xs :]
– concurrent programming
16
Concurrent Programming in Manticore (DAMP '07)
• Based on Concurrent ML– threads and synchronous message passing– Threads do not share mutable states.
• actually no mutable references and arrays– explicit:
• The programmer manages thread creation.
Data ParallelismThread Parallelism
Shared Memory Parallelism(Shared State Concurrency)
18
Share Memory Parallelism
shared memory
hardware thread #1
hardware thread #2
hardware thread #3
19
World War II
20
Company of Heroes• Interaction of a LOT of objects:
– thousands of objects– Each object has its own mutable state.– Each object update affects several other objects.– All objects are updated 30+ times per second.
• Problem: – How do we handle simultaneous updates to the
same memory location?
21
Manual Lock-based Synchronization
pthread_mutex_lock(mutex);
mutate_variable();
pthread_mutex_unlock(mutex);
• Locks and conditional variables
) fundamentally flawed!
22
Bank Accounts Beautiful Concurrency, Peyton Jones, 2007
account A
thread #1 thread #2 thread #n
account B
...
transferrequest
transferrequest
transferrequest
shared memory
• Invariant: atomicity– no thread observes a state in which the money has left
one account, but has not arrived in the other.
23
Bank Accounts using Locks• In an object-oriented language:
class Account {Int balance;synchronized void deposit (Int n) {
balance = balance + n;}}
• Code for transfer:void transfer (Account from, Account to, Int amount) {
from.withdraw (amount);to.deposit (amount);
}
an intermediate state!
24
A Quick Fix: Explicit Lockingvoid transfer
(Account from, Account to, Int amount) {
from.lock(); to.lock();
from.withdraw (amount);
to.deposit (amount);
from.unlock(); to.unlock();
}
• Now, the program is prone to deadlock.
25
Locks are Bad• Taking two few locks
) simultaneous update• Taking too many locks
) no concurrency or deadlock• Taking the wrong locks
) error-prone programming• Taking locks in the wrong order
) error-prone programming• ...
• Fundamental problem: no modular programming– Correct implementations of withdraw and deposit do not
give a correct implementation of transfer.
26
Transactional Memory• An alternative to lock-based synchronization
– eliminates many problems associated with lock-based synchronization• no deadlock• read sharing• safe modular programming
• Hot research area– hardware transactional memory– software transactional memory
• C, Java, functional languages, ...
27
Transactions in Haskelltransfer :: Account -> Account -> Int -> IO ()
-- transfer 'amount' from account 'from' to account 'to'
transfer from to amount =
atomically (do { deposit to amount
; withdraw from amount })
• atomically act– atomicity:
• the effects become visible to other threads all at once.– isolation:
• the action act does not see any effects from other threads.
Conclusion:We need parallelism!
29
Tim Sweeney's POPL '06 Invited Talk- Last Slide
CSE-700 Parallel Programming
Fall 2007
31
CSE-700 in a Nutshell• Scope
– Parallel computing from the viewpoint of programmers and language designers
– We will not talk about hardware for parallel computing
• Audience– Anyone interested in learning parallel programming
• Prerequisite– C programming– Desire to learn new programming languages
32
Material• Books
– Introduction to Parallel Programming (2nd). Ananth Grama et al.
– Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al.• Any textbook on MPI and OpenMP is fine.
• Papers
33
Teaching Staff• Instructors
– Gla– Myson– ...– and YOU!
• We will lead this course TOGETHER.
34
Resources• Plquad
– quad-core Linux– OpenMP and MPI already installed
• Ask for an account if you need one.
35
Basic Plan - First Half• Goal
– learn the basics of parallel programming through 5+ assignments on OpenMP and MPI
• Each lecture consists of:– discussion on the previous assignment
• Each of you is expected to give a presentation.– presentation on OpenMP and MPI by the
instructors– discussion on the next assignment
36
Basic Plan - Second Half• Recent parallel languages
– learn a recent parallel language– write a cool program in your parallel language– give a presentation on your experience
• Topics in parallel language research– choose a topic– give a presentation on it
37
What Matters Most?• Spirit of adventure• Proactivity• Desire to provoke Happy Chaos
– I want you to develop this course into a total, complete, yet happy chaos.
– A truly inspirational course borders almost on chaos.
Impact of Memory and Cache on Performance
39
Impact of Memory Bandwidth [1]Consider the following code fragment:
for (i = 0; i < 1000; i++)
column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
column_sum[i] += b[j][i];
The code fragment sums columns of the matrix b into a vector column_sum.
40
Impact of Memory Bandwidth [2]• The vector column_sum is small and easily fits into the cache • The matrix b is accessed in a column order. • The strided access results in very poor performance.
Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.
41
Impact of Memory Bandwidth [3]We can fix the above code as follows:
for (i = 0; i < 1000; i++)
column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
for (i = 0; i < 1000; i++)
column_sum[i] += b[j][i];
In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.
42
Lesson• Memory layouts and organizing computation
appropriately can make a significant impact on the spatial and temporal locality.
Assignment 1Cache & Matrix Multiplication
44
Typical Sequential Implementation• A : n x n• B : n x n• C = A * B : n x n
for i = 1 to n
for j = 1 to n
C[i, j] = 0;
for k = 1 to n
C[i, j] += A[i, k] * B [k, j];
45
Using Submatrixes• Improves data locality significantly.
46
Experimental Results
47
Assignment 1• Machine
– the older, the better.– Myson offers his ancient notebook for you.
• Pentium II 600Mhz• no L1 cache• 64KB L2 cache• running Linux
• Prepare a presentation on your experimental results.