cse-700 parallel programming introduction

CSE-700 Parallel Programming

Introduction

POSTECH

Sep 6, 2007

박성우

2

Common Features?

3

... runs faster on

4

Multi-core CPUs• IBM Power4, dual-core, 2000• Intel reaches thermal wall, 2004 ) no more free lunch!• Intel Xeon, quad-core, 2006• Sony PlayStation 3 Cell, eight cores enabled, 2006

• Intel, 80-cores, 2011 (prototype finished)

source: Herb Sutter - "Software and the concurrency revolution"

5

Parallel Programming Models• Posix threads (API)• OpenMP (API) • HPF (High Performance Fortran)• Cray's Chapel• Nesl• Sun's Fortress• IBM's X10• ...• and a lot more.

6

Parallelism• Data parallelism

– ability to apply a function in parallel to each element of a collection of data

• Thread parallelism– ability to run multiple threads concurrently– Each thread uses its own local state.

• Shared memory parallelism

Data ParallelismThread Parallelism

Shared Memory Parallelism

8

Data Parallelism = Data Separation

a1 a2 ... an an+1 an+2 ... an+m an+m+1 ... an+m+l

hardware thread #1

hardware thread #2

hardware thread #3

9

Data Parallelism in Hardware

• GeForce 8800– 128 stream processors @ 1.3Ghz, 500+GFlops

10

Data Parallelism in Programming Languages

• Fortress– parallelism is the default.

for i Ã 1:m, j Ã 1:n do // 1:n is a generator

a[i, j] := b[i] c[j]

end

• Nesl (1990's)– supports nested data parallelism

• the function being applied itself can be parallel.

{sum(a) : a in [[2, 3], [8, 3, 9], [7]]};

11

Data Parallel Haskell (DAMP '07)

• Haskell + nested data parallelism– flattening (vectorization)

• transforms a nested parallel program such that it manipulates only flat arrays.

– fusion• eliminate many intermediate arrays

• Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements


Shared Memory Parallelism

13

Thread Parallelism

hardware thread #1

hardware thread #2

local state local state

message

message

synchronous communication

14

Pure Functional Threads• Purely functional threads can run concurrently.

– Effect-free computations can be executed in parallel with any other effect-free computations.

• Example: collision-detection

A

A'

B

B'

15

Manticore (DAMP '07)

• Three layers– sequential base language

• functional language drawn from SML• no mutable references and arrays!

– data-parallel programming • Implicit:

– the compiler and runtime system manage thread creation.

• E.g.) parallel arrays of parallel arrays

[: 2 * n | n in nums where n > 0 :]fun mapP f xs = [: f x | x in xs :]

– concurrent programming

16

Concurrent Programming in Manticore (DAMP '07)

• Based on Concurrent ML– threads and synchronous message passing– Threads do not share mutable states.

• actually no mutable references and arrays– explicit:

• The programmer manages thread creation.


Shared Memory Parallelism(Shared State Concurrency)

18

Share Memory Parallelism

shared memory

hardware thread #1

hardware thread #2

hardware thread #3

19

World War II

20

Company of Heroes• Interaction of a LOT of objects:

– thousands of objects– Each object has its own mutable state.– Each object update affects several other objects.– All objects are updated 30+ times per second.

• Problem: – How do we handle simultaneous updates to the

same memory location?

21

Manual Lock-based Synchronization

pthread_mutex_lock(mutex);

mutate_variable();

pthread_mutex_unlock(mutex);

• Locks and conditional variables

) fundamentally flawed!

22

Bank Accounts Beautiful Concurrency, Peyton Jones, 2007

account A

thread #1 thread #2 thread #n

account B

...

transferrequest

transferrequest

transferrequest

shared memory

• Invariant: atomicity– no thread observes a state in which the money has left

one account, but has not arrived in the other.

23

Bank Accounts using Locks• In an object-oriented language:

class Account {Int balance;synchronized void deposit (Int n) {

balance = balance + n;}}

• Code for transfer:void transfer (Account from, Account to, Int amount) {

from.withdraw (amount);to.deposit (amount);

}

an intermediate state!

24

A Quick Fix: Explicit Lockingvoid transfer

(Account from, Account to, Int amount) {

from.lock(); to.lock();

from.withdraw (amount);

to.deposit (amount);

from.unlock(); to.unlock();

}

• Now, the program is prone to deadlock.

25

Locks are Bad• Taking two few locks

) simultaneous update• Taking too many locks

) no concurrency or deadlock• Taking the wrong locks

) error-prone programming• Taking locks in the wrong order

) error-prone programming• ...

• Fundamental problem: no modular programming– Correct implementations of withdraw and deposit do not

give a correct implementation of transfer.

26

Transactional Memory• An alternative to lock-based synchronization

– eliminates many problems associated with lock-based synchronization• no deadlock• read sharing• safe modular programming

• Hot research area– hardware transactional memory– software transactional memory

• C, Java, functional languages, ...

27

Transactions in Haskelltransfer :: Account -> Account -> Int -> IO ()

-- transfer 'amount' from account 'from' to account 'to'

transfer from to amount =

atomically (do { deposit to amount

; withdraw from amount })

• atomically act– atomicity:

• the effects become visible to other threads all at once.– isolation:

• the action act does not see any effects from other threads.

Conclusion:We need parallelism!

29

Tim Sweeney's POPL '06 Invited Talk- Last Slide

CSE-700 Parallel Programming

Fall 2007

31

CSE-700 in a Nutshell• Scope

– Parallel computing from the viewpoint of programmers and language designers

– We will not talk about hardware for parallel computing

• Audience– Anyone interested in learning parallel programming

• Prerequisite– C programming– Desire to learn new programming languages

32

Material• Books

– Introduction to Parallel Programming (2nd). Ananth Grama et al.

– Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al.• Any textbook on MPI and OpenMP is fine.

• Papers

33

Teaching Staff• Instructors

– Gla– Myson– ...– and YOU!

• We will lead this course TOGETHER.

34

Resources• Plquad

– quad-core Linux– OpenMP and MPI already installed

• Ask for an account if you need one.

35

Basic Plan - First Half• Goal

– learn the basics of parallel programming through 5+ assignments on OpenMP and MPI

• Each lecture consists of:– discussion on the previous assignment

• Each of you is expected to give a presentation.– presentation on OpenMP and MPI by the

instructors– discussion on the next assignment

36

Basic Plan - Second Half• Recent parallel languages

– learn a recent parallel language– write a cool program in your parallel language– give a presentation on your experience

• Topics in parallel language research– choose a topic– give a presentation on it

37

What Matters Most?• Spirit of adventure• Proactivity• Desire to provoke Happy Chaos

– I want you to develop this course into a total, complete, yet happy chaos.

– A truly inspirational course borders almost on chaos.

Impact of Memory and Cache on Performance

39

Impact of Memory Bandwidth [1]Consider the following code fragment:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;

for (j = 0; j < 1000; j++)

column_sum[i] += b[j][i];

The code fragment sums columns of the matrix b into a vector column_sum.

40

Impact of Memory Bandwidth [2]• The vector column_sum is small and easily fits into the cache • The matrix b is accessed in a column order. • The strided access results in very poor performance.

Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.

41

Impact of Memory Bandwidth [3]We can fix the above code as follows:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;

for (j = 0; j < 1000; j++)

for (i = 0; i < 1000; i++)

column_sum[i] += b[j][i];

In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.

42

Lesson• Memory layouts and organizing computation

appropriately can make a significant impact on the spatial and temporal locality.

Assignment 1Cache & Matrix Multiplication

44

Typical Sequential Implementation• A : n x n• B : n x n• C = A * B : n x n

for i = 1 to n

for j = 1 to n

C[i, j] = 0;

for k = 1 to n

C[i, j] += A[i, k] * B [k, j];

45

Using Submatrixes• Improves data locality significantly.

46

Experimental Results

47

Assignment 1• Machine

– the older, the better.– Myson offers his ancient notebook for you.

• Pentium II 600Mhz• no L1 cache• 64KB L2 cache• running Linux

• Prepare a presentation on your experimental results.

cse-700 parallel programming introduction

Documents