cse-700 parallel programming introduction

47
CSE-700 Parallel Programming Introduction POSTECH Sep 6, 2007 박박박

Upload: prisca

Post on 14-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

CSE-700 Parallel Programming Introduction. 박성우. POSTECH Sep 6, 2007. Common Features?. ... runs faster on. Multi-core CPUs. IBM Power4, dual-core, 2000 Intel reaches thermal wall , 2004 ) no more free lunch ! Intel Xeon, quad-core, 2006 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE-700 Parallel Programming Introduction

CSE-700 Parallel Programming

Introduction

POSTECH

Sep 6, 2007

박성우

Page 2: CSE-700 Parallel Programming Introduction

2

Common Features?

Page 3: CSE-700 Parallel Programming Introduction

3

... runs faster on

Page 4: CSE-700 Parallel Programming Introduction

4

Multi-core CPUs• IBM Power4, dual-core, 2000• Intel reaches thermal wall, 2004 ) no more free lunch!• Intel Xeon, quad-core, 2006• Sony PlayStation 3 Cell, eight cores enabled, 2006

• Intel, 80-cores, 2011 (prototype finished)

source: Herb Sutter - "Software and the concurrency revolution"

Page 5: CSE-700 Parallel Programming Introduction

5

Parallel Programming Models• Posix threads (API)• OpenMP (API) • HPF (High Performance Fortran)• Cray's Chapel• Nesl• Sun's Fortress• IBM's X10• ...• and a lot more.

Page 6: CSE-700 Parallel Programming Introduction

6

Parallelism• Data parallelism

– ability to apply a function in parallel to each element of a collection of data

• Thread parallelism– ability to run multiple threads concurrently– Each thread uses its own local state.

• Shared memory parallelism

Page 7: CSE-700 Parallel Programming Introduction

Data ParallelismThread Parallelism

Shared Memory Parallelism

Page 8: CSE-700 Parallel Programming Introduction

8

Data Parallelism = Data Separation

a1 a2 ... an an+1 an+2 ... an+m an+m+1 ... an+m+l

hardware thread #1

hardware thread #2

hardware thread #3

Page 9: CSE-700 Parallel Programming Introduction

9

Data Parallelism in Hardware

• GeForce 8800– 128 stream processors @ 1.3Ghz, 500+GFlops

Page 10: CSE-700 Parallel Programming Introduction

10

Data Parallelism in Programming Languages

• Fortress– parallelism is the default.

for i à 1:m, j à 1:n do // 1:n is a generator

a[i, j] := b[i] c[j]

end

• Nesl (1990's)– supports nested data parallelism

• the function being applied itself can be parallel.

{sum(a) : a in [[2, 3], [8, 3, 9], [7]]};

Page 11: CSE-700 Parallel Programming Introduction

11

Data Parallel Haskell (DAMP '07)

• Haskell + nested data parallelism– flattening (vectorization)

• transforms a nested parallel program such that it manipulates only flat arrays.

– fusion• eliminate many intermediate arrays

• Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements

Page 12: CSE-700 Parallel Programming Introduction

Data ParallelismThread Parallelism

Shared Memory Parallelism

Page 13: CSE-700 Parallel Programming Introduction

13

Thread Parallelism

hardware thread #1

hardware thread #2

local state local state

message

message

synchronous communication

Page 14: CSE-700 Parallel Programming Introduction

14

Pure Functional Threads• Purely functional threads can run concurrently.

– Effect-free computations can be executed in parallel with any other effect-free computations.

• Example: collision-detection

A

A'

B

B'

Page 15: CSE-700 Parallel Programming Introduction

15

Manticore (DAMP '07)

• Three layers– sequential base language

• functional language drawn from SML• no mutable references and arrays!

– data-parallel programming • Implicit:

– the compiler and runtime system manage thread creation.

• E.g.) parallel arrays of parallel arrays

[: 2 * n | n in nums where n > 0 :]fun mapP f xs = [: f x | x in xs :]

– concurrent programming

Page 16: CSE-700 Parallel Programming Introduction

16

Concurrent Programming in Manticore (DAMP '07)

• Based on Concurrent ML– threads and synchronous message passing– Threads do not share mutable states.

• actually no mutable references and arrays– explicit:

• The programmer manages thread creation.

Page 17: CSE-700 Parallel Programming Introduction

Data ParallelismThread Parallelism

Shared Memory Parallelism(Shared State Concurrency)

Page 18: CSE-700 Parallel Programming Introduction

18

Share Memory Parallelism

shared memory

hardware thread #1

hardware thread #2

hardware thread #3

Page 19: CSE-700 Parallel Programming Introduction

19

World War II

Page 20: CSE-700 Parallel Programming Introduction

20

Company of Heroes• Interaction of a LOT of objects:

– thousands of objects– Each object has its own mutable state.– Each object update affects several other objects.– All objects are updated 30+ times per second.

• Problem: – How do we handle simultaneous updates to the

same memory location?

Page 21: CSE-700 Parallel Programming Introduction

21

Manual Lock-based Synchronization

pthread_mutex_lock(mutex);

mutate_variable();

pthread_mutex_unlock(mutex);

• Locks and conditional variables

) fundamentally flawed!

Page 22: CSE-700 Parallel Programming Introduction

22

Bank Accounts Beautiful Concurrency, Peyton Jones, 2007

account A

thread #1 thread #2 thread #n

account B

...

transferrequest

transferrequest

transferrequest

shared memory

• Invariant: atomicity– no thread observes a state in which the money has left

one account, but has not arrived in the other.

Page 23: CSE-700 Parallel Programming Introduction

23

Bank Accounts using Locks• In an object-oriented language:

class Account {Int balance;synchronized void deposit (Int n) {

balance = balance + n;}}

• Code for transfer:void transfer (Account from, Account to, Int amount) {

from.withdraw (amount);to.deposit (amount);

}

an intermediate state!

Page 24: CSE-700 Parallel Programming Introduction

24

A Quick Fix: Explicit Lockingvoid transfer

(Account from, Account to, Int amount) {

from.lock(); to.lock();

from.withdraw (amount);

to.deposit (amount);

from.unlock(); to.unlock();

}

• Now, the program is prone to deadlock.

Page 25: CSE-700 Parallel Programming Introduction

25

Locks are Bad• Taking two few locks

) simultaneous update• Taking too many locks

) no concurrency or deadlock• Taking the wrong locks

) error-prone programming• Taking locks in the wrong order

) error-prone programming• ...

• Fundamental problem: no modular programming– Correct implementations of withdraw and deposit do not

give a correct implementation of transfer.

Page 26: CSE-700 Parallel Programming Introduction

26

Transactional Memory• An alternative to lock-based synchronization

– eliminates many problems associated with lock-based synchronization• no deadlock• read sharing• safe modular programming

• Hot research area– hardware transactional memory– software transactional memory

• C, Java, functional languages, ...

Page 27: CSE-700 Parallel Programming Introduction

27

Transactions in Haskelltransfer :: Account -> Account -> Int -> IO ()

-- transfer 'amount' from account 'from' to account 'to'

transfer from to amount =

atomically (do { deposit to amount

; withdraw from amount })

• atomically act– atomicity:

• the effects become visible to other threads all at once.– isolation:

• the action act does not see any effects from other threads.

Page 28: CSE-700 Parallel Programming Introduction

Conclusion:We need parallelism!

Page 29: CSE-700 Parallel Programming Introduction

29

Tim Sweeney's POPL '06 Invited Talk- Last Slide

Page 30: CSE-700 Parallel Programming Introduction

CSE-700 Parallel Programming

Fall 2007

Page 31: CSE-700 Parallel Programming Introduction

31

CSE-700 in a Nutshell• Scope

– Parallel computing from the viewpoint of programmers and language designers

– We will not talk about hardware for parallel computing

• Audience– Anyone interested in learning parallel programming

• Prerequisite– C programming– Desire to learn new programming languages

Page 32: CSE-700 Parallel Programming Introduction

32

Material• Books

– Introduction to Parallel Programming (2nd). Ananth Grama et al.

– Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al.• Any textbook on MPI and OpenMP is fine.

• Papers

Page 33: CSE-700 Parallel Programming Introduction

33

Teaching Staff• Instructors

– Gla– Myson– ...– and YOU!

• We will lead this course TOGETHER.

Page 34: CSE-700 Parallel Programming Introduction

34

Resources• Plquad

– quad-core Linux– OpenMP and MPI already installed

• Ask for an account if you need one.

Page 35: CSE-700 Parallel Programming Introduction

35

Basic Plan - First Half• Goal

– learn the basics of parallel programming through 5+ assignments on OpenMP and MPI

• Each lecture consists of:– discussion on the previous assignment

• Each of you is expected to give a presentation.– presentation on OpenMP and MPI by the

instructors– discussion on the next assignment

Page 36: CSE-700 Parallel Programming Introduction

36

Basic Plan - Second Half• Recent parallel languages

– learn a recent parallel language– write a cool program in your parallel language– give a presentation on your experience

• Topics in parallel language research– choose a topic– give a presentation on it

Page 37: CSE-700 Parallel Programming Introduction

37

What Matters Most?• Spirit of adventure• Proactivity• Desire to provoke Happy Chaos

– I want you to develop this course into a total, complete, yet happy chaos.

– A truly inspirational course borders almost on chaos.

Page 38: CSE-700 Parallel Programming Introduction

Impact of Memory and Cache on Performance

Page 39: CSE-700 Parallel Programming Introduction

39

Impact of Memory Bandwidth [1]Consider the following code fragment:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;

for (j = 0; j < 1000; j++)

column_sum[i] += b[j][i];

The code fragment sums columns of the matrix b into a vector column_sum.

Page 40: CSE-700 Parallel Programming Introduction

40

Impact of Memory Bandwidth [2]• The vector column_sum is small and easily fits into the cache • The matrix b is accessed in a column order. • The strided access results in very poor performance.

Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.

Page 41: CSE-700 Parallel Programming Introduction

41

Impact of Memory Bandwidth [3]We can fix the above code as follows:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;

for (j = 0; j < 1000; j++)

for (i = 0; i < 1000; i++)

column_sum[i] += b[j][i];

In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.

Page 42: CSE-700 Parallel Programming Introduction

42

Lesson• Memory layouts and organizing computation

appropriately can make a significant impact on the spatial and temporal locality.

Page 43: CSE-700 Parallel Programming Introduction

Assignment 1Cache & Matrix Multiplication

Page 44: CSE-700 Parallel Programming Introduction

44

Typical Sequential Implementation• A : n x n• B : n x n• C = A * B : n x n

for i = 1 to n

for j = 1 to n

C[i, j] = 0;

for k = 1 to n

C[i, j] += A[i, k] * B [k, j];

Page 45: CSE-700 Parallel Programming Introduction

45

Using Submatrixes• Improves data locality significantly.

Page 46: CSE-700 Parallel Programming Introduction

46

Experimental Results

Page 47: CSE-700 Parallel Programming Introduction

47

Assignment 1• Machine

– the older, the better.– Myson offers his ancient notebook for you.

• Pentium II 600Mhz• no L1 cache• 64KB L2 cache• running Linux

• Prepare a presentation on your experimental results.