selections from cse332 data abstractions course at university of washington 1.introduction to...
Post on 18-Dec-2015
215 Views
Preview:
TRANSCRIPT
1
Selections from CSE332 Data Abstractions course at University of Washington
1. Introduction to Multithreading & Fork-Join Parallelism– hashtable, vector sum
2. Analysis of Fork-Join Parallel Programs– Maps, vector addition, linked lists versus trees for parallelism,
work and span
3. Parallel Prefix, Pack, and Sorting
4. Shared-Memory Concurrency & Mutual Exclusion– Concurrent bank account , OpenMP nested locks, OpenMP
critical section
5. Programming with Locks and Critical Sections– Simple minded concurrent stack, hashtable revisited
6. Data Races and Memory Reordering, Deadlock, Reader/Writer Locks, Condition Variables
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency
Lecture 1Introduction to Multithreading & Fork-Join Parallelism
Dan Grossman
Last Updated: August 2011
For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
3Sophomoric Parallelism and Concurrency, Lecture 1
What to do with multiple processors?
• Next computer you buy will likely have 4 processors– Wait a few years and it will be 8, 16, 32, …– The chip companies have decided to do this (not a “law”)
• What can you do with them?– Run multiple totally different programs at the same time
• Already do that? Yes, but with time-slicing– Do multiple things at once in one program
• Our focus – more difficult• Requires rethinking everything from asymptotic
complexity to how to implement data-structure operations
4Sophomoric Parallelism and Concurrency, Lecture 1
Parallelism vs. Concurrency
Note: Terms not yet standard but the perspective is essential– Many programmers confuse these concepts
There is some connection:– Common to use threads for both– If parallel computations need access to shared resources,
then the concurrency needs to be managed
First 3ish lectures on parallelism, then 3ish lectures on concurrency
Parallelism:
Use extra resources to
solve a problem faster
resources
Concurrency:
Correctly and efficiently manage
access to shared resources
requestswork
resource
5Sophomoric Parallelism and Concurrency, Lecture 1
An analogy
CS1 idea: A program is like a recipe for a cook– One cook who does one thing at a time! (Sequential)
Parallelism:– Have lots of potatoes to slice? – Hire helpers, hand out potatoes and knives– But too many chefs and you spend all your time coordinating
Concurrency:– Lots of cooks making different things, but only 4 stove burners– Want to allow access to all 4 burners, but not cause spills or
incorrect burner settings
6Sophomoric Parallelism and Concurrency, Lecture 1
Sum(int arr[], int len) { int res[4]; FORALL(i=0; i < 4; i++) { //parallel iterations res[i] = sumRange(arr,i*len/4,(i+1)*len/4); } return res[0]+res[1]+res[2]+res[3];}int sumRange(int arr[], int lo, int hi) { result = 0; for(j=lo; j < hi; j++) result += arr[j]; return result;}
Parallelism ExampleParallelism: Use extra computational resources to solve a problem
faster (increasing throughput via simultaneous execution)
Pseudocode for array sum– Bad style for reasons we’ll see, but may get roughly 4x speedup
7Sophomoric Parallelism and Concurrency, Lecture 1
Concurrency ExampleConcurrency: Correctly and efficiently manage access to shared
resources (from multiple possibly-simultaneous clients)
Pseudocode for a shared chaining hashtable– Prevent bad interleavings (correctness)– But allow some concurrent access (performance)
class Hashtable<K,V> { … void insert(K key, V value) { int bucket = …; prevent-other-inserts/lookups in table[bucket] do the insertion re-enable access to arr[bucket] } V lookup(K key) {
(like insert, but can allow concurrent lookups to same bucket)
}}
8Sophomoric Parallelism and Concurrency, Lecture 1
Shared memory
The model we will assume is shared memory with explicit threads
Old story: A running program has– One call stack (with each stack frame holding local variables) – One program counter (current statement executing)– Static fields– Objects (created by new) in the heap (nothing to do with heap
data structure)
New story:– A set of threads, each with its own call stack & program counter
• No access to another thread’s local variables– Threads can (implicitly) share static fields / objects
• To communicate, write somewhere another thread reads
9Sophomoric Parallelism and Concurrency, Lecture 1
Shared memory
…
pc=…
…
pc=…
…
pc=…
…
Unshared:locals andcontrol
Shared:objects andstatic fields
Threads each have own unshared call stack and current statement – (pc for “program counter”) – local variables are numbers, null, or heap references
Any objects can be shared, but most are not
10Sophomoric Parallelism and Concurrency, Lecture 1
First attempt, Sum class - serial class Sum {
int ans; int len; int *arr; public: Sum (int a[], int num) { //constructor arr=a, ans = 0; for (int i=0; i < num; ++i) {
arr[i] = i+1; //initialize array } }
int sum (int lo, int hi) { int ans = 0; for(int i=lo; i < hi; ++i) { ans += arr[i]; } return ans; }}
11Sophomoric Parallelism and Concurrency, Lecture 1
Parallelism idea• Example: Sum elements of a large array • Idea Have 4 threads simultaneously sum 1/4 of the array
– Warning: Inferior first approach – explicitly using fixed number of threads
ans0 ans1 ans2 ans3 +
ans
– Create 4 threads using openMP parallel– Determine thread ID, assign work based on thread ID– Accumulate partial sums for each thread– Add together their 4 answers for the final result
12Sophomoric Parallelism and Concurrency, Lecture 1
OpenMP basics
First learn some basics of OpenMP
1. Pragma based approach to parallelism
2. Create pool of threads using #pragma omp parallel
3. Use work sharing directives to allow each thread to do work
4. To get started, we will explore these worksharing directives:
1. parallel for
2. reduction
3. single
4. task
5. taskwait
Most of these constructs have an analog in Intel® Cilk plus™ as well
13Sophomoric Parallelism and Concurrency, Lecture 1
Demo
Walk through of parallel sum code – SPMD version
Discussion points:•#pragma omp parallel
•num_threads clause
14Sophomoric Parallelism and Concurrency, Lecture 1
Demo
Walk through of OpenMP parallel for with reduction code
Discussion points:•#pragma omp parallel for
•reduction(+:ans) clause
15Sophomoric Parallelism and Concurrency, Lecture 1
Shared memory?
• Fork-join programs (thankfully) don’t require much focus on sharing memory among threads
• But in languages like C++, there is memory being shared. In our example:– lo, hi, arr fields written by “main” thread, read by helper
thread– ans field written by helper thread, read by “main” thread
• When using shared memory, you must avoid race conditions– With concurrency, we’ll learn other ways to synchronize later
such as the use of atomics, locks, critical sections
16Sophomoric Parallelism and Concurrency, Lecture 1
Divide and Conquer Approach
Want to use (only) processors “available to you now”
– Not used by other programs or threads in your program• Maybe caller is also using parallelism• Available cores can change even while your threads run
– If you have 3 processors available and using 3 threads would take time X, then creating 4 threads would take time 1.5X
// numThreads == numProcessors is bad// if some are needed for other thingsint sum(int arr[], int numThreads){ …}
17Sophomoric Parallelism and Concurrency, Lecture 1
Divide and Conquer Approach
Though unlikely for sum, in general sub-problems may take significantly different amounts of time
– Example: Apply method f to every array element, but maybe f is much slower for some data items• Example: Is a large integer prime?
– If we create 4 threads and all the slow data is processed by 1 of them, we won’t get nearly a 4x speedup• Example of a load imbalance
18Sophomoric Parallelism and Concurrency, Lecture 1
Divide and Conquer Approach
The counterintuitive (?) solution to all these problems is to use lots of threads, far more than the number of processors– But this will require changing our algorithm
ans0 ans1 … ansN ans
1. Forward-portable: Lots of helpers each doing a small piece
2. Processors available: Hand out “work chunks” as you go• If 3 processors available and have 100 threads, then ignoring
constant-factor overheads, extra time is < 3%
3. Load imbalance: No problem if slow thread scheduled early enough• Variation probably small anyway if pieces of work are small
19Sophomoric Parallelism and Concurrency, Lecture 1
Divide-and-Conquer idea
This is straightforward to implement using divide-and-conquer– Parallelism for the recursive calls
+ + + + + + + +
+ + + +
+ ++
20Sophomoric Parallelism and Concurrency, Lecture 1
Demo
Walk through of Divide and Conquer code
Discussion points:•#pragma omp task•omp shared•omp firstprivate•omp taskwait
21Sophomoric Parallelism and Concurrency, Lecture 1
Demo
Walk through of load balanced OpenMP parallel for
Discussion points:•Omp scheduling(dynamic)
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency
Lecture 2Analysis of Fork-Join Parallel Programs
Dan Grossman
Last Updated: August 2011
For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
23Sophomoric Parallelism and Concurrency, Lecture 2
Even easier: Maps (Data Parallelism)
• A map operates on each element of a collection independently to create a new collection of the same size– No combining results– For arrays, this is so trivial some hardware has direct support
• Canonical example: Vector addition
int vector_add(int lo, int hi){ FORALL(i=lo; i < hi; i++) { result[i] = arr1[i] + arr2[i]; } return 1;}
24Sophomoric Parallelism and Concurrency, Lecture 2
Maps in ForkJoin Framework
• Even though there is no result-combining, it still helps with load balancing to create many small tasks– Maybe not for vector-add but for more compute-intensive
maps– The forking is O(log n) whereas theoretically other approaches
to vector-add is O(1)
class VecAdd { int *arr1, *arr2, *res, len; public:
VecAdd(int _arr1[],int _arr2[],int _res[],int _num) { arr1 = _arr1; arr2 = _arr2; res = _res; len = _num; }
int vector_add( int lo, int hi) {#pragma omp parallel for for( int i=lo; i < hi; ++i) { res[i] = arr1[i] + arr2[i]; } return 1; }
} //end class definition
25Sophomoric Parallelism and Concurrency, Lecture 2
Maps and reductions
Maps and reductions: the “workhorses” of parallel programming
– By far the two most important and common patterns• Two more-advanced patterns in next lecture
– Learn to recognize when an algorithm can be written in terms of maps and reductions
– Use maps and reductions to describe (parallel) algorithms
– Programming them becomes “trivial” with a little practice• Exactly like sequential for-loops seem second-nature
26Sophomoric Parallelism and Concurrency, Lecture 2
Trees
• Maps and reductions work just fine on balanced trees– Divide-and-conquer each child rather than array subranges– Correct for unbalanced trees, but won’t get much speed-up
• Example: minimum element in an unsorted but balanced binary tree in O(log n) time given enough processors
• How to do the sequential cut-off?– Store number-of-descendants at each node (easy to
maintain)– Or could approximate it with, e.g., AVL-tree height
27Sophomoric Parallelism and Concurrency, Lecture 2
Linked lists
• Can you parallelize maps or reduces over linked lists?– Example: Increment all elements of a linked list– Example: Sum all elements of a linked list
b c d e f
front back
• Once again, data structures matter!
• For parallelism, balanced trees generally better than lists so that we can get to all the data exponentially faster O(log n) vs. O(n)– Trees have the same flexibility as lists compared to arrays
28Sophomoric Parallelism and Concurrency, Lecture 2
Work and Span
Let TP be the running time if there are P processors available
Two key measures of run-time:
• Work: How long it would take 1 processor = T1– Just “sequentialize” the recursive forking
• Span: How long it would take infinity processors = T– The longest dependence-chain– Example: O(log n) for summing an array since > n/2
processors is no additional help– Also called “critical path length” or “computational depth”
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency
Lecture 4Shared-Memory Concurrency & Mutual Exclusion
Dan Grossman
Last Updated: May 2011
For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
30Sophomoric Parallelism & Concurrency, Lecture 4
Canonical Bank Account Example
Correct code in a single-threaded world
class BankAccount { private int balance = 0; int getBalance() { return balance; } void setBalance(int x) { balance = x; } void withdraw(int amount) { int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); } … // other operations like deposit, etc.}
31Sophomoric Parallelism & Concurrency, Lecture 4
A bad interleaving
Interleaved withdraw(100) calls on the same account– Assume initial balance == 150
int b = getBalance();
if(amount > b) try{throw …;}setBalance(b – amount);
int b = getBalance();if(amount > b) try{throw …;}setBalance(b – amount);
Thread 1 Thread 2
Tim
e
“Lost withdraw” – unhappy bank
32Sophomoric Parallelism & Concurrency, Lecture 4
What we need – an abstract data type for mutual exclusion • There are many ways out of this conundrum, but we need help from the
language
• One basic solution: Locks, or critical sections– OpenMP implements locks as well as critical sections. They do
similar things, but have subtle differences between them• #pragma omp critical
{ // allows only one thread access at a time in the code block
// code block must have one entrance and one exit }• omp_lock_t myLock;
– omp locks have to be initialized before use.– Can be locked with omp_set_nest_lock, and unlocked with
omp_unset_nest_lock– Only one thread may hold the lock– Allows for exception handling and non structured jumps
33Sophomoric Parallelism & Concurrency, Lecture 4
Almost-correct pseudocode
class BankAccount { private int balance = 0; private Lock lk = new Lock(); … void withdraw(int amount) {
lk.acquire(); /* may block */ int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); lk.release(); } // deposit would also acquire/release lk}
34Sophomoric Parallelism & Concurrency, Lecture 4
Some mistakes• A lock & critical sections are very primitive mechanisms
– Still up to you to use correctly to implement them
• Incorrect: Use different locks or different named ciritical sections for withdraw and deposit– Mutual exclusion works only when using same lock
• Poor performance: Use same lock for every bank account– No simultaneous operations on different accounts
• Incorrect: Forget to release a lock (blocks other threads forever!)– Previous slide is wrong because of the exception possibility!
if(amount > b) { lk.release(); // hard to remember! try {throw WithdrawTooLargeException;}}
35Sophomoric Parallelism & Concurrency, Lecture 4
Other operations
• If withdraw and deposit use the same lock, then simultaneous calls to these methods are properly synchronized
• But what about getBalance and setBalance?– Assume they’re public, which may be reasonable
• If they don’t acquire the same lock, then a race between setBalance and withdraw could produce a wrong result
• If they do acquire the same lock, then withdraw would block forever because it tries to acquire a lock it already has
36Sophomoric Parallelism & Concurrency, Lecture 4
Re-acquiring locks?
• Can’t let outside world call
setBalance1, not protected by locks
• Can’t have withdraw call setBalance2, because the locks would be nested
• We can use intricate re-entrant locking scheme or better yet re-structure the code. Nested locking is not recommended
int setBalance1(int x) { balance = x; }int setBalance2(int x) { lk.acquire(); balance = x; lk.release();}void withdraw(int amount) { lk.acquire(); … setBalanceX(b – amount); lk.release(); }
37Sophomoric Parallelism & Concurrency, Lecture 4
This code is easier to lock
• You may provide a setBalance() method for external use, but do NOT call it for the withdraw() member function.
• Instead, protect the direct modification of the data as shown here – avoiding nested locks
int setBalance(int x) { lk.acquire(); balance = x; lk.release();}void withdraw(int amount) { lk.acquire(); … balance = (b – amount); lk.release(); }
38Sophomoric Parallelism and Concurrency, Lecture 1
Demo
Walk through of load balanced BankAccount code
Discussion points:•#pragma omp critical
39Sophomoric Parallelism & Concurrency, Lecture 4
OpenMp Critical Section method
• Works great & is simple to implement• Can’t be used as nested function calls where each function sets
a critical section. For example foo() contains critical section which calls bar(), while bar() also contains critical section. This will not be allowed by OpenMP runtime!
• Generally better to avoid nested calls such as foo,bar example above, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance()
• Can’t be used with try/catch exception handling (due to multiple exists from code block)
• If try/catch is required, consider using scoped locking – example to follow.
40Sophomoric Parallelism & Concurrency, Lecture 4
Scoped Locking method
• Works great & is fairly simple to implement• More complicated than critical section method• Can be used as nested function calls where each function
acquires its lock. – Generally it is still better to avoid nested calls such as
foo,bar example on previous foil, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance()
• Can be used with try/catch exception handling
• The lock is released when the object or function goes out of scope.
41Sophomoric Parallelism and Concurrency, Lecture 1
Demo
Walk through of load balanced BankAccount code
Discussion points:•scoped locking using OpenMP nested locks
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency
Lecture 5 Programming with Locks and Critical Sections
Dan Grossman
Last Updated: May 2011
For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
43Sophomoric Parallelism and Concurrency, Lecture 1
Demo
Walk through of load balanced stack code
Discussion points:•using OpenMP nested locks•Using OpenMP critical section
44Sophomoric Parallelism & Concurrency, Lecture 5
Example, using critical sectionsclass Stack<E> { private E[] array = (E[])new Object[SIZE]; int index = -1; bool isEmpty() { // unsynchronized: wrong?! return index==-1; } void push(E val) { #pragma omp critical {array[++index] = val;} } E pop() { E temp; #pragma omp critical {temp = array[index--];} return temp; } E peek() { // unsynchronized: wrong! return array[index]; } }
45Sophomoric Parallelism & Concurrency, Lecture 5
Why wrong?
• It looks like isEmpty and peek can “get away with this” since push and pop adjust the state “in one tiny step”
• But this code is still wrong and depends on language-implementation details you cannot assume– Even “tiny steps” may require multiple steps in the
implementation: array[++index] = val probably takes at least two steps
– Code has a data race, allowing very strange behavior • Important discussion in next lecture
• Moral: Don’t introduce a data race, even if every interleaving you can think of is correct
46Sophomoric Parallelism & Concurrency, Lecture 5
3 choices
For every memory location (e.g., object field) in your program, you must obey at least one of the following:
1. Thread-local: Don’t use the location in > 1 thread
2. Immutable: Don’t write to the memory location
3. Synchronized: Use synchronization to control access to the location
all memory thread-localmemory
immutablememory
need synchronization
47Sophomoric Parallelism & Concurrency, Lecture 5
Example
Suppose we want to change the value for a key in a hashtable without removing it from the table– Assume critical section guards the whole table
#pragma omp critical{ v1 = table.lookup(k); v2 = expensive(v1); table.remove(k); table.insert(k,v2);}
Papa Bear’s critical section was too long
(table locked during expensive call)
48Sophomoric Parallelism & Concurrency, Lecture 5
Example
Suppose we want to change the value for a key in a hashtable without removing it from the table– Assume critical section guards the whole table
#pragma omp critical{ v1 = table.lookup(k);}v2 = expensive(v1);#pragma omp critical{ table.remove(k); table.insert(k,v2);}
Mama Bear’s critical section was too short
(if another thread updated the entry, we will lose an update)
49Sophomoric Parallelism & Concurrency, Lecture 5
ExampleSuppose we want to change the value for a key in a hashtable
without removing it from the table– Assume critical section guards the whole table
done = false;while(!done) { #pragma omp critical { v1 = table.lookup(k); } v2 = expensive(v1); #pragma omp critical { if(table.lookup(k)==v1) { done = true; table.remove(k); table.insert(k,v2);}}}
Baby Bear’s critical section was just right
(if another updateoccurred, try ourupdate again)
50Sophomoric Parallelism & Concurrency, Lecture 5
Don’t roll your own
• It is rare that you should write your own data structure– Provided in standard libraries– Point of these lectures is to understand the key trade-offs
and abstractions
• Especially true for concurrent data structures– Far too difficult to provide fine-grained synchronization
without race conditions– Standard thread-safe libraries like TBB ConcurrentHashMap written by world experts
Guideline #5: Use built-in libraries whenever they meet your needs
top related