executing parallel programs with potential bottlenecks efficiently yoshihiro oyama kenjiro taura...
TRANSCRIPT
Executing Parallel Programs withPotential Bottlenecks Efficiently
Yoshihiro Oyama
Kenjiro Taura
Akinori Yonezawa{oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp
University of Tokyo
Bottlenecks
bottleneck object(e.g.,shared counter)
……..
The execution timehere is very large
Research context: Implementing a concurrent OO language on SMP or DSM machines
concurrentinvocations
e.g., synchronizedmethods in Java
exclusivemethod
The methods are serialized
Update!Update!Update! Update!
exclusivemethod
exclusivemethod
exclusivemethod
Speedup Curvesfor Programs with Bottlenecks
processors
tim
e
ideal
realityGood compilers shouldgive this curve!!!
We may execute a program on too many processors(because it is not always easy to predict dynamic behavior).
Goal
other parts
Naïve Implementationtime
Ideal Implementation
1PE
50PE
bottleneck parts
other parts
Makingthe whole execution time on multiprocessors
the time to sequentially execute bottlenecks onlyclose to
bottleneck parts
50PE
bottleneck parts
other parts
Experiment usingCounter Program in C
0
500
1000
1500
2000
0 10 20 30 40 50 60 70# of PEs
tim
e (m
sec)
spin block block (det a c h )getone detach reg. + p r e f .
• Solaris threads & Ultra Enterprise 10000• Each processor increments a shared counter in parallel
Implementation with Spinlocks
object data object data
method
methodmethod
Advantage:No need to move “computation” among processors
Disadvantage:Frequent cache misses in reading a bottleneck object (because of cache invalidation by other processors)
bottleneckobject
method methodmethod
method
Each processor executesmethods by itself
non-owners
Implementation withSimple Blocking Locks
bottleneckobject
a queue of “contexts”
owner
objectdata
Advantage:
Disadvantage:
Few cache misses in reading a bottleneck object
Overheads to move “computation”
Owner dequeues contextsone by one
with mutex operations
enqueue
dequeue
Overview of Our Scheme
Improvement of simple blocking locks
– Overheads in simple blocking locks Mutex operations for a queue of contexts Waiting time imposed on an owner for mutex Cache misses in reading contexts
– Solution Detaching a whole list of contexts from an object Giving higher priority to an owner Prefetching context data
Y
B C D
Our Scheme(Inserting a Context)
bottleneckobject
A
When a non-owner invokes a method
X
a list of contexts
Y Z
non-ownersowner
B C Dbottleneckobject
A
X Z
context inserted
Our Scheme(Detaching Contexts)
When an owner executes methods
Y B C D
bottleneckobject
A
X Zlist detached!!!
Y B C D bottleneckobject
A X Z
Many mutex operations by owner are eliminated
contexts insertedcontexts are executed in turnwithout mutex ops for the list
Our Scheme(Low-Level Implementation)
Owner no longer has the overhead of waiting time for mutex
bottleneckobject
non-owners (with low priority)owner (with high priority)
updating the areawith swap
updating the area withcompare-and-swap
one word area
Detachment: always succeeds in constant timeInsertion: may fail many times
Why one word? Why list, not queue?To make our algorithm lock-free and non-blocking
Compile-time Optimizations
Prefetching context data
Assigning object data to registersWhile this context is executed, this context is prefetched
passing object data on registers
These processing is realized implicitly by the compilerand runtime of a concurrent OO language Schematic
The number of cache misses in reading contexts is reduced
detachedcontexts
Experimental Results (1)
RNA problem solver (with statistics)in Schematic on Ultra Enterprise 10000
01000200030004000500060007000
0 10 20 30 40 50 60 70# of PEs
tim
e (m
sec)
spin block spin bl o c kour scheme
Experimental Results (2)
RNA problem solver (with statistics)in Schematic on Origin 2000
0
4000
8000
12000
16000
0 10 20 30 40 50 60 70 80 90 100 110
# of PEs
tim
e (m
sec)
spin block spin block our scheme
メインの説明はここまで
メインの説明はここまで
The Other Interesting Facts
Waiting time for mutex is very large– 70 % of owner’s execution time
Our scheme gives good performance also on uniprocessor– spinlock: 641 msec– simple blocking lock: 1025 msec– our scheme: 810 msec
(the execution time of a simple counter program)
Examples of Bottlenecks
MT-unsafe libraries– Many libraries assume single-threaded use
I/O calls– printf, etc.
Stub objects in distributed systems– One representative object is responsible for all
communication in a site
Shared global variables– e.g., counters to collect statistics information
Limitations
Our scheme may use large memory– Non-owners create many contexts
Our scheme does not guarantee FIFO scheduling of methods in an object– Simple solution is reversing a detached list
Future Work
Solving a potential problem in memory use– Problem: Huge memory may be required for contexts
– Simple solution: switch to local-based execution when memory for contexts exceeds some threshold
Owner-basedexecution
• More efficient in bottlenecks• Using more memory
Local-basedexecution
• Less efficient in bottlenecks• Using less memory
switch dynamically
……….
Achieving the Same Effect inLow-level Languages (e.g., in C)
Typical behavior of programmers– Local-based execution in non-bottlenecks– Owner-based execution in bottlenecks
Disadvantages• Some bottlenecks emerge dynamically (under the effect of the number of processors and runtime parameters)• It is tedious to implement owner-based execution (because context data structure varies according to objects and methods)