executing parallel programs with potential bottlenecks efficiently yoshihiro oyama kenjiro taura...

Executing Parallel Programs withPotential Bottlenecks Efficiently

Yoshihiro Oyama

Kenjiro Taura

Akinori Yonezawa{oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

University of Tokyo

Bottlenecks

bottleneck object(e.g.,shared counter)

……..

The execution timehere is very large

Research context: Implementing a concurrent OO language on SMP or DSM machines

concurrentinvocations

e.g., synchronizedmethods in Java

exclusivemethod

The methods are serialized

Update!Update!Update! Update!

exclusivemethod

exclusivemethod

exclusivemethod

Speedup Curvesfor Programs with Bottlenecks

processors

tim

e

ideal

realityGood compilers shouldgive this curve!!!

We may execute a program on too many processors(because it is not always easy to predict dynamic behavior).

Goal

other parts

Naïve Implementationtime

Ideal Implementation

1PE

50PE

bottleneck parts

other parts

Makingthe whole execution time on multiprocessors

the time to sequentially execute bottlenecks onlyclose to

bottleneck parts

50PE

bottleneck parts

other parts

Experiment usingCounter Program in C

0

500

1000

1500

2000

0 10 20 30 40 50 60 70# of PEs

tim

e (m

sec)

spin block block (det a c h )getone detach reg. + p r e f .

• Solaris threads & Ultra Enterprise 10000• Each processor increments a shared counter in parallel

Implementation with Spinlocks

object data object data

method

methodmethod

Advantage:No need to move “computation” among processors

Disadvantage:Frequent cache misses in reading a bottleneck object (because of cache invalidation by other processors)

bottleneckobject

method methodmethod

method

Each processor executesmethods by itself

non-owners

Implementation withSimple Blocking Locks

bottleneckobject

a queue of “contexts”

owner

objectdata

Advantage:

Disadvantage:

Few cache misses in reading a bottleneck object

Overheads to move “computation”

Owner dequeues contextsone by one

with mutex operations

enqueue

dequeue

Overview of Our Scheme

Improvement of simple blocking locks

– Overheads in simple blocking locks Mutex operations for a queue of contexts Waiting time imposed on an owner for mutex Cache misses in reading contexts

– Solution Detaching a whole list of contexts from an object Giving higher priority to an owner Prefetching context data

Y

B C D

Our Scheme(Inserting a Context)

bottleneckobject

A

When a non-owner invokes a method

X

a list of contexts

Y Z

non-ownersowner

B C Dbottleneckobject

A

X Z

context inserted

Our Scheme(Detaching Contexts)

When an owner executes methods

Y B C D

bottleneckobject

A

X Zlist detached!!!

Y B C D bottleneckobject

A X Z

Many mutex operations by owner are eliminated

contexts insertedcontexts are executed in turnwithout mutex ops for the list

Our Scheme(Low-Level Implementation)

Owner no longer has the overhead of waiting time for mutex

bottleneckobject

non-owners (with low priority)owner (with high priority)

updating the areawith swap

updating the area withcompare-and-swap

one word area

Detachment: always succeeds in constant timeInsertion: may fail many times

Why one word? Why list, not queue?To make our algorithm lock-free and non-blocking

Compile-time Optimizations

Prefetching context data

Assigning object data to registersWhile this context is executed, this context is prefetched

passing object data on registers

These processing is realized implicitly by the compilerand runtime of a concurrent OO language Schematic

The number of cache misses in reading contexts is reduced

detachedcontexts

Experimental Results (1)

RNA problem solver (with statistics)in Schematic on Ultra Enterprise 10000

01000200030004000500060007000

0 10 20 30 40 50 60 70# of PEs

tim

e (m

sec)

spin block spin bl o c kour scheme

Experimental Results (2)

RNA problem solver (with statistics)in Schematic on Origin 2000

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110

# of PEs

tim

e (m

sec)

spin block spin block our scheme

メインの説明はここまで

メインの説明はここまで

The Other Interesting Facts

Waiting time for mutex is very large– 70 % of owner’s execution time

Our scheme gives good performance also on uniprocessor– spinlock: 641 msec– simple blocking lock: 1025 msec– our scheme: 810 msec

(the execution time of a simple counter program)

Examples of Bottlenecks

MT-unsafe libraries– Many libraries assume single-threaded use

I/O calls– printf, etc.

Stub objects in distributed systems– One representative object is responsible for all

communication in a site

Shared global variables– e.g., counters to collect statistics information

Limitations

Our scheme may use large memory– Non-owners create many contexts

Our scheme does not guarantee FIFO scheduling of methods in an object– Simple solution is reversing a detached list

Future Work

Solving a potential problem in memory use– Problem: Huge memory may be required for contexts

– Simple solution: switch to local-based execution when memory for contexts exceeds some threshold

Owner-basedexecution

• More efficient in bottlenecks• Using more memory

Local-basedexecution

• Less efficient in bottlenecks• Using less memory

switch dynamically

……….

Achieving the Same Effect inLow-level Languages (e.g., in C)

Typical behavior of programmers– Local-based execution in non-bottlenecks– Owner-based execution in bottlenecks

Disadvantages• Some bottlenecks emerge dynamically (under the effect of the number of processors and runtime parameters)• It is tedious to implement owner-based execution (because context data structure varies according to objects and methods)

executing parallel programs with potential bottlenecks efficiently yoshihiro oyama kenjiro taura...

Documents