executing parallel programs with potential bottlenecks efficiently yoshihiro oyama kenjiro taura...

20
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp University of Tokyo

Upload: ross-curtis

Post on 13-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Executing Parallel Programs withPotential Bottlenecks Efficiently

Yoshihiro Oyama

Kenjiro Taura

Akinori Yonezawa{oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

University of Tokyo

Page 2: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Bottlenecks

bottleneck object(e.g.,shared counter)

……..

The execution timehere is very large

Research context: Implementing a concurrent OO language on SMP or DSM machines

concurrentinvocations

e.g., synchronizedmethods in Java

exclusivemethod

The methods are serialized

Update!Update!Update! Update!

exclusivemethod

exclusivemethod

exclusivemethod

Page 3: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Speedup Curvesfor Programs with Bottlenecks

processors

tim

e

ideal

realityGood compilers shouldgive this curve!!!

We may execute a program on too many processors(because it is not always easy to predict dynamic behavior).

Page 4: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Goal

other parts

Naïve Implementationtime

Ideal Implementation

1PE

50PE

bottleneck parts

other parts

Makingthe whole execution time on multiprocessors

the time to sequentially execute bottlenecks onlyclose to

bottleneck parts

50PE

bottleneck parts

other parts

Page 5: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Experiment usingCounter Program in C

0

500

1000

1500

2000

0 10 20 30 40 50 60 70# of PEs

tim

e (m

sec)

spin block block (det a c h )getone detach reg. + p r e f .

• Solaris threads & Ultra Enterprise 10000• Each processor increments a shared counter in parallel

Page 6: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Implementation with Spinlocks

object data object data

method

methodmethod

Advantage:No need to move “computation” among processors

Disadvantage:Frequent cache misses in reading a bottleneck object (because of cache invalidation by other processors)

bottleneckobject

method methodmethod

method

Each processor executesmethods by itself

Page 7: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

non-owners

Implementation withSimple Blocking Locks

bottleneckobject

a queue of “contexts”

owner

objectdata

Advantage:

Disadvantage:

Few cache misses in reading a bottleneck object

Overheads to move “computation”

Owner dequeues contextsone by one

with mutex operations

enqueue

dequeue

Page 8: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Overview of Our Scheme

Improvement of simple blocking locks

– Overheads in simple blocking locks Mutex operations for a queue of contexts Waiting time imposed on an owner for mutex Cache misses in reading contexts

– Solution Detaching a whole list of contexts from an object Giving higher priority to an owner Prefetching context data

Page 9: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Y

B C D

Our Scheme(Inserting a Context)

bottleneckobject

A

When a non-owner invokes a method

X

a list of contexts

Y Z

non-ownersowner

B C Dbottleneckobject

A

X Z

context inserted

Page 10: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Our Scheme(Detaching Contexts)

When an owner executes methods

Y B C D

bottleneckobject

A

X Zlist detached!!!

Y B C D bottleneckobject

A X Z

Many mutex operations by owner are eliminated

contexts insertedcontexts are executed in turnwithout mutex ops for the list

Page 11: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Our Scheme(Low-Level Implementation)

Owner no longer has the overhead of waiting time for mutex

bottleneckobject

non-owners (with low priority)owner (with high priority)

updating the areawith swap

updating the area withcompare-and-swap

one word area

Detachment: always succeeds in constant timeInsertion: may fail many times

Why one word? Why list, not queue?To make our algorithm lock-free and non-blocking

Page 12: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Compile-time Optimizations

Prefetching context data

Assigning object data to registersWhile this context is executed, this context is prefetched

passing object data on registers

These processing is realized implicitly by the compilerand runtime of a concurrent OO language Schematic

The number of cache misses in reading contexts is reduced

detachedcontexts

Page 13: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Experimental Results (1)

RNA problem solver (with statistics)in Schematic on Ultra Enterprise 10000

01000200030004000500060007000

0 10 20 30 40 50 60 70# of PEs

tim

e (m

sec)

spin block spin bl o c kour scheme

Page 14: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Experimental Results (2)

RNA problem solver (with statistics)in Schematic on Origin 2000

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110

# of PEs

tim

e (m

sec)

spin block spin block our scheme

Page 15: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

メインの説明はここまで

メインの説明はここまで

Page 16: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

The Other Interesting Facts

Waiting time for mutex is very large– 70 % of owner’s execution time

Our scheme gives good performance also on uniprocessor– spinlock: 641 msec– simple blocking lock: 1025 msec– our scheme: 810 msec

(the execution time of a simple counter program)

Page 17: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Examples of Bottlenecks

MT-unsafe libraries– Many libraries assume single-threaded use

I/O calls– printf, etc.

Stub objects in distributed systems– One representative object is responsible for all

communication in a site

Shared global variables– e.g., counters to collect statistics information

Page 18: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Limitations

Our scheme may use large memory– Non-owners create many contexts

Our scheme does not guarantee FIFO scheduling of methods in an object– Simple solution is reversing a detached list

Page 19: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Future Work

Solving a potential problem in memory use– Problem: Huge memory may be required for contexts

– Simple solution: switch to local-based execution when memory for contexts exceeds some threshold

Owner-basedexecution

• More efficient in bottlenecks• Using more memory

Local-basedexecution

• Less efficient in bottlenecks• Using less memory

switch dynamically

……….

Page 20: Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, yonezawa}@is.s.u-tokyo.ac.jp

Achieving the Same Effect inLow-level Languages (e.g., in C)

Typical behavior of programmers– Local-based execution in non-bottlenecks– Owner-based execution in bottlenecks

Disadvantages• Some bottlenecks emerge dynamically (under the effect of the number of processors and runtime parameters)• It is tedious to implement owner-based execution (because context data structure varies according to objects and methods)