a survey on scheduling methods of task-parallel processing

41
A Survey on Scheduling Methods of Task-Parallel Processing Chikayama and Taura Lab M1 48-096415 Jun Nakashima 1

Upload: zev

Post on 22-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

A Survey on Scheduling Methods of Task-Parallel Processing. Chikayama and Taura Lab M1 48-096415 Jun Nakashima. Agenda. Introduction Basic Scheduling Methods Challenges and solutions Consideration Summary. Motivation. Thread and task have many in common Both are unit of execution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Survey on Scheduling Methods of Task-Parallel Processing

A Survey on Scheduling Methods of Task-Parallel Processing

Chikayama and Taura LabM1 48-096415Jun Nakashima

1

Page 2: A Survey on Scheduling Methods of Task-Parallel Processing

Agenda

• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary

2

Page 3: A Survey on Scheduling Methods of Task-Parallel Processing

Motivation

• Thread and task have many in common– Both are unit of execution– Multiple threads/tasks may be executed

simultaneously

• Scheduling methods of tasks can be useful for that of threads

3

Page 4: A Survey on Scheduling Methods of Task-Parallel Processing

Background

• Demand of exploiting dynamic and irregular parallelism• Simple parallelization (pthread,OpenMP,…) is not

efficient– Few threads : Difficulties of load balancing– Many threads : Good load balance but overhead is not

bearable• Example:– N-Queens puzzle– Strassen’s algorithm (matrix-matrix product)– LU Factorization of sparse matrix

4

Page 5: A Survey on Scheduling Methods of Task-Parallel Processing

Task-Parallel Processing

• Decompose entire process into tasks and execute them in parallel– Task : Unit of execution much lighter than thread– Fairness of tasks are not considered

• May be deferred or suspended

• Representation of dependence– Task creation by a task– Wait for child tasks

• Programming environments with task support :– Cilk, X10, Intel TBB, OpenMP(>3.0),etc…

5

Page 6: A Survey on Scheduling Methods of Task-Parallel Processing

Task-Parallel Processing(2)

A simple example Task graphtask task_fib(n){ if (n<=1)return 1;

t1=create_task(task_fib(n-2)); //create task t2=create_task(task_fib(n-1));

ret1=task_wait(t1); //wait for children ret2=task_wait(t2);

return ret1+ret2;}

fib(n)

fib(n-1)fib(n-2)

fib(n-4) fib(n-2)fib(n-3)fib(n-3)

Tasks of same color can be executed in parallel6

Page 7: A Survey on Scheduling Methods of Task-Parallel Processing

Basic execution model

• Forks threads up to the number of CPU cores– Each thread has queues for tasks– Assign a task by one thread

fib(n)

fib(n-1)fib(n-2)

fib(n-4) fib(n-2)fib(n-3)fib(n-3)

Thread1 Thread2

fib(n)

fib(n-2) fib(n-1)

7

Page 8: A Survey on Scheduling Methods of Task-Parallel Processing

Agenda

• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary

8

Page 9: A Survey on Scheduling Methods of Task-Parallel Processing

fib(n-2)fib(n-1)fib(n-2)

Basic scheduling strategy : Breadth-First and Work-first

Breadth-First• At task creation :

– Enqueue new task– Execute child when parent

task suspends

Work-First• At task creation

– Parent task always suspends and run child

– Continue parent when child task is finished

Thread

fib(n)

Thread

running

ready

ready

fib(n)running

fib(n-4)

waiting

running

ready

readyrunning

running

9

Page 10: A Survey on Scheduling Methods of Task-Parallel Processing

Work stealing

• Load-balancing technique of threads for work-first scheduler– Idle threads steal runnable

tasks from other threads

• Basic strategy : FIFO– Steals oldest task in the

task queue– Victim thread should be

chosen at random

fib(n-2)

Thread

fib(n)

fib(n-4)

ready

ready

running

Thread

running fib(n-1)

running

ready

10

Steal request

Steals oldest task

Page 11: A Survey on Scheduling Methods of Task-Parallel Processing

Effect of Work Stealing

• Old task tends to create many tasks in the future– Especially recursive

parallelism

fib(n)

fib(n-1)fib(n-2)

fib(n-4)fib(n-2)

fib(n-3)fib(n-3)

Thread 1’s task

Thread 2’s task

Task graph of previous page

11

Page 12: A Survey on Scheduling Methods of Task-Parallel Processing

Lazy Task Creation

• Save continuation of parent task instead of creating child task– Continuation is lighter

than task

• At work stealing, crate task from continuation and steal it

fib(n-2)

Thread

fib(n)

fib(n-4)

ready

ready

running

Thread

running

Create task and steal it

fib(n-1)

running

ready

12

fib(n)

fib(n-2)

Continutation (≠ Task)

Steal request

fib(n)

Page 13: A Survey on Scheduling Methods of Task-Parallel Processing

Cut-off

• Execute child task sequentially instead of creating– To avoid too fine-grained

tasks

• Basic cut-off strategy– Amount of tasks– Recursive depth

fib(n)

fib(n-1)fib(n-2)

fib(n-4)fib(n-2)

fib(n-3)fib(n-3)

Execute serially

fib(n-3)fib(n-4)

13

Page 14: A Survey on Scheduling Methods of Task-Parallel Processing

Agenda

• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary

14

Page 15: A Survey on Scheduling Methods of Task-Parallel Processing

Challenges

• Architecture-aware scheduling

• Scalable implementation

• Determination of cut-off threshold

15

Page 16: A Survey on Scheduling Methods of Task-Parallel Processing

Architecture-aware scheduling

• Basic methods are not considered of architecture

• In some architecture performance is degraded• Example : NUMA architecture

16

Core1 Core2 Core3 Core4

Interconnect

Memory Memory

Page 17: A Survey on Scheduling Methods of Task-Parallel Processing

Core1

NUMA Architecture

• NUMA = Non Uniform Memory Access• Memory access cost depends on CPU core and

address• Considering locality is very important!

17

Core2 Core3 Core4

Interconnect

Memory Memory

Remote memory access is slowLocal memory access is fast

Page 18: A Survey on Scheduling Methods of Task-Parallel Processing

A bad case on NUMA

• When a thread steals a task of remote CPU• More remote memory access

18

Core1 Core2 Core3 Core4

Memory Memory

task

data

Local memory accessRemote memory access

Page 19: A Survey on Scheduling Methods of Task-Parallel Processing

Affinity Bubble-Scheduler

• Scheduling Dynamic OpenMP Applications over Multicore Architecture(Broquedis et al.)

• Locality-aware thread scheduler• Based on BubbleSched :– Framework to implement scheduler on hieratical

architecture– Threads are grouped by bubbles– Scheduler uses bubbles as hints

19

Page 20: A Survey on Scheduling Methods of Task-Parallel Processing

What is bubble?

• Group of tasks and bubbles– Describes affinities of

tasks

• Call library function to create

• Grouped tasks use shared data

20

task

task

task

tasktask

task

task

task

Page 21: A Survey on Scheduling Methods of Task-Parallel Processing

Initial task distribution

• Explode bubbles hieratically

21

Core1 Core2 Core3 Core4

task

task

task

tasktask

task

task

task

Explode the root bubbleDivide to balance loadExplode a bubble to distribute to 2 CPU cores

Page 22: A Survey on Scheduling Methods of Task-Parallel Processing

Steals from local

NUMA-aware Work Stealing

• Idle threads steal tasks from as local thread as possible

22

task

task

task task

Core1 Core2 Core3 Core4

task

task

task

Page 23: A Survey on Scheduling Methods of Task-Parallel Processing

Challenges

• Architecture-aware scheduling– Affinity Bubble-scheduler

• Scalable implementation

• Determination of cut-off threshold

23

Page 24: A Survey on Scheduling Methods of Task-Parallel Processing

Scalable implementation

• When operating task queues, threads have to acquire a lock– Because task queues may

be accessed by multiple threads

• Task queue operation occur every task creation and destruction

• Locks may be serious bottleneck!

24

Thread

task

Thread

task

task

Steal request

Finished!Need to lock the entire queue

Page 25: A Survey on Scheduling Methods of Task-Parallel Processing

Steal request

A simple way to decrease locks

• Double Task Queue per thread– One for local and one for

public

• Tasks are stolen only from public queue

• Local queue is lock-free

25

Threadlock-free!

Thread

task

task

task

public

local

Need to lock the public queue only

task

Page 26: A Survey on Scheduling Methods of Task-Parallel Processing

Problem of double task queue

• When task is moved, memory copy is required

26

Thread

public

local

task

Task copy is required

Page 27: A Survey on Scheduling Methods of Task-Parallel Processing

Split Task Queues

• Scalable Work Stealing (Dinal et al.)

• Split task queue by “split pointer”– From head to split

pointer : Local potion– From split pointer to

tail : Public potion

Thread

task

Thread

task

taskpublic

local

lock-free!

27

Page 28: A Survey on Scheduling Methods of Task-Parallel Processing

Split Task Queues

• Move pointer to head if public potion gets empty– This operation is lock-free

• Move pointer to tail if local potion gets empty

• Task copy is not required

Thread

task

task

task

task

Thread

task

task public

local

28

Page 29: A Survey on Scheduling Methods of Task-Parallel Processing

And more…

• In “Scalable work stealing” (Dianl et al.)

• Efficient task creation– Initialize task queue directly

• Better amount of tasks to steal– Half of public queue

29

Page 30: A Survey on Scheduling Methods of Task-Parallel Processing

Challenges

• Architecture-aware scheduling– Affinity Bubble-Scheduler

• Scalable implementation– Split Task Queues

• Determination of cut-off threshold

30

Page 31: A Survey on Scheduling Methods of Task-Parallel Processing

Determination of cut-off threshold

• Appropriate cut-off threshold cannot be determined simply– Depends on algorithm, scheduling methods, and

input data• Too large : Tasks become too coarse-grained– Leads to load imbalance

• Too small : Tasks become too fine-grained– Large overhead

31

Page 32: A Survey on Scheduling Methods of Task-Parallel Processing

Profile-based cut-off determination

• An adaptive cut-off for task parallelism (Duran et al.)

• Use 2 profiling methods– Full Mode– Minimal Mode

• Estimate execution time and decide cut-off

32

Page 33: A Survey on Scheduling Methods of Task-Parallel Processing

Full Mode

• Measure every tasks’ execution time• Heavy overhead

• Complete information

33

fib(n)

fib(n-1)fib(n-2)

fib(n-4)fib(n-2)

fib(n-3)fib(n-3)

Collect execution time

Depth Time

1

2

3

XXX ms

YYY ms

ZZZ ms???

???

???

Page 34: A Survey on Scheduling Methods of Task-Parallel Processing

Minimal Mode

• Measure execution time of “real tasks”• Small overhead• Incomplete information– Cut-off tasks are not measured

34

Collect execution time

fib(n)

fib(n-1)fib(n-2)

fib(n-4)fib(n-2)

fib(n-3)fib(n-3)fib(n-3)

fib(n-4)

These tasks are not measured

Depth Time

1

2

3 ???

XXX ms

YYY ms???

??? fib(n-2)

fib(n-3)

Page 35: A Survey on Scheduling Methods of Task-Parallel Processing

Adaptive Profiling

• Collects execution time for each depth of recursion

• Use Full Mode until enough information is collected

• After that, use Minimal Mode

fib(n)

fib(n-1)fib(n-2)

fib(n-4)fib(n-2)

fib(n-3)fib(n-3)

Profiled(Full Mode)

Maybe not profiled(Minimal Mode)

35

12 3

4Execution order

Page 36: A Survey on Scheduling Methods of Task-Parallel Processing

Depth Time

1 XXX ms

2 YYY ms

3 ZZZ ms

Cut-off strategy

• Estimates execution time of the task by collected information– Average of previous

executions

• If estimated execution time is smaller than threshold, apply cut-off

fib(n)

fib(n-1)fib(n-2)

fib(n-4)fib(n-2)

fib(n-3)fib(n-3)

How long the task will take ?

If estimated time is larger,create new task and execute in parallel

fib(n-3)

If estimated time is smaller,execute serially

36

fib(n-2)

12 3

4Execution order

Page 37: A Survey on Scheduling Methods of Task-Parallel Processing

Agenda

• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary

37

Page 38: A Survey on Scheduling Methods of Task-Parallel Processing

Consideration

• When adopting task methods into thread scheduling, it is necessary to consider side-effect

• Main difference between task and thread is fairness

• Fairness : Runnable threads take equal CPU time (based on priority)– Any thread never keeps CPU forever

38

Page 39: A Survey on Scheduling Methods of Task-Parallel Processing

Consideration of fairness

• Affinity Bubble Scheduler– Originally designed for threads

• Split task queues– Data structure for reducing locks improves scalability– Basic idea does not impede fairness

• Profile-based cut-off– Can apply cut-off only short-lived thread– It makes easier to apply cut-off

39

Page 40: A Survey on Scheduling Methods of Task-Parallel Processing

Summary

• Basic scheduling methods• Challenges and solutions– Architecture-aware scheduling

• Affinity Bubble-Scheduler

– Scalable implementation• Split Task Queues

– Determination of cut-off threshold• Profile-based cut-off

• Consideration– These solutions are NOT SO harmful for fairness

40

Page 41: A Survey on Scheduling Methods of Task-Parallel Processing

Thanks for your attention!

41