intel ® threading building blocks

36
Intel ® Threading Building Blocks

Upload: arnon

Post on 12-Feb-2016

58 views

Category:

Documents


3 download

DESCRIPTION

Intel ® Threading Building Blocks . Agenda. Overview Intel ® Threading Building Blocks Parallel Algorithms Task Scheduler Concurrent Containers Sync Primitives Memory Allocator Summary. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Intel ®  Threading Building Blocks

Intel® Threading Building Blocks

Page 2: Intel ®  Threading Building Blocks

2Software and Services Group 2

Agenda• Overview• Intel® Threading Building Blocks

− Parallel Algorithms− Task Scheduler− Concurrent Containers− Sync Primitives− Memory Allocator

• Summary

Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries

Page 3: Intel ®  Threading Building Blocks

3Software and Services Group 3

• Gaining performance from multi-core requires parallel programming

• Multi-threading is used to:• Reduce or hide latency• Increase throughput

Multi-Core is Mainstream

Page 4: Intel ®  Threading Building Blocks

4Software and Services Group 4

Going Parallel

Typical Serial C++ Program

Ideal Parallel C++ Program Issues

Algorithms Parallel Algorithms Require many code changes when developed from scratch: often it takes a threading expert to get it right

Data Structures Thread-safe and scalable Data Structures

Serial data structures usually require global locks to make operations thread-safe

Dependencies - Minimum of dependencies- Efficient use of synchronization primitives

Too many dependencies expensive synchronization poor parallel performance

Memory Management Scalable Memory Manager

Standard memory allocator is often inefficient in multi-threaded app

Page 5: Intel ®  Threading Building Blocks

5Software and Services Group 5

Concurrent ContainersCommon idioms for concurrent access

- a scalable alternative to a serial container with a lock around it

MiscellaneousThread-safe timers

Generic Parallel AlgorithmsEfficient scalable way to exploit the power

of multi-core without having to startfrom scratch

Task schedulerThe engine that empowers parallel

algorithms that employs task-stealingto maximize concurrency Synchronization Primitives

User-level and OS wrappers for mutual exclusion, ranging from atomic

operations to several flavors of mutexes and condition variables

Memory AllocationPer-thread scalable memory manager and false-sharing free allocators

Intel® Threading Building Blocks

ThreadsOS API wrappers

Thread Local StorageScalable implementation of thread-localdata that supports infinite number of TLS

TBB Flow Graph – New!

Page 6: Intel ®  Threading Building Blocks

6Software and Services Group 6

• Portable C++ runtime library that does thread management, letting developers focus on proven parallel patterns• Scalable• Composable• Flexible• Portable

Both GPL and commercial licenses are available.http://threadingbuildingblocks.org

Intel® Threading Building BlocksExtend C++ for parallelism

*Other names and brands may be claimed as the property of others

Page 7: Intel ®  Threading Building Blocks

7Software and Services Group 7

Intel® Threading Building BlocksParallel Algorithms

Page 8: Intel ®  Threading Building Blocks

8Software and Services Group 8

Generic Parallel Algorithms• Loop parallelization

parallel_for, parallel_reduce, parallel_scan>Load balanced parallel execution of fixed number of independent loop

iterations• Parallel Algorithms for Streams

parallel_do, parallel_for_each, pipeline / parallel_pipeline>Use for unstructured stream or pile of work

• Parallel function invocationparallel_invoke

>Parallel execution of a number of user-specified functions• Parallel Sort

parallel_sort>Comparison sort with an average time complexity O(N Log(N))

Page 9: Intel ®  Threading Building Blocks

9Software and Services Group 9

Parallel Algorithm Usage Example#include "tbb/blocked_range.h"#include "tbb/parallel_for.h“using namespace tbb;

class ChangeArray{ int* array;public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ Foo (array[i]); } }};

void ChangeArrayParallel (int* a, int n ){ parallel_for (blocked_range<int>(0, n), ChangeArray(a));}

int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0;}

ChangeArray class definesa for-loop body for parallel_for

blocked_range – TBB templaterepresenting 1D iteration space

As usual with C++ functionobjects the main work is done inside operator()

A call to a template function parallel_for<Range, Body>:with arguments Range blocked_rangeBody ChangeArray

Page 10: Intel ®  Threading Building Blocks

10Software and Services Group 10

tasks available to thieves

[Data, Data+N)

[Data, Data+N/2)[Data+N/2, Data+N)

[Data, Data+N/k)

[Data, Data+GrainSize)

parallel_for(Range(Data), Body(), Partitioner());

Page 11: Intel ®  Threading Building Blocks

11Software and Services Group 11

Two Execution Orders

Depth First(stack)

Small spaceExcellent cache localityNo parallelism

Breadth First(queue)

Large spacePoor cache localityMaximum parallelism

Page 12: Intel ®  Threading Building Blocks

12Software and Services Group 12

Work Depth First; Steal Breadth First

L1

L2

victim thread

Best choice for theft!• big piece of work• data far from victim’s hot data.

Second best choice.

Page 13: Intel ®  Threading Building Blocks

13Software and Services Group 13

C++0x Lambda Expression Supportparallel_for example will transform into:

#include "tbb/blocked_range.h"#include "tbb/parallel_for.h“using namespace tbb;

void ChangeArrayParallel (int* a, int n ){ parallel_for (0, n, 1, [=](int i) { Foo (a[i]); });}

int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0;}

Capture variables by valuefrom surrounding scope tocompletely mimic the non-lambdaimplementation. Note that [&]could be used to capture variables by reference .

Using lambda expressions implementMyBody::operator() right insidethe call to parallel_for().

parallel_for has an overload that takesstart, stop and step argument andconstructs blocked_range internally

Page 14: Intel ®  Threading Building Blocks

14Software and Services Group 14

Functional parallelism has never been easierint main(int argc, char* argv[]) { spin_mutex m; int a = 1, b = 2;

parallel_invoke( foo, [a, b, &m](){ bar(a, b, m); }, [&m](){ for(int i = 0; i < K; ++i) { spin_mutex::scoped_lock l(m); cout << i << endl; } }, [&m](){ parallel_for( 0, N, 1, [&m](int i) { spin_mutex::scoped_lock l(m); cout << i << " "; }); });

return 0;}

void function_handle(void) callingvoid bar(int, int, mutex)implementedusing a lambda expression

Serial thread-safe job,wrapped in a lambda expressionthat is being executed in parallelwith three other functions

Parallel job, which is also executed in parallel with other functions.

Now imagine writing all this code with just plain threads

Page 15: Intel ®  Threading Building Blocks

15Software and Services Group 15

Strongly-typed parallel_pipelinefloat RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_tokens=*/16, make_filter<void,float*>( filter::serial, [&](flow_control& fc)-> float*{ if( first<last ) { return first++; } else { fc.stop(); // stop processing return NULL; } } ) & make_filter<float*,float>( filter::parallel, [](float* p){return (*p)*(*p);} ) & make_filter<float,void>( filter::serial, [&sum](float x) {sum+=x;} ) ); /* sum=first2+(first+1)2 + … +(last-1)2 computed in parallel */ return sqrt(sum); }

Call function tbb::parallel_pipelineto run pipeline stages (filters)

Create pipeline stage object tbb::make_filter<InputDataType, OutputDataType>(mode, body)

Pipeline stage mode can be serial, parallel, serial_in_order, or serial_out_of_order

get new float

float*float

sum+=float2

input: void

output: float*

input: float*

output: float

input: float

output: void

Page 16: Intel ®  Threading Building Blocks

16Software and Services Group 16

Intel® Threading Building BlocksTask Scheduler

Page 17: Intel ®  Threading Building Blocks

17Software and Services Group 17

Task Scheduler• Task scheduler is the engine driving Intel® Threading Building Blocks

• Manages thread pool, hiding complexity of native thread management• Maps logical tasks to threads

• Parallel algorithms are based on task scheduler interface• Task scheduler is designed to address common performance issues of

parallel programming with native threads

Problem Intel® TBB Approach

Oversubscription One scheduler thread per hardware thread

Fair scheduling Non-preemptive unfair schedulingHigh overhead Programmer specifies tasks, not threads.

Load imbalance Work-stealing balances load

Page 18: Intel ®  Threading Building Blocks

18Software and Services Group 18

Logical task – it is just a C++ class• Derive from tbb::task class• Implement execute()

member function• Create and spawn root task

and your tasks• Wait for tasks to finish

#include “tbb/task_scheduler_init.h”#include “tbb/task.h”

using namespace tbb;

class ThisIsATask: public task {public: task* execute () { WORK (); return NULL; }};

Page 19: Intel ®  Threading Building Blocks

19Software and Services Group 19

Task Tree Example

Yellow arrows– Creation sequenceBlack arrows – Task dependency

task

child2child1

Tim

eDepth Level

rootThread 1 Thread 2

waitforall()

Intel® TBB wait calls don’t block calling thread! It blocks the task however. Intel TBB worker thread keeps stealing tasks while waiting

Page 20: Intel ®  Threading Building Blocks

20Software and Services Group 20

Intel® Threading Building BlocksConcurrent Containers

Page 21: Intel ®  Threading Building Blocks

21Software and Services Group 21

Concurrent Containers• Intel® TBB provides highly concurrent containers

− STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container

− Wrapping a lock around an STL container turns it into a serial bottleneck and still does not always guarantee thread safety> STL containers are inherently not thread-safe

• Intel TBB provides fine-grained locking or lockless implementations− Worse single-thread performance, but better scalability.− Can be used with the library, OpenMP*, or native threads.

*Other names and brands may be claimed as the property of others

Page 22: Intel ®  Threading Building Blocks

22Software and Services Group 22

Concurrent Containers Key Features−concurrent_hash_map <Key,T,Hasher,Allocator>

>Models hash table of std::pair <const Key, T> elements−concurrent_unordered_map<Key,T,Hasher,Equality,Allocator>

>Permits concurrent traversal and insertion (no concurrent erasure)>Requires no visible locking, looks similar to STL interfaces

−concurrent_vector <T, Allocator>>Dynamically growable array of T: grow_by and grow_to_atleast

−concurrent_queue <T, Allocator>>For single threaded run concurrent_queue supports regular “first-in-first-out” ordering>If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed

−concurrent_bounded_queue <T, Allocator>>Similar to concurrent_queue with a difference that it allows specifying capacity. Once the capacity is reached ‘push’ will wait until other elements will be popped before it can continue.

−concurrent_priority_queue <T, Compare, Allocator>>Similar to std::priority_queue with scalable pop and push oprations

Page 23: Intel ®  Threading Building Blocks

24Software and Services Group 24

Hash-map ExamplesConcurrent

OpsTBB

cumapTBB

chmapSTL map

Traversal Yes No NoInsertion Yes Yes NoErasure No Yes NoSearch Yes Yes No

#include <map>typedef std::map<std::string, int> StringTable;

for (std::string* p=range.begin(); p!=range.end(); ++p) { tbb::spin_mutex::scoped_lock lock( global_lock ); table[*p] += 1;}

#include "tbb/concurrent_hash_map.h"typedef concurrent_hash_map<std::string,int> StringTable;

for (std::string* p=range.begin(); p!=range.end(); ++p) { StringTable::accessor a; // local lock table.insert( a, *p ); a->second += 1;}}

#include "tbb/concurrent_unordered_map.h“typedef concurrent_unordered_map<std::string,atomic<int>> StringTable;

for (std::string* p=range.begin(); p!=range.end(); ++p) { table[*p] += 1; // similar to STL but value is tbb::atomic<int>}

Page 24: Intel ®  Threading Building Blocks

25Software and Services Group 25

Intel® Threading Building BlocksSync Primitives

Page 25: Intel ®  Threading Building Blocks

27Software and Services Group 27

Synchronization Primitives Features• Atomic Operations.

−High-level abstractions• Exception-safe Locks

−spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few instructions

−Use queuing_rw_mutex when scalability and fairness are important

−Use recursive_mutex when your threading model requires that one thread can re-acquire a lock. All locks should be released by one thread for another one to get a lock.

−Use reader-writer mutex to allow non-blocking read for multiple threads

• Portable condition variables

Page 26: Intel ®  Threading Building Blocks

28Software and Services Group 28

Example: spin_rw_mutex

• If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lock

• Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it was upgraded

#include “tbb/spin_rw_mutex.h”using namespace tbb;

spin_rw_mutex MyMutex;

int foo (){ // Construction of ‘lock’ acquires ‘MyMutex’ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; // Destructor of ‘lock’ releases ‘MyMutex’ }

Page 27: Intel ®  Threading Building Blocks

29Software and Services Group 29

Intel® Threading Building BlocksScalable Memory Allocator

Page 28: Intel ®  Threading Building Blocks

30Software and Services Group 30

Scalable Memory Allocation• Problem

− Memory allocation is a bottle-neck in concurrent environment Threads acquire a global lock to allocate/deallocate memory

from the global heap• Solution

− Intel® Threading Building Blocks provides tested, tuned, and scalable memory allocator optimized for all object sizes: Manual and automatic replacement of memory management

calls C++ interface to use it with C++ objects as an underlying

allocator (e.g. STL containers) Scalable memory pools

Page 29: Intel ®  Threading Building Blocks

31Software and Services Group 31

Memory API Calls Replacement• Manual

−Change your code to call Intel® TBB scable_malloc/scalable_free instead of malloc and free

−Use scalable_* API to implement operators new and delete−Use tbb::scalable_allocator<T> as an underlying allocator for C++

objects (e.g. STL containers)• Automatic (Windows* and Linux*)

−Requires no code changes just re-link your binaries using proxy librariesLinux*: libtbbmalloc_proxy.so.2 or libtbbmalloc_proxy_debug.so.2Windows*: tbbmalloc_proxy.dll or tbbmalloc_debug_proxy.dll

Page 30: Intel ®  Threading Building Blocks

32Software and Services Group 32

C++ Allocator Template• Use tbb::scalable_allocator<T> as an underlying allocator for

C++ objects • Example:

// STL container used with Intel® TBB scalable allocatorstd::vector<int, tbb::scalable_allocator<int> >;

Page 31: Intel ®  Threading Building Blocks

33Software and Services Group 33

Scalable Memory Pools#include "tbb/memory_pool.h"...tbb::memory_pool<std::allocator<char> > my_pool();void* my_ptr = my_pool.malloc(10);void* my_ptr_2 = my_pool.malloc(20);…my_pool.recycle(); // destructor also frees everything

#include "tbb/memory_pool.h"...char buf[1024*1024];tbb::fixed_pool my_pool(buf, 1024*1024);void* my_ptr = my_pool.malloc(10);my_pool.free(my_ptr);}

Allocate and free from a fixed size buffer

Allocate memory from the pool

Page 32: Intel ®  Threading Building Blocks

34Software and Services Group 34

Scalable Memory Allocator Structurescalable_malloc

interface

pool_malloc layer

small object support, incl.per-

core caches

large object support, incl. cache

backend

system malloc mmap/VirtalAllocpool callbacks

free space acquisition

Page 33: Intel ®  Threading Building Blocks

35Software and Services Group 35

Intel® TBB Memory Allocator Internals• Small blocks

−Per-thread memory pools • Large blocks

−Treat memory as “objects” of fixed size, not as ranges of address space.Typically several dozen (or less) object sizes are in active use

−Keep released memory objects in a pool and reuse when object of such size is requested

−Pooled objects “age” over timeCleanup threshold varies for different object sizes

−Low fragmentation is achieved using segregated free lists

Intel TBB scalable memory allocator is designed for multi-threaded apps and optimized for multi-core

Page 34: Intel ®  Threading Building Blocks

36Software and Services Group 36

Concurrent ContainersCommon idioms for concurrent access

- a scalable alternative to a serial container with a lock around it

MiscellaneousThread-safe timers

Generic Parallel AlgorithmsEfficient scalable way to exploit the power

of multi-core without having to startfrom scratch

Task schedulerThe engine that empowers parallel

algorithms that employs task-stealingto maximize concurrency Synchronization Primitives

User-level and OS wrappers for mutual exclusion, ranging from atomic

operations to several flavors of mutexes and condition variables

Memory AllocationPer-thread scalable memory manager and false-sharing free allocators

Intel® Threading Building Blocks

ThreadsOS API wrappers

Thread Local StorageScalable implementation of thread-localdata that supports infinite number of TLS

TBB Graph

Page 35: Intel ®  Threading Building Blocks

37Software and Services Group 37

Supplementary Links• Commercial Product Web Page

www.intel.com/software/products/tbb • Open Source Web Portal

www.threadingbuildingblocks.org• Knowledge Base, Blogs and User Forums

http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1 http://software.intel.com/en-us/blogs/category/osstbb/ http://software.intel.com/en-us/forums/intel-threading-building-blocks

• Technical Articles:− “Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms”

http://www.devx.com/cplus/Article/32935 − “Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers”

http://www.devx.com/cplus/Article/33334

• Industry Articles:− Product Review: Intel Threading Building Blocks

http://www.devx.com/go-parallel/Article/33270 − “The Concurrency Revolution”, Herb Sutter, Dr. Dobb’s 1/19/2005

http://www.ddj.com/dept/cpp/184401916

Page 36: Intel ®  Threading Building Blocks

38Software and Services Group 38