intel ® threading building blocks
DESCRIPTION
Intel ® Threading Building Blocks . Agenda. Overview Intel ® Threading Building Blocks Parallel Algorithms Task Scheduler Concurrent Containers Sync Primitives Memory Allocator Summary. - PowerPoint PPT PresentationTRANSCRIPT
Intel® Threading Building Blocks
2Software and Services Group 2
Agenda• Overview• Intel® Threading Building Blocks
− Parallel Algorithms− Task Scheduler− Concurrent Containers− Sync Primitives− Memory Allocator
• Summary
Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries
3Software and Services Group 3
• Gaining performance from multi-core requires parallel programming
• Multi-threading is used to:• Reduce or hide latency• Increase throughput
Multi-Core is Mainstream
4Software and Services Group 4
Going Parallel
Typical Serial C++ Program
Ideal Parallel C++ Program Issues
Algorithms Parallel Algorithms Require many code changes when developed from scratch: often it takes a threading expert to get it right
Data Structures Thread-safe and scalable Data Structures
Serial data structures usually require global locks to make operations thread-safe
Dependencies - Minimum of dependencies- Efficient use of synchronization primitives
Too many dependencies expensive synchronization poor parallel performance
Memory Management Scalable Memory Manager
Standard memory allocator is often inefficient in multi-threaded app
5Software and Services Group 5
Concurrent ContainersCommon idioms for concurrent access
- a scalable alternative to a serial container with a lock around it
MiscellaneousThread-safe timers
Generic Parallel AlgorithmsEfficient scalable way to exploit the power
of multi-core without having to startfrom scratch
Task schedulerThe engine that empowers parallel
algorithms that employs task-stealingto maximize concurrency Synchronization Primitives
User-level and OS wrappers for mutual exclusion, ranging from atomic
operations to several flavors of mutexes and condition variables
Memory AllocationPer-thread scalable memory manager and false-sharing free allocators
Intel® Threading Building Blocks
ThreadsOS API wrappers
Thread Local StorageScalable implementation of thread-localdata that supports infinite number of TLS
TBB Flow Graph – New!
6Software and Services Group 6
• Portable C++ runtime library that does thread management, letting developers focus on proven parallel patterns• Scalable• Composable• Flexible• Portable
Both GPL and commercial licenses are available.http://threadingbuildingblocks.org
Intel® Threading Building BlocksExtend C++ for parallelism
*Other names and brands may be claimed as the property of others
7Software and Services Group 7
Intel® Threading Building BlocksParallel Algorithms
8Software and Services Group 8
Generic Parallel Algorithms• Loop parallelization
parallel_for, parallel_reduce, parallel_scan>Load balanced parallel execution of fixed number of independent loop
iterations• Parallel Algorithms for Streams
parallel_do, parallel_for_each, pipeline / parallel_pipeline>Use for unstructured stream or pile of work
• Parallel function invocationparallel_invoke
>Parallel execution of a number of user-specified functions• Parallel Sort
parallel_sort>Comparison sort with an average time complexity O(N Log(N))
9Software and Services Group 9
Parallel Algorithm Usage Example#include "tbb/blocked_range.h"#include "tbb/parallel_for.h“using namespace tbb;
class ChangeArray{ int* array;public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ Foo (array[i]); } }};
void ChangeArrayParallel (int* a, int n ){ parallel_for (blocked_range<int>(0, n), ChangeArray(a));}
int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0;}
ChangeArray class definesa for-loop body for parallel_for
blocked_range – TBB templaterepresenting 1D iteration space
As usual with C++ functionobjects the main work is done inside operator()
A call to a template function parallel_for<Range, Body>:with arguments Range blocked_rangeBody ChangeArray
10Software and Services Group 10
tasks available to thieves
[Data, Data+N)
[Data, Data+N/2)[Data+N/2, Data+N)
[Data, Data+N/k)
[Data, Data+GrainSize)
parallel_for(Range(Data), Body(), Partitioner());
11Software and Services Group 11
Two Execution Orders
Depth First(stack)
Small spaceExcellent cache localityNo parallelism
Breadth First(queue)
Large spacePoor cache localityMaximum parallelism
12Software and Services Group 12
Work Depth First; Steal Breadth First
L1
L2
victim thread
Best choice for theft!• big piece of work• data far from victim’s hot data.
Second best choice.
13Software and Services Group 13
C++0x Lambda Expression Supportparallel_for example will transform into:
#include "tbb/blocked_range.h"#include "tbb/parallel_for.h“using namespace tbb;
void ChangeArrayParallel (int* a, int n ){ parallel_for (0, n, 1, [=](int i) { Foo (a[i]); });}
int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0;}
Capture variables by valuefrom surrounding scope tocompletely mimic the non-lambdaimplementation. Note that [&]could be used to capture variables by reference .
Using lambda expressions implementMyBody::operator() right insidethe call to parallel_for().
parallel_for has an overload that takesstart, stop and step argument andconstructs blocked_range internally
14Software and Services Group 14
Functional parallelism has never been easierint main(int argc, char* argv[]) { spin_mutex m; int a = 1, b = 2;
parallel_invoke( foo, [a, b, &m](){ bar(a, b, m); }, [&m](){ for(int i = 0; i < K; ++i) { spin_mutex::scoped_lock l(m); cout << i << endl; } }, [&m](){ parallel_for( 0, N, 1, [&m](int i) { spin_mutex::scoped_lock l(m); cout << i << " "; }); });
return 0;}
void function_handle(void) callingvoid bar(int, int, mutex)implementedusing a lambda expression
Serial thread-safe job,wrapped in a lambda expressionthat is being executed in parallelwith three other functions
Parallel job, which is also executed in parallel with other functions.
Now imagine writing all this code with just plain threads
15Software and Services Group 15
Strongly-typed parallel_pipelinefloat RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_tokens=*/16, make_filter<void,float*>( filter::serial, [&](flow_control& fc)-> float*{ if( first<last ) { return first++; } else { fc.stop(); // stop processing return NULL; } } ) & make_filter<float*,float>( filter::parallel, [](float* p){return (*p)*(*p);} ) & make_filter<float,void>( filter::serial, [&sum](float x) {sum+=x;} ) ); /* sum=first2+(first+1)2 + … +(last-1)2 computed in parallel */ return sqrt(sum); }
Call function tbb::parallel_pipelineto run pipeline stages (filters)
Create pipeline stage object tbb::make_filter<InputDataType, OutputDataType>(mode, body)
Pipeline stage mode can be serial, parallel, serial_in_order, or serial_out_of_order
get new float
float*float
sum+=float2
input: void
output: float*
input: float*
output: float
input: float
output: void
16Software and Services Group 16
Intel® Threading Building BlocksTask Scheduler
17Software and Services Group 17
Task Scheduler• Task scheduler is the engine driving Intel® Threading Building Blocks
• Manages thread pool, hiding complexity of native thread management• Maps logical tasks to threads
• Parallel algorithms are based on task scheduler interface• Task scheduler is designed to address common performance issues of
parallel programming with native threads
Problem Intel® TBB Approach
Oversubscription One scheduler thread per hardware thread
Fair scheduling Non-preemptive unfair schedulingHigh overhead Programmer specifies tasks, not threads.
Load imbalance Work-stealing balances load
18Software and Services Group 18
Logical task – it is just a C++ class• Derive from tbb::task class• Implement execute()
member function• Create and spawn root task
and your tasks• Wait for tasks to finish
#include “tbb/task_scheduler_init.h”#include “tbb/task.h”
using namespace tbb;
class ThisIsATask: public task {public: task* execute () { WORK (); return NULL; }};
19Software and Services Group 19
Task Tree Example
Yellow arrows– Creation sequenceBlack arrows – Task dependency
task
child2child1
Tim
eDepth Level
rootThread 1 Thread 2
waitforall()
Intel® TBB wait calls don’t block calling thread! It blocks the task however. Intel TBB worker thread keeps stealing tasks while waiting
20Software and Services Group 20
Intel® Threading Building BlocksConcurrent Containers
21Software and Services Group 21
Concurrent Containers• Intel® TBB provides highly concurrent containers
− STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container
− Wrapping a lock around an STL container turns it into a serial bottleneck and still does not always guarantee thread safety> STL containers are inherently not thread-safe
• Intel TBB provides fine-grained locking or lockless implementations− Worse single-thread performance, but better scalability.− Can be used with the library, OpenMP*, or native threads.
*Other names and brands may be claimed as the property of others
22Software and Services Group 22
Concurrent Containers Key Features−concurrent_hash_map <Key,T,Hasher,Allocator>
>Models hash table of std::pair <const Key, T> elements−concurrent_unordered_map<Key,T,Hasher,Equality,Allocator>
>Permits concurrent traversal and insertion (no concurrent erasure)>Requires no visible locking, looks similar to STL interfaces
−concurrent_vector <T, Allocator>>Dynamically growable array of T: grow_by and grow_to_atleast
−concurrent_queue <T, Allocator>>For single threaded run concurrent_queue supports regular “first-in-first-out” ordering>If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed
−concurrent_bounded_queue <T, Allocator>>Similar to concurrent_queue with a difference that it allows specifying capacity. Once the capacity is reached ‘push’ will wait until other elements will be popped before it can continue.
−concurrent_priority_queue <T, Compare, Allocator>>Similar to std::priority_queue with scalable pop and push oprations
24Software and Services Group 24
Hash-map ExamplesConcurrent
OpsTBB
cumapTBB
chmapSTL map
Traversal Yes No NoInsertion Yes Yes NoErasure No Yes NoSearch Yes Yes No
#include <map>typedef std::map<std::string, int> StringTable;
for (std::string* p=range.begin(); p!=range.end(); ++p) { tbb::spin_mutex::scoped_lock lock( global_lock ); table[*p] += 1;}
#include "tbb/concurrent_hash_map.h"typedef concurrent_hash_map<std::string,int> StringTable;
for (std::string* p=range.begin(); p!=range.end(); ++p) { StringTable::accessor a; // local lock table.insert( a, *p ); a->second += 1;}}
#include "tbb/concurrent_unordered_map.h“typedef concurrent_unordered_map<std::string,atomic<int>> StringTable;
for (std::string* p=range.begin(); p!=range.end(); ++p) { table[*p] += 1; // similar to STL but value is tbb::atomic<int>}
25Software and Services Group 25
Intel® Threading Building BlocksSync Primitives
27Software and Services Group 27
Synchronization Primitives Features• Atomic Operations.
−High-level abstractions• Exception-safe Locks
−spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few instructions
−Use queuing_rw_mutex when scalability and fairness are important
−Use recursive_mutex when your threading model requires that one thread can re-acquire a lock. All locks should be released by one thread for another one to get a lock.
−Use reader-writer mutex to allow non-blocking read for multiple threads
• Portable condition variables
28Software and Services Group 28
Example: spin_rw_mutex
• If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lock
• Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it was upgraded
#include “tbb/spin_rw_mutex.h”using namespace tbb;
spin_rw_mutex MyMutex;
int foo (){ // Construction of ‘lock’ acquires ‘MyMutex’ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; // Destructor of ‘lock’ releases ‘MyMutex’ }
29Software and Services Group 29
Intel® Threading Building BlocksScalable Memory Allocator
30Software and Services Group 30
Scalable Memory Allocation• Problem
− Memory allocation is a bottle-neck in concurrent environment Threads acquire a global lock to allocate/deallocate memory
from the global heap• Solution
− Intel® Threading Building Blocks provides tested, tuned, and scalable memory allocator optimized for all object sizes: Manual and automatic replacement of memory management
calls C++ interface to use it with C++ objects as an underlying
allocator (e.g. STL containers) Scalable memory pools
31Software and Services Group 31
Memory API Calls Replacement• Manual
−Change your code to call Intel® TBB scable_malloc/scalable_free instead of malloc and free
−Use scalable_* API to implement operators new and delete−Use tbb::scalable_allocator<T> as an underlying allocator for C++
objects (e.g. STL containers)• Automatic (Windows* and Linux*)
−Requires no code changes just re-link your binaries using proxy librariesLinux*: libtbbmalloc_proxy.so.2 or libtbbmalloc_proxy_debug.so.2Windows*: tbbmalloc_proxy.dll or tbbmalloc_debug_proxy.dll
32Software and Services Group 32
C++ Allocator Template• Use tbb::scalable_allocator<T> as an underlying allocator for
C++ objects • Example:
// STL container used with Intel® TBB scalable allocatorstd::vector<int, tbb::scalable_allocator<int> >;
33Software and Services Group 33
Scalable Memory Pools#include "tbb/memory_pool.h"...tbb::memory_pool<std::allocator<char> > my_pool();void* my_ptr = my_pool.malloc(10);void* my_ptr_2 = my_pool.malloc(20);…my_pool.recycle(); // destructor also frees everything
#include "tbb/memory_pool.h"...char buf[1024*1024];tbb::fixed_pool my_pool(buf, 1024*1024);void* my_ptr = my_pool.malloc(10);my_pool.free(my_ptr);}
Allocate and free from a fixed size buffer
Allocate memory from the pool
34Software and Services Group 34
Scalable Memory Allocator Structurescalable_malloc
interface
pool_malloc layer
small object support, incl.per-
core caches
large object support, incl. cache
backend
system malloc mmap/VirtalAllocpool callbacks
free space acquisition
35Software and Services Group 35
Intel® TBB Memory Allocator Internals• Small blocks
−Per-thread memory pools • Large blocks
−Treat memory as “objects” of fixed size, not as ranges of address space.Typically several dozen (or less) object sizes are in active use
−Keep released memory objects in a pool and reuse when object of such size is requested
−Pooled objects “age” over timeCleanup threshold varies for different object sizes
−Low fragmentation is achieved using segregated free lists
Intel TBB scalable memory allocator is designed for multi-threaded apps and optimized for multi-core
36Software and Services Group 36
Concurrent ContainersCommon idioms for concurrent access
- a scalable alternative to a serial container with a lock around it
MiscellaneousThread-safe timers
Generic Parallel AlgorithmsEfficient scalable way to exploit the power
of multi-core without having to startfrom scratch
Task schedulerThe engine that empowers parallel
algorithms that employs task-stealingto maximize concurrency Synchronization Primitives
User-level and OS wrappers for mutual exclusion, ranging from atomic
operations to several flavors of mutexes and condition variables
Memory AllocationPer-thread scalable memory manager and false-sharing free allocators
Intel® Threading Building Blocks
ThreadsOS API wrappers
Thread Local StorageScalable implementation of thread-localdata that supports infinite number of TLS
TBB Graph
37Software and Services Group 37
Supplementary Links• Commercial Product Web Page
www.intel.com/software/products/tbb • Open Source Web Portal
www.threadingbuildingblocks.org• Knowledge Base, Blogs and User Forums
http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1 http://software.intel.com/en-us/blogs/category/osstbb/ http://software.intel.com/en-us/forums/intel-threading-building-blocks
• Technical Articles:− “Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms”
http://www.devx.com/cplus/Article/32935 − “Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers”
http://www.devx.com/cplus/Article/33334
• Industry Articles:− Product Review: Intel Threading Building Blocks
http://www.devx.com/go-parallel/Article/33270 − “The Concurrency Revolution”, Herb Sutter, Dr. Dobb’s 1/19/2005
http://www.ddj.com/dept/cpp/184401916
38Software and Services Group 38