intel ® threading building blocks

Intel® Threading Building Blocks

2Software and Services Group 2

Agenda• Overview• Intel® Threading Building Blocks

− Parallel Algorithms− Task Scheduler− Concurrent Containers− Sync Primitives− Memory Allocator

• Summary

Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries


• Gaining performance from multi-core requires parallel programming

• Multi-threading is used to:• Reduce or hide latency• Increase throughput

Multi-Core is Mainstream


Going Parallel

Typical Serial C++ Program

Ideal Parallel C++ Program Issues

Algorithms Parallel Algorithms Require many code changes when developed from scratch: often it takes a threading expert to get it right

Data Structures Thread-safe and scalable Data Structures

Serial data structures usually require global locks to make operations thread-safe

Dependencies - Minimum of dependencies- Efficient use of synchronization primitives

Too many dependencies expensive synchronization poor parallel performance

Memory Management Scalable Memory Manager

Standard memory allocator is often inefficient in multi-threaded app


Concurrent ContainersCommon idioms for concurrent access

- a scalable alternative to a serial container with a lock around it

MiscellaneousThread-safe timers

Generic Parallel AlgorithmsEfficient scalable way to exploit the power

of multi-core without having to startfrom scratch

Task schedulerThe engine that empowers parallel

algorithms that employs task-stealingto maximize concurrency Synchronization Primitives

User-level and OS wrappers for mutual exclusion, ranging from atomic

operations to several flavors of mutexes and condition variables

Memory AllocationPer-thread scalable memory manager and false-sharing free allocators


ThreadsOS API wrappers

Thread Local StorageScalable implementation of thread-localdata that supports infinite number of TLS

TBB Flow Graph – New!


• Portable C++ runtime library that does thread management, letting developers focus on proven parallel patterns• Scalable• Composable• Flexible• Portable

Both GPL and commercial licenses are available.http://threadingbuildingblocks.org

Intel® Threading Building BlocksExtend C++ for parallelism

*Other names and brands may be claimed as the property of others

http://threadingbuildingblocks.org/


Intel® Threading Building BlocksParallel Algorithms


Generic Parallel Algorithms• Loop parallelization

parallel_for, parallel_reduce, parallel_scan>Load balanced parallel execution of fixed number of independent loop

iterations• Parallel Algorithms for Streams

parallel_do, parallel_for_each, pipeline / parallel_pipeline>Use for unstructured stream or pile of work

• Parallel function invocationparallel_invoke

>Parallel execution of a number of user-specified functions• Parallel Sort

parallel_sort>Comparison sort with an average time complexity O(N Log(N))


Parallel Algorithm Usage Example#include "tbb/blocked_range.h"#include "tbb/parallel_for.h“using namespace tbb;

class ChangeArray{ int* array;public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ Foo (array[i]); } }};

void ChangeArrayParallel (int* a, int n ){ parallel_for (blocked_range<int>(0, n), ChangeArray(a));}

int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0;}

ChangeArray class definesa for-loop body for parallel_for

blocked_range – TBB templaterepresenting 1D iteration space

As usual with C++ functionobjects the main work is done inside operator()

A call to a template function parallel_for<Range, Body>:with arguments Range blocked_rangeBody ChangeArray


tasks available to thieves

[Data, Data+N)

[Data, Data+N/2)[Data+N/2, Data+N)

[Data, Data+N/k)

[Data, Data+GrainSize)

parallel_for(Range(Data), Body(), Partitioner());


Two Execution Orders

Depth First(stack)

Small spaceExcellent cache localityNo parallelism

Breadth First(queue)

Large spacePoor cache localityMaximum parallelism


Work Depth First; Steal Breadth First

L1

L2

victim thread

Best choice for theft!• big piece of work• data far from victim’s hot data.

Second best choice.


C++0x Lambda Expression Supportparallel_for example will transform into:

#include "tbb/blocked_range.h"#include "tbb/parallel_for.h“using namespace tbb;

void ChangeArrayParallel (int* a, int n ){ parallel_for (0, n, 1, [=](int i) { Foo (a[i]); });}

int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0;}

Capture variables by valuefrom surrounding scope tocompletely mimic the non-lambdaimplementation. Note that [&]could be used to capture variables by reference .

Using lambda expressions implementMyBody::operator() right insidethe call to parallel_for().

parallel_for has an overload that takesstart, stop and step argument andconstructs blocked_range internally


Functional parallelism has never been easierint main(int argc, char* argv[]) { spin_mutex m; int a = 1, b = 2;

parallel_invoke( foo, [a, b, &m](){ bar(a, b, m); }, [&m](){ for(int i = 0; i < K; ++i) { spin_mutex::scoped_lock l(m); cout << i << endl; } }, [&m](){ parallel_for( 0, N, 1, [&m](int i) { spin_mutex::scoped_lock l(m); cout << i << " "; }); });

return 0;}

void function_handle(void) callingvoid bar(int, int, mutex)implementedusing a lambda expression

Serial thread-safe job,wrapped in a lambda expressionthat is being executed in parallelwith three other functions

Parallel job, which is also executed in parallel with other functions.

Now imagine writing all this code with just plain threads


Strongly-typed parallel_pipelinefloat RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_tokens=*/16, make_filter<void,float*>( filter::serial, [&](flow_control& fc)-> float*{ if( first<last ) { return first++; } else { fc.stop(); // stop processing return NULL; } } ) & make_filter<float*,float>( filter::parallel, [](float* p){return (*p)*(*p);} ) & make_filter<float,void>( filter::serial, [&sum](float x) {sum+=x;} ) ); /* sum=first2+(first+1)2 + … +(last-1)2 computed in parallel */ return sqrt(sum); }

Call function tbb::parallel_pipelineto run pipeline stages (filters)

Create pipeline stage object tbb::make_filter<InputDataType, OutputDataType>(mode, body)

Pipeline stage mode can be serial, parallel, serial_in_order, or serial_out_of_order

get new float

float*float

sum+=float2

input: void

output: float*

input: float*

output: float

input: float

output: void


Intel® Threading Building BlocksTask Scheduler


Task Scheduler• Task scheduler is the engine driving Intel® Threading Building Blocks

• Manages thread pool, hiding complexity of native thread management• Maps logical tasks to threads

• Parallel algorithms are based on task scheduler interface• Task scheduler is designed to address common performance issues of

parallel programming with native threads

Problem Intel® TBB Approach

Oversubscription One scheduler thread per hardware thread

Fair scheduling Non-preemptive unfair schedulingHigh overhead Programmer specifies tasks, not threads.

Load imbalance Work-stealing balances load


Logical task – it is just a C++ class• Derive from tbb::task class• Implement execute()

member function• Create and spawn root task

and your tasks• Wait for tasks to finish

#include “tbb/task_scheduler_init.h”#include “tbb/task.h”

using namespace tbb;

class ThisIsATask: public task {public: task* execute () { WORK (); return NULL; }};


Task Tree Example

Yellow arrows– Creation sequenceBlack arrows – Task dependency

task

child2child1

Tim

eDepth Level

rootThread 1 Thread 2

waitforall()

Intel® TBB wait calls don’t block calling thread! It blocks the task however. Intel TBB worker thread keeps stealing tasks while waiting


Intel® Threading Building BlocksConcurrent Containers


Concurrent Containers• Intel® TBB provides highly concurrent containers

− STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container

− Wrapping a lock around an STL container turns it into a serial bottleneck and still does not always guarantee thread safety> STL containers are inherently not thread-safe

• Intel TBB provides fine-grained locking or lockless implementations− Worse single-thread performance, but better scalability.− Can be used with the library, OpenMP*, or native threads.

*Other names and brands may be claimed as the property of others


Concurrent Containers Key Features−concurrent_hash_map <Key,T,Hasher,Allocator>

>Models hash table of std::pair <const Key, T> elements−concurrent_unordered_map<Key,T,Hasher,Equality,Allocator>

>Permits concurrent traversal and insertion (no concurrent erasure)>Requires no visible locking, looks similar to STL interfaces

−concurrent_vector <T, Allocator>>Dynamically growable array of T: grow_by and grow_to_atleast

−concurrent_queue <T, Allocator>>For single threaded run concurrent_queue supports regular “first-in-first-out” ordering>If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed

−concurrent_bounded_queue <T, Allocator>>Similar to concurrent_queue with a difference that it allows specifying capacity. Once the capacity is reached ‘push’ will wait until other elements will be popped before it can continue.

−concurrent_priority_queue <T, Compare, Allocator>>Similar to std::priority_queue with scalable pop and push oprations


Hash-map ExamplesConcurrent

OpsTBB

cumapTBB

chmapSTL map

Traversal Yes No NoInsertion Yes Yes NoErasure No Yes NoSearch Yes Yes No

#include <map>typedef std::map<std::string, int> StringTable;

for (std::string* p=range.begin(); p!=range.end(); ++p) { tbb::spin_mutex::scoped_lock lock( global_lock ); table[*p] += 1;}

#include "tbb/concurrent_hash_map.h"typedef concurrent_hash_map<std::string,int> StringTable;

for (std::string* p=range.begin(); p!=range.end(); ++p) { StringTable::accessor a; // local lock table.insert( a, *p ); a->second += 1;}}

#include "tbb/concurrent_unordered_map.h“typedef concurrent_unordered_map<std::string,atomic<int>> StringTable;

for (std::string* p=range.begin(); p!=range.end(); ++p) { table[*p] += 1; // similar to STL but value is tbb::atomic<int>}


Intel® Threading Building BlocksSync Primitives


Synchronization Primitives Features• Atomic Operations.

−High-level abstractions• Exception-safe Locks

−spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few instructions

−Use queuing_rw_mutex when scalability and fairness are important

−Use recursive_mutex when your threading model requires that one thread can re-acquire a lock. All locks should be released by one thread for another one to get a lock.

−Use reader-writer mutex to allow non-blocking read for multiple threads

• Portable condition variables


Example: spin_rw_mutex

• If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lock

• Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it was upgraded

#include “tbb/spin_rw_mutex.h”using namespace tbb;

spin_rw_mutex MyMutex;

int foo (){ // Construction of ‘lock’ acquires ‘MyMutex’ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; // Destructor of ‘lock’ releases ‘MyMutex’ }


Intel® Threading Building BlocksScalable Memory Allocator


Scalable Memory Allocation• Problem

− Memory allocation is a bottle-neck in concurrent environment Threads acquire a global lock to allocate/deallocate memory

from the global heap• Solution

− Intel® Threading Building Blocks provides tested, tuned, and scalable memory allocator optimized for all object sizes: Manual and automatic replacement of memory management

calls C++ interface to use it with C++ objects as an underlying

allocator (e.g. STL containers) Scalable memory pools


Memory API Calls Replacement• Manual

−Change your code to call Intel® TBB scable_malloc/scalable_free instead of malloc and free

−Use scalable_* API to implement operators new and delete−Use tbb::scalable_allocator<T> as an underlying allocator for C++

objects (e.g. STL containers)• Automatic (Windows* and Linux*)

−Requires no code changes just re-link your binaries using proxy librariesLinux*: libtbbmalloc_proxy.so.2 or libtbbmalloc_proxy_debug.so.2Windows*: tbbmalloc_proxy.dll or tbbmalloc_debug_proxy.dll


C++ Allocator Template• Use tbb::scalable_allocator<T> as an underlying allocator for

C++ objects • Example:

// STL container used with Intel® TBB scalable allocatorstd::vector<int, tbb::scalable_allocator<int> >;


Scalable Memory Pools#include "tbb/memory_pool.h"...tbb::memory_pool<std::allocator<char> > my_pool();void* my_ptr = my_pool.malloc(10);void* my_ptr_2 = my_pool.malloc(20);…my_pool.recycle(); // destructor also frees everything

#include "tbb/memory_pool.h"...char buf[1024*1024];tbb::fixed_pool my_pool(buf, 1024*1024);void* my_ptr = my_pool.malloc(10);my_pool.free(my_ptr);}

Allocate and free from a fixed size buffer

Allocate memory from the pool


Scalable Memory Allocator Structurescalable_malloc

interface

pool_malloc layer

small object support, incl.per-

core caches

large object support, incl. cache

backend

system malloc mmap/VirtalAllocpool callbacks

free space acquisition


Intel® TBB Memory Allocator Internals• Small blocks

−Per-thread memory pools • Large blocks

−Treat memory as “objects” of fixed size, not as ranges of address space.Typically several dozen (or less) object sizes are in active use

−Keep released memory objects in a pool and reuse when object of such size is requested

−Pooled objects “age” over timeCleanup threshold varies for different object sizes

−Low fragmentation is achieved using segregated free lists

Intel TBB scalable memory allocator is designed for multi-threaded apps and optimized for multi-core


Concurrent ContainersCommon idioms for concurrent access

- a scalable alternative to a serial container with a lock around it

MiscellaneousThread-safe timers

Generic Parallel AlgorithmsEfficient scalable way to exploit the power

of multi-core without having to startfrom scratch

Task schedulerThe engine that empowers parallel

algorithms that employs task-stealingto maximize concurrency Synchronization Primitives

User-level and OS wrappers for mutual exclusion, ranging from atomic

operations to several flavors of mutexes and condition variables

Memory AllocationPer-thread scalable memory manager and false-sharing free allocators


ThreadsOS API wrappers

Thread Local StorageScalable implementation of thread-localdata that supports infinite number of TLS

TBB Graph


Supplementary Links• Commercial Product Web Page

www.intel.com/software/products/tbb • Open Source Web Portal

www.threadingbuildingblocks.org• Knowledge Base, Blogs and User Forums

http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1 http://software.intel.com/en-us/blogs/category/osstbb/ http://software.intel.com/en-us/forums/intel-threading-building-blocks

• Technical Articles:− “Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms”

http://www.devx.com/cplus/Article/32935 − “Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers”

http://www.devx.com/cplus/Article/33334

• Industry Articles:− Product Review: Intel Threading Building Blocks

http://www.devx.com/go-parallel/Article/33270 − “The Concurrency Revolution”, Herb Sutter, Dr. Dobb’s 1/19/2005

http://www.ddj.com/dept/cpp/184401916

http://www.intel.com/software/products/tbb

http://www.threadingbuildingblocks.org/

http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1

http://software.intel.com/en-us/blogs/category/osstbb/

http://software.intel.com/en-us/forums/intel-threading-building-blocks



http://www.devx.com/go-parallel/Article/33270

http://www.ddj.com/dept/cpp/184401916

intel ® threading building blocks

Documents