1© 2013 The MathWorks, Inc.
Parallel Computing with MATLAB
성호현차장
Senior Application EngineerMathWorks Korea
2
Questions to Consider
Have you already optimized your serial code?
3
Serial MATLAB CodeBest Practices
Techniques for addressing performance– Vectorization– Preallocation
Consider readability and maintainability– Looping vs. matrix operations– Subscripted vs. linear vs. logical– etc.
4
Addressing Bottlenecks
Identify using Profiler
Focus on top bottlenecks– Total number of function calls– Time per function call
Classes of bottlenecks– File I/O– Displaying output– Computationally intensive
5
Questions to Consider
Have you already optimized your serial code? Do you need to reduce your run-time? Do you need to solve larger problems?
If so, do you have…– A multi-core or multi-processor computer?
– A graphics processing unit (GPU)?
– Access to a computer cluster?
6
Introduction to parallel computing tools
Using multicore/multi-processor computers
Using graphics processing units (GPUs)
Scaling up to a cluster
Agenda
7
Utilizing Additional Processing Power
Built-in multithreading– Automatically enabled in MATLAB since R2008a– Multiple threads in a single MATLAB computation engine
Parallel Computing using MATLAB Workers– Parallel Computing Toolbox, MATLAB Distributed Computing Server– Multiple computation engines with inter-process communication
GPU use: directly from MATLAB – Parallel Computing Toolbox– Perform MATLAB Computations on GPUs
www.mathworks.com/discovery/multicore-matlab.html
8
Worker Worker
Worker
Worker
WorkerWorker
Worker
WorkerTOOLBOXES
BLOCKSETS
Going Beyond Serial MATLAB Applications
9
Parallel Computing Toolbox for the Desktop
Speed up parallel applications
Take advantage of GPUs
Prototype code for your cluster
Desktop Computer
Parallel Computing ToolboxParallel Computing Toolbox
10
Scale Up to Clusters, Grids, and Clouds
Computer ClusterComputer Cluster
MATLAB Distributed Computing ServerMATLAB Distributed Computing Server
Scheduler
Desktop Computer
Parallel Computing ToolboxParallel Computing Toolbox
11
Introduction to parallel computing tools
Using multicore/multi-processor computers
Using graphics processing units (GPUs)
Scaling up to a cluster
Agenda
12
Programming Parallel Applications (CPU)
Ease
of U
se
Greater C
ontrol
13
Programming Parallel Applications (CPU)
Built-in support with Toolboxes
Ease
of U
seG
reater Control
14
Example: Optimizing Cell Tower PositionBuilt-in parallel support
With Parallel Computing Toolbox use built-in parallel algorithms in Optimization Toolbox
Run optimization in parallel
Use pool of MATLAB workers
15
Tools Providing Parallel Computing Support
Optimization Toolbox, Global Optimization Toolbox Statistics Toolbox Signal Processing Toolbox Neural Network Toolbox Image Processing Toolbox …
Worker
Worker
Worker
WorkerWorker
Worker
WorkerTOOLBOXES
BLOCKSETS
Directly leverage functions in Parallel Computing Toolboxwww.mathworks.com/products/parallel-computing/builtin-parallel-support.html
16
Programming Parallel Applications (CPU)
Built-in support with Toolboxes
Simple programming constructs:parfor, batch, distributed
Ease
of U
seG
reater Control
17
Ideal problem for parallel computing No dependencies or communications between tasks Examples: parameter sweeps, Monte Carlo simulations
Independent Tasks or Iterations
Time
blogs.mathworks.com/loren/2009/10/02/using-parfor-loops-getting-up-and-running/
18
Example: Parameter Sweep of ODEsParallel for-loops
Parameter sweep of ODE system– Damped spring oscillator– Sweep through different values
of damping and stiffness– Record peak value for each
simulation
Convert for to parfor
Use pool of MATLAB workers
0,...2,1,...2,1
5
xkxbxm
19
Programming Parallel Applications (CPU)
Built-in support with Toolboxes
Simple programming constructs:parfor, batch, distributed
Advanced programming constructs:createJob, labSend, spmd
Ease
of U
seG
reater Control
20
Example: MPI-based Functions
21
Parallel Profiler
Profiles the execution time for a function – Similar to the MATLAB profiler– Includes information about the communication between labs
Time spent in communication Amount of data passed between labs
Benefits– Identify the bottlenecks in your parallel algorithm– Understand which operations require communication
22
Parallel Computing Toolbox: Developments
R2012a– New programming interface– Distributed arrays: more MATLAB functions enabled
R2012b– Diary output available during running task – Distributed arrays: more MATLAB functions enabled
R2013a– Auto detect and transfer of files in batch and interactive workflows– Distributed arrays: more MATLAB functions enabled
www.mathworks.com/help/distcomp/using-matlab-functions-on-codistributed-arrays.html
.
23
Introduction to parallel computing tools
Using multicore/multi-processor computers
Using graphics processing units (GPUs)
Scaling up to a cluster
Agenda
24
What is a Graphics Processing Unit (GPU)
Originally for graphics acceleration, now also used for scientific calculations
Massively parallel array of integer andfloating point processors– Typically hundreds of processors per card– GPU cores complement CPU cores
Dedicated high-speed memory
* Parallel Computing Toolbox requires NVIDIA GPUs with Compute Capability 1.3 or greater, including NVIDIA Tesla 10-series and 20-series products. See www.nvidia.com/object/cuda_gpus.html for a complete listing
25
Core 1
Core 3 Core 4
Core 2
Cache
Performance Gain with More Hardware
Using More Cores (CPUs) Using GPUs
Device Memory
26
Programming Parallel Applications (GPU)
Built-in support with Toolboxes
Ease
of U
seG
reater Control
27
Programming Parallel Applications (GPU)
Built-in support with Toolboxes
Simple programming constructs:gpuArray, gather
Ease
of U
seG
reater Control
28
Example: Solving 2D Wave EquationGPU Computing
Solve 2nd order wave equation using spectral methods:
Run both on CPU and GPU
Using gpuArray and overloaded functions
www.mathworks.com/help/distcomp/using-gpuarray.html#bsloua3-1
29
Benchmark: Solving 2D Wave EquationCPU vs GPU
Intel Xeon Processor X5650, NVIDIA Tesla C2050 GPU
Grid Size CPU (s)
GPU(s) Speedup
64 x 64 0.1004 0.3553 0.28
128 x 128 0.1931 0.3368 0.57
256 x 256 0.5888 0.4217 1.4
512 x 512 2.8163 0.8243 3.4
1024 x 1024 13.4797 2.4979 5.4
2048 x 2048 74.9904 9.9567 7.5
30
Programming Parallel Applications (GPU)
Built-in support with Toolboxes
Simple programming constructs:gpuArray, gather
Advanced programming constructs:arrayfun, bsxfun, spmd
Interface for experts:CUDAKernel
Ease
of U
seG
reater Control
31
GPU Performance – not all cards are equal
Tesla-based cards will provide best performance Realistically, expect 4x to 10x speedup (Tesla) vs CPU See GPUBench on MATLAB Central for examples
Laptop GPU GeForce
Desktop GPU GeForce / Quadro
High Performance Computing GPU Tesla / Quadro
www.mathworks.com/matlabcentral/fileexchange/34080-gpubench
32
Criteria for Good Problems to Run on a GPU
Massively parallel:– Calculations can be broken into hundreds
or thousands of independent units of work– Problem size takes advantage of many GPU cores
Computationally intensive:– Computation time significantly exceeds CPU/GPU data transfer time
Algorithm consists of supported functions:– Growing list of Toolboxes with built-in support
www.mathworks.com/products/parallel-computing/builtin-parallel-support.html
– Subset of core MATLAB for gpuArray, arrayfun, bsxfun www.mathworks.com/help/distcomp/using-gpuarray.html#bsloua3-1 www.mathworks.com/help/distcomp/execute-matlab-code-elementwise-on-a-
gpu.html#bsnx7h8-1
33
GPU Support: Developments
R2012a– New supported functions and function enhancements– Asynchronous GPU Calculations– Ability to reset and deselect GPU device
R2012b– New support for toolboxes: Neural Network, Signal Processing– New supported functions and function enhancements– Auto-detection and selection for multiple GPU systems
R2013a– New support for toolboxes: Image Processing, Array Systems – New supported functions and function enhancements– Ability to use GPU arrays from MEX functions
34
Introduction to parallel computing tools
Using multicore/multi-processor computers
Using graphics processing units (GPUs)
Scaling up to a cluster
Agenda
35
Desktop Computer
Parallel Computing ToolboxParallel Computing Toolbox
Use MATLAB Distributed Computing Server
1. Prototype code
Localprofile
Your MATLAB
code
36
Use MATLAB Distributed Computing Server
1. Prototype code2. Get access to an enabled cluster
Computer ClusterComputer ClusterMATLAB Distributed Computing ServerMATLAB Distributed Computing ServerCluster
profile
Scheduler
37
Use MATLAB Distributed Computing Server
1. Prototype code2. Get access to an enabled cluster3. Switch cluster profile
Desktop Computer
Parallel Computing ToolboxParallel Computing Toolbox
Computer ClusterComputer ClusterMATLAB Distributed Computing ServerMATLAB Distributed Computing Server
SchedulerYour MATLAB
code
Localprofile
Clusterprofile
38
Example: Migrate from Desktop to ClusterScale to computer cluster
Desktop interface– Set defaults– Discover clusters– Manage profiles– Monitor jobs
Command-line API
39
Offload computation:– Free up desktop– Access better computers
Scale speed-up: – Use more cores– Go from hours to minutes
Scale memory: – Utilize distributed arrays– Solve larger problems without re-coding
Computer ClusterComputer ClusterMATLAB Distributed
Computing ServerMATLAB Distributed
Computing Server
Take Advantage of Cluster Hardware
Desktop Computer
Parallel Computing Toolbox
Parallel Computing Toolbox
40
Offload Computations with batch
TOOLBOXES
BLOCKSETSResult
Work Worker
Worker
Worker
Worker
41
Example: Parameter Sweep of ODEsScale Scheduled Processing
Offload processing to workers– batch
– matlabpool
Monitor progress of scheduled job– Job Monitor
Retrieve results from job– fetchOutputs
0,...2,1,...2,1
5
xkxbxm
42
Benchmark: Parameter Sweep of ODEsScaling case study with a cluster
Processor: Intel Xeon E5-267016 cores per node
Workers Job (minutes)
Main(minutes)
Job -Main
1 - 149
16 10.3 10.2 0.1
32 5.1 5.0 0.1
64 2.8 2.4 0.4
96 2.0 1.7 0.3
128 1.7 1.3 0.6
160 1.4 1.0 0.4
192 1.4 0.9 0.5
224 1.3 0.8 0.5
256 1.4 0.7 0.71 16 32 64 96 128 160 192 224 256
1
16
32
64
96
128
160
192
224
256
Rel
ativ
e S
peed
-up
Workers
Simulation for 106 Combinations
JobLinear ReferenceMain (Core Computation)
43
Distributed ArrayLives on the Workers
Remotely Manipulate Array from Client MATLAB
11 26 41
12 27 42
13 28 43
14 29 44
15 30 45
16 31 46
17 32 47
17 33 48
19 34 49
20 35 50
21 36 51
22 37 52
Distributing Large Data
TOOLBOXES
BLOCKSETS
44
Distributed Arrays and SPMD
Distributed arrays– Hold data remotely on workers running on a cluster– Manipulate directly from client MATLAB (desktop) – Use MATLAB functions directly on distributed arrays
www.mathworks.com/help/distcomp/using-matlab-functions-on-codistributed-arrays.html
spmd– Execute blocks of code on workers– Explicitly communicate between workers with message passing– Mix parallel and serial code in same program
45
Extension of Parallel Computing Toolbox
Complete pre-built solution– Framework and infrastructure– Communication between computers
Cost-effective– License for number of cores you will use– Simplified maintenance
MATLAB Distributed Computing Server
Computer ClusterComputer ClusterMATLAB Distributed Computing ServerMATLAB Distributed Computing Server
Scheduler
46
Dynamic Licensing Model
Users have access to their licensed products
Server does not check out any licenses on the client
User can exit MATLAB once the job is queued
Computer ClusterComputer ClusterMATLAB Distributed Computing ServerMATLAB Distributed Computing Server
Scheduler
47
Job Schedulers
• Direct support for existing scheduler: MDCS is simply another application
• Open API to support other schedulers( generic integration scripts)
• MathWorks Job Scheduler: turn-key solution for MATLAB-only clusters
www.mathworks.com/products/distriben/supported
Ease
of U
seG
reater Control
48
Best Practices for Scalability and Portability
Prototype for portability on the desktop
Use functions (avoid scripts)
Use MATLAB workflows to access more cores– batch
– createJob, createTask
Avoid large data transfers through the desktop interface
49
MATLAB Distributed Computing Server: Developments
R2012a– New Cluster Profile Manager– See Parallel Computing Toolbox developments
R2012b– Detection of available enabled clusters through Profile Manager– See Parallel Computing Toolbox developments
R2013a– See Parallel Computing Toolbox developments
Parallel Computing Toolbox: Developments
50
Summary Parallel Computing Toolbox and MATLAB Distributed Computing Server
Speed up your MATLAB execution with more hardware
Easily learn parallel MATLAB without being a parallel programming expert
51
For more information
Visit
www.mathworks.com/parallel-computing
52
Example: Filtering an ImageBuilt-in parallel support
With Parallel Computing Toolboxuse built-in parallel algorithms inImage Processing Toolbox
Run median filtering in parallel
Use pool of MATLAB workers
http://hirise.lpl.arizona.edu/From - NASA/JPL/University of Arizona
Noisy Image
Filtered Image
53
2D Median Filter
54
Example: Parameter Sweep of ODEsParallel for-loops Simulink
Parameter sweep of ODE system– Damped spring oscillator in Simulink – Sweep through different values
of damping and stiffness– Record peak value for each
simulation
Convert for to parfor
Use pool of MATLAB workers
0,...2,1,...2,1
5
xkxbxm
55
Scaling Up to Run on Multiple GPUs
Worker Worker
Worker
Worker
WorkerWorker
Worker
WorkerTOOLBOXES
BLOCKSETS
56
Running on Multiple GPUs
N = 1000; % Number of iterations
spmd% Assign each worker a different GPU gpuDevice(labindex);A = gpuArray(A); % transfer data
end
parfor ix = 1:M% Do the GPU-based calculationX = myGPUFunction(ix,A);% Gather data Xtotal(ix,:)= gather(X);
end
Single GPU Multiple GPUs
N = 1000; % Number of iterations
A = gpuArray(A); % transfer data to GPU
for ix = 1:M% Do the GPU-based calculationX = myGPUFunction(ix,A);% Gather data Xtotal(ix,:)= gather(X);
end