Autotuning at Illinois
María Jesús Garzarán
University of Illinois
Outline
1. Why Autotuning?
2. What is Autotuning?
3. Research Problems
Why autotuning?
• In the era of parallelism…• Applications and software must maintain high
efficiency as machines evolve.– Otherwise, no reason for new machines.
• Problem: High-efficiency requires laborious tuning. – Cost increase. – Low performance if not enough resources
• Would like to automate tuning.
Compilers
• One way is compilers, but compilers have limitations.– Lack semantic information → fewer choices– Must target all applications– Must be reasonably fast
Compiler vs. Manual TuningDiscrete Fourier Transform
Compiler vs. Manual TuningMatrix Matrix Multiplication
20x
MF
LOP
S
Matrix Size
Intel MKL
icc -O3 -xT
icc -O3
Compiler vs. Manual TuningMatrix Matrix Multiplication
loop 1c[i*N+j] += a[i*N+k]*b[k*N+j]
loop 2c[i][j] += a[i][k]*b[k][j]
loop 3C += a[i][k]*b[k][j]
Compilers …
• Can and should improve
• But we will need other strategies (at least in the short term)
Outline
1. Why Autotuning?
2. What is Autotuning?
3. Research Problems
What is Autotuning
• An emerging strategy: empirical search– Goal: Automatically generate highly efficient code for each target
machine (and input set). – Programmers develop metaprograms (a program that generates
programs) that search the space of possible algorithms/implementations
Generator of the versions
High-level code
Source-to-source optimizer
Native compiler
Metaprogram:Decription of the space of versions
Object code
Execution
performance
Selectedcode
High-level code
Input data(training)
Autotuning with empirical search
Autotuning
• More laborious than conventional programming, but – Longer lifetime → cost reduction – Can accumulate experience → better results
– Can afford to search more extensively → better results
Examples of Existing Autotuning Systems
• ATLAS: Whaley, Petite, Dongarra (Tennessee)• BeBop: Demmel, Yelick, Im, Vuduc (Berkeley)
• Datamining: Jian, Garzarán, Snir (Illinois)• FFTW: Frigo (MIT)
• Illinois Sorting: Li, Garzarán, Padua (Illinois)• Matrix-matrix multiplication for GPU: Jiang, Snir (Illinois)• Phipac: Bilmes, Asanovic, Vuduc, Iyer, Demmel, Chin, Lan (Berkeley)• Space Pruning for GPU: Ryoo, Rodrigues,Stone, Baghsorkhi, Ueng,
Stratton, Hwu (Illinois)
• SPIRAL: Moura, Pueschel (CMU), Johnson (Drexel), Garzarán, Padua (Illinois)
• SPIKETune: Wong, Kuck (Intel), Sameh(Purdue), Padua (Illinois)
Outline
1. Why Autotuning?
2. What is Autotuning?
3. Research Problems
Generator of the versions
High-level code
Source-to-source optimizer
Native compiler
Metaprogram: Decription of the version space
Object code
Execution
Selectedcode
High-level code
Input data(training)
Autotuning with empirical search
What to do when performance depends on the input
How to specify the search space?
performanceWhat is performance(execution time, power)?
How to drive the search?
Research Issues
1. What to do when performance depends on input
2. Modeling/Search
3. Description of the space
4. What to tune
5. What to tune for
Very promising, but much to learn
Issue 1: Performance depends on input
• When performance depends on the input we must generate dynamically adapting routines. – Illustrated with the generation of sorting routines
[CGO04] Li, Garzarán, Padua. A Dynamically Tuned Sorting Library. In Proc. of the Int. Symp. on Code Generation and Optimization,2004.
[CGO05] Li, Garzarán, Padua. Optimizing Sorting with Genetic Algorithms. In Proc. of the Int. Symp. on Code Generation and Optimization 2005.
Issue 1: Sorting
• Different algorithms to perform sorting– Radix sort– Quick sort– Merge sort
• No single algorithm is the best for all inputs and platforms
Our Contribution
• Design of hybrid algorithms and use of genetic search to find sorting routines that automatically adapt to the target machine and the input characteristics.
• Result:– Generation of the fastest sorting routines for sequential and
parallel execution
20
Sorting
Perf
orm
ance
(ke
ys
per
cycl
e)
Intel Xeon
AMD Athlon MP
CC-Radix
Merge Sort
Quicksort
CC-Radix
Merge SortQuicksort
Same inputdifferent performance
Standard Deviation
21
Sorting
Perf
orm
ance
(ke
ys
per
cycl
e)
Intel Xeon
AMD Athlon MP
CC-Radix
Merge Sort
Quicksort
CC-Radix
Merge SortQuicksort
Standard Deviation
22
Divide with pivot
Select with entropy
Divide into block
Sorting Genome
< theta ≥ theta
Divide by digit
Hybrid sorting
for dynamic adaptation
23
Input
Divide with pivot
Select with entropy
Divide by digit
Divide into block
< theta ≥ theta
Example of hybrid sorting
24
Divide with pivot
Select with entropy
Divide into block
Input
< theta ≥ theta
Divide by digit
Example of hybrid sorting
25
Divide with pivot
Select with entropy
Divide into block
PivotBucket 1
Bucket 2
Input
< theta ≥ theta
Divide by digit
Example of hybrid sorting
26
Divide with pivot
Select with entropy
Divide into block
Pivot
Select operations based on entropy
Bucket 1
Bucket 2
Input
< theta ≥ theta
Divide by digit
Example of hybrid sorting
27
Divide with pivot
Select with entropy
Divide into block
Pivot
Select operations based on entropy
Bucket 1
Bucket 2
Input
Sorted
< theta ≥ theta
Divide by digit
Example of hybrid sorting
28
Divide with pivot
Select with entropy
Divide into block
Pivot
Select operations based on entropy
Bucket 1
Bucket 2
Input
Sorted Sorted
< theta ≥ theta
Divide by digit
Example of hybrid sorting
29
Divide with pivot
Select with entropy
Divide into block
Pivot
Select operations based on entropy
Bucket 1
Bucket 2
Input
Sorted Sorted
< theta ≥ theta
Divide by digit
Example of hybrid sorting
30
Divide with pivot
Select with entropy
Divide into block
Pivot
Select operations based on entropy
Bucket 1
Bucket 2
Input
Sorted
< theta ≥ theta
Divide by digit
Example of hybrid sorting
31
Target Machine
Learning Mechanism
Used at runtime
Training inputs
Mappinginput data ➔ best algorithm
Learning: Algorithm Selection
32
IBM Power3
26%
ClassifierSort
IBM ESSL
C++ STL
Results: Sequential Sorting
Results: Parallel SortingIntel Quad Intel Quad
CoreCore
Research Issues
1. Performance depends on input
2. Modeling/Search
3. Description of the space
4. What to tune
5. What to tune for
Issue 2: Modeling/Search
• When the search space is too big we must use models or better search mechanisms. Illustrated with:
1. An analytical model and hybrid approach for ATLAS[PLDI03] Yotov, Li, Ren, Cibulskis, DeJong, Garzarán, Padua, Pingali, Stodghill, and Wu. A
Comparison of Empirical and Model-driven Optimization. In PLDI, 2003.[Proc of IEEE] Yotov, Li, Ren, Garzarán, Padua, Pingali, and Stodghill. Is Search Really
Necessary to Generate High-Performance BLAS? In Proc. of the IEEE, 2005.[LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models
and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005
2. Genetic search for sorting[CGO04, CG005]
36
ATLAS Modeling• ATLAS = Automated Tuned Linear Algebra Software,
developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee.
• ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). – Use search to adapt to the target machine
37
Our Contribution• Development of methods to speed-up the search process.
– Analytical models that replace the search– Hybrid models that combine models with empirical search
[LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005
• The result– Same performance – Faster generation
38
ATLAS Infrastructure
DetectHardwareParameters
ATLAS SearchEngine(MMSearch)
NRMulAddLatency
L1SizeATLAS MMCode Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
DetectHardwareParameters
ATLAS MMCode Generator(MMCase)
ATLAS SearchEngine(MMSearch)
39
Modeling for Optimization Parameters
• Our Modeling Engine
• Optimization parameters– NB: Hierarchy of Models (later)– MU, NU:– KU: maximize subject to L1 Instruction Cache– Latency, MulAdd: from hardware parameters– xFetch: set to 2
DetectHardwareParameters
ATLAS SearchEngine(MMSearch)
NRMulAddLatency
L1I$Size ATLAS MMCode Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
L1Size
Model
MU *NUMU NU LatencyRegisters
40
Modeling for Tile Size (NB)• Models of increasing complexity
– 3*NB2 ≤ C• Whole work-set fits in L1
– NB2 + NB + 1 ≤ C• Fully Associative• Optimal Replacement• Line Size: 1 word
– or
• Line Size > 1 word
– or
• LRU Replacement
B
N
M
A C
NB
NB
K
K
B
C
B
NB
B
NB
1
2
B
CNB
B
NB
1
2
B
C
B
NB
B
NB
B
NB
12
2
B
CNB
B
NB
13
2 A
M(I)
K
C
B
N (J)
KB
A
M(I)
K
C
B
N (J)
KL
41
MMM Performance• SGI R12000 • Sun UltraSparc III
• Intel Pentium III
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000
0200400600800
10001200140016001800
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
BLAS COMPILER
ATLAS MODEL
MF
LO
PS
MF
LO
PS
MF
LO
PS
42
Models/Search
• Models reduce search time to 0.
• However, search is still necessary when a model does not exist.
43
Divide with pivotSelect with entropy
Divide into block
Sorting Genome
< theta ≥ theta
Divide by digit
Genetic search for sorting
Genetic operators are used to derive new offsprings:-Mutation (add, remove subtrees, change params)-Cross-over
Issue 2: Modeling/Search
We need tools to guide models and search:
P-Ray: Characterization of hardware
[LCPC05] Duchateau, Sidelnik, Garzarán, Padua. P-RAY: A Suite of Micro benchmarks for Multi-core Architectures. In LCPC, 2008.
45
Characterize Hardware
• P-Ray: Development of benchmarks to measure hardware characteristics of multicore platforms
DetectHardwareParameters
ATLAS SearchEngine(MMSearch)
NRMulAddLatency
L1I$Size ATLAS MMCode Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
L1Size
46
Our Contribution• P-Ray: Tool to measure.
– Block Size– Cache Mapping– Processor Mapping– Effective Bandwidth
• The result– Correct results for 3 different platforms (Intel Xeon Haperton, Sun
UltraSparc T1 Niagara, Intel Core 2 Quad Kentsfield)
P-Ray:Processor Mapping
L2L2L2
Core 1
Core 3
L2L2L2
Core 5
Core 7
L2L2L2
Core 2
Core 4
L2L2L2
Core 6
Core 8
8 Core Intel Hapertown
Chip 1
Chip 2
Research Issues
1. Performance depends on input
2. Modeling/Search
3. Description of the space
4. What to automate
5. What to tune for
Issue 3:Description of the Space
• ATLAS generator is written in C
• We need more effective notations to implement a generator (describe the search space)
• Two possibilities:– Domain Specific Languages
– General Purpose Languages
Issue 3:Description of the Space
Illustrated with:
1. SPIRAL (Domain Specific Language) [Proc. Of IEEE05] Püschel, Moura, Johnson, Padua, Veloso, Singer, Xiong,
Franchetti, Gacic, Voronenko, Chen, Johnson, and Rizzolo. Spiral: Code Generation for DSP Transforms. Proc. Of IEEE, 2005.
http://www.spiral.net
2. Metalanguage (General Purpose Language)
[LCPC05] Donadio, Brodman, Roeder, Yotov, Barthou, Cohen, Garzarán, Padua and Pingali. A Language for the Compact Representation of Multiples Program
Versions. In LCPC 2005.
SPIRAL
• SPIRAL, generator of signal processing algorithms (DFT, DCT, WHT, filters, …)
• SPIRAL uses empirical search to generate routines that adapt to the target machine:– Sequential, parallel, SIMD, …
SPIRAL Contribution
• Declarative domain-specific language and rewriting rules to specify the search space.
• The result– Generation of routines that run faster than IPP (manually tuned)– Intel has started to use SPIRAL to generate parts of the IPP
library
SPIRAL
• Search based on breakdown and re-writing rules:
This is SPL, SPIRALmetalanguage
54
SPIRAL Program Generation
Transform
Rule
SPL Formula
PDFTIDIDFTDFT mnmnnm
parameterized matrix
• a breakdown strategy (Cooley Tukey) • product of sparse matrices
DFTp
Ruletree8DFT
2DFT 4DFT
2DFT 2DFT
8DFT
DFT 24DFT
2DFT 2DFT(a) (b)
(a)
(b)
PFIIDIFDFT 222428
DFT8 ( F2 I2 ... I2)D I2 F2 P
CT
CT
CT
CT
SPIRAL Program Generation
SPIRAL
• Why is search important?
– Different formulas (algorithms) have different execution times• They differ in the memory access pattern• Have different ILP
SPIRAL Performance Results
Metaprogramming
• General-purpose programming of autotuned libraries and applications.
• A metaprogram contains a compact description of the space of program versions and how to proceed with the search.
Metaprogram example
%try s in {2,4,8}for j=1 to 128 by %s %for k=j to j+s-1 a(%k) = …
for j=1 to 128 by 4 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = …
for j=1 to 128 by 2 a(j) = … a(j+1) = …
for j=1 to 128 by 8 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … a(j+4) = … a(j+5) = … a(j+6) = … a(j+7) = …
Search strategy
Program shapefor each value
Research Issues
1. Performance depends on input
2. Modelling/Search
3. Description of the space
4. What to tune
5. What to tune for
Issue 4: What to tune
1. Kernels (MMM, FFT, sorting, …)
2. Codelets
3. Primitives
Codelets
• A class of (short) code sequences that appear often in an application domain
• The set of codelets should cover much of the execution domain
• Applications are decomposed into codelets
• Codelets are autotuned
Codelets
• Need a database of codelets– Each codelet in the database contains a set of compiler
optimizations
• Application is decomposed in codelets that are matched against the codelets in the database – Application codelets are optimized using the set of optimizations
of the matched codelet in the database
• Collaboration with David Kuck and David Wong, INTEL
Primitive Operations
• Same as codelets, but not identified automatically by the compiler
• The user is expected to write the application using primitives
• The primitives operations are tuned for each target platform
Example of Primitive Operations
• HTA : Hierarchically Tiled Arrays
[PPoPP06] Bikshandi, Guo, Hoeflinger, Almasi, Fraguela, Garzarán, Padua, and von Praun. Programming for Parallelism and Locality with Hierarchically Tiled. In PPoPP, 2006.
[PPoPP08] Guo, Bikshandi, Fraguela, Garzarán, and Padua. Programming with Tiles.In PPoPP 2008.
•
Hierarchically Tiled Arrays (HTAs)
• HTA is a data type where tiles are explicit
• HTAs are manipulated with data parallel primitives– HTA programs look sequential programs where parallelism is
encapsulated into the data parallel primitives
• Result– Programs that run as fast as MPI (test with NAS benchmarks)– Fewer lines of code– Portable codes
FFT using HTA parallel primitives
Can be autotuned
Data Parallel Primitives
• Challenge:
Can we extend data parallel primitive operations to other complex data types, such as sets, trees, graphs?
Research Issues
1. Performance depends on input
2. Modeling/Search
3. Description of options/space search
4. What to tune
5. What to tune for
Issue 5: What to tune for
1. Execution Time (All the previous systems)
2. Power (Preliminary data in next slides)
3. Space
4. Reliability
71
Power in SPIRAL
• Processors allow software control of operating frequency and voltage
• e.g. Intel Pentium M 770 has 6 settings– 2.13 GHz at 1.340 volt (max performance)– 800MHz at 0.988 volt (min power/energy)
72
Experimental Setup
• Intel Pentium M model 770 – <2133MHz, 1.34V>, <1866MHz, 1.27V>, <1600MHz, 1.2V>, <1333MHz ,
1.13V>, <1067MHz, 1.06V>, <800MHz, 0.99V>
• Measurements– HW: Agilent 34134A current probe and Agilent 34401A DMM– SW: SPIRAL controlled automatic runtime and energy measurement routine
• Optimization space– voltage-frequency scaling
73
Dynamic voltage-frequency scaling
• Use of voltage scaling instructions– CPU bound region --> run at high frequency– Memory bound region --> run at low frequency
• Minimum impact on execution time and significant reduction in energy consumption
74
0
5
10
15
20
25
30
35
40
45
1 201 401 601 801 1001 1201 1401 1601 1801
Dynamic voltage-frequency scaling: memory profile
Time
Cach
e m
iss ra
tio
Each point shows the cache miss ratio every 100 seconds
WHT-219 (out-of-cache)
Zoom
75
Dynamic voltage-frequency scaling: memory profile
Cach
e m
iss ra
tio
Each point shows the cache miss ratio every 100 seconds
WHT-219 (out-of-cache)
Time
0
5
10
15
20
25
30
35
18000 19000 20000 21000 22000 23000 24000 25000 26000
low frequency
high frequency
76
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0.075 0.08 0.085 0.09 0.095 0.1 0.105
Dynamic voltage-frequency scaling: results
Ener
gy (J
oule
s)
WHT-219
Execution Time (Seconds)
Energy versus execution time
77
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0.075 0.08 0.085 0.09 0.095 0.1 0.105
Same exec. time10% less energy
Dynamic voltage-frequency scaling: results
Ener
gy (J
oule
s)
Execution Time (Seconds)
Energy versus execution time
Dynamic Voltage Scaling
Same energyless execution time
WHT-219
78
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Compiler Optimizations (Future work)
Iterations
Ca
che
mis
s ra
tio
Apply dependence analysis and group together iterations
with similar cache miss ratio
increases the benefit of dynamic voltage scaling
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Iterations
Research Agenda
1. Performance depends on input
2. Modeling/Search
3. Description of the space
4. What to automate
5. What to tune for