accelerating gmres via mixed precision
TRANSCRIPT
Accelerating GMRES via Mixed
Precision
Neil Lindquist, Piotr Luszczek, Jack Dongarra
12th JLESC Workshop
February 25th, 2020
1
GMRES
โขGeneral purpose, sparse linear solverโข Iterative, Krylov solver
โขMemory bound performanceโข Mix single and double precision
2
GMRES Algorithm
3
GMRES๐๐๐ (๐จ, ๐๐, ๐,๐ดโ1)
for ๐ = 0, 1, 2, โฆ๐๐ โ ๐โ ๐จ๐๐๐๐ โ ๐ดโ1๐๐๐ฝ โ ๐๐ 2
๐ฝ:,0 โ ฮค๐๐ ๐ฝ
๐ฌ โ ๐ฝ, 0, 0, โฆ , 0 ๐
for j = 0, 1, 2, โฆ๐ โ ๐ดโ1๐จ๐ฝ:,๐๐,๐ฏ:,๐ โ ๐๐๐กโ๐๐๐๐๐๐๐๐ง๐ ๐, ๐ฝ:,๐๐ฏ๐+1,๐ โ ๐ 2
๐ฝ:,๐+1 โ ฮค๐ ๐ 2
๐ฏ:,๐ โ ๐ฎ๐๐ฎ๐โฆ๐ฎ๐โ๐๐ฏ:,๐
๐ฎ๐ โ ๐๐๐ก๐๐ก๐๐๐_๐๐๐ก๐๐๐ฅ(๐ฏ:,๐)
๐ฏ:,๐ โ ๐ฎ๐๐ฏ:,๐
๐ โ ๐ฎ๐๐
๐๐ โ ๐ฝ๐ฏโ1๐๐๐+๐ โ ๐๐ + ๐๐
Computing ๐จ๐ = ๐. ๐จโ1 โ ๐ดโ1
Restarts
Iteration count
GMRES Algorithm
4
GMRES๐๐๐ (๐จ, ๐๐, ๐,๐ดโ1)
for ๐ = 0, 1, 2, โฆ๐๐ โ ๐โ ๐จ๐๐๐๐ โ ๐ดโ1๐๐๐ฝ โ ๐๐ 2
๐ฝ:,0 โ ฮค๐๐ ๐ฝ
๐ฌ โ ๐ฝ, 0, 0, โฆ , 0 ๐
for j = 0, 1, 2, โฆ๐ โ ๐ดโ1๐จ๐ฝ:,๐๐,๐ฏ:,๐ โ ๐๐๐กโ๐๐๐๐๐๐๐๐ง๐ ๐, ๐ฝ:,๐๐ฏ๐+1,๐ โ ๐ 2
๐ฝ:,๐+1 โ ฮค๐ ๐ 2
๐ฏ:,๐ โ ๐ฎ๐๐ฎ๐โฆ๐ฎ๐โ๐๐ฏ:,๐
๐ฎ๐ โ ๐๐๐ก๐๐ก๐๐๐_๐๐๐ก๐๐๐ฅ(๐ฏ:,๐)
๐ฏ:,๐ โ ๐ฎ๐๐ฏ:,๐
๐ โ ๐ฎ๐๐
๐๐ โ ๐ฝ๐ฏโ1๐๐๐+๐ โ ๐๐ + ๐๐
Computing ๐จ๐ = ๐. ๐จโ1 โ ๐ดโ1
Restarts
Iteration count
Double:
Single:
Double:
GMRES Simplified Algorithm
5
GMRES๐๐๐ (๐จ, ๐๐, ๐,๐ดโ1)
for ๐ = 0, 1, 2, โฆ๐๐ โ ๐ โ ๐จ๐๐๐๐ โ GMRES๐๐ ๐๐๐ (๐จ, ๐, ๐๐,๐ด
โ1)๐๐+๐ โ ๐๐ + ๐๐
Double:
Single:
Double:
GMRES Simplified Algorithm
6
GMRES๐๐๐ (๐จ, ๐๐, ๐,๐ดโ1)
for ๐ = 0, 1, 2, โฆ๐๐ โ ๐ โ ๐จ๐๐๐๐ โ ๐จโ1 ๐๐ ๐๐๐+๐ โ ๐๐ + ๐๐
Double:
Single:
Double:
Performance
โข Target accuracy 10โ10 =๐โ๐ด๐ฅ 2
๐ด ๐น ๐ฅ 2+ ๐ 2
โขRestart strategies:I. 100 inner iterations
II. 100 inner iterations or residual estimate of 10โ10
III. First: 100 inner iterations or residual estimate of 10โ6
Then: same number of inner iterations
โข20-core Haswell node with NVIDIA V100 GPUโข cuSparse, cuBLAS, Kokkos
โขCSR matrix format
7
Performance โ Scalar Jacobi
9
โข Speedups
โข Median time of 3 run
โข 3 runs
โข Error bars: mins and maxes
โข Geometric mean of speedup
โข MGS: 14%
โข CGSR: 54%
Performance โ ILU(0)
10
โข Speedups
โข Median time of 3 run
โข 3 runs
โข Error bars: mins and maxes
โข Geometric mean of speedup
โข MGS: -7%
โข CGSR: -4%
Performance โ ILU(0) with Jacobi Solves
11
โข ILU(0) w/ 5 Jacobi iterations
for each triangular solve
โข Speedups
โข Median time of 3 run
โข 3 runs
โข Error bars: mins and maxes
โข Geometric mean of speedup
โข MGS: 8%
โข CGSR: 14%
Future Directions
โขChoice of low-precisionโข Half, Bfloat16
โข Compression
โขDistributed systems
โขOther Krylov methods
โขApplications
12
Conclusions
โขWhen restarted, mixed-precision GMRES often
outperforms double-precision GMRES
13
Extra Slides
14
Test Configuration Details
โขCUDA 10.2.199, Kokkos 3.1.01, GCC 7.3.0
โขhttps://bitbucket.org/icl/mixed-precision-gmresโข tag TPDS-perf
15
Publications
โข N. Lindquist, P. Luszczek, and J. Dongarra, โImproving the
performance of the GMRES method using mixed-precision
techniques,โ in Driving Scientific and Engineering
Discoveries through the Convergence of HPC, Big Data and
AI. DOI: 10.1007/978-3-030-63393-6_4
โข [Submitted] N. Lindquist, P. Luszczek, and J. Dongarra,
โAccelerating restarted GMRES with mixed precision
arithmetic,โ in Transactions on Parallel and Distributed
Systems.
16
Effect on Convergence: Configuration
โข ILU(0) preconditioner (๐โ1)
โขCSR matrix format
โขCustom, mixed precision kernels w/ Kokkos
โข20-core Haswell nodeโข 2x Intelยฎ Xeonยฎ E5-2650 v3 processors
17
Effect on Convergence: Configuration
โขairfoil_2d from SuiteSparse collectionโข ๐ = 14,214
โข ๐๐๐ง = 259,688
โข ๐ 2 = 1.8 โ 106
โขError if GMRES stopped๐ โ ๐ด๐ฅ 2
๐ด ๐น ๐ฅ 2 + ๐ 2
18
Accuracy results
19
Modified Gram-Schmidt
Orthogonalization (MGS)
Classical Gram-Schmidt with
Reorthogonalization (CGSR)