accelerating gmres via mixed precision

18
Accelerating GMRES via Mixed Precision Neil Lindquist , Piotr Luszczek, Jack Dongarra 12th JLESC Workshop February 25th, 2020 1

Upload: others

Post on 25-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating GMRES via Mixed Precision

Accelerating GMRES via Mixed

Precision

Neil Lindquist, Piotr Luszczek, Jack Dongarra

12th JLESC Workshop

February 25th, 2020

1

Page 2: Accelerating GMRES via Mixed Precision

GMRES

โ€ขGeneral purpose, sparse linear solverโ€ข Iterative, Krylov solver

โ€ขMemory bound performanceโ€ข Mix single and double precision

2

Page 3: Accelerating GMRES via Mixed Precision

GMRES Algorithm

3

GMRES๐‘Ÿ๐‘’๐‘ (๐‘จ, ๐’™๐ŸŽ, ๐’ƒ,๐‘ดโˆ’1)

for ๐‘˜ = 0, 1, 2, โ€ฆ๐’“๐’Œ โ† ๐’ƒโˆ’ ๐‘จ๐’™๐’Œ๐’›๐’Œ โ† ๐‘ดโˆ’1๐’“๐’Œ๐›ฝ โ† ๐’›๐’Œ 2

๐‘ฝ:,0 โ† ฮค๐’›๐’Œ ๐›ฝ

๐ฌ โ† ๐›ฝ, 0, 0, โ€ฆ , 0 ๐‘‡

for j = 0, 1, 2, โ€ฆ๐’˜ โ† ๐‘ดโˆ’1๐‘จ๐‘ฝ:,๐‘—๐’˜,๐‘ฏ:,๐‘— โ† ๐‘œ๐‘Ÿ๐‘กโ„Ž๐‘œ๐‘”๐‘œ๐‘›๐‘Ž๐‘™๐‘–๐‘ง๐‘’ ๐’˜, ๐‘ฝ:,๐‘—๐‘ฏ๐‘—+1,๐‘— โ† ๐’˜ 2

๐‘ฝ:,๐‘—+1 โ† ฮค๐’˜ ๐’˜ 2

๐‘ฏ:,๐‘— โ† ๐‘ฎ๐ŸŽ๐‘ฎ๐Ÿโ€ฆ๐‘ฎ๐’‹โˆ’๐Ÿ๐‘ฏ:,๐‘—

๐‘ฎ๐’‹ โ† ๐‘Ÿ๐‘œ๐‘ก๐‘Ž๐‘ก๐‘–๐‘œ๐‘›_๐‘š๐‘Ž๐‘ก๐‘Ÿ๐‘–๐‘ฅ(๐‘ฏ:,๐‘—)

๐‘ฏ:,๐‘— โ† ๐‘ฎ๐’‹๐‘ฏ:,๐‘—

๐’” โ† ๐‘ฎ๐’‹๐’”

๐’–๐’Œ โ† ๐‘ฝ๐‘ฏโˆ’1๐’”๐’™๐’Œ+๐Ÿ โ† ๐’™๐’Œ + ๐’–๐’Œ

Computing ๐‘จ๐’™ = ๐’ƒ. ๐‘จโˆ’1 โ‰ˆ ๐‘ดโˆ’1

Restarts

Iteration count

Page 4: Accelerating GMRES via Mixed Precision

GMRES Algorithm

4

GMRES๐‘Ÿ๐‘’๐‘ (๐‘จ, ๐’™๐ŸŽ, ๐’ƒ,๐‘ดโˆ’1)

for ๐‘˜ = 0, 1, 2, โ€ฆ๐’“๐’Œ โ† ๐’ƒโˆ’ ๐‘จ๐’™๐’Œ๐’›๐’Œ โ† ๐‘ดโˆ’1๐’“๐’Œ๐›ฝ โ† ๐’›๐’Œ 2

๐‘ฝ:,0 โ† ฮค๐’›๐’Œ ๐›ฝ

๐ฌ โ† ๐›ฝ, 0, 0, โ€ฆ , 0 ๐‘‡

for j = 0, 1, 2, โ€ฆ๐’˜ โ† ๐‘ดโˆ’1๐‘จ๐‘ฝ:,๐‘—๐’˜,๐‘ฏ:,๐‘— โ† ๐‘œ๐‘Ÿ๐‘กโ„Ž๐‘œ๐‘”๐‘œ๐‘›๐‘Ž๐‘™๐‘–๐‘ง๐‘’ ๐’˜, ๐‘ฝ:,๐‘—๐‘ฏ๐‘—+1,๐‘— โ† ๐’˜ 2

๐‘ฝ:,๐‘—+1 โ† ฮค๐’˜ ๐’˜ 2

๐‘ฏ:,๐‘— โ† ๐‘ฎ๐ŸŽ๐‘ฎ๐Ÿโ€ฆ๐‘ฎ๐’‹โˆ’๐Ÿ๐‘ฏ:,๐‘—

๐‘ฎ๐’‹ โ† ๐‘Ÿ๐‘œ๐‘ก๐‘Ž๐‘ก๐‘–๐‘œ๐‘›_๐‘š๐‘Ž๐‘ก๐‘Ÿ๐‘–๐‘ฅ(๐‘ฏ:,๐‘—)

๐‘ฏ:,๐‘— โ† ๐‘ฎ๐’‹๐‘ฏ:,๐‘—

๐’” โ† ๐‘ฎ๐’‹๐’”

๐’–๐’Œ โ† ๐‘ฝ๐‘ฏโˆ’1๐’”๐’™๐’Œ+๐Ÿ โ† ๐’™๐’Œ + ๐’–๐’Œ

Computing ๐‘จ๐’™ = ๐’ƒ. ๐‘จโˆ’1 โ‰ˆ ๐‘ดโˆ’1

Restarts

Iteration count

Double:

Single:

Double:

Page 5: Accelerating GMRES via Mixed Precision

GMRES Simplified Algorithm

5

GMRES๐‘Ÿ๐‘’๐‘ (๐‘จ, ๐’™๐ŸŽ, ๐’ƒ,๐‘ดโˆ’1)

for ๐‘˜ = 0, 1, 2, โ€ฆ๐’“๐’Œ โ† ๐’ƒ โˆ’ ๐‘จ๐’™๐’Œ๐’–๐’Œ โ† GMRES๐‘›๐‘œ ๐‘Ÿ๐‘’๐‘ (๐‘จ, ๐ŸŽ, ๐’“๐’Œ,๐‘ด

โˆ’1)๐’™๐’Œ+๐Ÿ โ† ๐’™๐’Œ + ๐’–๐’Œ

Double:

Single:

Double:

Page 6: Accelerating GMRES via Mixed Precision

GMRES Simplified Algorithm

6

GMRES๐‘Ÿ๐‘’๐‘ (๐‘จ, ๐’™๐ŸŽ, ๐’ƒ,๐‘ดโˆ’1)

for ๐‘˜ = 0, 1, 2, โ€ฆ๐’“๐’Œ โ† ๐’ƒ โˆ’ ๐‘จ๐’™๐’Œ๐’–๐’Œ โ† ๐‘จโˆ’1 ๐’“๐’Œ ๐ŸŽ๐’™๐’Œ+๐Ÿ โ† ๐’™๐’Œ + ๐’–๐’Œ

Double:

Single:

Double:

Page 7: Accelerating GMRES via Mixed Precision

Performance

โ€ข Target accuracy 10โˆ’10 =๐‘โˆ’๐ด๐‘ฅ 2

๐ด ๐น ๐‘ฅ 2+ ๐‘ 2

โ€ขRestart strategies:I. 100 inner iterations

II. 100 inner iterations or residual estimate of 10โˆ’10

III. First: 100 inner iterations or residual estimate of 10โˆ’6

Then: same number of inner iterations

โ€ข20-core Haswell node with NVIDIA V100 GPUโ€ข cuSparse, cuBLAS, Kokkos

โ€ขCSR matrix format

7

Page 8: Accelerating GMRES via Mixed Precision

Performance โ€“ Scalar Jacobi

9

โ€ข Speedups

โ€ข Median time of 3 run

โ€ข 3 runs

โ€ข Error bars: mins and maxes

โ€ข Geometric mean of speedup

โ€ข MGS: 14%

โ€ข CGSR: 54%

Page 9: Accelerating GMRES via Mixed Precision

Performance โ€“ ILU(0)

10

โ€ข Speedups

โ€ข Median time of 3 run

โ€ข 3 runs

โ€ข Error bars: mins and maxes

โ€ข Geometric mean of speedup

โ€ข MGS: -7%

โ€ข CGSR: -4%

Page 10: Accelerating GMRES via Mixed Precision

Performance โ€“ ILU(0) with Jacobi Solves

11

โ€ข ILU(0) w/ 5 Jacobi iterations

for each triangular solve

โ€ข Speedups

โ€ข Median time of 3 run

โ€ข 3 runs

โ€ข Error bars: mins and maxes

โ€ข Geometric mean of speedup

โ€ข MGS: 8%

โ€ข CGSR: 14%

Page 11: Accelerating GMRES via Mixed Precision

Future Directions

โ€ขChoice of low-precisionโ€ข Half, Bfloat16

โ€ข Compression

โ€ขDistributed systems

โ€ขOther Krylov methods

โ€ขApplications

12

Page 12: Accelerating GMRES via Mixed Precision

Conclusions

โ€ขWhen restarted, mixed-precision GMRES often

outperforms double-precision GMRES

13

Page 13: Accelerating GMRES via Mixed Precision

Extra Slides

14

Page 14: Accelerating GMRES via Mixed Precision

Test Configuration Details

โ€ขCUDA 10.2.199, Kokkos 3.1.01, GCC 7.3.0

โ€ขhttps://bitbucket.org/icl/mixed-precision-gmresโ€ข tag TPDS-perf

15

Page 15: Accelerating GMRES via Mixed Precision

Publications

โ€ข N. Lindquist, P. Luszczek, and J. Dongarra, โ€œImproving the

performance of the GMRES method using mixed-precision

techniques,โ€ in Driving Scientific and Engineering

Discoveries through the Convergence of HPC, Big Data and

AI. DOI: 10.1007/978-3-030-63393-6_4

โ€ข [Submitted] N. Lindquist, P. Luszczek, and J. Dongarra,

โ€œAccelerating restarted GMRES with mixed precision

arithmetic,โ€ in Transactions on Parallel and Distributed

Systems.

16

Page 16: Accelerating GMRES via Mixed Precision

Effect on Convergence: Configuration

โ€ข ILU(0) preconditioner (๐‘€โˆ’1)

โ€ขCSR matrix format

โ€ขCustom, mixed precision kernels w/ Kokkos

โ€ข20-core Haswell nodeโ€ข 2x Intelยฎ Xeonยฎ E5-2650 v3 processors

17

Page 17: Accelerating GMRES via Mixed Precision

Effect on Convergence: Configuration

โ€ขairfoil_2d from SuiteSparse collectionโ€ข ๐‘› = 14,214

โ€ข ๐‘›๐‘›๐‘ง = 259,688

โ€ข ๐œ…2 = 1.8 โˆ™ 106

โ€ขError if GMRES stopped๐‘ โˆ’ ๐ด๐‘ฅ 2

๐ด ๐น ๐‘ฅ 2 + ๐‘ 2

18

Page 18: Accelerating GMRES via Mixed Precision

Accuracy results

19

Modified Gram-Schmidt

Orthogonalization (MGS)

Classical Gram-Schmidt with

Reorthogonalization (CGSR)