iwapt 2020 workshop: acceleration of structural analysis ... · can only estimate physics under...
TRANSCRIPT
iWAPT 2020 Workshop:
Acceleration of Structural Analysis Simulations using CNN-based Auto-Tuning of Solver Tolerance
Amir Haderbache1, Koichi Shirahata1, Takuji Yamamoto1,Hiroshi Okuda2 and Yasumoto Tomita1
1Fujitsu Laboratories LTD.2The University of Tokyo
Copyright 2020 Fujitsu Laboratories Ltd.0
Outline
1. Introduction2. Theoretical Background3. Performance Analysis4. Proposal5. Experimental Evaluation6. Conclusion
1 Copyright 2020 Fujitsu Laboratories Ltd.
Introduction
Copyright 2020 Fujitsu Laboratories Ltd.2
AI for HPC There is a surge of interest in applying machine learning to
traditional HPC workloads for performance reasons.
3 Copyright 2020 Fujitsu Laboratories Ltd.
For examples, using simulation data and deep neural networks,it was possible to:1. Estimate steady flow2. Accelerate particle physics 3. Infer rainfallat very low latency compared to traditional simulations.
(1) Guo, Li, “Convolutional Neural Networks for Steady Flow Approximation”(2) Paganini, Oliveira “Accelerating science with generative adversarial networks: an application to 3d
particle showers in multilayer calorimeters,”(3) S. Kim & .al, “Deep-hurricane-tracker: Tracking and forecasting extreme climate events,”
Simulation data
Physics simulations Neural Network
Main trends of “AI for HPC” AI models (neural networks) are used in several ways to
accelerate HPC simulations.
Copyright 2016 FUJITSU LIMITED
Existing methods Qualitative description
Pre- and post-processing of
simulation data
Example: Infer fine structure
parameter from coarse one.
Surrogate models:
Predict simulation results without
first principles at very low latency.
4
Corner Fillet
Mesh DisplacementNeural Network
Neural Network
Limitations of existing methods
5 Copyright 2020 Fujitsu Laboratories Ltd.
Pre- and post-processing of simulation data:1. Do not accelerate directly HPC simulation.2. Are limited to specific data representation. Surrogate models:
1. Suffer from significant accuracy loss (typical error rates: 1.98% ~ 2.69%) 2. Can only estimate physics under very limited conditions.
Ground true (Simulation)
Surrogate model (prediction)
Difference
Results from AI Solver (3D model, experimental results) made by Fujitsu Laboratories of Europe, Ahmed Al-Jarro
Our approach: AI inside HPC simulation
6 Copyright 2020 Fujitsu Laboratories Ltd.
We incorporate a CNN model inside the simulation runtime to accelerate the solver computation. The CNN is trained using internal simulation data. The CNN auto-tunes the solver convergence criteria for speed-up. The CNN does not interfere with first principle computation thus accuracy
is guaranteed.
Input model Simulation results
Simulation runtime enhanced with AI inference
Iterative solver
Internal data
Tuning control CNN
Theoretical Background: nonlinear analysis & iterative solver
Copyright 2020 Fujitsu Laboratories Ltd.7
Structural Analysis (SA) Simulations
8 Copyright 2019 Fujitsu Laboratories Ltd.
Input: simulation model
SA simulations compute the effects of loads applied on physical structures. Designers test several loads (forces, heat, pressure) in order to check how the system would respond.
Output: simulation results
mesh
force
Simulation Displacement field
Geometric Nonlinear Analysis
9 Copyright 2019 Fujitsu Laboratories Ltd.
When applied loads create large deformations on the structure: The relation between applied loads and displacement is nonlinear.Nonlinear response is computed by an incremental step-by-step analysis
where the load is applied gradually, in increments. This process is expressed by [K] Δr = ΔR (1) and rn+1 = rn +Δr (2) Equation (1) is solved iteratively because stiffness matrix K changes
during a given load step.
Displacement (m)
App
lied
load
(New
ton)
(ΔRE )0
(ΔRE )1
(ΔRE )2
r0 r1 r2 r3
load step 0
load step 1
load step 2
A nonlinear analysis with 3 load steps
Newton-Raphson in nonlinear analysis
10 Copyright 2019 Fujitsu Laboratories Ltd.
In a given load step n, the Newton-Raphson algorithm solves iteratively the equation (1) by a recurrence relation:
r
R
rn
(ΔRE )n
(ΔRI )k-1
(ΔRE )n - (ΔRI )k-1
rn+1k-1 rn+1
k
Δrn k
iteration k
rn+1
Newton-Raphson method in a given load step n
then
Newton-Raphson iterations k proceed until the norm of becomes smaller than a specified tolerance value.
Linear solver in Newton-Raphson method
11 Copyright 2019 Fujitsu Laboratories Ltd.
For each NR iteration k, a sparse system of linear equations of type must be solved:
NR method relies on iterative solvers (such as CG) to compute an approximate solution of such system. This is the main computational part of nonlinear analysis.
During the solving process, the solution vector x is iteratively corrected until convergence is obtained as:
P is the preconditioner matrix and , defined asis the residual error at iteration s. When the residual gets smaller than the specified solver tolerance, convergence occurs.
Summary of nonlinear analysis
12 Copyright 2019 Fujitsu Laboratories Ltd.
Simulation computes the displacement values through several load steps.
Each load step relies on Newton-Raphson method to solve iteratively the equation .
Each NR iteration relies on an iterative solver to compute an approximation of the solution.
The outer loop (NR) and inner loop (solver) convergences are
controlled by theirs corresponding tolerance
parameter.
Nonlinear analysis is a nested loop.
Performance analysis
Copyright 2020 Fujitsu Laboratories Ltd.13
Performance analysis
14 Copyright 2019 Fujitsu Laboratories Ltd.
We conducted a performance analysis of SA simulations based on nonlinear algorithm.
The goal is to clarify the impact of solver, preconditioner and tolerance value on simulation time and accuracy.
Our experimental environment: Intel Xeon machine (72 CPU cores)Main memory (RAM): 128 GBOperating system: CentOS 7.2 Simulation software: FrontISTR v5We use OpenMP multithreading for running simulation with either single
or multiple threads.
Performance analysis: Application models
15 Copyright 2019 Fujitsu Laboratories Ltd.
We used 3 different FrontISTR application models: Plastic Can (a), Hyper elastic Spring (b) and Ball Grid Array (c)
(a)
(b)
(c)
Default simulation parameters for each application model:
Performance analysis: best solver/preconditioner
16 Copyright 2019 Fujitsu Laboratories Ltd.
First we evaluate the performance of simulation using different combinations for solver method and preconditioner.
0200400600800
100012001400160018002000
CG BiCGSTAB GMRES GPBiCG
Exec
utio
n tim
e [s
ec]
SSOR_1 SSOR_2 Diag_Scal AMG ILU_0 ILU_1 ILU_2
Results for Plastic Can model,using 72 threads
fastest
Performance analysis: best solver/preconditioner
17 Copyright 2019 Fujitsu Laboratories Ltd.
0
200
400
600
800
1000
1200
1400
1600
1800
2000
CG BiCGSTAB GMRES GPBiCG
Exec
utio
n tim
e [s
ec]
SSOR_1 SSOR_2 Diag_Scal AMG ILU_0 ILU_1 ILU_2
Simulation which fail to converge:
fastest
Results for Elastic model,using 72 threads
0200400600800
100012001400160018002000
CG BiCGSTAB GMRES GPBiCG
Exec
utio
n tim
e [s
ec]
SSOR_1 SSOR_2 Diag_Scal AMG ILU_0 ILU_1 ILU_2
Performance analysis: best solver/preconditioner
18 Copyright 2019 Fujitsu Laboratories Ltd.
fastest
Results for BGA model,using 72 threads
CG-AMG provides the fastest convergence for our 3 models. Thus, we keep fixed this combination for our remaining study.
Performance analysis: different solver tolerance
19 Copyright 2019 Fujitsu Laboratories Ltd.
Then, we evaluate the performance of simulation using different solver tolerance values using fixed model and CG-AMG.
0
10
20
30
40
50
60
70
1.00
E-08
1.00
E-07
1.00
E-06
1.00
E-05
1.00
E-04
1.00
E-03
1.00
E-02
0.02
50.
050.
075
0.1
0.25 0.
50.
75
Exec
utio
n tim
e [s
ec]
Solver tolerance
0.00E+00
2.00E-03
4.00E-03
6.00E-03
8.00E-03
1.00E-02
1.20E-02
1.40E-02
1.60E-02
1.80E-02
2.00E-02
1.00
E-08
1.00
E-07
1.00
E-06
1.00
E-05
1.00
E-04
1.00
E-03
1.00
E-02
0.02
50.
050.
075
0.1
0.25 0.
50.
75
Mea
n A
bsol
ute
Erro
r
Solver tolerance
Results for Plastic Can model,using CG-AMG and 72 threads
Best time-error trade-off
0
50
100
150
200
250
300
350
400
450
1.00
E-08
1.00
E-07
1.00
E-06
1.00
E-05
1.00
E-04
1.00
E-03
1.00
E-02
0.02
50.
050.
075
0.1
0.25 0.
50.
75
Exec
utio
n tim
e [s
ec]
Solver tolerance
Performance analysis: different solver tolerance
20 Copyright 2019 Fujitsu Laboratories Ltd.
The Mean Absolute Error (MAE) is computed from the exact solution provided by a direct solver.
Results for Elastic model,using CG-AMG and 72 threads
Best time-error trade-off
0.00E+00
5.00E-04
1.00E-03
1.50E-03
2.00E-03
1.00
E-08
1.00
E-07
1.00
E-06
1.00
E-05
1.00
E-04
1.00
E-03
1.00
E-02
0.02
50.
050.
075
0.1
0.25 0.
50.
75
Mea
n A
bsol
ute
Erro
r
Solver tolerance
Performance analysis: different solver tolerance
21 Copyright 2020 Fujitsu Laboratories Ltd.
0
200
400
600
800
1000
1200
1.00
E-08
1.00
E-07
1.00
E-06
1.00
E-05
1.00
E-04
1.00
E-03
1.00
E-02
0.02
50.
050.
075
0.1
0.25 0.
50.
75
Exec
utio
n tim
e [s
ec]
Solver tolerance
OthersNewton Raphson post-processingIterative solver (CG)Newton Raphson pre-processing
0
0.01
0.02
0.03
0.04
0.05
0.06
1.00
E-08
1.00
E-07
1.00
E-06
1.00
E-05
1.00
E-04
1.00
E-03
1.00
E-02
0.02
50.
050.
075
0.1
0.25 0.
50.
75
Mea
n A
bsol
ute
Erro
r
Solver tolerance
Performance analysis shows there is an optimal tolerance value considering simulation time and error trade-off.
Results for BGA model,using CG-AMG and 1 thread
Best time-error trade-off
Performance analysis: residual evolution
22 Copyright 2019 Fujitsu Laboratories Ltd.
The shape of residual error curves are changing with respect to the current solver tolerance. Each slope is a NR iteration
while each point is a single solver iteration.
1E-08
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
Res
idua
l Err
or
Solver iteration1.00E-08 1.00E-04 0.025
0 100 200
Solver tolerance:
Performance analysis: the take away
23 Copyright 2019 Fujitsu Laboratories Ltd.
1. Iterative solver in NR method is the main computational part of nonlinear analysis.
2. For our three models, CG-AMG is the fastest combination.
3. Low tolerance value long execution time, small error.4. Big tolerance value small execution time, big error.
5. There is an optimal tolerance value which optimize solver convergence time and accuracy of results.
6. Residual error is directly correlated to the solver tolerance and is continuously generated by the solver.
Important to accelerate
Need autotuning
Data is available
Proposal
Copyright 2020 Fujitsu Laboratories Ltd.24
Proposal overview
25 Copyright 2019 Fujitsu Laboratories Ltd.
Based on performance analysis, we developed an auto-tuning of solver tolerance inside NR runtime to optimize simulation performance.
Our proposal is based on 3 core components:1. A CNN model aware of time-error tradeoff.2. A quantification of tolerance update based on Softmax probability output.3. An in-memory data transfer for minimizing the overhead of AI inference.
An application of proposal is structural design optimization:• Spatial representation• Physical properties• Condition of experiment
Computer simulation results
manufacturingstandard ?
yesprototyping
noComputer Simulation is the bottleneck of design iteration
Training data for time-error tradeoff Training data are generated while running simulations using
different solver tolerance (~70 samples per simulation).Residual values are accumulated and converted into binary image in a
periodical manner (e.g: every 2 NR iterations). According to tolerance value used during simulation, we assign a label to
each image: “increase” ([1,0]) or “decrease” ([0,1]). At inference, the CNN predicts if the current tolerance value should be
increased or decreased for performance improvement. The choice of label depends on specific policy: speed or accuracy concern
1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 0.01 0.025 0.05 0.075 0.1 0.25 0.5 0.75Speed
concernAccuracy concern Increase Decrease
DecreaseIncrease
Example of labelling from BGA simulation results of page 21fastest: 0.025, best trade-off: 1e-04
CNN architecture and hyperparameters
27 Copyright 2020 Fujitsu Laboratories Ltd.
We trained a CNN using Keras 2.2.4 (Tensorflow backend) and two Nvidia Tesla GPUs.
Conv1
FC1Conv2
Conv3
• AlexNet like architecture (Conv-Pool-Relu)• Input layer: single channel image (256x256)• Training algorithm: Adam optimizer• Loss function: Binary Cross-Entropy• Final activation layer: Softmax
FC2
• Samples: 6706• Epochs: 100• Accuracy: 98%• Val accuracy: 97%
Input data256x256 image
Two classes:* increase: p* decrease: 1-pSoftmax
System: Auto-Tuning of Solver Tolerance
28 Copyright 2020 Fujitsu Laboratories Ltd.
We modified the conventional Newton-Raphson algorithm and incorporate an AI-based auto-tuning of solver tolerance.Our proposal does not interfere with nonlinear-level (NR) convergence,
thus consistent and correct simulation results are guaranteed.
while ( |ΔRn | ≥ NR_tolerance ) do //NR_loop
if ( period is true ) then[p, 1-p] AI_inference()adjust(solver_tolerance, [p, 1-p])
while ( rs ≥ solver_tolerance ) do //solver_loop
send residual data
• probability of increase (p)• probability of decrease (1-p) CNN
CPU
GPU
k-1
…
……
Modified simulation algorithm
Control of tolerance tuning
29 Copyright 2019 Fujitsu Laboratories Ltd.
After inference, the returned probability values are used to update the current tolerance value, between NR iterations. This is implemented by our custom ‘adjust’ function (see page 28). The modification (increase or decrease) is proportional to the level of
confidence expressed by the probability value.
Very low probability strong decrease
low probability decrease a little bit
high probability increase a little bit
Very high probability strong increase
Incorporate AI inference at minimal overhead
30 Copyright 2019 Fujitsu Laboratories Ltd.
AI inference is very fast due to efficient accelerators. However, we must minimize the overhead of data exchanges between simulation and neural network processes.
To transfer data at low latency, we developed a in-memory data path relying on shared memory using memory-mapped file (mmap). This approach is faster than I/O (read/write).
Experimental Evaluation
Copyright 2020 Fujitsu Laboratories Ltd.31
Test model for evaluation
32 Copyright 2020 Fujitsu Laboratories Ltd.
We evaluate our proposal using the BGA model which is a real application model used in chip-carrier design.
Test model
The simulation conditions used at test phase are different from the one used in simulation which has generated training data.
We evaluate the capability of the model to
extrapolate tolerance tuning as value are out of the scope of the range
used at training.
Performance results on real application
33 Copyright 2020 Fujitsu Laboratories Ltd.
0
200
400
600
800
1000
1200
1400
1600
1800
Sim
ulat
ion
time
[sec
]
[A] Baseline: simulation without AI [B] Proposal: accuracy concern [C] Proposal: speed concern0
0.0010.0020.0030.0040.0050.0060.0070.008
Mea
n A
bsol
ute
Erro
r
x1.20
x1.58
Baseline simulation use default and fixed tolerance at 1e-08 Auto-tuned simulations (B,C) use AI to update the tolerance
every 2 NR iterations, starting from default value at 1e-08.Difference between case B and C is the tuned policy: both AI
models used different training labels. Achieve 1.58x speed-up with an error rate of 0.279% (case C).
0.279%
0.00193%
Solver tolerance auto-tuning
34 Copyright 2020 Fujitsu Laboratories Ltd.
Auto-tuned simulations (B,C) use AI to update the tolerance every 2 NR iterations, starting from default value at 1e-08.
The tolerance evolution during auto-tuned simulation: fast increase at the beginning then a smooth convergence towards optimal tolerance value around 1e-04.
1E-08
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
Solv
er to
lera
nce
NR iterations
Fast & Smooth: Adjustment with Softmax probability
Conclusion
35 Copyright 2020 Fujitsu Laboratories Ltd.
We proposed an AI method to accelerate HPC simulation based on Newton-Raphson algorithm, a numerical method used in several engineering fields.
We incorporated an AI inference inside simulations and show how AI models can replace a part of parameter tuning.
On real application, we achieved: 1.58x faster simulation with 0.279% error rate (speed concern) 1.20x faster simulation with 0.00193% error rate (accuracy concern)
HPC simulation generate large amounts of internal data, so AI algorithms represents an opportunity to accelerate HPC applications such as design space exploration.
Thank you for listening !
For any questions, please send email to:[email protected]
36 Copyright 2020 Fujitsu Laboratories Ltd.
Copyright 2020 Fujitsu Laboratories Ltd.37