recent advances on the solution phase of direct solvers with...
TRANSCRIPT
Recent advances on the solution phaseof direct solvers with multiple sparseright-hand sides
P. Amestoy1, J.-Y. L’Excellent2, G. Moreau2
1. Universite de Toulouse INPT and IRIT , 2. Universite de Lyon, Inria and LIP-ENS Lyon ,[email protected]
Mumps User’s Days, Montbonnot, June 1-2 2017
SIAM CSE 17, Altanta, Georgia
Introduction
ோߝ��������������������������௫ܧ ൌ 10
ൌ, െ
|, |
�௫ܧ����������
����������������������������������������������������������������������������
��
Linear systems of equations :Ax = b, A is sparseSolve phase (Ly = b,Ux = y) maybe critical.
01234D
epth
(km
)
0Dip (km)
5
10
15
20
Cross (
km)
5 10 15 20
3000 4000 5000 6000m/s
Application coming from Helmholtz or Maxwell equations:
name n (million) nrhs nnz/nrhs Tfacto Tsolve
sei70m 2.9 2302 587 1258 1267
sei50m 7.1 2302 486 6289 2985
E1 0.33 8000 9.8 55.2 291
E3 2.8 8000 7.5 1951 5610
Table: Characteristics of matrices and right-hand-sides.
2 / 19
Introduction
Objectives:
• focus on the forward solution phase Ly = b;
• exploit sparsity of right-hand-sides;
• limit the number of operations (∆);
3 / 19
Overview
Exploitation of sparse right-hand-sidesContext of studyTree pruningExploitation of subintervals of columns at each node
Minimizing the number of operationsPermutation of columnsAdapted blocking technique
Conclusion
4 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube)
separator tree L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree
L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree L factor
5 / 19
Context: analysis phase
Ordering: reorder variables of the matrix A to reduce fill-in and buildelimination tree:
• Nested Dissection ⇒ build tree of separators.
1
4
2
5
10
11
13
143
6
12
15
7
9
8
16
18
17
19
20
21
2223
24
25
26
27
u15
u14
u13u10
u7
u6u3
u7
•••••••••••••••••••••••••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
• •
•
•
•
•
•••
••
•••• • • •
◦
◦ ◦
◦
◦ ◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦
◦ ◦
◦
◦ ◦◦
◦
◦ ◦◦
◦ ◦
◦ ◦◦
◦
◦
◦
◦
◦
◦◦
◦
◦
◦ ◦
◦
◦
◦ ◦◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦ ◦◦
◦ ◦ ◦
◦◦
◦◦◦
◦◦
◦
◦◦
◦◦
◦
◦
◦◦
◦
◦◦◦◦◦ ◦
3D physical domain (cube) separator tree L factor
5 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
u15
u14u7
0
u15
u7
u14
Block operations:
• y1 ← L−111 b1
• b2 ← b2 − L21y1
#flops for node u is given by:
Fu = 2 ∗ (#entries in L11 + L21)
Total #flops: ∆ =∑u∈TFu
6 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Forward solution process
Forward solve phase processes the tree from bottom to top:
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
×
×
×ff
ff × : initial non-zeros
f : filled non-zeros
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
Computation follows paths in the tree T [Gilbert, 1994].
↪→ Tree pruning (T → Tp(b)) to reduce computation:
∆ =∑
u∈Tp(b)
Fu
7 / 19
Exposition of padded zeros
When B is a matrix with multiple columns:
• use of BLAS 3 operations for efficiency;
• Tp(B) =⋃Tp(Bi ), where Bi is column i of B;
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
nrhs
1 2 3 4 5 6
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
But still, extra computations are done ...
∆ = nrhs ×∑
u∈Tp(B)Fu
8 / 19
Exposition of padded zeros
When B is a matrix with multiple columns:
• use of BLAS 3 operations for efficiency;
• Tp(B) =⋃Tp(Bi ), where Bi is column i of B;
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
nrhs
1 2 3 4 5 6
u15
u14
u13
u12u11
u10
u9u8
u7
u6
u5u4
u3
u2u1
But still, extra computations are done ...
∆ = nrhs ×∑
u∈Tp(B)Fu
8 / 19
Solutions
What are the possible alternatives ?
• Indirections: rebuilding data structures;
• Sequential: solution phase on each column⇒ optimal (∆ = ∆min) but not efficient;
• Regular blocking: how to build blocks ?◦ minimal access to factors (out of core) [Amestoy et al.,SISC,2012];◦ minimal number of operations (in core) [Yamazaki et al.,2013];
• Exploitation of subintervals of columns at each node [Amestoy et
al.,SISC,2015].
9 / 19
Solutions
What are the possible alternatives ?
• Indirections: rebuilding data structures;
• Sequential: solution phase on each column⇒ optimal (∆ = ∆min) but not efficient;
• Regular blocking: how to build blocks ?◦ minimal access to factors (out of core) [Amestoy et al.,SISC,2012];◦ minimal number of operations (in core) [Yamazaki et al.,2013];
• Exploitation of subintervals of columns at each node [Amestoy et
al.,SISC,2015].
9 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Work on node subintervals
Let u ∈ T :
Active columns at node u
Zu = {i ∈ {1, . . . ,m} | u ∈ Tp(Bi )}
Subinterval is given by:
θu = max(Zu)−min(Zu) + 1
Example: θu1 = 1, θu10 = 6
∆ =∑
u∈Tp(B)
Fu × θu
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
1 2 3 4 5 6θ1 = 1θ2 = 1θ3 = 4θ4 = 1θ5 = 1θ6 = 4θ7 = 5θ8 = 6θ9 = 0θ10 = 6θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
×
f
f×
f
f×
×
ff
f
×
×ff
×f
f
f
×ff
×ff×
×
f
ff
1 4 2 5 6 3θ1 = 1θ2 = 1θ3 = 2θ4 = 1θ5 = 1θ6 = 2θ7 = 4θ8 = 5θ9 = 0θ10 = 5θ11 = 1θ12 = 1θ13 = 3θ14 = 6θ15 = 6
↪→ ∆ is extremely dependant on column permutation.
10 / 19
Problem statement & algorithms
Goal is to minimize or decrease ∆ =∑
u∈Tp(B)Fu × θu:
• find permutation σ of columns to decrease θu,∀u ∈ Tp(B);
• in case of blocking, minimize the number of blocks.
Proposed heuristics:
• based on geometrical properties (Nested Dissection);
• generalization possible thanks to pruned tree Tp(B).
11 / 19
Problem statement & algorithms
Goal is to minimize or decrease ∆ =∑
u∈Tp(B)Fu × θu:
• find permutation σ of columns to decrease θu,∀u ∈ Tp(B);
• in case of blocking, minimize the number of blocks.
Proposed heuristics:
• based on geometrical properties (Nested Dissection);
• generalization possible thanks to pruned tree Tp(B).
11 / 19
Flat Tree Algorithm
Intuition based on a simple 2D example:
2D physical domain
u0
u1 u2a b c
d
e
f
g
h
i
j
k
l
structure of B
a bc cbd e f g h i j k l
T [u1]
T [u2]
u0
T [u11]
T [u12]
u1
T [u21]
T [u12]
u2u0
×
f
×
×
f×
×
×
×
×
×f
×
f
f
×
××
f
×f
f
×
f
×
f×
×
×××
×××
×f
×f×
×
ff
×
××f
×ff
u0
u2
u22u21
u1
u12u11
elimination tree
• Nested Dissection ⇒ partition right-hand-sides into 3 sets (a, b, c);
• θu1 = a + c + b
⇒ θu1 = a + b;
↪→ Top-down approach + local optimisation for the nodes at thecurrent layer in the tree.
12 / 19
Flat Tree Algorithm
Intuition based on a simple 2D example:
2D physical domain
u0
u1 u2a b c
d
e
f
g
h
i
j
k
l
structure of B
a bc cbd e f g h i j k l
T [u1]
T [u2]
u0
T [u11]
T [u12]
u1
T [u21]
T [u12]
u2u0
×
f
×
×
f×
×
×
×
×
×f
×
f
f
×
××
f
×f
f
×
f
×
f×
×
×××
×××
×f
×f×
×
ff
×
××f
×ff
u0
u2
u22u21
u1
u12u11
elimination tree
• Nested Dissection ⇒ partition right-hand-sides into 3 sets (a, b, c);
• θu1 = a + c + b ⇒ θu1 = a + b;
↪→ Top-down approach + local optimisation for the nodes at thecurrent layer in the tree.
12 / 19
Flat Tree Algorithm
Intuition based on a simple 2D example:
2D physical domain
u0
u1 u2a b c
d
e
f
g
h
i
j
k
l
structure of B
a bc cbd e f g h i j k l
T [u1]
T [u2]
u0
T [u11]
T [u12]
u1
T [u21]
T [u12]
u2u0
×
f
×
×
f×
×
×
×
×
×f
×
f
f
×
××
f
×f
f
×
f
×
f×
×
×××
×××
×f
×f×
×
ff
×
××f
×ff
u0
u2
u22u21
u1
u12u11
elimination tree
• Nested Dissection ⇒ partition right-hand-sides into 3 sets (a, b, c);
• θu1 = a + c + b ⇒ θu1 = a + b;
↪→ Top-down approach + local optimisation for the nodes at thecurrent layer in the tree.
12 / 19
Results on the Flat Tree
flops: normalized with the dense case; Ordering: Nested Dissection;
E1 E3 sei70m sei50mMatrices
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Norm
aliz
ed fl
ops
TP INT PO FT LB
Strategies:
• TP = tree pruning only;
• INT = tree pruning + nodeinterval+natural order;
• PO = tree pruning+nodeinterval+Postorder;
• FT = tree pruning+node interval+FlatTree;
• LB = Lower Bound (∆min).
↪→ Still 28% above the lower bound on one case.
13 / 19
Adapted blocking technique
Objective: decrease ∆ with the creation of a minimum number ofgroups.
×f
f
f
×
ff
f
×
f
f×
f
f×
×ff
×ff×
×
f
ff
×
×ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
4 2 1 5 6 34
×f
f
f
2
×
ff
f
1×
f
f×
f
f×
5
×ff
×ff×
6
×
f
ff
3
×
×ff
4 2 6 3
×f
f
f
×
ff
f
×
f
ff
×
×ff
1 5×
f
f×
f
f×
×ff
×ff×
Computations on explicit zeros still exist.
• however, not performant (loss of BLAS 3 operations);
• need to find some property to group right hand sides togetherwithout introducing extra operations.
14 / 19
Adapted blocking technique
Objective: decrease ∆ with the creation of a minimum number ofgroups.
×f
f
f
×
ff
f
×
f
f×
f
f×
×ff
×ff×
×
f
ff
×
×ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
4 2 1 5 6 34
×f
f
f
2
×
ff
f
1×
f
f×
f
f×
5
×ff
×ff×
6
×
f
ff
3
×
×ff
4 2 6 3
×f
f
f
×
ff
f
×
f
ff
×
×ff
1 5×
f
f×
f
f×
×ff
×ff×
∆min may be obtain by creating nrhs groups:
• however, not performant (loss of BLAS 3 operations);
• need to find some property to group right hand sides togetherwithout introducing extra operations.
14 / 19
Adapted blocking technique
Objective: decrease ∆ with the creation of a minimum number ofgroups.
×f
f
f
×
ff
f
×
f
f×
f
f×
×ff
×ff×
×
f
ff
×
×ff
u1u2u3u4u5u6u7u8u9u10u11u12u13u14u15
4 2 1 5 6 34
×f
f
f
2
×
ff
f
1×
f
f×
f
f×
5
×ff
×ff×
6
×
f
ff
3
×
×ff
4 2 6 3
×f
f
f
×
ff
f
×
f
ff
×
×ff
1 5×
f
f×
f
f×
×ff
×ff×
∆min may be obtain by creating nrhs groups:
• however, not performant (loss of BLAS 3 operations);
• need to find some property to group right hand sides togetherwithout introducing extra operations.
14 / 19
Adapted blocking technique
Principle (1): group sets of right hand sides that belong to differentsubdomains (starting with root separator).
a b c
a bc cc b
2D Domain structure of B
• non-zero structure of a and c are disjoint;
• whereas b may have the non-zero structure of both a and c ;
• thus, we extract them.
15 / 19
Adapted blocking technique
Principle (2): extract set of right hand sides that belong to bothsubdomains (starting with root separator).
a b c
a bc cc b
2D Domain structure of B
• non-zero structure of a and c are disjoint;
• whereas b may have the non-zero structure of both a and c ;
• thus, we extract them.
15 / 19
Adapted blocking technique
Principle (2): extract set of right hand sides that belong to bothsubdomains (starting with root separator).
a b c
a bc cc b
2D Domain structure of B
• non-zero structure of a and c are disjoint;
• whereas b may have the non-zero structure of both a and c ;
• thus, we extract them.
15 / 19
Comparison with a regular blocking strategy
Our Blocking algorithm (BLK):
• greedy algorithm to choose nextgroup;
• stop condition: ∆ < ∆tol ,where ∆tol = 1.01∆min.
Regular blocking algorithm(REG):
• split in chunk of regular size;
• stop condition: ∆ < ∆tol ,where ∆tol = 1.01∆min.
nb groups 5Hz 7Hz E1 E3
REG 328 255 363 258
BLK 3 3 4 4
Table: Number of groups created for each strategy with a tolerance such that∆ < 1.01×∆tol .
16 / 19
Conclusion
Achievements:
• implementation of two heuristics (permutation, blocking);
• 90% decrease in flops by exploiting sparsity;
• Up to 40% decrease in time for forward solve w.r.t. INT strategyand Nested Dissection ordering (sequential).
Perspectives:
• adapt the Flat Tree algorithm to unbalanced trees;
• parallelism and sparsity aspects of Flat Tree permutation;
• extend to more general test cases.
17 / 19
Acknowledgements
• LIP laboratory for access to the machines;
• EMGS et SEISCOPE for providing test cases;
• This work was performed within the frameworks of both theMUMPS consortium and the LABEX MILYON(ANR-10-LABX-0070) of Universite de Lyon, within the program”Investissements d’Avenir” (ANR-11-IDEX-0007) operated by theFrench National Research Agency (ANR).
18 / 19
? Thanks!Questions?