solving finite element systems with 17 billion …...1 solving finite element systems with 17...
TRANSCRIPT
![Page 1: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/1.jpg)
1
Solving Finite Element Systems withSolving Finite Element Systems with17 Billion Unknowns17 Billion Unknowns
at Sustained Teraflops Performanceat Sustained Teraflops Performance
Lehrstuhl für Informatik 10 (Systemsimulation)
Universität Erlangen-Nürnberg
www10.informatik.uni-erlangen.de
Erlangen, July 21, 2006
B. Bergen (LSS Erlangen/Los Alamos)
T. Gradl (LSS Erlangen)
F. Hülsemann (LSS Erlangen/EDF)
U. Rüde (LSS Erlangen, [email protected])
G. Wellein (RRZE Erlangen)
![Page 2: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/2.jpg)
2
OverviewOverviewMotivation
Applications of Large-Scale FE SimulationsDirect and Inverse Bio-Electric Field Problems3D image registration/ image processing/ analysisFlow Induced Noise and AcousticsMultiphysics-Problems in Nano TechnologySimulation of High Temperature Furncaces
Scalable Finite Element SolversScalable Algorithms: Multigrid
Scalable Architecture:Scalable Architecture: H Hierarchical HHybrid GGrids (HHG)Results
OutlookTowards Peta-Scale FE Solvers
![Page 3: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/3.jpg)
3
Part I Part I
MotivationMotivation
![Page 4: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/4.jpg)
4
A little quiz ...A little quiz ...
1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,
1 Pflops = 1015 floating point operations per second
What is the speed of your PC?
What is the speed of the fastest computer currently available (and where is it located)
What was the speed of the fastest computer in1995?2000?
2005?
![Page 5: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/5.jpg)
5
A little quiz ...A little quiz ...
1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,
1 Pflops = 1015 floating point operations per second
What is the speed of your PC?
probably between 1 and 6.5 GFlops
![Page 6: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/6.jpg)
6
A little quiz ...A little quiz ...
1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,
1 Pflops = 1015 floating point operations per second
What is the speed of your PC?
What is the speed of the fastest computer currently available (and where is it located)
What was the speed of the fastest computer in1995?2000?
2005?
![Page 7: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/7.jpg)
7
A little quiz ...A little quiz ...
1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,
1 Pflops = 1015 floating point operations per second
What is the speed of your PC?
What is the speed of the fastest computer currently available (and where is it located)
367 Tflop, it is a Blue Gene/L in Livermore/California with >130 000 processors
![Page 8: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/8.jpg)
8
A little quiz ...A little quiz ...
1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,
1 Pflops = 1015 floating point operations per second
What is the speed of your PC?
What is the speed of the fastest computer currently available (and where is it located)
What was the speed of the fastest computer in1995?2000?
2005?
![Page 9: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/9.jpg)
9
A little quiz ...A little quiz ...
1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,
1 Pflops = 1015 floating point operations per second
What is the speed of your PC?What is the speed of the fastest computer currently available (and where is it located)What was the speed of the fastest computer in
1995? 0.236 TFlops2000? 12.3 TFlops2005? 367 TFlops
... and how much has the speed of cars/airplanes/... improved in the same time?
additional question: When do you expect that computers exceed 1 PFlops?
![Page 10: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/10.jpg)
10
Architecture ExampleArchitecture ExampleOur Pet Dinosaur Our Pet Dinosaur (now in retirement)(now in retirement)
HLRB-I: Hitachi SR 8000 at the Leibniz-Rechenzentrum der
Bayerischen Akademie der Wissenschaften
(No. 5 at time of installation in 2000)
8 Proc and 8 GB per node
1344 CPUs (168*8)
2016 Gflop total
Very sensitive to data structures
Currently being replaced by HLRB-II:4096 Processor SGI Altix
upgrade to >60 Tflop in 2007
KONWIHR = Kompetenznetzwerk für Technisch-Wissenschaftliches Hoch- und Höchstleistungsrechnen
Project Gridlib (2001-2005)
Development of Hierarchical Hybrid Grids (HHG)
![Page 11: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/11.jpg)
11
HHG Motivation I HHG Motivation I Structured vs. Unstructured GridsStructured vs. Unstructured Grids
(on Hitachi SR 8000)(on Hitachi SR 8000)
gridlib/HHG MFlops rates for matrix-vector multiplication on one node on the Hitachi
compared with highly tuned JDS results for sparse matrices
0
1000
2000
3000
4000
5000
6000
7000
8000
729 35,937 2,146,689 # unknowns
JDSStencils
![Page 12: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/12.jpg)
12
HHG Motivation II: DiMe - ProjectHHG Motivation II: DiMe - Project
Cache-optimizations for sparse matrix codes
High Performance Multigrid Solvers
Efficient LBM codes for CFD
Collaboration projectfunded 1996 - 20xy by DFG
LRR München (Prof. Bode)
LSS Erlangen (Prof. Rüde)
Zur Anzeige wird der QuickTime Dekompressor YUV420 codec
bentigt.
www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/
Data Local Iterative Methods for theEfficient Solution of Partial Differential Equations
![Page 13: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/13.jpg)
13
HHG Motivation III HHG Motivation III
Memory for the discretization is often the limiting factor on the size
of problem that can be solved.
Common PracticeAdd resolution by applying regular refinement to an unstructured input grid
Refinement does not add new information about the domain ⇒ CRS is overkill (or any other sparse matrix structure)
Missed OpportunityRegularity of structured patches is not exploited
HHGDevelop new data structures that exploit regularity for enhanced performance
Employ stencil-based discretization techniques on structured patches to reduce memory usage
![Page 14: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/14.jpg)
14
Part II Part II
Applications of Large ScaleApplications of Large ScaleFinite Element SimulationsFinite Element Simulations
![Page 15: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/15.jpg)
15
Example: Flow Induced NoiseExample: Flow Induced Noise
Flow around a small square obstacle
Relatively simple geometries, but fine resolution for resolving physics
Images by M. Escobar (Dept. of Sensor Technology + LSTM)
![Page 16: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/16.jpg)
16
Part III - a Part III - a
Towards Scalable FE SoftwareTowards Scalable FE Software
Multigrid AlgorithmsMultigrid Algorithms
![Page 17: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/17.jpg)
17
What is Multigrid?What is Multigrid?Has nothing to do with „grid computing“A general methodology
multi - scale (actually it is the „original“)many different applicationsdevelopped in the 1970s - ...
Useful e.g. for solving elliptic PDEs large sparse systems of equationsiterativeconvergence rate independent of problem sizeasymptotically optimal complexity -> algorithmic scalability!can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpointefficient parallelization - if one knows how to do itbest (maybe the only?) basis for fully scalable FE solvers
![Page 18: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/18.jpg)
18
Multigrid: V-CycleMultigrid: V-Cycle
Relax on
Residual
Restrict
Correct
Solve
Interpolate
by recursion
… …
Goal:Goal: solve Ah uh = f h using a hierarchy of grids
![Page 19: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/19.jpg)
19
Parallel High Performance FE MultigridParallel High Performance FE MultigridParallelize „plain vanilla“ multigrid
partition domain
parallelize all operations on all grids
use clever data structures
Do not worry (so much) about Coarse Gridsidle processors?
short messages?
sequential dependency in grid hierarchy?
Why we do not use Domain DecompositionDD without coarse grid does not scale (algorithmically) and is inefficient for large problems/ many procssors
DD with coarse grids is still less efficient than multigrid and is as difficult to parallelize
![Page 20: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/20.jpg)
20
Part III - b Part III - b
Towards Scalable FE SoftwareTowards Scalable FE Software
Scalable ArchitectureScalable ArchitectureHierarchical Hybrid GridsHierarchical Hybrid Grids
![Page 21: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/21.jpg)
21
Hierarchical Hybrid Grids (HHG)Hierarchical Hybrid Grids (HHG)
Unstructured input grid
Resolves geometry of problem domain
Patch-wise regular refinement
generates nested grid hierarchies naturally suitable for geometric multigrid algorithms
New:
Modify storage formats and operations on the grid to exploit the regular substructures
Does an unstructured grid with 100 000 000 000 elements make sense?
HHG - Ultimate Parallel FE Performance!
![Page 22: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/22.jpg)
22
HHG Refinement exampleHHG Refinement example
Input Grid
![Page 23: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/23.jpg)
23
HHG Refinement exampleHHG Refinement example
Refinement Level one
![Page 24: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/24.jpg)
24
HHG Refinement exampleHHG Refinement example
Refinement Level Two
![Page 25: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/25.jpg)
25
HHG Refinement exampleHHG Refinement example
Structured InteriorStructured Interior
![Page 26: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/26.jpg)
26
HHG Refinement exampleHHG Refinement example
Structured Interior
![Page 27: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/27.jpg)
27
HHG Refinement exampleHHG Refinement example
Edge Interior
![Page 28: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/28.jpg)
28
HHG Refinement exampleHHG Refinement example
Edge Interior
![Page 29: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/29.jpg)
29
Common HHG MisconceptionsCommon HHG Misconceptions
Hierarchical hybrid grids (HHG)
are not only another block structured grid
HHG are more flexible (unstructured, hybrid input grids)
are not only another unstructured geometric multigrid package
HHG achieve better performance
unstructured treatment of regular regions does not improve performance
![Page 30: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/30.jpg)
30
Parallel HHG - Framework Design Parallel HHG - Framework Design GoalsGoals
To realize good parallel scalability:
Minimize latency by reducing the number of messages that must be sent
Optimize for high bandwidth interconnects ⇒ large messages
Avoid local copying into MPI buffers
![Page 31: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/31.jpg)
31
HHG for ParallelizationHHG for ParallelizationUse regular HHG patches for partitioning the domain
![Page 32: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/32.jpg)
32
HHG Parallel Update AlgorithmHHG Parallel Update Algorithmfor each vertex do
apply operation to vertexend for
for each edge docopy from vertex interiorapply operation to edgecopy to vertex halo
end for
for each element docopy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos
end for
update vertex primary dependenciesupdate vertex primary dependencies
update edge primary dependenciesupdate edge primary dependencies
update secondary dependenciesupdate secondary dependencies
![Page 33: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/33.jpg)
33
HHG for ParallelizationHHG for Parallelization
![Page 34: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/34.jpg)
34
Part III - c Part III - c
Towards Scalable FE SoftwareTowards Scalable FE Software
Performance ResultsPerformance Results
![Page 35: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/35.jpg)
35
Node Performance is Difficult! Node Performance is Difficult! (B. Gropp)(B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)DiMe project: Cache-aware Multigrid (1996- ...)
12821284131913121913
2400
2x blocking
20492134214021672389
2420
3x blocking
81984910659951417
2445
no blocking
5794906777151344
1072
standard
513325731293653333173grid size
Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHz
Array Padding
Temporal blocking - in EPIC assembly language
Software pipelineing in the extreme (M. Stürmer - J. Treibig)
Node Performance is Possible!
![Page 36: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/36.jpg)
36
Single Processor HHG Performance on Itanium forSingle Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element Mesh Relaxation of a Tetrahedral Finite Element Mesh
![Page 37: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/37.jpg)
37
Parallel HHG Performance on Altix forParallel HHG Performance on Altix forMultigrid with a Tetrahedral Finite Element Mesh Multigrid with a Tetrahedral Finite Element Mesh
![Page 38: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/38.jpg)
38
HHG: Parallel ScalabilityHHG: Parallel Scalability
431,456/96449152103,07917,1671024
75762/54549152103,07917,167512
76409/2702457651,5398,577256
69200/1471228825,7694,288128
68100/75614412,8842,14464
Time [s]GFLOP/s#Input Els #Els x 106#DOFS x 106#Procs
Parallel scalability of Poisson problem discretized bytetrahedral finite elements - SGI Altix (Itanium-2 1.6 GHz)
B. Bergen, F. Hülsemann, U. Ruede: Is 1.7× 1010 unknowns the largest
finite element system that can be solved today?
SuperComputing, Nov‘ 2005
2006 ISC-Award for 2006 ISC-Award for Application ScalabilityApplication Scalability
![Page 39: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/39.jpg)
39
Why Multigrid?Why Multigrid?
A direct elimination banded solver has complexity O(N2.3) for a 3-D problem.
This becomes
338869529114764631553758 = O(1023)operations for our problem size
At one Teraflops this would result in a runtime of 10000 years which could be reduced to 10 years on a Petaflops system
![Page 40: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/40.jpg)
40
Part IV Part IV
OutlookOutlook
![Page 41: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/41.jpg)
41
System ConfigurationSystem ConfigurationHLRB-II (Phase I)HLRB-II (Phase I)4096 Intel Madison9M (1.6 GHz) cores
single core = 4096 cores
6 MByte L3
FSB 533 (8.5 GByte/s for one core)
1.33 Byte/Flop = 0.17 Words/Flop
17 TByte memory
26.2 TF Peak -> 24.5 TF Linpack
~ 7.4 TF aggreg. weighted Application Performance (LRZ BM)
~ 3.5 TF expected as every-day mean performance
256 core single system image
![Page 42: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/42.jpg)
42
System Configuration HLRB-II (Phase II)System Configuration HLRB-II (Phase II)Target: 3328+ Intel Montvale Sockets
dual core = 6656+ cores 9 MByte L3
FSB 667 (10.7 GB/s shared by two cores)~0.1 Words/Flop
40+ TByte memory>60 TF Peak
13 TF LRZ benchmarks~ 6 TF expected sustained512 cores per node
Installation: 2007
![Page 43: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/43.jpg)
43
ConclusionsConclusionsSupercomputer Performance is Easy!
If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive
Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive) algorithm from the start
Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe by within “clever data structures” that introduce a lot of overhead
Never cite “time-to-solution”who cares whether you solve a real life problem anyway
it is the MachoFlops that interest the people who pay for your research
Never waste your time by trying to use a complicated algorithm in parallel (such as multigrid)
the more primitive the algorithm the easier to maximize your MachoFlops.
![Page 44: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/44.jpg)
44
AcknowledgementsAcknowledgementsCollaborators
In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, Neurozentrum, Radiologie, etc.
International: Utah, Technion, Constanta, Ghent, Boulder, ...
Dissertationen ProjectsB. Bergen (HHG development)
T. Gradl (HHG application/ Optimization)
J. Treibig (Cache Optimizations)
M. Stürmer (Architecture Aware Programming)
U. Fabricius (AMG-Methods and SW-Engineering for parallelization)
C. Freundl (Parallel Expression Templates for PDE-solver)
J. Härtlein (Expression Templates for FE-Applications)
N. Thürey (LBM, free surfaces)
T. Pohl (Parallel LBM)
... and 6 more
20 Diplom- /Master- ThesisStudien- /Bachelor- Thesis
Especially for Performance-Analysis/ Optimization for LBM• J. Wilke, K. Iglberger, S. Donath
... and 23 more
Funding: KONWIHR, DFG, NATO, BMBFFunding: KONWIHR, DFG, NATO, BMBF
Graduate Education: Elitenetzwerk BayernGraduate Education: Elitenetzwerk BayernBavarian Graduate School in Computational EngineeringBavarian Graduate School in Computational Engineering (with TUM, since 2004) (with TUM, since 2004)
Special International PhD program: Special International PhD program: Identifikation, Optimierung und Steuerung für technische AnwendungenIdentifikation, Optimierung und Steuerung für technische Anwendungen (with Bayreuth and Würzburg) since Jan. 2006(with Bayreuth and Würzburg) since Jan. 2006
![Page 45: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/45.jpg)
45
Talk is OverTalk is Over
Please wake up!Zur Anzeige wird der QuickTime
Dekompressor bentigt.
![Page 46: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/46.jpg)
46
FE Applications FE Applications (in our group)(in our group)
Solid-state lasersheat conduction, elasticity, wave equation
Bio-electric fieldsCalculation of electrostatic or electromagnetic potential in heterogeneous domains
Computational Nano TechnologyLattice-Boltzmann simulation of charged particles in colloidsCalculation of electrostatic potential in very large domains
Acousticsroom acousticsflow induced noise
Industrial High Temperature Furnaces heat transport, radiation, ...
![Page 47: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/47.jpg)
47
HHG Motivation III HHG Motivation III
Memory used to represent the discretization is often the limiting factor on the size of
problem that can be solved
Common PracticeAdd resolution by applying regular refinement to an unstructured input grid
Refinement does not add new information about the domain ⇒ CRS is overkill (or any other sparse matrix structure)
Missed OpportunityRegularity of structured patches is not exploited
HHGDevelop new data structures that exploit regularity for enhanced performance
Employ stencil-based discretization techniques on structured patches to reduce memory usage
![Page 48: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/48.jpg)
48
Large Scale Acoustic Simulations Large Scale Acoustic Simulations A concert hall may have >10 000 m3 volume
need a resolution of <1cm to resolve audible spectrum of wavelengths
need >106 cells per 1 m3
The concert hall will require to solve a system with >1010 unknowns per (implicit) time step
Flow induced noise (KONWIHR)generated by turbulence and boundary effects
high frequency noise requires a very fine resolution of acoustic field
far field computations => large domains
HHG will make unprecedentedHHG will make unprecedented
acoustic simulations possible!acoustic simulations possible!
![Page 49: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/49.jpg)
49
HHG Motivation I HHG Motivation I
Standard sparse matrix data structures (like CRS)
generally achieve poor performance!
Cache EffectsDifficult to find an ordering of the unknowns that maximizes data locality
Indirect IndexingPrecludes aggressive compiler optimizations that exploit instruction level parallelism (ILP)
Continuously Variable CoefficientsOverkill for certain class of problems
![Page 50: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/50.jpg)
50
User-Friendliness: ParExPDEUser-Friendliness: ParExPDEParallel Expression Templates forParallel Expression Templates for
Partial Differential Equations Partial Differential Equations
Library for the rapid development of numerical PDE solvers on parallel architectures
High level intuitive programming interface
Use of expression templates ensures good efficiency
Funded by KONWIHR (2001-2004)
User-friendly FE-Application development with excellent
parallel efficiency
![Page 51: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/51.jpg)
51
ConclusionsConclusionsHigh performance simulation still requires “heroic programming”Parallel Programming: single node performance is difficult Which architecture ?Which data structures?Where are we going?
the end of Moore’s lawpetaflops require >100,000 processors - and we can hardly handle 1000!the memory wall
• latency• bandwidth• Locality!
We are open for collaborations!
![Page 52: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/52.jpg)
52
Conclusions (1)Conclusions (1)High performance simulation still requires “heroic programming” … but we are on the way to make supercomputers more generally usableParallel Programming is easy, node performance is difficult (B. Gropp)Which architecture ?
ASCI-type: custom CPU, massively parallel cluster of SMPs• nobody has been able to show that these machines scale efficiently,
except on a few very special applications and using enormous human effort
Earth-simulator-type: Vector CPU, as many CPUs as affordable• impressive performance on vectorizable code, but need to check with
more demanding data and algorithm structuresHitachi Class: modified custom CPU, cluster of SMPs
• excellent performance on some codes, but unexpected slowdowns on others, too exotic to have a sufficiently large software base
Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable …, quantum computing, …
![Page 53: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/53.jpg)
53
Conclusions (2)Conclusions (2)Which data structures?
structured (inflexible) unstructured (slow)HHG (high development effort, even prototype 50 K lines of code)meshless … (useful in niches)
Where are we going?the end of Moore’s lawnobody builds CPUs with HPC specific requirements high on the list of prioritiespetaflops: 100,000 processors and we can hardly handle 1000It’s the locality - stupid!the memory wall
• latency• bandwidth
Distinguish between algorithms where control flow is• data independent: latency hiding techniques (pipelining, prefetching, etc)
can help• data dependent
![Page 54: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/54.jpg)
54
Talk is OverTalk is Over
Please wake up!
![Page 55: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/55.jpg)
55
In the Future?In the Future?
What’s beyond Moore’s Law?What’s beyond Moore’s Law?
![Page 56: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/56.jpg)
56
Conclusions (1)Conclusions (1)High performance simulation still requires “heroic programming” … but we are on the way to make supercomputers more generally usableParallel Programming is easy, node performance is difficult (B. Gropp)Which architecture ?
Large Scale Cluster: custom CPU, massively parallel cluster of SMPs• nobody has been able to show that these machines scale efficiently,
except on a few very special applications and using enormous human effort
Earth-simulator-type: Vector CPU, as many CPUs as affordable• impressive performance on vectorizable code, but need to check with
more demanding data and algorithm structuresHitachi Class: modified custom CPU, cluster of SMPs
• excellent performance on some codes, but unexpected slowdowns on others, too exotic to have a sufficiently large software base
Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable …, quantum computing, …
![Page 57: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/57.jpg)
57
Conclusions (2)Conclusions (2)Which data structures?
structured (inflexible) unstructured (slow)HHG (high development effort, even prototype 50 K lines of code)meshless … (useful in niches)
Where are we going?the end of Moore’s lawnobody builds CPUs for HPC-specific requirements petaflops require >100,000 processors - and we can hardly handle 1000!It’s the locality - stupid!the memory wall
• latency• bandwidth
Distinguish between algorithms where control flow is• data independent: latency hiding techniques (pipelining, prefetching,
etc) can help• data dependent
![Page 58: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/58.jpg)
58
VV10.000.000.000
1.000.000.000
100.000.000
10.000.000
1.000.000
100.000
1.000
10.000
100
4004
80368
Pentium
Merced
1K
64K
256K
1M4M
64M
1G
4G
1970 1975 1980 1985 1990 1995 2000 2005
Year
Tra
nsi
sto
rs/D
ie
Microprocessor(Intel)
DRAM
Growth:42% per year
Growth:52% per year
Moore's Law in Semiconductor Technology(F. Hossfeld)
80468
Pentium Pro
![Page 59: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/59.jpg)
59
1021
1018
1015
1012
109
1024
103
106
1
1950 1960 1970 1980 1990 2000 2010 2020
Year
Ato
ms/
Bit
Information Density & Energy Dissipation(adapted by F. Hossfeld from C. P. Williams et al., 1998)
10 -9
10 -6
10 -3
1
10 3
10 6
10 9
1012
1015
Ene
rgy/
logi
c O
pera
tion
[pic
o-Jo
ules
]
kT
Semiconductor Technology
≈ 2017
![Page 60: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/60.jpg)
60
AcknowledgementsAcknowledgementsCollaborators
In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, etc.Especially for foams: C. Körner (WTM)International: Utah, Technion, Constanta, Ghent, Boulder, ...
Dissertationen ProjectsU. Fabricius (AMG-Verfahren and SW-Engineering for parallelization)C. Freundl (Parelle Expression Templates for PDE-solver)J. Härtlein (Expression Templates for FE-Applications)N. Thürey (LBM, free surfaces)T. Pohl (Parallel LBM)... and 6 more
16 Diplom- /Master- ThesisStudien- /Bachelor- Thesis
Especially for Performance-Analysis/ Optimization for LBM• J. Wilke, K. Iglberger, S. Donath
... and 21 more
KONWIHR, DFG, NATO, BMBFKONWIHR, DFG, NATO, BMBFElitenetzwerk BayernElitenetzwerk Bayern
Bavarian Graduate School in Computational EngineeringBavarian Graduate School in Computational Engineering (with TUM, since 2004) (with TUM, since 2004)
Special International PhD program: Special International PhD program: Identifikation, Optimierung und Steuerung für technische Identifikation, Optimierung und Steuerung für technische AnwendungenAnwendungen (with Bayreuth and Würzburg) to start Jan. 2006. (with Bayreuth and Würzburg) to start Jan. 2006.
![Page 61: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/61.jpg)
61
ParExPDE - Performance ResultsParExPDE - Performance ResultsSpeedup of ParExPDE on the LSS Cluster
8.51·10632
248.51·10616
398.51·1068
718.51·1064
1388.51·1062
2918.51·1061
Wallclock Time (s)
Number of Unknowns
# Proc. • Multigrid solver (V(2,2) cycles) for Poisson problem with Dirichlet boundary
• LSS Cluster: AMD Opteron 2.2 GHz
![Page 62: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/62.jpg)
62
ParExPDE - Performance ResultsParExPDE - Performance ResultsScaleup of ParExPDE on the LSS Cluster
766.45·10732
723.22·10716
721.61·1078
688.05·1064
684.02·1062
672.00·1061
Wallclock Time (s)
Number of Unknowns
# Proc. • Multigrid solver (V(2,2) cycles) for Poisson problem with Dirichlet boundary
• LSS Cluster: AMD Opteron 2.2 GHz
![Page 63: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/63.jpg)
63
Source LocalizationSource Localization3D MRI data3D MRI data
512 x 512 x 256512 x 512 x 256voxelsvoxels
segmentationsegmentation4 compartmens4 compartmens
Localized Localized Epileptic focusEpileptic focus
Dipole localizationDipole localizationsearch algorithmsearch algorithm = optimization= optimization
Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste (Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)(Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)
![Page 64: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/64.jpg)
64
Source LocalizationSource Localization
Localized Localized Epileptic focusEpileptic focus
Dipole localizationDipole localizationsearch algorithmsearch algorithm = optimization= optimization
Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste (Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)(Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)
![Page 65: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/65.jpg)
65
Data RequirementsData Requirements3D MRI data3D MRI data
512 x 512 x 256512 x 512 x 256voxelsvoxels
segmentationsegmentation4 compartmens4 compartmens
Data typesData typesto describeto describe
isotropic and anisotropicisotropic and anisotropicconductivityconductivity
![Page 66: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/66.jpg)
66
HHG Parallel Update AlgorithmHHG Parallel Update Algorithmfor each vertex do
apply operation to vertexend for
for each edge docopy from vertex interiorapply operation to edgecopy to vertex halo
end for
for each element docopy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos
end for
for each vertex doapply operation to vertex
end forupdate vertex primary dependenciesfor each edge do
copy from vertex interiorapply operation to edgecopy to vertex halo
end forupdate edge primary dependenciesfor each element do
copy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos
end forupdate secondary dependencies
![Page 67: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/67.jpg)
67
Numerical Models forNumerical Models for Source Localization Source Localization
Computational expensive treatment of singularities• Dipole Models (Blurred, Subtraction, Mathematical/ Zenger
Correction)
• Resolution and complexity of the mesh is important
![Page 68: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/68.jpg)
68
ParExPDE - Performance ResultsParExPDE - Performance ResultsSpeedup of ParExPDE
148.51·10632
248.51·10616
398.51·1068
718.51·1064
1388.51·1062
2918.51·1061
Wallclock Time (s)
Number of Unknowns
# Proc.
Multigrid solver
V(2,2) cycles for Poisson problem
Dirichlet boundary
LSS-Cluster
Compute NodesCompute Nodes(8x4 CPUs)(8x4 CPUs)
CPU: AMD Opteron 848
![Page 69: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/69.jpg)
69
ParExPDE - Performance ResultsParExPDE - Performance ResultsScaleup of ParExPDE on the LSS Cluster
766.45·10732
723.22·10716
721.61·1078
688.05·1064
684.02·1062
672.00·1061
Wallclock Time (s)
Number of Unknowns
# Proc. • Multigrid solver
• V(2,2) cycles
• Poisson problem
• Dirichlet boundary
• LSS Cluster: AMD Opteron 2.2 GHz
![Page 70: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/70.jpg)
70
ParExPDE - ApplicationsParExPDE - Applications
Simulation of solid-state lasers (Prof. Pflaum)Calculation of heat conduction
Calculation of elasticity
Solution of Helmholtz equation
![Page 71: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/71.jpg)
71
Bio-electric Field ComputationsBio-electric Field Computations
Reconstruction of electromagnetic fieldsReconstruction of electromagnetic fields from EEG-Measurements:from EEG-Measurements:
Source LocalizationSource Localization
NeurosurgeryNeurosurgeryKopfklinikum ErlangenKopfklinikum Erlangen
View throughView throughoperation microscopeoperation microscope
![Page 72: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/72.jpg)
72
Why simulate and not experiment?Why simulate and not experiment?
Open brainOpen brainEEG Measurements for EEG Measurements for
LocalizingLocalizing functional brain regions functional brain regions
Simulation basedSimulation basedVirtual operation planning Virtual operation planning
![Page 73: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/73.jpg)
73
Problem of inverse EEG/MEGProblem of inverse EEG/MEGDirect Problem: Direct Problem:
• Known:Known: Sources (strength, position, Sources (strength, position, orientation)orientation)
• Wanted:Wanted: Potentials on the head surface Potentials on the head surface• Inverse ProblemInverse Problem
• Known:Known: Potentials on the head surface Potentials on the head surface• Wanted:Wanted: Sources (strength, position, Sources (strength, position,
orientation)orientation)
![Page 74: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/74.jpg)
74
Computational RequirementsComputational Requirementsfor Biomedical FE Computationsfor Biomedical FE Computations
Resolution: 10243 ≈ 109 Voxels
Complex source models, anisotropic conductivity, inverse problem
Bio-electro-chemical model for signal propagation
Coupling to bio-mechanical effects, e.g.contraction of heart muscle
blood flow
Therapy planning, e.g. cardiac resynchronization pacemakeroptimal placement of electrodes
optimal control of resynchrnoization
ParExPDE as scalable parallel solver!
![Page 75: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/75.jpg)
75
Computational Nano TechnologyComputational Nano TechnologyExtension of the LBM particle simulation by an electrostatic potential
Long range particle-particle interactions
Simulating charged nano-particles in flow field
generates ions in the fluid (three species of fluid: +,-,o)
generates charge distribution
ParExPDE for computing the (static) potential distribution in each time step
109 - 1015 unknowns
compute forces on particles and ionized fluid from potential
Diplomarbeit C. Feichtinger
![Page 76: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)](https://reader036.vdocuments.pub/reader036/viewer/2022070908/5f8440d5e2627d120f429cf0/html5/thumbnails/76.jpg)
76
3. ORCAN3. ORCAN
Component based application framework to ease the development of parallel simulations
Central issues are • Flexibility: Dynamically switch modules
• Maintainability: Enable a longterm usage of a code
• Extensibility: Enable reuse of existing codes and coupling to other solvers (multiphysics)
• Enable a well-organized SW-development process
Modern Software Infrastructure forTechnical High Performance Computing