performance evaluation of a vector supercomputer sx-aurora ......turbulent flow cfd navier-stokes...
Post on 10-Mar-2021
0 Views
Preview:
TRANSCRIPT
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe,A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi
Tohoku University14 November, 2018
SC18
Outline
• Background• Overview of SX-Aurora TSUBASA• Performance evaluation
• Benchmark performance• Application performance
• Conclusions
14 November, 2018 SC18 2
Background
• Supercomputers become important infrastructures• Widely used for scien8fic researches as well as various
industries• Top1 Summit system reaches 143.5 Pflop/s
• Big gap between theore8cal performance and sustained performance◎ Compute-intensive applica8ons stand to benefit from
high peak performance✖Memory-intensive applica8ons are limited by lower
memory performance
14 November, 2018 SC18 3
Memory performance has gained more and more attentions
A new vector supercomputerSX-Aurora TSUBASA• Two important concepts of its design
• High usability• High sustained performance
• New memory integration technology• Realize the world’s highest memory bandwidth
• New architecture• Vector host (VH) is attached to vector engines (VEs)
• VE is responsible for executing an entire application• VH is used for processing system calls invoked by the
applications
14 November, 2018 5X86 Linux
VE
Vector processor
VEVH
SC18
New execution model
• Conven1onal model • New execution model
6
VH VE
Exe module load Startprocessing
System call(I/O, etc)
Transparent offloadOS function
Finishprocessing
Host GPU
Startprocessing
Kernelexecution
Kernelexecution
Fisnishprocessing SC18
Highlights of the execution model
• Two advantages over conventional execution model• Avoid frequent data transfers between VE and VH
• Applications are entirely executed on VE• Only necessary data for system calls need to be transferred
→High sustained performance• No special programming
• Explicit specifications of computation kernels are not necessary
• System calls are transparently offloadedto the VH• Programmers do not need to care system calls
→High usability
14 November, 2018 SC18 7
VH VE
Finishprocessing
OS function
Offload System call(I/O, etc)
Exe module load Startprocessing
Specification of SX-Aurora TSUBASA• Memory bandwidth
• 1.22 TB/s world’s highest memory bandwidth• Six HBM2 memory modules integraEon
• 3.0 TB/s LLC bandwidth• LLC is connected to cores via 2D mesh network
• ComputaEonal capability• 2.15 Tflop/s@1.4 GHz
• 8 powerful vector cores• 16 nm FINFET process technology• 4.8 billion transistors• 14.96 mm x 33.00 mm
14 November, 2018 SC18Block diagram of a vector processor
Outline
• Background• Overview of SX-Aurora TSUBASA• Performance evaluation
• Benchmark performance• Application performance
• Conclusions
14 November, 2018 SC18 9
Experimental environments
• SX-Aurora TSUBASA A300-2
• 2x VEs Type 10B
• 1x VH
14 November, 2018 SC18 10
VE Type 10BFrequency 1.4 GHz
Peak FP / core 268.8 Gflop/s
# cores 8
Peak DP Flops / socket 2.15 Tflop/s
Memory BW 1.2 TB/s
Memory capacity 48 GB
Memory config HBM2
VH Intel Xeon Gold 6126Frequency 2.60 GHz / 3.70 GHz (Turbo)
Peak FP / core 83.2 Gflop/s
# cores 12
Peak DP Flops 998.4 Gflop/s
Mem BW 128 GB/s
Mem Capacity 96 GB
Mem config DDR4-2666 DIMM 16GB x 6
X86 Linux
VE
Vector Engines
VEVH
Experimental environments cont.
14 November, 2018 SC18 11
Processor SX-AuroraType 10B
Xeon Gold 6126 SX-ACE Tesla V100 Xeon Phi
KNL 7290Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz
# of cores 8 12 4 5120 72
DP flop/s(SP flop/s)
2.15 T(4.30 T)
998.4 GF (1996.8 GF) 256 GF 7 TF
(14 TF)3.456 TF
(6.912 TF)
Memory subsystem HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 MCDRAM
DDR4
Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s 450+ GB/s115.2 GB/s
Memory capacity 48 GB 96 GB 64 GB 16 GB 16 GB
96 GB
LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A
LLC capacity 16 MB shared
19.25 MB shared 1 MB private 6 MB shared 1 MB shared
by 2 cores
Applications used for evaluation
• SGEMM/DGEMM• Matrix-matrix mul;plica;ons to evaluate the Peak flop/s
• Stream benchmark• Simple kernels (copy, scale, add, triad) to measure sustained
memory performance
• Himeno benchmark• Jacobi kernels with a 19-point stencil as a memory-intensive
kernels
• Applica;on kernels• Kernels of prac;cal applica;ons of Tohoku univ in Earthquake,
CFD, Electromagne;c
• Microbenchmark to evaluate the execu;on model• Mixture with vector-friendly Jacobi kernels and I/O kernels
14 November, 2018 SC18 12
Overview of application kernels
14 November, 2018 13
Kernels Fields Methods Memory access
Meshsize
CodeB/F
ActualB/F
Land mine Electromagnetic FDTD Sequential 100x750x750 6.22 5.15
Earthquake Seismology
Friction Law Sequential 2047x2047x256 4.00 4.00
Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35
Antenna Electromagnetic FDTD Sequential 252755x9x97336 1.73 0.98
Plasma Physics Lax-Wendroff Indirect 20,048,000 1.12 0.075
Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084
SC18
Outline
• Background• Overview of SX-Aurora TSUBASA• Performance evaluation
• Benchmark performance• Application performance
• Conclusions
14 November, 2018 SC18 14
SGEMM/DGEMM Performance
14 November, 2018 SC18 15
• High scalability up to 8 threads• High vectorization ratio 99.36%, good vector length 253.8
• High efficiency• Efficiency 97.8~99.2%
0102030405060708090100
0500
10001500200025003000350040004500
1 2 3 4 5 6 7 8
Effic
ienc
y (%
)
GEM
M p
erfo
rman
ce (G
flop/
s)
Number of threads
DGEMM Gflop/s SGEMM Gflop/sDGEMM efficiency SGEMM Efficiency
Memory performance(Stream Triad)
• High sustained memory bandwidth of SX-Aurora TSUBASA• Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81%
• Scalability• Saturated even when the number of threads is less than half
14 November, 2018 SC18 16
0
200
400
600
800
1000
1200
SX-AuroraTSUBASA
SX-ACE Skylake Tesla V100 KNL
Stre
am b
andw
idth
(GB/
s)
x 4.72 x 11.7 x 1.37 x 2.23
0
200
400
600
800
1000
1200
1400
0 2 4 6 8 10 12
Stre
am b
andw
idth
(GB/
s)
Number of threads
SX-Aurora… SX-ACE Skylake
Himeno (Jacobi) performance
• Higher performance… except GPU• Vector reduction becomes bottleneck due to copy among vector pipes
• Nice thread scalability• 6.9x speedup in 8 threads => 86% parallel efficiency
14 November, 2018 SC18 17
0
50
100
150
200
250
300
350
SX-AuroraTSUBASA
SX-ACE Skylake Tesla V100 KNL
Him
eno
perf
orm
ance
(Gflo
p/s) x 3.4 x 7.96 x 0.93 x 2.1
0
40
80
120
160
200
240
280
320
0 2 4 6 8 10 12
Him
eno
perf
orm
ance
(Gflo
p/s)
Number of threads
SX-Aurora TSUBASA SX-ACE Skylake
x 6.9
Application kernel performance
• SX-Aurora TSUBASA could achieve high performance• Plasma, Turbine => Indirect access, memory latency-bound• Antenna => computation-bound to memory BW-bound• Land mine, Earthqauke, Turbulent flow =>memory or LLC BW-
bound
14 November, 2018 SC18 18
0
10
20
30
40
50
60
70
80
Land mine Earthquake Turbulentflow
Antenna Plasma Turbine
Exec
utio
n tim
e (s
ec)
SX-Aurora TSUBASA SX-ACE Skylake
1
4
16
64
256
1024
4096
0.1250.25 0.5 1 2 4 8 16 32 64 128
Atta
inab
le p
erfo
rman
ce (G
flop/
s)
Arithmetic intensity (Flops/Byte)
SX-Aurora TSUBASA SX-ACE
Land mine
Earthquake Antenna
Antenna
Turbulent flow
Plasma
PlasmaTurbine
x 4.3 x 2.9x 3.4
x 9.9
x 2.6 x 3.4
Memory bound? or LLC bound?
• Further analysis using 4 types of Bytes/Flop ratio• Memory B/F = (memory BW) / (peak performance)• LLC B/F = (LLC BW) / (peak performance)• Code B/F = (necessary data in Byte) / (# FP operations)• Actual B/F = (# block memory access) * (block size) / (# FP operations)
• Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound• Code B/F < Actual B/F * LLC BW / Memory BW => memory bound
14 November, 2018 SC18 19
B/F ratio Actual < Memory Memory > Actual
Code < LLC Computation-bound Memory BW-bound
Code > LLC LLC BW-bound Memory or LLC bound *
Application kernel performance
• Memory B/F = 1.22 TB/s / 2.15 TF= 0.57
• LLC B/F = 2.66 TB/s / 2.15 TF= 1.24
• Land mine (Code 6.22, Actual 5.79) => LLC bound• Earthqauke (Code 6.00, Actual 2.00) => LLC bound• Turbulent flow (Code 1.91, Actual 0.35) => memory BW bound• Antenna (Code 1.73, Actual 0.98) => memory BW bound
14 November, 2018 SC18 20
0
10
20
30
40
50
60
70
80
Land mine Earthquake Turbulentflow
Antenna Plasma Turbine
Exec
utio
n tim
e (s
ec)
SX-Aurora TSUBASA SX-ACE Skylake
x 4.3 x 2.9x 3.4
x 9.9
x 2.6 x 3.4
B/F ratio Actual < Memory Actual > Memory
Code < LLC Computation-bound Memory BW-bound
Code > LLC LLC BW-bound Memory or LLC bound
Evaluation of the execution model• (Transparent/Explicit)
Offload from VE to VH• Offload from VH to VE
21
VH VE
SAP VVE
offload
0
5
10
15
20
25
30
35
40
VE execution VH offload VE offload
Exec
utio
n tim
e (s
ec)
Data transfer Jacobi 1 I/O Jacobi 2
VH VE
VAPS VH
offload
VAP
S
VAP
VAP
S
VAP
V
V
SAP
Transparent VE to VH explicit VE to VH VH to VE
SAP
Multi-VE performance on A300-8
• Stream VE-level scalability• Almost ideal scalability up to 8 VEs
• Himeno VE-level scalability• Good scalability up to 4VEs• Lack of vector lengths when more than 5VEs
• Problem size is too small
14 November, 2018 SC18 22
0100020003000400050006000700080009000
1 2 3 4 5 6 7 8
Stre
am b
andw
idth
(GB/
s)
Number of VEs
Aurora Ideal
0200400600800
10001200140016001800200022002400
1 2 3 4 5 6 7 8
Him
eno
perf
orm
ance
(Gflo
p/s)
Number of VEs
Aurora Ideal
Outline
• Background• Overview of SX-Aurora TSUBASA• Performance evaluation
• Benchmark performance• Application performance
• Conclusions
14 November, 2018 SC18 23
Application evaluation
• Applications• Tsunami simulation code
• Mainly consists on ground fluctuation and Tsunami inundation prediction
• Used for a real-time tsunami inundation system• A costal region of Japan(1244x826km) with an 810m resolution
• Direct numerical simulation code of turbulent flows• Incompressible Navier-Stokes equations and a continuum
equation are solved by 3D FFT• MPI parallelization to 4VEs (1~32 processes)• 1283, 2563, 5123 grid points
14 November, 2018 SC18 24
Performance of the Tsunami code
• Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE.• Socket performance is x 2.5, x 1.9, x 3.5
14 November, 2018 SC18 25
Performance of the DNS code
• About 8.14, 6.73, 5.48 speedup by 32cores to 1core in 1283, 2563, 5123 grids• LLC hit ratios of 2563 and 5123 are lower than that of 1283
• 10% is bank conflicts
14 November, 2018 SC18 26
Conclusions
• A vector supercomputer SX-Aurora TSUBASA• The highest bandwidth 1.22 TB/s by six HBM2 integration• New execution model to achieve high usability and high
sustained performance• Performance evaluation
• Benchmark programs→ High sustained memory performance→ effectiveness of a new execution model• Application programs→ High sustained performance
• Future work• Further optimizations for SX-Aurora TSUBASA
14 November, 2018 SC18 27
Acknowledgements
• People• Hiroyuki Takizawa• Ryusuke Egawa• Souya Fujimoto• Yasuhisa Masaoka
• Projects• The closed beta program for
early access to Aurora• High performance computing
division (jointly-organized with NEC), Cyberscience Center, Tohoku university
14 November, 2018 SC18 28
top related