search for optimal network topologies for supercomputers ... · the top supercomputer in top500...
TRANSCRIPT
Search for Optimal Network Topologies for Supercomputers
寻找超级计算机优化的网络拓扑结构
GUO, Meng 郭猛
Shandong Computer Science Center (National Supercomputer Center in Jinan)
山东省计算中心(国家超级计算济南中心)
2014/11/5 · Guangzhou 广州
Y.F. Deng of Stony Brook, USA & National Supercomputer Center in Jinan, China
M. Michalewicz and L. Orlowski of A*CRC, Singapore and Stony Brook
T. Mayer, Z. Ye, and L. Zhang of Stony Brook, USA
C. C. Hwang, Y. T. Chen, C. H. Liang, and S. W. Liou of NCKU, Taiwan
Joint work with Prof Deng’s students, postdocs, and other colleagues
Early work was done on Shenway Bluelight at The National Supercomputer Center in Jinan, China
Acknowledgements
01 Motivations
02 Reviews of Network Topologies
Search for Optimal Network Topologies for Supercomputers
03 Search for Optimal Topologies
04 Summaries
Development of calculator / computer
ENIAC (1946) ~300Flops
Tianhe-2 (2013) ~50PFlops
Human ~0.01Flops
Mechanical Calculator ~0.1Flops
TOP1: Tianhe-2
Tianhe-2
System
TH-IVB FEP Cluster, Xeon E5-2692 12C 2.2GHz with Phi 31S1p
Cores 3,120,000
Nodes 16,000
Flops/node 3.431 Tflops / node
RAM 1,024,000 GB
Rmax 33.9 Pflop/s
Rpeak 54.9 Pflop/s
Power 17.8 MW
Network TH Express-2
OS Kylin Linux
Communication vs. Computation: 96/11=8.73
Local calculation FMAdd costs: 11 pJ
Cross-die a double costs: 96 pJ
Communication Costly in Time and Energy
DARPA report of P. Kogge (ND) et al. and T. Schulthess (ETH), and David Keyes’ PPT
Operation Approximate energy cost
DP FMADD flop 100 pico J
DP DRAM read-to-register 4800 pico J
DP word transmit-to-neighbor 7500 pico J
DP word transmit-across-system 9000 pico J
[Power costs per operation, today]
[Power costs per operation, 2019]
Interconnection Network: Topologies & Technologies
[Topologies]
Bus, ring, grid/mesh, torus, hypercube,
tree, fat tree, omega, crossbar, etc.
[Technologies]
Device: Ethernet, Myrinet, Infiniband, etc.
Protocol: TCP/IP, UDP, VMMC, U-net, BIP, etc.
NETWORK:
Latency
Bandwidth
SYSTEM:
Performance
Cost
01 Motivations
02 Reviews of Network Topologies
Search for Optimal Network Topologies for Supercomputers
03 Search for Optimal Topologies
04 Summaries
Diagram of different BASIC network topologies
Source: http://en.wikipedia.org/wiki/Network_topology
An Ocean of Networks
Source: B. Parhami
The Top Supercomputer in TOP500 Lists in The Last 20 Years
System Site Topology Date
TMC CM-5 Los Alamos National Lab Fat Tree 6/93~11/93
Fujitsu Numerical Wind Tunnel National Aerospace Laboratory of Japan Crossbar 11/93~6/96
Intel XP/S 140 Paragon Sandia National Labs 2D Mesh 6/94~11/94
Hitachi SR2201 University of Tokyo 3D Crossbar 6/96~11/96
Hitachi CP-PACS University of Tsukuba 3D Hyper- crossbar 11/96~6/97
Intel ASCI Red Sandia National Laboratory Mesh 6/97 ~11/00
IBM ASCI White Lawrence Livermore National Laboratory Omega 11/00~6/02
NEC The Earth Simulator Earth Simulator Center Crossbar 6/02~11/04
IBM BlueGene/L Lawrence Livermore National Laboratory 3D Torus 11/04~6/08
IBM Roadrunner Los Alamos National Laboratory Fat-Tree hierarchy of crossbars 6/08~11/09
Cray Jaguar Oak Ridge National Laboratory 3D Torus 11/09~11/10
NUDT Tianhe-1A National Supercomputing Center in Tianjin Fat Tree 11/10~6/11
Fujitsu K Computer RIKEN Advanced Institute for Computational Science Tofu: 6D Mesh / Torus 6/11~6/12
IBM Sequoia Blue Gene/Q Lawrence Livermore National Laboratory 5D Torus 6/12~11/12
Cray Titan Oak Ridge National Laboratory 3D Torus 11/12~6/13
NUDT Tianhe-2 National Super Computer Center in Guangzhou Fat Tree 6/13~ Source: http://www.top500.org
Interconnect Family System Share of TOP 500 (June 2014)
Source: http://www.top500.org
Popular Networks:
5D Torus (IBM) Tofu: 6D Mesh/Torus (K) 3D Torus (Cray Gemini)
Butterfly (Monsoon) Dragonfly (Cray XC30) Hypercube (SGI Origin)
The networks for Tianhe-2 (GZ), Shenway (JN), Dawning Nebulae (SZ), …
Popular Network: Fat Tree
1960s
Mesh-based
(ILLIAC IV)
1970s
Butterfly, other MINs
1980s
Hypercube, bus-based
1990s
Fat tree, LAN-based
Direct to indirect,
shared memory
Lower diameter,
message passing
Scalability,
local wires
Greater
bandwidth
So, only a small
portion of the of the
networks has been
explored in practical
parallel computers
2000s
Returning to Square One
Comparison of Common Topologies
网络拓扑 节点度数 网络直径 对分带宽
Full Connected 𝑁 − 1 1 𝑁2/4
Ring 2 𝑁/2 2
2D Torus 4 𝑁 − 1 2 𝑁
Tree - 2logd−1𝑁 1
Fat Tree - 2log2𝑁 𝑁/2
Hypercube log2𝑁 log2𝑁 𝑁/2
Butterfly 4 2𝑙 𝑁/ 𝑙 + 1
de Bruijn 𝑑 log𝑑𝑁 2𝑑𝑁/log𝑑𝑁
Dcell 𝑘 + 1 < 𝑙𝑜𝑔𝑛𝑁 − 1 𝑁/ 4log𝑛𝑁
1 Full Connected
log N Binary tree, Hypercube
sqrt N Torus
N / 2 Ring
N - 1 Linear Array
Diameter
De
gree
Supercomputer Interconnects?!
New York
Data Traffic in Computer is Similar to This
Search for Optimal Network Topologies for Supercomputers
[Our Goals]
Design the state-of-the-art interconnection networks.
[Challenges]
The entire ecosystem of network design is too big.
[Our Focus]
On discovering the optimal network topologies.
01 Motivations
02 Reviews of Network Topologies
Search for Optimal Network Topologies for Supercomputers
03 Search for Optimal Topologies
04 Summaries
Heterogeneous nodes
Node degree k Diameter
D Bisection
bandwidth
B
Longest wire
Other attributes:
Regularity
Scalability
Packageability
Robustness
Number
of nodes
N
Adapted from B. Parhami
Interconnection Networks
Strategy to Search for Optimal Topologies
Add bypass links on known topologies
Add links on a Hamiltonian Cycle
Remove links from full-connected network
Successive construction
Exhaustion and embedding
1 Full Connected
log N Binary tree, Hypercube
sqrt N Torus
N / 2 Ring
N - 1 Linear Array
Diameter
Source: P. Zhang, R. Powell, and Y. Deng, IEEE Trans. Parallel and Distributed Systems Vol. 22 Issue 2 (2011) pp. 287-295
Torus LinkX-Axis Bypass Link
Y-Axis Bypass Link Torus Link
iBT (8×8; b=<4>) iBT (93; b=<3>)
Add Bypass Links on Torus
0
10
20
30
40
50
60
0 200 400 600 800 1,000
× 1,000
0
5
10
15
20
25
30
0 200 400 600 800 1,000
× 1,000
Dia
mete
r
Mean
Pa
th L
eng
th
Network Size (# of nodes) Network Size (# of nodes)
3D iBT (8) 3D iBT (8)
32.0
10.9
64.0
16.0
3D iBT vs. 4D Torus
Mean Path Length
Diameter Number of cable
HPL with CPU stability
HPL with CPU Turbo Boost
HPL with CPU Turbo & HT
N8k4 1.43 2 16 Done Pending Pending
N16k6 1.6 2 48 Done Pending Pending
N32k6 2.13 3 96 Done Pending Pending
N64k6 2.38 4 192 Pending Pending Pending
N8k4 N16k6 N16k6 N64k6
Add Links on A Hamiltonian Cycle
Tests at 10/14 at NCKU
Node HPL Best Efficiency
(64G RAM)
HPL Best Efficiency (96G RAM)
1 91.44% 91.31%
8 85.83% 87.49%
16 83.52% 84.44%
32 80.58% 81.94%
64
76.00%78.00%80.00%82.00%84.00%86.00%88.00%90.00%92.00%94.00%
1 8 16 32
HPL Efficiency
HPL Efficiency (64G RAM) HPL Efficiency (96G RAM)
Parallel Efficiency (64G RAM)
Parallel Efficiency (96G RAM)
1 100% 100%
8 93.86% 95.82%
16 91.34% 92.48%
32 88.12% 89.74%
64
86%
88%
90%
92%
94%
96%
98%
100%
1 8 16 32
Parallel Efficiency
Parallel Efficiency(64G RAM) Parallel Efficiency(96G RAM)
Add Links on A Hamiltonian Cycle
Tests at 10/14 at NCKU
Remove Links from Full-connected Network
M. Michalewicz, L. Orlowski and Y.F. Deng, Constructing graphs by algorithmic edge removal (in preparation)
Successive Construction
M. Michalewicz, L. Orlowski and Y.F. Deng, Constructing graphs by algorithmic edge removal (in preparation)
What is The Best Network Topology?
1 Full Connected
log N Binary tree, Hypercube
sqrt N Torus
N / 2 Ring
N - 1 Linear Array
Diameter
? Wires
vs.
Diameter
The Degree/Diameter Graph Problem
The Degree/Diameter Graph Problem
Suppose you have an unlimited supply of degree-d nodes. How many can be
connected into a network of diameter D?
Petersen graph
d = 3, D = 2, N=10
Hoffman-Singleton graph
d = 7, D = 2, N=50
Source: http://en.wikipedia.org/wiki/Petersen_graph; http://en.wikipedia.org/wiki/Hoffman–Singleton_graph
E8 Picture (E8 Lie Group: 240 points in 8-dim. Source: http://www.math.lsa.umich.edu/~jrs/coxplane.html
Graph Theory Topology Network
Problem Statement: (N, k)
Given N vertices, find a graph for which the
diameter, defined as the longest of the
geodesic distances between all pairs of nodes,
is minimal for a fixed vertex degree k (defined
as the number of edges incident to the vertex).
Also, the mean path length is minimal.
Why do Exhaustion Search?
3D
2D
Diameter 3
Mean path length
1.714
Rearrange a couple links
2
1.571
Rearrange a couple links
2
1.571
Rearranging order of links makes diameter reduced by 33.3% & mean path length by 8.3%!
Comparison of Topologies for N=16
Hypercube 4x4 Mesh 4x4 Torus Optimal N16k3 Optimal N16k4
Network
Degree 4 4 4 3 4
Diameter 4 6 4 3 3
Mean path length 2.1333 3.2000 2.1333 2.2000 1.7500
Number of edges 32 24 32 24 32
Using 25% less of the wires to keep similar mean path length and 25% less of diameter. Using the same amount of wires to get 25% less of diameter and mean path length.
Comparison of Topologies for N=32
Hypercube 4x8 Torus Optimal N32k3 Optimal N32k4 Optimal N32k5
Network
Degree 5 4 3 4 5
Diameter 5 6 5 4 3
Mean path length 2.581 3.0968 2.98185 2.42540 2.16734
Number of edges 80 64 48 64 80
N64k6:
D=3;
A=2.33;
L=192
How to find topologies
for N=1,024 or even 3,000,000???
Graph for N=64
Exhaustive searches for top topologies are possible for N64k6, i.e., N=64
and k=6.
The search space for 256k8 is ~101,760
For a comparison, there are 1023 stars in the universe so it’s probably
impossible to do exhaustive search for larger graphs
Therefore, we must invent techniques to search for top topologies (quasi-
optimal).
McKay, B. D., & Wormald, N. C. (1990). Asymptotic enumeration by degree sequence of graphs of high degree.
European Journal of Combinatorics, 11, 565-580. Retrieved from http://cs.anu.edu.au/~bdm/papers/highdeg.pdf
Deng, Y. et al (2014, in preparation), The first-principle discovery of k-degree optimal graphs and engineering
validations of optimality
Possible to Generate Massive Graphs?
Method 2: Graph Embedding——(N8k3)x(N8k3(a))(M=64)
=
Best Way to Connect M=32 Nodes
Hypercube 4x8 Torus N4k2 x N8k3 (a) N4k2 x N8k3 (b)
Network
Degree 5 4 3 5
Diameter 5 6 6 7
Mean path length 2.581 3.0968 3.887 3.935
Number of edges 80 64 44 44
For hypercube 2^6
M=64, k=6
A = 3.048
D = 6
L = 192 = (64x6/2)
For 2D torus 8x8
M=64, k=4
A = 4.0635
D = 8
L = 128 = (64x4/2)
N8k3 x N8k3(a):
M=64, k=3 or 4
A = ???
D = ???
L = 76 = (12x8+8+4)
Graph Embedding N8k3xN8k3(a) (M=64)
Graph Embedding N8k3 x N16k3 (M=128)
N8k3 x N16k3:
M=128, k=3 or 4
A = 6.330
D = 13
L = 216
For 16x8 Torus
M=128, k=4
A = 6.0472
D = 12
L = 256
For hypercube 2^7
M=128, k=7
A = 3.5276
D = 7
L = 448
Hop Distributions
Graph Embedding (N16k3)^2 (M=256)
For hypercube 2^8
M=256, k=8
A = 4.0157
D = 8
L = 1024
For 16x16 Torus
M=256, k=4
A = 8.0314
D = 16
L = 512
N16k3 x N16k3:
M=256, k=3 or 4
A = 9.23
D = 15
L = 408
Hop Distributions
Graph Embedding (N8k3)^3 (M=512)
1752 3376 4040
7152 9704
12064 12104 11888 15072
17872 21648
25120 25248 25008 25040 22080
15536
6000
880 48 0
5000
10000
15000
20000
25000
30000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
For hypercube 2^9
M=512, k=9
A = 4.5088
D = 9
L = 2304
For 32x16 Torus
M=512, k=4
A = 12.0235
D = 24
L = 1024
N8k3xN8k3xN8k3:
M=512, k=3 or 4
A = 11.47
D = 20
L = 876 Hop Distributions
Graph Embedding (M=4096): (N16k3)^3 & (N8k3)^4
For hypercube 2^12
M=4096, k=12
A = 6.0015, D = 12
L = 24,576
For 64x64 Torus
M=4096, k=4
A = 32.0078, D = 64
L = 8,192
(N16k3)^3:
M=4096, k=3 or 4
A = 34.72, D = 60
L = 6,552
(N8k3)^4:
M=4096, k=3 or 4
A = 30.851, D = 55
L = 7,020
Hop Distributions for M=4096
Prototype 1: iBT(8^2,b=2) vs. T(8^2) vs. T(4^3)
NAMD
NAS Parallel Benchmarks
HPC Challenge Benchmarks
LINPACK Benchmarks
Prototype 2: Optimal N8k3 at NCKU
CK -1024 2 Pflops, 1.1 MWs
5,120 Fiber links
One Prototype with N=1024 at NCKU
01 Motivations
02 Reviews of Network Topologies
Search for Optimal Network Topologies for Supercomputers
03 Search for Optimal Topologies
04 Summaries
Search for Optimal Network Topologies for Supercomputers
[Summaries]
Next generation supercomputer need better interconnect network:
Technologies and topologies
Optimal topology shows a better performance:
Diameter, mean path length, utilization of wires, etc.
There’s a long way to find and use optimal topologies:
Other optimization metrics: Bandwidth, applications, etc.
Massive network: Optimization algorithm, embedding, packaging, etc.
Routing, mapping; Scalability, robustness, etc.
Engineering