modeling a million-node dragonfly network using massively parallel discrete-event simulation misbah...
TRANSCRIPT
Modeling a Million-Node Dragonfly Network using Massively Parallel
Discrete-Event Simulation
Misbah Mubarak, Christopher D. CarothersRensselaer Polytechnic Institute
Robert Ross, Philip CarnsArgonne National Laboratory
Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work
The Dragonfly Network Topology A two level directly
connected topology Uses high-radix
routers Large number of ports
per router Each port has
moderate bandwidth“p”: Number of compute nodes connected to a router“a”: Number of routers in a group“h”: Number of global channels per routerk=a + p + h – 1a=2p=2h (Recommended configuration)
Simulating interconnect networks Expected size of exascale systems
Millions of compute cores Up to 1 million compute nodes
Critical to have a low-latency, small diameter and low-cost interconnect network
Exascale HPC systems cannot be effectively simulated with small-scale prototypes
We use Rensselaer Optimistic Simulation System (ROSS) to simulate a dragonfly model with millions of nodes Our dragonfly model attains a peak event rate of 1.33 billion
events/sec Total committed events: 872 billion
Dragonfly Model Configuration
Traffic Patterns Uniform Random Traffic (UR) Nearest Neighbor Traffic (or Worst Case traffic WC)
Virtual channels To avoid deadlocks
Credit based flow control Upstream nodes/routers keep track of buffer slots
An input-queued virtual channel router Each router port supports up to ‘v’ virtual channels
credit
generate
send to outbound
buffer
send
arrive
channel delay
source router
send
arrivechannel delay
send
arrive
send
Interval
Buffer
full? N
Y
credit
Buffer
full?
Buffer
full?
Ndestination router
Buffer
full?
Wait for
credit
Y
N
Wait for credit
Y
PacketPacket
Packet
Destination node
Wait for credit
Ycredit
credit
N
channel delay
arriveSending nodeSource routerIntermediate router(s)Destination nodeDestination router
Dragonfly Model Routing Algorithms Minimal Routing (MIN)
Uniform random traffic: High throughput, low latency
Nearest neighbor traffic: causes congestion, high latency, low throughput
Non-minimal routing (VAL) Half the throughput as MIN under UR traffic Nearest neighbor traffic: optimal performance
(about 50% throughput) Global Adaptive routing
Chooses between MIN and VAL by sensing the traffic conditions on the global channels
Dragonfly Model Minimal Routing
(iv) Packet traverses to R7 over local channel
R0
(i) Packet arrives at R0, Destination Router = R7 (ii) Packet traverses to R1 over local channel
P
(iii) Packet traverses from R1 to R4 over the global channel
G0 G1
G0 G1G0 G1
R0 R1
R2 R3
R4 R5
R6 R7
P
G0 G1
R0 R1
R2 R3
R4 R5
R6 R7
PR0 R1
R2 R3
R4 R5
R6 R7
R1
R2 R3
R4 R5
R6 R7
P
Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work
Dragonfly Model Validation Dragonfly network topologies in design
PERCS network topology Machines from Echelon project
Booksim: A cycle accurate simulator with dragonfly model Used by Dally et. al to validate the dragonfly topology
proposal Runs in serial mode only Supports minimal and global adaptive routing Performance results shown on 1,024 nodes and 264
routers We validated our ROSS dragonfly model against
booksim
Global Adaptive Routing---Threshold selection (ROSS vs. Booksim)
If min_queue_size < (2 * nonmin_queue_size) + adaptive_threshold then
route minimally
Else
route non-minimally
End if
Global Adaptive Routing
• Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routing
• We incorporated a similar threshold in ROSS
• To find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value.
• The value that yields maximum non-minimal packets is -180
ROSS vs. booksim– Uniform Random traffic
With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results
With global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results
ROSS vs. booksim– Nearest neighbor traffic
- The nearest neighbor traffic yields a very high latency and low throughput with minimal routing.
- This traffic pattern can be load balanced by either non-minimal or adaptive routing
- Non-minimal routing gives slightly under 50% throughput with nearest neighbor traffic
Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work
Dragonfly performance: ROSS vs. booksim
ROSS attains the following performance speedup Minimum of 5x up to a maximum of 11x speedup
over booksim with MIN routing Minimum of 5.3x speedup and a maximum of 12.38x
speedup with global adaptive routing
Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work
ROSS Dragonfly model on BG/P and BG/Q
We evaluated the strong scaling characteristics of the dragonfly model on Argonne Leadership Computing Facility (ALCF) IBM Blue
Gene/P system (Intrepid) Computational Center for Nanotechnology Innovations
(CCNI) IBM Blue Gene/Q We scheduled 64 MPI tasks per node on BG/Q and 4
MPI tasks per node on BG/P Performance was evaluated through the following
metrics Committed event rate Percentage of remote events ROSS event efficiency Simulation run time
ROSS Parameters ROSS employs Time Warp Optimistic synchronization
protocol To reduce state saving overheads, ROSS employs an
event roll back mechanism ROSS event efficiency determines the amount of
useful work performed by the simulation
Global Virtual Time (GVT) imposes a lower bound on the simulation time
GVT is controlled by batch and gvt-interval parameters On average, batch * gvt-interval events are processed
between each GVT epoch
ROSS Dragonfly Performance Results on BG/P vs. BG/Q
Event efficiency drops and total rollbacks increase on BG/P after 16K MPI tasks Less off-node communication on BG/Q vs. BG/P Each MPI task has more processing power on BG/P and
simulation advances quickly
ROSS Dragonfly Performance Results on BG/P vs. BG/Q
The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work load
The computation performed at each MPI task dominates the number of rolled back events
Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work
Conclusion & Future work Conclusion
We presented a parallel discrete-event simulation for a dragonfly network topology
We validated our simulator with cycle accurate simulator booksim
We demonstrated the ability of our simulator to scale on very large models with up to 50M nodes
Future work Introduce an improved queue congestion sensing
policy for global adaptive routing Experiment with other variations of nearest neighbor
traffic in dragonfly Compare the dragonfly network model with other
candidate topology models for exascale computing