modeling a million-node dragonfly network using massively parallel discrete-event simulation misbah...

Modeling a Million-Node Dragonfly Network using Massively Parallel

Discrete-Event Simulation

Misbah Mubarak, Christopher D. CarothersRensselaer Polytechnic Institute

Robert Ross, Philip CarnsArgonne National Laboratory

Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work

The Dragonfly Network Topology A two level directly

connected topology Uses high-radix

routers Large number of ports

per router Each port has

moderate bandwidth“p”: Number of compute nodes connected to a router“a”: Number of routers in a group“h”: Number of global channels per routerk=a + p + h – 1a=2p=2h (Recommended configuration)

Simulating interconnect networks Expected size of exascale systems

Millions of compute cores Up to 1 million compute nodes

Critical to have a low-latency, small diameter and low-cost interconnect network

Exascale HPC systems cannot be effectively simulated with small-scale prototypes

We use Rensselaer Optimistic Simulation System (ROSS) to simulate a dragonfly model with millions of nodes Our dragonfly model attains a peak event rate of 1.33 billion

events/sec Total committed events: 872 billion

Dragonfly Model Configuration

Traffic Patterns Uniform Random Traffic (UR) Nearest Neighbor Traffic (or Worst Case traffic WC)

Virtual channels To avoid deadlocks

Credit based flow control Upstream nodes/routers keep track of buffer slots

An input-queued virtual channel router Each router port supports up to ‘v’ virtual channels

credit

generate

send to outbound

buffer

send

arrive

channel delay

source router

send

arrivechannel delay

send

arrive

send

Interval

Buffer

full? N

Y

credit

Buffer

full?

Buffer

full?

Ndestination router

Buffer

full?

Wait for

credit

Y

N

Wait for credit

Y

PacketPacket

Packet

Destination node

Wait for credit

Ycredit

credit

N

channel delay

arriveSending nodeSource routerIntermediate router(s)Destination nodeDestination router

Dragonfly Model Routing Algorithms Minimal Routing (MIN)

Uniform random traffic: High throughput, low latency

Nearest neighbor traffic: causes congestion, high latency, low throughput

Non-minimal routing (VAL) Half the throughput as MIN under UR traffic Nearest neighbor traffic: optimal performance

(about 50% throughput) Global Adaptive routing

Chooses between MIN and VAL by sensing the traffic conditions on the global channels

Dragonfly Model Minimal Routing

(iv) Packet traverses to R7 over local channel

R0

(i) Packet arrives at R0, Destination Router = R7 (ii) Packet traverses to R1 over local channel

P

(iii) Packet traverses from R1 to R4 over the global channel

G0 G1

G0 G1G0 G1

R0 R1

R2 R3

R4 R5

R6 R7

P

G0 G1

R0 R1

R2 R3

R4 R5

R6 R7

PR0 R1

R2 R3

R4 R5

R6 R7

R1

R2 R3

R4 R5

R6 R7

P

Dragonfly Model Validation Dragonfly network topologies in design

PERCS network topology Machines from Echelon project

Booksim: A cycle accurate simulator with dragonfly model Used by Dally et. al to validate the dragonfly topology

proposal Runs in serial mode only Supports minimal and global adaptive routing Performance results shown on 1,024 nodes and 264

routers We validated our ROSS dragonfly model against

booksim

Global Adaptive Routing---Threshold selection (ROSS vs. Booksim)

If min_queue_size < (2 * nonmin_queue_size) + adaptive_threshold then

route minimally

Else

route non-minimally

End if

Global Adaptive Routing

• Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routing

• We incorporated a similar threshold in ROSS

• To find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value.

• The value that yields maximum non-minimal packets is -180

ROSS vs. booksim– Uniform Random traffic

With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results

With global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results

ROSS vs. booksim– Nearest neighbor traffic

- The nearest neighbor traffic yields a very high latency and low throughput with minimal routing.

- This traffic pattern can be load balanced by either non-minimal or adaptive routing

- Non-minimal routing gives slightly under 50% throughput with nearest neighbor traffic

Dragonfly performance: ROSS vs. booksim

ROSS attains the following performance speedup Minimum of 5x up to a maximum of 11x speedup

over booksim with MIN routing Minimum of 5.3x speedup and a maximum of 12.38x

speedup with global adaptive routing

ROSS Dragonfly model on BG/P and BG/Q

We evaluated the strong scaling characteristics of the dragonfly model on Argonne Leadership Computing Facility (ALCF) IBM Blue

Gene/P system (Intrepid) Computational Center for Nanotechnology Innovations

(CCNI) IBM Blue Gene/Q We scheduled 64 MPI tasks per node on BG/Q and 4

MPI tasks per node on BG/P Performance was evaluated through the following

metrics Committed event rate Percentage of remote events ROSS event efficiency Simulation run time

ROSS Parameters ROSS employs Time Warp Optimistic synchronization

protocol To reduce state saving overheads, ROSS employs an

event roll back mechanism ROSS event efficiency determines the amount of

useful work performed by the simulation

Global Virtual Time (GVT) imposes a lower bound on the simulation time

GVT is controlled by batch and gvt-interval parameters On average, batch * gvt-interval events are processed

between each GVT epoch

ROSS Dragonfly Performance Results on BG/P vs. BG/Q

Event efficiency drops and total rollbacks increase on BG/P after 16K MPI tasks Less off-node communication on BG/Q vs. BG/P Each MPI task has more processing power on BG/P and

simulation advances quickly

ROSS Dragonfly Performance Results on BG/P vs. BG/Q

The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work load

The computation performed at each MPI task dominates the number of rolled back events

Conclusion & Future work Conclusion

We presented a parallel discrete-event simulation for a dragonfly network topology

We validated our simulator with cycle accurate simulator booksim

We demonstrated the ability of our simulator to scale on very large models with up to 50M nodes

Future work Introduce an improved queue congestion sensing

policy for global adaptive routing Experiment with other variations of nearest neighbor

traffic in dragonfly Compare the dragonfly network model with other

candidate topology models for exascale computing

modeling a million-node dragonfly network using massively parallel discrete-event simulation misbah...

Documents