2012-an efficient all-to-all communication algorithm for meshtorus networks

8
An Efficient All-to-all Communication Algorithm for Mesh/Torus Networks Syunji Yazaki , Haruyuki Takaue , Yuichiro Ajima , Toshiyuki Shimizu and Hiroaki Ishihata § Information Technology Center The University of Electro-Communications, Tokyo, Japan 182–8585 Email: [email protected] COMSYS JOHO SYSTEM Corporation, Tokyo, Japan 108–8610 Email: [email protected] Fujitsu Ltd., Kawasaki, Japan 211-8588 Email: {aji, t.shimizu}@jp.fujitsu.com § School of Computer Science Tokyo University of Technology, Tokyo, Japan 192–0982 Email: [email protected] Abstract—An efficient all-to-all communication algorithm for torus and mesh networks, A2AT, was proposed. A2AT schedules message sending sequence so that all links are fully used by exploiting function of concurrent message transfer in the node. By using A2AT, the hop count of messages equals the maximum number of messages sharing a link in their routes for all message transfers. A2AT can therefore maintain synchronization without the need for phasing operation such as an MPI barrier. When the VOQ which is an ideal configuration for A2AT was used, communication times for mesh/torus network obtained by A2AT were roughly 1.20 and 1.09 times higher, on average, than those of the ideal times. When the networks had the minimum number of virtual channels and a small buffer, assuming a practical network, A2AT was able to reduce communication times by 12.5% and 36.0% compared with those of the conventional algorithm. When two controllers are used, A2AT reduced 28.2% and 55.7% communication time with those by A2AND on 15×15×15 (=3,375 nodes) mesh and torus networks respectively (18.6% and 44.8% in average). A2AT also reduced 15.1% and 41.9% of communication time with those by A2AND on the same mesh and torus networks respectively (14.4% and 37.5% in average) when six controllers are used. I. I NTRODUCTION In distributed parallel computers, the performance of intern- ode communication greatly influences the overall performance of some applications. Torus and mesh networks are widely used for massively parallel computers. In networks that have a small bisection bandwidth, such as mesh or torus networks, contention can easily occur and deteriorate the communication performance. Cray XT5 [1] uses a three-dimensional (3D) torus network, and an asymmetric 3D mesh network is used in Red Storm [2], which uses torus networks for only one dimension. For such networks, it is important to develop a communication algorithm that can fully exploit the network bandwidth. All-to-all communication is a communication pattern in which each node transmits different messages to all the other nodes. This pattern is widely used for matrix transposition and FFT (Fast Fourier Transform). Previously, several sophisticated all-to-all communication algorithms have been proposed. For example, Scott [3] pro- posed an algorithm for hypercube and mesh topology, and Horie [4] and Yu-Chee [5] proposed one for an n-dimensional torus topology. The performance of these algorithms reached the theoretical lower bound. However, phasing operation, such as an MPI barrier, for message communication was assumed in all of these algorithms. All nodes repeatedly perform scheduled message transmissions and pause between each transmission step. This ensures that all the links in the network are used with 100% efficiency. The problem is that the use of phasing operation of communication networks is challenging when the algorithm is implemented in a practical system. Since all the nodes need to transmit messages and pause synchronously, global synchronization is required before starting each phase. This adds extra overhead to all-to-all communication. In the BlueGene/L system [6], a special algorithm is em- ployed in which randomization of the destination is combined with an adaptive routing technique. Alm´ asi has reported that this algorithm achieved almost 100% efficiency on an 8×8×8 torus network [7]. However, the algorithm cannot be applied to the systems which use static routing. Moreover, Kumar [8] proposed an algorithm that uses static routing and barrier synchronization to improve the efficiency of networks with an asymmetric torus configuration. Doi [9] proposed overlapping method of all-to-all communication for FFT on asymmetric 3D torus network such as 8×8×16. This method overlapped local FFT calculations with communication to improve FFT performance. Recently developed parallel computers contain several com- munication controllers in one node. This type of hardware can execute multiple message communications concurrently. Bruck [10] and Tipparaju [11] studied the performance of a system in which multiple communication controllers are used. Ajima [12] et al. developed “Tofu”, which is a six-dimensional torus network used in a K computer [13]. In this network, 2012 10th IEEE International Symposium on Parallel and Distributed Processing with Applications 978-0-7695-4701-5/12 $26.00 © 2012 IEEE DOI 10.1109/ISPA.2012.44 277

Upload: rajkumarpani

Post on 14-May-2017

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

An Efficient All-to-all Communication Algorithmfor Mesh/Torus Networks

Syunji Yazaki∗, Haruyuki Takaue†, Yuichiro Ajima‡, Toshiyuki Shimizu‡ and Hiroaki Ishihata§∗Information Technology Center

The University of Electro-Communications, Tokyo, Japan 182–8585

Email: [email protected]†COMSYS JOHO SYSTEM Corporation, Tokyo, Japan 108–8610

Email: [email protected]‡Fujitsu Ltd., Kawasaki, Japan 211-8588

Email: {aji, t.shimizu}@jp.fujitsu.com§School of Computer Science

Tokyo University of Technology, Tokyo, Japan 192–0982

Email: [email protected]

Abstract—An efficient all-to-all communication algorithm fortorus and mesh networks, A2AT, was proposed. A2AT schedulesmessage sending sequence so that all links are fully used byexploiting function of concurrent message transfer in the node.By using A2AT, the hop count of messages equals the maximumnumber of messages sharing a link in their routes for all messagetransfers. A2AT can therefore maintain synchronization withoutthe need for phasing operation such as an MPI barrier. Whenthe VOQ which is an ideal configuration for A2AT was used,communication times for mesh/torus network obtained by A2ATwere roughly 1.20 and 1.09 times higher, on average, than thoseof the ideal times. When the networks had the minimum numberof virtual channels and a small buffer, assuming a practicalnetwork, A2AT was able to reduce communication times by12.5% and 36.0% compared with those of the conventionalalgorithm. When two controllers are used, A2AT reduced 28.2%and 55.7% communication time with those by A2AND on15×15×15 (=3,375 nodes) mesh and torus networks respectively(18.6% and 44.8% in average). A2AT also reduced 15.1% and41.9% of communication time with those by A2AND on thesame mesh and torus networks respectively (14.4% and 37.5%in average) when six controllers are used.

I. INTRODUCTION

In distributed parallel computers, the performance of intern-

ode communication greatly influences the overall performance

of some applications. Torus and mesh networks are widely

used for massively parallel computers. In networks that have

a small bisection bandwidth, such as mesh or torus networks,

contention can easily occur and deteriorate the communication

performance. Cray XT5 [1] uses a three-dimensional (3D)

torus network, and an asymmetric 3D mesh network is used

in Red Storm [2], which uses torus networks for only one

dimension. For such networks, it is important to develop a

communication algorithm that can fully exploit the network

bandwidth.

All-to-all communication is a communication pattern in

which each node transmits different messages to all the other

nodes. This pattern is widely used for matrix transposition and

FFT (Fast Fourier Transform).

Previously, several sophisticated all-to-all communication

algorithms have been proposed. For example, Scott [3] pro-

posed an algorithm for hypercube and mesh topology, and

Horie [4] and Yu-Chee [5] proposed one for an n-dimensional

torus topology. The performance of these algorithms reached

the theoretical lower bound. However, phasing operation,

such as an MPI barrier, for message communication was

assumed in all of these algorithms. All nodes repeatedly

perform scheduled message transmissions and pause between

each transmission step. This ensures that all the links in the

network are used with 100% efficiency. The problem is that

the use of phasing operation of communication networks is

challenging when the algorithm is implemented in a practical

system. Since all the nodes need to transmit messages and

pause synchronously, global synchronization is required before

starting each phase. This adds extra overhead to all-to-all

communication.

In the BlueGene/L system [6], a special algorithm is em-

ployed in which randomization of the destination is combined

with an adaptive routing technique. Almasi has reported that

this algorithm achieved almost 100% efficiency on an 8×8×8

torus network [7]. However, the algorithm cannot be applied

to the systems which use static routing. Moreover, Kumar

[8] proposed an algorithm that uses static routing and barrier

synchronization to improve the efficiency of networks with an

asymmetric torus configuration. Doi [9] proposed overlapping

method of all-to-all communication for FFT on asymmetric

3D torus network such as 8×8×16. This method overlapped

local FFT calculations with communication to improve FFT

performance.

Recently developed parallel computers contain several com-

munication controllers in one node. This type of hardware

can execute multiple message communications concurrently.

Bruck [10] and Tipparaju [11] studied the performance of a

system in which multiple communication controllers are used.

Ajima [12] et al. developed “Tofu”, which is a six-dimensional

torus network used in a K computer [13]. In this network,

2012 10th IEEE International Symposium on Parallel and Distributed Processing with Applications

978-0-7695-4701-5/12 $26.00 © 2012 IEEE

DOI 10.1109/ISPA.2012.44

277

Page 2: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

�����

Fig. 1. Example of a torus network.

each node has four controllers. The efficient use of this type

of hardware for specific communication patterns needs to be

explored further.

In this paper, we propose an efficient all-to-all communi-

cation algorithm, A2AT (All-to-all communication for torus

networks), that schedules sending sequence so that all links

are fully used. We implemented the proposed algorithm in

a mesh/torus network that are capable of transferring mul-

tiple messages concurrently. When implemented in a two-

dimensional (2D) mesh/torus network in which each node

has four controllers, the proposed algorithm demonstrated a

performance can attain the theoretical lower bound.

We performed simulations to compare the performance of

our algorithm with another algorithm that is currently being

used. Simulation results showed that the proposed algorithm

had a stronger performance than the conventional one.

The rest of this paper is organized as follows. In Section

II, we introduce the conventional all-to-all algorithm imple-

mented on a mesh/torus network and describe the theoretical

lower bound of the all-to-all communication. Section III

describes the proposed algorithm. In Section IV, we analyzed

communication time using simple communication model. In

Section V, we present the simulation and evaluation results

and discuss them in detail. Finally, this paper is concluded in

Section VI.

II. PRELIMINARIES

A. Mesh/Torus Network

In a mesh/torus network, the routers are connected in a

multi-dimensional mesh. An example of a 2D torus network

is shown in Fig. 1. We will explain the configuration of the

nodes and routers in detail in following Section II-B.

In this type of network, the communication links in the

upward direction are connected to those in the downward

direction; similarly, the links to the left and right of each node

are also connected to each other. In a mesh network, nodes

located at the edges of the network have only three links, and

�����

�����

����� �����

�������� �����

�����������

������������

������������

������������

����������

�����

��������������� ����������������

Fig. 2. Configuration of nodes and routers in a mesh/torus network.

for i = 1 to N − 1 dosend((myid + i) mod N )

end for

Fig. 3. Sending order of A2ASS.

nodes located at the corners of the network have only two

links.

The routers are connected via full-duplex links that have

the same bandwidth, and the messages are routed by a pre-

determined route (static routing). In this study, we focus on

dimension-order routing, in which a message is first routed

along the x direction and then along the y direction and so

on. Several virtual channels (VCs) are used per link.

B. Configuration of Nodes and Router

Figure 2 shows a configuration of nodes and routers. The

communication links between the two are also full-duplex

links and have sufficiently high bandwidth. Each node has

several communication controllers: in this case, four. All

controllers equally share the bandwidth of the link between

the node and the router. The receiver ends of the controllers

also operate concurrently. The communication library in each

node orders a communication request to a free communication

controller in a first-in-first-out manner. We assume that the

routing scheme on the router is a wormhole or a virtual cut-

through.

C. Existing Algorithm

To perform all-to-all communication smoothly, all nodes

must have the exact same function; that is, they send a message

to the destination node and receive a message from the source

node at the same time. The sequence of the destination nodes

is the key point in communication algorithms.

A direct-type all-to-all algorithm, which does not store and

combine messages in the intermediate node, is shown in Fig.

3. This algorithm is called a simple spread algorithm and it

is used in MPICH [14]. Note that in this paper we refer to it

as A2ASS. In all-to-all communication, all nodes send N −1 messages to all other nodes, where N represents the total

278

Page 3: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

for y = 0 to ky − 1 dofor x = 0 to kx − 1 do

if x �= 0 & y �= 0 thensend((myidx +x) mod kx, (myidy +y) mod ky);

end ifend for

end for

Fig. 4. Sending order of A2AND.

number of nodes, myid denotes the rank of the node, and

send(t) implies that the message is sent to a node which has

rank t.There are other algorithm for n-dimension has been used

that we call A2AND. Its sending order is shown in Fig. 4. For

A2AND, the rank of sending node myid is represented as a

coordinate (myidx, myidy). kx and ky represent the network

size of x and y directions, respectively.

D. Lower Bound of all-to-all Communication Time

The lower bound of all-to-all communication (LMS) on a

kx × ky mesh network is shown as

LMS = �kx

2��kx

2�ky. (1)

The bisection bandwidth of a torus network is twice of that of a

mesh network, so the lower bound of all-to-all communication

(LTR) on a kx × ky torus network is shown as

LTR =12LMS =

12�kx

2��kx

2�ky. (2)

III. PROPOSED ALGORITHM (A2AT)

In a mesh/torus network, if all nodes transmit messages to

their left nodes simultaneously, the left link is only used and

the other links remain idle. This lowers the link bandwidth

usage.

To resolve this problem, several messages should be sent

concurrently via idle links by operating multiple communi-

cation controllers in parallel. A configuration of one node

and router, in which is shown in Fig. 2 enables this parallel

operation. When all messages are sent via different links,

the extent of link utilization increases. However, in the con-

ventional algorithms described in the previous section, even

if the network hardware can send more than one messages

concurrently, collisions occur because the nodes concurrently

send messages via the same link.

A2AT is designed to avoid collisions caused by messages

sent from the same nodes: it controls the selection of desti-

nation nodes so that all links are fully utilized when several

messages in a group are being transferred concurrently. By

using A2AT, the communication time meets the theoretical

lower bound without the need for phasing operation.

In the following section, we describe the implementation of

our algorithm in an odd-sized 2D mesh and then its extension

to an even-sized mesh and then to a 2D torus.

� � �

� � � �

� �� �� ��

����������

������������������������� �������

�������

Fig. 5. Algorithm of A2AT for 5-ary 2-mesh. (step 1, step 2)

for i = 1 to k−12 do

send(i, 0); send(0, i); {g1, g5}send(−i, 0); send(0, −i); {g2, g6}send(i, i); send(−i, −i); {g3, g7}send(i, −i); send(−i, i); {g4, g8}

end for

Fig. 6. Order of sending an A2AT message in step 1.

A. Odd-sized 2D Mesh

The size of a mesh/torus network is generally represented as

k-ary n-mesh/cube. Let us consider a k×k 2D mesh network

(k-ary 2-mesh) which k is an odd number. All nodes have

two-dimensional coordinates. To express the destination node,

we use a relative two-dimensional distance notation as (x, y) :−k−1

2 ≤ x, y ≤ k−12 , where the sign represents direction.

Figure 5 shows sending order of A2AT in case of a 5 × 5mesh network. The figure shows the sequence of sending

messages from the node at the center of the networks. All

nodes in the network send messages at same time as the

center nodes. In A2AT in odd-sized 2D mesh, all messages in

all-to-all communication are divided into 12 groups (g1, · · · ,g12) as shown in the figure and they are sent according to

group number. For mesh networks, A2AT concurrently sends

two messages by using two communication controllers. In this

paper, the number of concurrent message transfers represents

NCT and messages are sent with NCT = 2.

Eight groups (g1, · · · , g8) are included in step 1. In this

step, all nodes send messages to the nodes that lie on the xor y axis and along the diagonal lines as shown in Fig. 6. In

this figure, send(i, j) represents “send((myidx + i) modkx,

(myidy + j) modky)”.

The sequence of sending a message in step 2 is shown in

Fig. 7. Four groups (g9, · · · , g12) are included in this step.

279

Page 4: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

for i = 1 to k−12 do

for j = 1 to i dosend(i, j); send(−j, −i); {g9}send(j, i); send(−i, −j); {g10}send(i, −j); send(−j, i); {g11}send(j, −i); send(−i, j); {g12}

end forend for

Fig. 7. Order of sending an A2AT message in step 2.

� �� �� ��

�� ��

������

Fig. 8. Algorithm of A2AT for 5-ary 2-mesh (step 3).

B. Even-sized 2D Mesh

When the size of network k is even, messages distributed to

nodes in the network with size of k − 1 is sent by the above

mentioned algorithm at first. Then, the remaining messages

distributed to the remaining nodes in additional step 3 shown

in Fig. 8. Six groups (g13 · · · g18) are included in step 3. In

this step, messages are sent in the manner shown in Fig. 9.

C. 2D Torus

A2AT can be adapted to a 2D torus network with only

minimal changes. For the torus network, messages are sent

with NCT = 4; that is, all communication controllers are

used. Using all the controllers means that all four links for all

four directions (+x, +y, −x, −y) are used.

In the case of torus network in which k is an even number,

there are two possible routes to the destination, both of which

are separated by a distance k2 . If one route is repeatedly

selected, an imbalance occurs in the link usage. Therefore,

in this case, different direction must be used to balance link

usage as shown in Figure 10.

D. Asymmetrical and higher dimension

A2AT can be applied to asymmetrical 2D mesh/torus net-

works such as rectangular networks. A p× q mesh/torus net-

work (p ≥ q) consist of sections: one or more q× q section(s)

and the other section. In this case, messages distributed to

nodes including the q× q sections firstly. Then, the remaining

messages distributed to the nodes in the other section. A2AT

for i = 1 to k2 − 1 do

send(k2 , i); send(−i, k

2 ); {g13, g15}send(k

2 , −i); send(i, k2 ); {g14, g16}

end forsend(k

2 , 0); send(0, k2 ); {g17}

send(−k2 , k

2 ); {g18}

Fig. 9. Order of sending an A2AT message in step 3 when k is an evennumber.

for i = 1 to k2 − 1 do

send(k2 , i); send(−i, −k

2 ); {g13, g15}send(−k

2 , −i); send(i, k2 ); {g14, g16}

end forsend(k

2 , 0); send(0, k2 ); {g17}

send(−k2 , k

2 ); {g18}

Fig. 10. Order of sending an A2AT message in step 3 when k is an evennumber for torus network.

can be also extended to higher dimension networks such as 3D

torus/mesh networks, in which efficiently uses the all links in

the node.

IV. ANALYTICAL COMMUNICATION TIME OF A2AT

In all-to-all communication, K(K − 1) different messages

of the same size are transported, where K represents number

of nodes in the network. When the size of message is big, the

start-up time of communication is negligible. Moreover, all

links in a network have the same bandwidth. For the remainder

of this paper we normalize the bandwidth of the links to 1.

One of A2AT’s features is that hop count of messages

equals the maximum number of messages sharing a link in

a route for all message transfers in the same group. For

example, consider a case in which the hop count of all

messages in g5 is 2. The maximum number of messages

sharing a link on a route is also 2, thus indicating that

messages of same size are transferred with same bandwidth

1/2. Hence, the communication time of messages in this group

can be calculated as 2 by normalizing the link bandwidth

to 1. A2AT can therefore maintain synchronization without

the need for phasing operation, such as an MPI barrier. This

feature enables the all-to-all communication time of A2AT to

be calculated by the total of the communication time of all

groups.

A. Odd-sized 2D Mesh/torus

For a mesh network, in the case of NCT = 2, messages

including g1, g5 and g2, g6 are sent simultaneously by

according to Fig. 6. By considering hop count of the messages,

the communication time for sending messages to the nodes that

lie on the x or y axis is estimated by Equation (3).

T1 =S∑

i=1

i× 2, (3)

280

Page 5: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

where S = �(k − 1)/2�. Other communication times are

estimated as Equation (4):

T2 =S∑

i=1

S∑

j=1

(i + j)× 2. (4)

The overall communication time is the sum of Equation (3)

and (4). Hence, the communication time for an odd-sized 2D

mesh/torus network is calculated by Equation (5):

Todd = T1 + T2

= S(S + 1)(2S + 1)

=k(k + 1)(k − 1)

4. (5)

This equals the lower bound of the communication time shown

in Equation (1).

For odd-sized 2D torus network, the communication time is

half that of a mesh network, i.e., k(k +1)(k−1)/8. This also

equals the lower bound of the communication time shown in

Equation (2).

B. Even-sized 2D Mesh/torus

For even-sized 2D mesh/torus network, step 3 shown in Fig.

9 is required. The maximum number of messages sharing one

link is (k/2 + i) when messages in groups g13 · · · g16 are

sent. For sending messages to nodes that lie on x or y axis

(g17) and along the diagonal lines (g18), it will be k/2. The

sum of these is calculated as

T3 =S∑

i=1

(k

2+ i)× 2 +

k

2+

k

2, (6)

where S = (N/2) − 1. By adding Equation (5) and (6), the

communication time of an even-sized 2D mesh is estimated

as

Teven =N3

4. (7)

Equation (7) and (1) are equivalent.

For an even-sized 2D torus, the communication time is

estimated as k3/2 because the communication time of step

3 will be T3/2.

The overall communication time of the mesh/torus network

is estimated as k(k + 1)(k − 1)/3 when NCT = 1. We

depict this communication time as TNCT1. When NCT = 2,

the communication time of torus TNCT2 is half of TNCT1.

Therefore, TNCT2 is estimated as k(k + 1)(k − 1)/6. When

NCT = 4, the communication time reaches the lower bound

LTR.

Table I lists the analytical communication time of A2AT

with NCT = 1, 2, 4 on the mesh/torus network. They are

the ideal communication times of each case. Note that A2AT

efficiently uses links when NCT is high. This is because the

directions of communications that are continuously sent are

different.

TABLE ILIST OF IDEAL A2AT COMMUNICATION TIME.

NCT Mesh Torus1 TNCT1 = k(k + 1)(k − 1)/32 LMS = k(k +1)(k−

1)/4TNCT2 = k(k +1)(k − 1)/6

4 - LTR = k(k + 1)(k −1)/8

TABLE IISIMULATION CONDITIONS.

Networks k-ary 2-mesh/cube (k = 6 · · · 17)Communication algorithms A2AND,A2ATFlits per packet (FPP) 100Packets per message (PPM) 1Switching Wormhole routing,Virtual cut-throughBuffering Every flitArbitration Each packet by round-robinRouting Dimension-order

V. EVALUATION

A. Simulator and Condition

For the evaluation, we used a cycle-based flit-level network

simulator called Booksim [15]. In Booksim, communication

route is determined for each message and flow control is

performed for every flit. It takes one cycle to issue a flit from

a node to the network and to fetch flits from the network to a

node. We implemented the following features in Booksim to

evaluate A2AT.

• A2A, A2AND, and A2AT communication algorithms

• Local synchronization (LS), which simulates

MPI sendrecv.

• Varied start time (VST), which randomly varies the all-

to-all communication start time for each node.

• Extra ports for nodes that increment NCT .

Conditions of simulation are shown in Table II.

First, we made sure that A2AT could perform sufficiently,

assuming an ideal network configuration. Then, we performed

three experiments assuming practical networks.

B. Performance of A2AT on an Ideal Network

In this experiment, we evaluated the communication time

of A2AT and A2AND on an ideal mesh/torus network and

then compared these. For the Booksim simulation, the port

arbitration was performed in a round-robin manner when

several communications flows were in the routers. Performing

the port arbitration in each router means we cannot guarantee

global communication fairness. While this is normal for a

practical network, A2AT unstably work under this condition.

For the experiment, we used a virtual output queue (VOQ)

at the network level to increase the global communication

fairness for all messages. VOQ is a configuration that each

port in the routers has many send queues [16]. We used the

same number of virtual channels (V C) as the number of nodes

because each virtual channel has a send queue for the Booksim

simulation.

281

Page 6: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

���

���

���

��

��

���

� � � �� �� � � �� �� �� �

���

����

��

� � ��

��� �

� ���

��

��

���������� ���������������������� ������������

Fig. 11. All-to-all communication time of A2AT and A2AND on mesh/torusnetwork (V = Ck2 − 1, Buf = 2, NCT = 1).

To simulate wormhole switching, we set the buffer size

(Buf ) of each virtual channel to 2, which is the minimum.

This buffer can contain up to two flits. We used wormhole

switching in order to reduce the buffering effect. NCT was

set to 1. A2AT assumes static routing. However, when the size

of torus network k is even, Booksim randomly chooses a route

from two possible routes. This may cause little difference of

communication time of A2AT in this case.

Results are shown in Fig. 11. The x and y axes in the

figure represent the size of networks k and the ratio of the

experimental value of communication time by the simulator to

ideal communication time with TNCT1. Solid and dashed lines

denote for A2AT and A2AND, respectively and the circles and

squares represents mesh and torus networks, respectively.

The experimental time of A2AT on the mesh and torus

networks was, on average, about 1.20 and 1.09 times that of the

ideal time. This result demonstrates that the communication

time obtained by A2AT is closer to the ideal communication

time than that obtained by A2AND.

This difference between communication times is probably

caused by the router process. Routers modeled in Booksim

require one cycle to pass a flit through the router, and these

cycles emerge as latency for communication. The ideal time

does not include this latency. The difference in hop count also

affects. When several messages are sent from nodes located

near the edge or corner of a network, the hop count of the

messages varies for each node.

C. Performance of A2AT on Practical Networks

1) Case of Minimum VC and Small Buffer Size: We com-

pared the communication time of A2AT and A2AND when the

networks had the minimum number of virtual channels and a

small buffer, assuming a conventional network. We set V C to

2 for the torus network, as two virtual channels are required

to avoid communication deadlock, and also set V C = 2 for

the mesh network to ensure a common condition. We set Buf

���

���

���

��

��

���

���

���

� � � �� �� � � �� �� �� �

���

����

��

� � ��

��� �

� ���

��

��

���������� ���������������������� ������������

Fig. 12. All-to-all communication time under condition assuming conven-tional network (V C = 2, buf = 20, NCT = 1).

to 20. In this experiment, each message was constructed by

100 flits, and therefore the routers performed the wormhole

switching.

The results are shown in Fig. 12. The x and y axes in

the figure represent the size of networks k and the ratio of

the communication time estimated by the simulator to ideal

communication time TNCT1. The lines and marks have the

same meaning as in the previous figure.

The results show that, compared with A2AND, A2AT does

not work effectively. It seems that the disordering of com-

munication had a negative effect on the A2AT performance.

In this experiment, global communication fairness was not

guaranteed because V C was set to 2 and the transmission time

of each message therefore differed. These differences are quite

significant and can have a devastating impact on collective

communication. The variations of the transmission times also

contributed to the disordering of the communication.

To probe the above issue more thoroughly, we performed an

evaluation with LS, which is an effective mode for reducing

variations of transmission time. The results of this evaluation

are shown in Fig. 13. On both network, A2AT was able

to reduce communication time by 12.5% (mesh) and 36.0%

(torus) more than A2AND was. A2AT could work effectively

under the condition of a small difference of transmission time.

This indicates that the LS should be used for practical network.

From now on, we will use the LS for all evaluations.

2) Influence of Varied Start Time: In a practical network,

communication start times of each node are varied due to

a load imbalance of each node and so on. Shibamura et.

al. pointed out that A2AT communication is significantly

influenced by such time variations [17]. We already know,

as stated above, that the effect of start time variation can

be reduced by using the LS mode. To make sure of this,

we compared the A2AT and A2AND communication time by

using the VST on mesh and torus networks. VST varied the

communication start time of each node. Results are show in

282

Page 7: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

���

��

���

���

���

��

� � � �� �� � � �� �� �� �

���

����

��

� � ��

��� �

� ���

��

��

���������� ���������������������� ������������

Fig. 13. Communication time of A2AT on mesh/torus networks with LS(V C = 2, buf = 20, NCT = 1).

���

��

���

���

���

��

� � � �� �� � � �� �� �� �

���

����

��

� � ��

��� �

� ���

��

��

��� ������������� ����������

Fig. 14. Communication time of A2AT on mesh network when start timevaried (V C = 2, buf = 20, NCT = 1, LS mode).

Fig. 14 and 15. The x and y axes in the figures represent the

size of networks k and the ratio of the averaged time gap of

communications to overall communication time.

For both the mesh and torus networks, the graphs of A2AT

and A2AND with the VST were similar to those without VST.

This indicates that variations to communication start time do

not greatly affect the communication efficiency for A2AT in

LS mode.

3) Case of Increasing NCT : We performed an experiment

and compared the communication times of A2AT and A2AND

on 3D mesh/torus network when NCT is greater than 1.

For the experiment, we set V C = 1 for mesh network and

V C = 2 for torus. An extra VC for torus is just used to avoid

communication deadlock. We set fpp and Buf to 5 and 10.

By using smaller fpp than that used in previous experiment,

we reduced computational resources for the simulations. This

���

��

���

���

���

��

� � � �� �� � � �� �� �� �

���

����

��

� � ��

��� �

� ���

��

��

��� ������������� ����������

Fig. 15. Communication time of A2AT on torus network when start timevaried (V C = 2, buf = 20, NCT = 1, LS mode).

���

�����

������

�������

���������

� � � � �� �� � �� �� ��

�����

��

����

��

��

�������� ��������� ��������� ����������

Fig. 16. Comparison of communication time of A2AT and A2AND on 3Dmesh/torus network when NCT = 2 (V C = 1 for mesh, V C = 2 for torus,fpp = 5, Buf = 10, LS).

allows to simulate all-to-all communication on large scale

networks. For mesh and torus network, NCT is set to 2 and

6, respectively.

The evaluation results for the mesh and torus networks are

shown in Fig. 16 and 17. The x and y axes in the figures

represent the size of networks k and overall cycle time for

all-to-all communication.

As shown in Fig. 16, when two controllers are used

(NCT = 2), A2AT reduced 28.2% and 55.7% communication

time with those by A2AND on k = 15 (3,375 nodes) mesh and

torus networks respectively (18.6% and 44.8% on average).

From the results in Fig. 17, A2AT also reduced 15.1% and

41.9% of communication time with those by A2AND on mesh

and torus networks when six controllers are used (14.4% and

37.5% on average). From these results, we found that our

A2AT more effective than A2AND when NCT is greater than

1.

283

Page 8: 2012-An Efficient All-To-All Communication Algorithm for MeshTorus Networks

���

�����

������

�������

���������

� � � � �� �� � �� �� ��

�����

��

����

��

��

�������� ��������� ��������� ����������

Fig. 17. Comparison of communication time of A2AT and A2AND on 3Dmesh/torus network when NCT = 6 (V C = 1 for mesh, V C = 2 for torus,fpp = 5, Buf = 10, LS).

VI. CONCLUSION

In this paper, we proposed an efficient all-to-all commu-

nication algorithm for mesh/torus networks, A2AT. A2AT

schedules sending sequence so that all links are fully used

by exploiting function of concurrent message transfer in the

node. A2AT groups communications so that the hop count of

messages equals the maximum number of messages sharing a

link in a route for all message transfers in the same group.

On mesh/torus networks that used VOQ, which is an ideal

configuration for A2AT, the obtained communication times

were roughly 1.20 and 1.09 times higher, on average, than

those of the ideal times that were analytically estimated. The

result also demonstrated that communication time obtained

with A2AT was closer to the ideal time than that with A2AND,

the conventional algorithm.

In a simulation assuming practical networks with two virtual

channels and small buffers, A2AT reduced communication

time by 12.5% (mesh) and 36.0% (torus) compared with

A2AND. It seems this superior performance was due to using

local synchronization to perform MPI sendrecv.

We evaluated the influence of variation to the communica-

tion start time by comparing the A2AT and A2AND communi-

cation times when the start time was randomly varied. Results

showed that variation does not have any significant effect on

communication efficiency for A2AT.

We compared the A2AT and A2AND performances when

NCT was greater than 1. Comparison results showed that

A2AT reduced 28.2% and 55.7% communication time with

those by A2AND on k = 15 (3,375 nodes) mesh and torus

networks respectively (18.6% and 44.8% in average from k =5 to 15) when two controllers are used (NCT = 2). And

A2AT also reduced 15.1% and 41.9% of communication time

with those by A2AND on mesh and torus network respectively

(14.4% and 37.5% in average from k = 5 to 15) when six

controllers are used.

ACKNOWLEDGMENT

The authors would like to thank Mr. Oinaga of Fujitsu

Limited for his helpful discussions. A part of this work was

supported by KAKENHI (22500052). The simulation was

mainly carried out using the computer facilities at Research

Institute for Information Technology, Kyushu University and

Information Technology Center at the University of Electro-

Communications.

REFERENCES

[1] P. H. Worley, R. F. Barrett, and J. A. Kuehn, “Early evaluation of thecray xt5,” in Proc. of the 51st Cray User Group Conference, May 2009,pp. 1–12.

[2] A. Hoisie, G. Johnson, D. J.Kerbyson, M. Lang, and S. Pakin, “Aperformance comparison through benchmarking and modeling of threeleading supercomputers: Blue gene/l, red storm, and purple,” in SC ’06:Proc. of the 2006 ACM/IEEE conference on Supercomputing. IEEEComputer Society, 2006, p. 3.

[3] D. S. Scott, “Efficient all-to-all communication patterns in hypercubeand mesh topologies,” in 6th Distributed Memory Computing Confer-ence, 1991, pp. 398–403.

[4] T. Horie and K. Hayashi, “All-to-all personalized communication on awraparound mesh,” Journal of Infomation processing society of Japan(in Japanese), vol. 34, no. 4, pp. 628–637, 1993.

[5] Y.-C. Tseng and S. K.S.Gupta, “All-to-all personalized communicationin a wormhole-routed torus,” IEEE Transactions on Parallel and Dis-tributed Systems, vol. 7, no. 5, pp. 498–505, May 1996.

[6] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Gi-ampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, T. Takken,M. Tsao, and P. Vranas, “Blue Gene/L torus interconnection network,”IBM Journal of Research and Development, vol. 49, no. 2/3, pp. 265–276, 2005.

[7] G. Almasi, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway,J. E. Moreira, B. Steinmacher-Burow, and Y. Zheng, “Optimization ofMPI collective communication on BlueGene/L systems,” in ICS ’05:Proceedings of the 19th annual international conference on Supercom-puting. New York, NY, USA: ACM, 2005, pp. 253–262.

[8] S. Kumar, Y. Sabharwal, R. Garg, and P. Heidelberger, “Optimizationof All-to-All Communication on the Blue Gene/L Supercomputer,” in37th International Conference on Parallel Processing, Sept 2008, pp.320–329.

[9] J. Doi and Y. Negishi, “Overlapping methods of all-to-allcommunication and fft algorithms for torus-connected massively parallelsupercomputers,” in Proceedings of the 2010 ACM/IEEE InternationalConference for High Performance Computing, Networking, Storage andAnalysis, ser. SC ’10. Washington, DC, USA: IEEE Computer Society,2010, pp. 1–9. [Online]. Available: http://dx.doi.org/10.1109/SC.2010.38

[10] J. Bruck, C.-T. Ho, S. Kipnis, and D. Weathersby, “Efficient algorithmsfor all-to-all communications in multi-port message-passing systems,” inSPAA ’94: Proceedings of the sixth annual ACM symposium on Parallelalgorithms and architectures. New York, NY, USA: ACM, 1994, pp.298–309.

[11] V. Tipparaju and J. Nieplocha, “Optimizing all-to-all collective commu-nication by exploiting concurrency in modern networks,” in SC ’05:Proceedings of the 2005 ACM/IEEE conference on Supercomputing.Washington, DC, USA: IEEE Computer Society, 2005, p. 46.

[12] Y. Ajima, S. Sumimoto, and T. Shimizu, “Tofu: A 6d mesh/torusinterconnect for exascale computers,” Computer, pp. 36–40, 2009.

[13] “K computer,” http://www.aics.riken.jp/en/kcomputer/.[14] “Mpich,” http://www-unix.mcs.anl.gov/mpi/.[15] W. Dally and B. Towles, Principles and Practices of Interconnection

Networks. San Francisco, CA, USA: Morgan Kaufmann PublishersInc., 2003.

[16] P. Garcia, F. Quiles, J. Flich, J. Duato, I. Johnson, and F. Naven, “Ef-ficient, scalable congestion management for interconnection networks,”Micro, IEEE, vol. 26, no. 5, pp. 52 –66, sept.-oct. 2006.

[17] H. Shibamura, H. Miwa, R. Susukita, T. Hirao, Y. Ajima, I. Miyoshi,T. Shimizu, H. Ishihata, and K. Inoue, “Simulation evaluation of anoptimal all-to-all communication algorithm using packet pacing,” IPSJSIG Notes (in Japanese), vol. 2010-HPC-126, no. 14, pp. 1–9, 20100727.

284