2012-an efficient all-to-all communication algorithm for meshtorus networks
TRANSCRIPT
An Efficient All-to-all Communication Algorithmfor Mesh/Torus Networks
Syunji Yazaki∗, Haruyuki Takaue†, Yuichiro Ajima‡, Toshiyuki Shimizu‡ and Hiroaki Ishihata§∗Information Technology Center
The University of Electro-Communications, Tokyo, Japan 182–8585
Email: [email protected]†COMSYS JOHO SYSTEM Corporation, Tokyo, Japan 108–8610
Email: [email protected]‡Fujitsu Ltd., Kawasaki, Japan 211-8588
Email: {aji, t.shimizu}@jp.fujitsu.com§School of Computer Science
Tokyo University of Technology, Tokyo, Japan 192–0982
Email: [email protected]
Abstract—An efficient all-to-all communication algorithm fortorus and mesh networks, A2AT, was proposed. A2AT schedulesmessage sending sequence so that all links are fully used byexploiting function of concurrent message transfer in the node.By using A2AT, the hop count of messages equals the maximumnumber of messages sharing a link in their routes for all messagetransfers. A2AT can therefore maintain synchronization withoutthe need for phasing operation such as an MPI barrier. Whenthe VOQ which is an ideal configuration for A2AT was used,communication times for mesh/torus network obtained by A2ATwere roughly 1.20 and 1.09 times higher, on average, than thoseof the ideal times. When the networks had the minimum numberof virtual channels and a small buffer, assuming a practicalnetwork, A2AT was able to reduce communication times by12.5% and 36.0% compared with those of the conventionalalgorithm. When two controllers are used, A2AT reduced 28.2%and 55.7% communication time with those by A2AND on15×15×15 (=3,375 nodes) mesh and torus networks respectively(18.6% and 44.8% in average). A2AT also reduced 15.1% and41.9% of communication time with those by A2AND on thesame mesh and torus networks respectively (14.4% and 37.5%in average) when six controllers are used.
I. INTRODUCTION
In distributed parallel computers, the performance of intern-
ode communication greatly influences the overall performance
of some applications. Torus and mesh networks are widely
used for massively parallel computers. In networks that have
a small bisection bandwidth, such as mesh or torus networks,
contention can easily occur and deteriorate the communication
performance. Cray XT5 [1] uses a three-dimensional (3D)
torus network, and an asymmetric 3D mesh network is used
in Red Storm [2], which uses torus networks for only one
dimension. For such networks, it is important to develop a
communication algorithm that can fully exploit the network
bandwidth.
All-to-all communication is a communication pattern in
which each node transmits different messages to all the other
nodes. This pattern is widely used for matrix transposition and
FFT (Fast Fourier Transform).
Previously, several sophisticated all-to-all communication
algorithms have been proposed. For example, Scott [3] pro-
posed an algorithm for hypercube and mesh topology, and
Horie [4] and Yu-Chee [5] proposed one for an n-dimensional
torus topology. The performance of these algorithms reached
the theoretical lower bound. However, phasing operation,
such as an MPI barrier, for message communication was
assumed in all of these algorithms. All nodes repeatedly
perform scheduled message transmissions and pause between
each transmission step. This ensures that all the links in the
network are used with 100% efficiency. The problem is that
the use of phasing operation of communication networks is
challenging when the algorithm is implemented in a practical
system. Since all the nodes need to transmit messages and
pause synchronously, global synchronization is required before
starting each phase. This adds extra overhead to all-to-all
communication.
In the BlueGene/L system [6], a special algorithm is em-
ployed in which randomization of the destination is combined
with an adaptive routing technique. Almasi has reported that
this algorithm achieved almost 100% efficiency on an 8×8×8
torus network [7]. However, the algorithm cannot be applied
to the systems which use static routing. Moreover, Kumar
[8] proposed an algorithm that uses static routing and barrier
synchronization to improve the efficiency of networks with an
asymmetric torus configuration. Doi [9] proposed overlapping
method of all-to-all communication for FFT on asymmetric
3D torus network such as 8×8×16. This method overlapped
local FFT calculations with communication to improve FFT
performance.
Recently developed parallel computers contain several com-
munication controllers in one node. This type of hardware
can execute multiple message communications concurrently.
Bruck [10] and Tipparaju [11] studied the performance of a
system in which multiple communication controllers are used.
Ajima [12] et al. developed “Tofu”, which is a six-dimensional
torus network used in a K computer [13]. In this network,
2012 10th IEEE International Symposium on Parallel and Distributed Processing with Applications
978-0-7695-4701-5/12 $26.00 © 2012 IEEE
DOI 10.1109/ISPA.2012.44
277
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
Fig. 1. Example of a torus network.
each node has four controllers. The efficient use of this type
of hardware for specific communication patterns needs to be
explored further.
In this paper, we propose an efficient all-to-all communi-
cation algorithm, A2AT (All-to-all communication for torus
networks), that schedules sending sequence so that all links
are fully used. We implemented the proposed algorithm in
a mesh/torus network that are capable of transferring mul-
tiple messages concurrently. When implemented in a two-
dimensional (2D) mesh/torus network in which each node
has four controllers, the proposed algorithm demonstrated a
performance can attain the theoretical lower bound.
We performed simulations to compare the performance of
our algorithm with another algorithm that is currently being
used. Simulation results showed that the proposed algorithm
had a stronger performance than the conventional one.
The rest of this paper is organized as follows. In Section
II, we introduce the conventional all-to-all algorithm imple-
mented on a mesh/torus network and describe the theoretical
lower bound of the all-to-all communication. Section III
describes the proposed algorithm. In Section IV, we analyzed
communication time using simple communication model. In
Section V, we present the simulation and evaluation results
and discuss them in detail. Finally, this paper is concluded in
Section VI.
II. PRELIMINARIES
A. Mesh/Torus Network
In a mesh/torus network, the routers are connected in a
multi-dimensional mesh. An example of a 2D torus network
is shown in Fig. 1. We will explain the configuration of the
nodes and routers in detail in following Section II-B.
In this type of network, the communication links in the
upward direction are connected to those in the downward
direction; similarly, the links to the left and right of each node
are also connected to each other. In a mesh network, nodes
located at the edges of the network have only three links, and
�����
�
�����
�
����� �����
�������� �����
�����������
������������
������������
������������
����������
�����
��������������� ����������������
�
Fig. 2. Configuration of nodes and routers in a mesh/torus network.
for i = 1 to N − 1 dosend((myid + i) mod N )
end for
Fig. 3. Sending order of A2ASS.
nodes located at the corners of the network have only two
links.
The routers are connected via full-duplex links that have
the same bandwidth, and the messages are routed by a pre-
determined route (static routing). In this study, we focus on
dimension-order routing, in which a message is first routed
along the x direction and then along the y direction and so
on. Several virtual channels (VCs) are used per link.
B. Configuration of Nodes and Router
Figure 2 shows a configuration of nodes and routers. The
communication links between the two are also full-duplex
links and have sufficiently high bandwidth. Each node has
several communication controllers: in this case, four. All
controllers equally share the bandwidth of the link between
the node and the router. The receiver ends of the controllers
also operate concurrently. The communication library in each
node orders a communication request to a free communication
controller in a first-in-first-out manner. We assume that the
routing scheme on the router is a wormhole or a virtual cut-
through.
C. Existing Algorithm
To perform all-to-all communication smoothly, all nodes
must have the exact same function; that is, they send a message
to the destination node and receive a message from the source
node at the same time. The sequence of the destination nodes
is the key point in communication algorithms.
A direct-type all-to-all algorithm, which does not store and
combine messages in the intermediate node, is shown in Fig.
3. This algorithm is called a simple spread algorithm and it
is used in MPICH [14]. Note that in this paper we refer to it
as A2ASS. In all-to-all communication, all nodes send N −1 messages to all other nodes, where N represents the total
278
for y = 0 to ky − 1 dofor x = 0 to kx − 1 do
if x �= 0 & y �= 0 thensend((myidx +x) mod kx, (myidy +y) mod ky);
end ifend for
end for
Fig. 4. Sending order of A2AND.
number of nodes, myid denotes the rank of the node, and
send(t) implies that the message is sent to a node which has
rank t.There are other algorithm for n-dimension has been used
that we call A2AND. Its sending order is shown in Fig. 4. For
A2AND, the rank of sending node myid is represented as a
coordinate (myidx, myidy). kx and ky represent the network
size of x and y directions, respectively.
D. Lower Bound of all-to-all Communication Time
The lower bound of all-to-all communication (LMS) on a
kx × ky mesh network is shown as
LMS = �kx
2��kx
2�ky. (1)
The bisection bandwidth of a torus network is twice of that of a
mesh network, so the lower bound of all-to-all communication
(LTR) on a kx × ky torus network is shown as
LTR =12LMS =
12�kx
2��kx
2�ky. (2)
III. PROPOSED ALGORITHM (A2AT)
In a mesh/torus network, if all nodes transmit messages to
their left nodes simultaneously, the left link is only used and
the other links remain idle. This lowers the link bandwidth
usage.
To resolve this problem, several messages should be sent
concurrently via idle links by operating multiple communi-
cation controllers in parallel. A configuration of one node
and router, in which is shown in Fig. 2 enables this parallel
operation. When all messages are sent via different links,
the extent of link utilization increases. However, in the con-
ventional algorithms described in the previous section, even
if the network hardware can send more than one messages
concurrently, collisions occur because the nodes concurrently
send messages via the same link.
A2AT is designed to avoid collisions caused by messages
sent from the same nodes: it controls the selection of desti-
nation nodes so that all links are fully utilized when several
messages in a group are being transferred concurrently. By
using A2AT, the communication time meets the theoretical
lower bound without the need for phasing operation.
In the following section, we describe the implementation of
our algorithm in an odd-sized 2D mesh and then its extension
to an even-sized mesh and then to a 2D torus.
� � �
� � � �
� �� �� ��
����������
������������������������� �������
�������
Fig. 5. Algorithm of A2AT for 5-ary 2-mesh. (step 1, step 2)
for i = 1 to k−12 do
send(i, 0); send(0, i); {g1, g5}send(−i, 0); send(0, −i); {g2, g6}send(i, i); send(−i, −i); {g3, g7}send(i, −i); send(−i, i); {g4, g8}
end for
Fig. 6. Order of sending an A2AT message in step 1.
A. Odd-sized 2D Mesh
The size of a mesh/torus network is generally represented as
k-ary n-mesh/cube. Let us consider a k×k 2D mesh network
(k-ary 2-mesh) which k is an odd number. All nodes have
two-dimensional coordinates. To express the destination node,
we use a relative two-dimensional distance notation as (x, y) :−k−1
2 ≤ x, y ≤ k−12 , where the sign represents direction.
Figure 5 shows sending order of A2AT in case of a 5 × 5mesh network. The figure shows the sequence of sending
messages from the node at the center of the networks. All
nodes in the network send messages at same time as the
center nodes. In A2AT in odd-sized 2D mesh, all messages in
all-to-all communication are divided into 12 groups (g1, · · · ,g12) as shown in the figure and they are sent according to
group number. For mesh networks, A2AT concurrently sends
two messages by using two communication controllers. In this
paper, the number of concurrent message transfers represents
NCT and messages are sent with NCT = 2.
Eight groups (g1, · · · , g8) are included in step 1. In this
step, all nodes send messages to the nodes that lie on the xor y axis and along the diagonal lines as shown in Fig. 6. In
this figure, send(i, j) represents “send((myidx + i) modkx,
(myidy + j) modky)”.
The sequence of sending a message in step 2 is shown in
Fig. 7. Four groups (g9, · · · , g12) are included in this step.
279
for i = 1 to k−12 do
for j = 1 to i dosend(i, j); send(−j, −i); {g9}send(j, i); send(−i, −j); {g10}send(i, −j); send(−j, i); {g11}send(j, −i); send(−i, j); {g12}
end forend for
Fig. 7. Order of sending an A2AT message in step 2.
� �� �� ��
�� ��
������
Fig. 8. Algorithm of A2AT for 5-ary 2-mesh (step 3).
B. Even-sized 2D Mesh
When the size of network k is even, messages distributed to
nodes in the network with size of k − 1 is sent by the above
mentioned algorithm at first. Then, the remaining messages
distributed to the remaining nodes in additional step 3 shown
in Fig. 8. Six groups (g13 · · · g18) are included in step 3. In
this step, messages are sent in the manner shown in Fig. 9.
C. 2D Torus
A2AT can be adapted to a 2D torus network with only
minimal changes. For the torus network, messages are sent
with NCT = 4; that is, all communication controllers are
used. Using all the controllers means that all four links for all
four directions (+x, +y, −x, −y) are used.
In the case of torus network in which k is an even number,
there are two possible routes to the destination, both of which
are separated by a distance k2 . If one route is repeatedly
selected, an imbalance occurs in the link usage. Therefore,
in this case, different direction must be used to balance link
usage as shown in Figure 10.
D. Asymmetrical and higher dimension
A2AT can be applied to asymmetrical 2D mesh/torus net-
works such as rectangular networks. A p× q mesh/torus net-
work (p ≥ q) consist of sections: one or more q× q section(s)
and the other section. In this case, messages distributed to
nodes including the q× q sections firstly. Then, the remaining
messages distributed to the nodes in the other section. A2AT
for i = 1 to k2 − 1 do
send(k2 , i); send(−i, k
2 ); {g13, g15}send(k
2 , −i); send(i, k2 ); {g14, g16}
end forsend(k
2 , 0); send(0, k2 ); {g17}
send(−k2 , k
2 ); {g18}
Fig. 9. Order of sending an A2AT message in step 3 when k is an evennumber.
for i = 1 to k2 − 1 do
send(k2 , i); send(−i, −k
2 ); {g13, g15}send(−k
2 , −i); send(i, k2 ); {g14, g16}
end forsend(k
2 , 0); send(0, k2 ); {g17}
send(−k2 , k
2 ); {g18}
Fig. 10. Order of sending an A2AT message in step 3 when k is an evennumber for torus network.
can be also extended to higher dimension networks such as 3D
torus/mesh networks, in which efficiently uses the all links in
the node.
IV. ANALYTICAL COMMUNICATION TIME OF A2AT
In all-to-all communication, K(K − 1) different messages
of the same size are transported, where K represents number
of nodes in the network. When the size of message is big, the
start-up time of communication is negligible. Moreover, all
links in a network have the same bandwidth. For the remainder
of this paper we normalize the bandwidth of the links to 1.
One of A2AT’s features is that hop count of messages
equals the maximum number of messages sharing a link in
a route for all message transfers in the same group. For
example, consider a case in which the hop count of all
messages in g5 is 2. The maximum number of messages
sharing a link on a route is also 2, thus indicating that
messages of same size are transferred with same bandwidth
1/2. Hence, the communication time of messages in this group
can be calculated as 2 by normalizing the link bandwidth
to 1. A2AT can therefore maintain synchronization without
the need for phasing operation, such as an MPI barrier. This
feature enables the all-to-all communication time of A2AT to
be calculated by the total of the communication time of all
groups.
A. Odd-sized 2D Mesh/torus
For a mesh network, in the case of NCT = 2, messages
including g1, g5 and g2, g6 are sent simultaneously by
according to Fig. 6. By considering hop count of the messages,
the communication time for sending messages to the nodes that
lie on the x or y axis is estimated by Equation (3).
T1 =S∑
i=1
i× 2, (3)
280
where S = �(k − 1)/2�. Other communication times are
estimated as Equation (4):
T2 =S∑
i=1
S∑
j=1
(i + j)× 2. (4)
The overall communication time is the sum of Equation (3)
and (4). Hence, the communication time for an odd-sized 2D
mesh/torus network is calculated by Equation (5):
Todd = T1 + T2
= S(S + 1)(2S + 1)
=k(k + 1)(k − 1)
4. (5)
This equals the lower bound of the communication time shown
in Equation (1).
For odd-sized 2D torus network, the communication time is
half that of a mesh network, i.e., k(k +1)(k−1)/8. This also
equals the lower bound of the communication time shown in
Equation (2).
B. Even-sized 2D Mesh/torus
For even-sized 2D mesh/torus network, step 3 shown in Fig.
9 is required. The maximum number of messages sharing one
link is (k/2 + i) when messages in groups g13 · · · g16 are
sent. For sending messages to nodes that lie on x or y axis
(g17) and along the diagonal lines (g18), it will be k/2. The
sum of these is calculated as
T3 =S∑
i=1
(k
2+ i)× 2 +
k
2+
k
2, (6)
where S = (N/2) − 1. By adding Equation (5) and (6), the
communication time of an even-sized 2D mesh is estimated
as
Teven =N3
4. (7)
Equation (7) and (1) are equivalent.
For an even-sized 2D torus, the communication time is
estimated as k3/2 because the communication time of step
3 will be T3/2.
The overall communication time of the mesh/torus network
is estimated as k(k + 1)(k − 1)/3 when NCT = 1. We
depict this communication time as TNCT1. When NCT = 2,
the communication time of torus TNCT2 is half of TNCT1.
Therefore, TNCT2 is estimated as k(k + 1)(k − 1)/6. When
NCT = 4, the communication time reaches the lower bound
LTR.
Table I lists the analytical communication time of A2AT
with NCT = 1, 2, 4 on the mesh/torus network. They are
the ideal communication times of each case. Note that A2AT
efficiently uses links when NCT is high. This is because the
directions of communications that are continuously sent are
different.
TABLE ILIST OF IDEAL A2AT COMMUNICATION TIME.
NCT Mesh Torus1 TNCT1 = k(k + 1)(k − 1)/32 LMS = k(k +1)(k−
1)/4TNCT2 = k(k +1)(k − 1)/6
4 - LTR = k(k + 1)(k −1)/8
TABLE IISIMULATION CONDITIONS.
Networks k-ary 2-mesh/cube (k = 6 · · · 17)Communication algorithms A2AND,A2ATFlits per packet (FPP) 100Packets per message (PPM) 1Switching Wormhole routing,Virtual cut-throughBuffering Every flitArbitration Each packet by round-robinRouting Dimension-order
V. EVALUATION
A. Simulator and Condition
For the evaluation, we used a cycle-based flit-level network
simulator called Booksim [15]. In Booksim, communication
route is determined for each message and flow control is
performed for every flit. It takes one cycle to issue a flit from
a node to the network and to fetch flits from the network to a
node. We implemented the following features in Booksim to
evaluate A2AT.
• A2A, A2AND, and A2AT communication algorithms
• Local synchronization (LS), which simulates
MPI sendrecv.
• Varied start time (VST), which randomly varies the all-
to-all communication start time for each node.
• Extra ports for nodes that increment NCT .
Conditions of simulation are shown in Table II.
First, we made sure that A2AT could perform sufficiently,
assuming an ideal network configuration. Then, we performed
three experiments assuming practical networks.
B. Performance of A2AT on an Ideal Network
In this experiment, we evaluated the communication time
of A2AT and A2AND on an ideal mesh/torus network and
then compared these. For the Booksim simulation, the port
arbitration was performed in a round-robin manner when
several communications flows were in the routers. Performing
the port arbitration in each router means we cannot guarantee
global communication fairness. While this is normal for a
practical network, A2AT unstably work under this condition.
For the experiment, we used a virtual output queue (VOQ)
at the network level to increase the global communication
fairness for all messages. VOQ is a configuration that each
port in the routers has many send queues [16]. We used the
same number of virtual channels (V C) as the number of nodes
because each virtual channel has a send queue for the Booksim
simulation.
281
���
���
���
��
��
���
� � � �� �� � � �� �� �� �
���
����
��
� � ��
��� �
� ���
��
��
���������� ���������������������� ������������
Fig. 11. All-to-all communication time of A2AT and A2AND on mesh/torusnetwork (V = Ck2 − 1, Buf = 2, NCT = 1).
To simulate wormhole switching, we set the buffer size
(Buf ) of each virtual channel to 2, which is the minimum.
This buffer can contain up to two flits. We used wormhole
switching in order to reduce the buffering effect. NCT was
set to 1. A2AT assumes static routing. However, when the size
of torus network k is even, Booksim randomly chooses a route
from two possible routes. This may cause little difference of
communication time of A2AT in this case.
Results are shown in Fig. 11. The x and y axes in the
figure represent the size of networks k and the ratio of the
experimental value of communication time by the simulator to
ideal communication time with TNCT1. Solid and dashed lines
denote for A2AT and A2AND, respectively and the circles and
squares represents mesh and torus networks, respectively.
The experimental time of A2AT on the mesh and torus
networks was, on average, about 1.20 and 1.09 times that of the
ideal time. This result demonstrates that the communication
time obtained by A2AT is closer to the ideal communication
time than that obtained by A2AND.
This difference between communication times is probably
caused by the router process. Routers modeled in Booksim
require one cycle to pass a flit through the router, and these
cycles emerge as latency for communication. The ideal time
does not include this latency. The difference in hop count also
affects. When several messages are sent from nodes located
near the edge or corner of a network, the hop count of the
messages varies for each node.
C. Performance of A2AT on Practical Networks
1) Case of Minimum VC and Small Buffer Size: We com-
pared the communication time of A2AT and A2AND when the
networks had the minimum number of virtual channels and a
small buffer, assuming a conventional network. We set V C to
2 for the torus network, as two virtual channels are required
to avoid communication deadlock, and also set V C = 2 for
the mesh network to ensure a common condition. We set Buf
���
���
���
��
��
���
���
���
� � � �� �� � � �� �� �� �
���
����
��
� � ��
��� �
� ���
��
��
���������� ���������������������� ������������
Fig. 12. All-to-all communication time under condition assuming conven-tional network (V C = 2, buf = 20, NCT = 1).
to 20. In this experiment, each message was constructed by
100 flits, and therefore the routers performed the wormhole
switching.
The results are shown in Fig. 12. The x and y axes in
the figure represent the size of networks k and the ratio of
the communication time estimated by the simulator to ideal
communication time TNCT1. The lines and marks have the
same meaning as in the previous figure.
The results show that, compared with A2AND, A2AT does
not work effectively. It seems that the disordering of com-
munication had a negative effect on the A2AT performance.
In this experiment, global communication fairness was not
guaranteed because V C was set to 2 and the transmission time
of each message therefore differed. These differences are quite
significant and can have a devastating impact on collective
communication. The variations of the transmission times also
contributed to the disordering of the communication.
To probe the above issue more thoroughly, we performed an
evaluation with LS, which is an effective mode for reducing
variations of transmission time. The results of this evaluation
are shown in Fig. 13. On both network, A2AT was able
to reduce communication time by 12.5% (mesh) and 36.0%
(torus) more than A2AND was. A2AT could work effectively
under the condition of a small difference of transmission time.
This indicates that the LS should be used for practical network.
From now on, we will use the LS for all evaluations.
2) Influence of Varied Start Time: In a practical network,
communication start times of each node are varied due to
a load imbalance of each node and so on. Shibamura et.
al. pointed out that A2AT communication is significantly
influenced by such time variations [17]. We already know,
as stated above, that the effect of start time variation can
be reduced by using the LS mode. To make sure of this,
we compared the A2AT and A2AND communication time by
using the VST on mesh and torus networks. VST varied the
communication start time of each node. Results are show in
282
���
��
���
���
���
��
�
� � � �� �� � � �� �� �� �
���
����
��
� � ��
��� �
� ���
��
��
���������� ���������������������� ������������
Fig. 13. Communication time of A2AT on mesh/torus networks with LS(V C = 2, buf = 20, NCT = 1).
���
��
���
���
���
��
�
� � � �� �� � � �� �� �� �
���
����
��
� � ��
��� �
� ���
��
��
��� ������������� ����������
Fig. 14. Communication time of A2AT on mesh network when start timevaried (V C = 2, buf = 20, NCT = 1, LS mode).
Fig. 14 and 15. The x and y axes in the figures represent the
size of networks k and the ratio of the averaged time gap of
communications to overall communication time.
For both the mesh and torus networks, the graphs of A2AT
and A2AND with the VST were similar to those without VST.
This indicates that variations to communication start time do
not greatly affect the communication efficiency for A2AT in
LS mode.
3) Case of Increasing NCT : We performed an experiment
and compared the communication times of A2AT and A2AND
on 3D mesh/torus network when NCT is greater than 1.
For the experiment, we set V C = 1 for mesh network and
V C = 2 for torus. An extra VC for torus is just used to avoid
communication deadlock. We set fpp and Buf to 5 and 10.
By using smaller fpp than that used in previous experiment,
we reduced computational resources for the simulations. This
���
��
���
���
���
��
�
� � � �� �� � � �� �� �� �
���
����
��
� � ��
��� �
� ���
��
��
��� ������������� ����������
Fig. 15. Communication time of A2AT on torus network when start timevaried (V C = 2, buf = 20, NCT = 1, LS mode).
���
�����
������
�������
���������
� � � � �� �� � �� �� ��
�
�����
��
����
��
��
�������� ��������� ��������� ����������
Fig. 16. Comparison of communication time of A2AT and A2AND on 3Dmesh/torus network when NCT = 2 (V C = 1 for mesh, V C = 2 for torus,fpp = 5, Buf = 10, LS).
allows to simulate all-to-all communication on large scale
networks. For mesh and torus network, NCT is set to 2 and
6, respectively.
The evaluation results for the mesh and torus networks are
shown in Fig. 16 and 17. The x and y axes in the figures
represent the size of networks k and overall cycle time for
all-to-all communication.
As shown in Fig. 16, when two controllers are used
(NCT = 2), A2AT reduced 28.2% and 55.7% communication
time with those by A2AND on k = 15 (3,375 nodes) mesh and
torus networks respectively (18.6% and 44.8% on average).
From the results in Fig. 17, A2AT also reduced 15.1% and
41.9% of communication time with those by A2AND on mesh
and torus networks when six controllers are used (14.4% and
37.5% on average). From these results, we found that our
A2AT more effective than A2AND when NCT is greater than
1.
283
���
�����
������
�������
���������
� � � � �� �� � �� �� ��
�
�����
��
����
��
��
�������� ��������� ��������� ����������
Fig. 17. Comparison of communication time of A2AT and A2AND on 3Dmesh/torus network when NCT = 6 (V C = 1 for mesh, V C = 2 for torus,fpp = 5, Buf = 10, LS).
VI. CONCLUSION
In this paper, we proposed an efficient all-to-all commu-
nication algorithm for mesh/torus networks, A2AT. A2AT
schedules sending sequence so that all links are fully used
by exploiting function of concurrent message transfer in the
node. A2AT groups communications so that the hop count of
messages equals the maximum number of messages sharing a
link in a route for all message transfers in the same group.
On mesh/torus networks that used VOQ, which is an ideal
configuration for A2AT, the obtained communication times
were roughly 1.20 and 1.09 times higher, on average, than
those of the ideal times that were analytically estimated. The
result also demonstrated that communication time obtained
with A2AT was closer to the ideal time than that with A2AND,
the conventional algorithm.
In a simulation assuming practical networks with two virtual
channels and small buffers, A2AT reduced communication
time by 12.5% (mesh) and 36.0% (torus) compared with
A2AND. It seems this superior performance was due to using
local synchronization to perform MPI sendrecv.
We evaluated the influence of variation to the communica-
tion start time by comparing the A2AT and A2AND communi-
cation times when the start time was randomly varied. Results
showed that variation does not have any significant effect on
communication efficiency for A2AT.
We compared the A2AT and A2AND performances when
NCT was greater than 1. Comparison results showed that
A2AT reduced 28.2% and 55.7% communication time with
those by A2AND on k = 15 (3,375 nodes) mesh and torus
networks respectively (18.6% and 44.8% in average from k =5 to 15) when two controllers are used (NCT = 2). And
A2AT also reduced 15.1% and 41.9% of communication time
with those by A2AND on mesh and torus network respectively
(14.4% and 37.5% in average from k = 5 to 15) when six
controllers are used.
ACKNOWLEDGMENT
The authors would like to thank Mr. Oinaga of Fujitsu
Limited for his helpful discussions. A part of this work was
supported by KAKENHI (22500052). The simulation was
mainly carried out using the computer facilities at Research
Institute for Information Technology, Kyushu University and
Information Technology Center at the University of Electro-
Communications.
REFERENCES
[1] P. H. Worley, R. F. Barrett, and J. A. Kuehn, “Early evaluation of thecray xt5,” in Proc. of the 51st Cray User Group Conference, May 2009,pp. 1–12.
[2] A. Hoisie, G. Johnson, D. J.Kerbyson, M. Lang, and S. Pakin, “Aperformance comparison through benchmarking and modeling of threeleading supercomputers: Blue gene/l, red storm, and purple,” in SC ’06:Proc. of the 2006 ACM/IEEE conference on Supercomputing. IEEEComputer Society, 2006, p. 3.
[3] D. S. Scott, “Efficient all-to-all communication patterns in hypercubeand mesh topologies,” in 6th Distributed Memory Computing Confer-ence, 1991, pp. 398–403.
[4] T. Horie and K. Hayashi, “All-to-all personalized communication on awraparound mesh,” Journal of Infomation processing society of Japan(in Japanese), vol. 34, no. 4, pp. 628–637, 1993.
[5] Y.-C. Tseng and S. K.S.Gupta, “All-to-all personalized communicationin a wormhole-routed torus,” IEEE Transactions on Parallel and Dis-tributed Systems, vol. 7, no. 5, pp. 498–505, May 1996.
[6] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Gi-ampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, T. Takken,M. Tsao, and P. Vranas, “Blue Gene/L torus interconnection network,”IBM Journal of Research and Development, vol. 49, no. 2/3, pp. 265–276, 2005.
[7] G. Almasi, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway,J. E. Moreira, B. Steinmacher-Burow, and Y. Zheng, “Optimization ofMPI collective communication on BlueGene/L systems,” in ICS ’05:Proceedings of the 19th annual international conference on Supercom-puting. New York, NY, USA: ACM, 2005, pp. 253–262.
[8] S. Kumar, Y. Sabharwal, R. Garg, and P. Heidelberger, “Optimizationof All-to-All Communication on the Blue Gene/L Supercomputer,” in37th International Conference on Parallel Processing, Sept 2008, pp.320–329.
[9] J. Doi and Y. Negishi, “Overlapping methods of all-to-allcommunication and fft algorithms for torus-connected massively parallelsupercomputers,” in Proceedings of the 2010 ACM/IEEE InternationalConference for High Performance Computing, Networking, Storage andAnalysis, ser. SC ’10. Washington, DC, USA: IEEE Computer Society,2010, pp. 1–9. [Online]. Available: http://dx.doi.org/10.1109/SC.2010.38
[10] J. Bruck, C.-T. Ho, S. Kipnis, and D. Weathersby, “Efficient algorithmsfor all-to-all communications in multi-port message-passing systems,” inSPAA ’94: Proceedings of the sixth annual ACM symposium on Parallelalgorithms and architectures. New York, NY, USA: ACM, 1994, pp.298–309.
[11] V. Tipparaju and J. Nieplocha, “Optimizing all-to-all collective commu-nication by exploiting concurrency in modern networks,” in SC ’05:Proceedings of the 2005 ACM/IEEE conference on Supercomputing.Washington, DC, USA: IEEE Computer Society, 2005, p. 46.
[12] Y. Ajima, S. Sumimoto, and T. Shimizu, “Tofu: A 6d mesh/torusinterconnect for exascale computers,” Computer, pp. 36–40, 2009.
[13] “K computer,” http://www.aics.riken.jp/en/kcomputer/.[14] “Mpich,” http://www-unix.mcs.anl.gov/mpi/.[15] W. Dally and B. Towles, Principles and Practices of Interconnection
Networks. San Francisco, CA, USA: Morgan Kaufmann PublishersInc., 2003.
[16] P. Garcia, F. Quiles, J. Flich, J. Duato, I. Johnson, and F. Naven, “Ef-ficient, scalable congestion management for interconnection networks,”Micro, IEEE, vol. 26, no. 5, pp. 52 –66, sept.-oct. 2006.
[17] H. Shibamura, H. Miwa, R. Susukita, T. Hirao, Y. Ajima, I. Miyoshi,T. Shimizu, H. Ishihata, and K. Inoue, “Simulation evaluation of anoptimal all-to-all communication algorithm using packet pacing,” IPSJSIG Notes (in Japanese), vol. 2010-HPC-126, no. 14, pp. 1–9, 20100727.
284