article in press - ایران عرضهiranarze.ir/wp-content/uploads/2016/11/e682.pdf · article in...
TRANSCRIPT
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
Computer Communications xxx (2016) xxx–xxx
Contents lists available at ScienceDirect
Computer Communications
journal homepage: www.elsevier.com/locate/comcom
Towards cost-effective and low latency data center network
architecture
Ting Wang
a , 1 , ∗, Zhiyang Su
b , 2 , Yu Xia
c , 3 , Bo Qin
b , 4 , Mounir Hamdi d , 5 Q1
a Hong Kong University of Science and Technology, Hong Kong Q2
Q3
b Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong c College of Computer Science, Sichuan Normal University, China d College of Science, Engineering and Technology in Hamad bin Khalifa University, Qatar
a r t i c l e i n f o
Article history:
Received 17 September 2015
Revised 25 February 2016
Accepted 27 February 2016
Available online xxx
Keywords:
Data center networks
Architecture
Torus topology
Probabilistic weighted routing
Deadlock-free
a b s t r a c t
This paper presents the design, analysis, and implementation of a novel data center network architecture,
named NovaCube . Based on regular Torus topology, NovaCube is constructed by adding a number of most
beneficial jump-over links, which offers many distinct advantages and practical benefits. Moreover, in or-
der to enable NovaCube to achieve its maximum theoretical performance, a probabilistic oblivious routing
algorithm PORA is carefully designed. PORA is a both deadlock and livelock free routing algorithm, which
achieves near-optimal performance in terms of average routing path length with better load balancing
thus leading to higher throughput. Theoretical derivation and mathematical analysis together with exten-
sive simulations further prove the good performance of NovaCube and PORA.
© 2016 Elsevier B.V. All rights reserved.
1. Introduction 1
The data center network (DCN) 6 architecture is regarded as one 2
of the most important determinants of network performance in 3
data centers, and plays a significant role in meeting the require- 4
ments of could-based services as well as the agility and dynamic 5
reconfigurability of the infrastructure for changing application de- 6
mands. As a result, many novel proposals, such as Fat-Tree [1] , 7
VL2 [2] , DCell [3] , BCube [4] , c-Through [5] , Helios [6] , SprintNet 8
[7,8] , CamCube [9] , Small-World [10] , NovaCube [11] , CLOT [12] , 9
and so on, have been proposed aiming to efficiently interconnect 10
the servers inside a data center to deliver peak performance to 11
users. 12
Generally, DCN topologies can be classified into four categories: 13
multi-rooted tree-based topology (e.g. Fat-Tree), server-centric 14
∗ Corresponding author. Tel.: +86 13681755836. Q4
E-mail addresses: [email protected] (T. Wang), [email protected] (Z. Su),
[email protected] (Y. Xia), [email protected] (B. Qin), [email protected] (M. Hamdi). 1 Student Member, IEEE 2 Student Member, IEEE 3 Member, IEEE 4 Student Member, IEEE 5 Fellow, IEEE 6 The term “data center network” and “DCN” are used interchangeably in this
paper.
topology (e.g. DCell, BCube, SprintNet), hybrid network (e.g. c- 15
Through, Helios) and direct network (e.g. CamCube, Small-World) 16
[13] . Each of these has their advantages and disadvantages. 17
Tree-based topologies, like FatTree and Clos, can provide full 18
bisection bandwidth, thus the any-to-any performance is good. 19
However, their building cost and complexity is relatively high. 20
The recursive-defined server-centric topology usually concentrates 21
on the scalability and incremental extensibility with a lower 22
building cost; however, the full bisection bandwidth may not be 23
achieved and their performance guarantee is only limited to a 24
small scope. The hybrid network is a hybrid packet and circuit 25
switched network architecture. Compared with packet switching, 26
optical circuit switching can provide higher bandwidth and lower 27
latency in transmission with lower energy consumption. However, 28
optical circuit switching cannot achieve full bisection bandwidth at 29
packet granularity. Furthermore, the optics also suffers from slow 30
switching speed which can take as high as tens of milliseconds. 31
The direct network topology, which directly connects servers 32
to other servers, is a switchless network interconnection without 33
any switches, or routers. It is usually constructed in a regular pat- 34
tern, such as Torus (as show in Fig. 1 ). Besides being widely used 35
in high-performance computing systems, Torus is also an attrac- 36
tive network architecture candidate for data centers. However, this 37
design suffers consistently from poor routing efficiency compared 38
with other designs due to its relatively long network diameter (the 39
maximum shortest path between any node pairs), which is known 40
http://dx.doi.org/10.1016/j.comcom.2016.02.016
0140-3664/© 2016 Elsevier B.V. All rights reserved.
Please cite this article as: T. Wang et al., Towards cost-effective and low latency data center network architecture, Computer Communi-
cations (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
2 T. Wang et al. / Computer Communications xxx (2016) xxx–xxx
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
1-D Torus (4-ary 1-cube) 2D Torus (4-ary 2-cube) 3D Torus (3-ary 3-cube)
Fig. 1. Examples of 1D, 2D, 3D Torus topologies.
as ⌊
k 2
⌋n for a n -D Torus 7 with radix k . Besides, a long network di- 41
ameter may also lead to high communication delay. Furthermore, 42
its performance largely depends on the routing algorithms. 43
In order to deal with these imperfections, in this paper we pro- 44
pose a novel container level high-performance Torus-based DCN 45
architecture named NovaCube . The key design principle of NovaC- 46
ube is to connect the farthest node pairs by adding additional 47
jump-over links. In this way, NovaCube can halve the network 48
diameter and receive higher bisection bandwidth and through- 49
put. Moreover, we design a new weighted probabilistic oblivi- 50
ous deadlock-free routing algorithm PORA for NovaCube , which 51
achieves low average routing path length and good load-balancing 52
by exploiting the path diversities. 53
The primary contributions of this paper can be summarized as 54
follows: 55
(1) We propose a novel Torus-based DCN architecture NovaCube , 56
which exhibits good performance in network latency, bisec- 57
tion bandwidth, throughput, path diversity and average path 58
length. 59
(2) We carefully design a weighted probabilistic oblivious rout- 60
ing algorithm PORA, which is both deadlock-free and 61
62
63
(64
65
(66
67
68
(69
70
71
72
73
rev74
dem75
is 76
gor77
cre78
ogr79
sys80
clu81
2. 82
2.1.83
84
ity 85
7
cha
creases the communication efficiency. Besides being widely used 86
in supercomputing, Torus network has also been introduced to 87
the data center networks. Three typical representatives are namely 88
CamCube [9] , Small-World [10] and CLOT [12] . 89
CamCube was proposed by Abu-Libdeh et al., and the servers 90
in CamCube are interconnected in a 3D Torus topology. The Cam- 91
Cube is designed target to shipping container-sized data centers, 92
and is a server-only switchless network design. With the benefit of 93
Torus architecture and the flexibility offered by a low-level link ori- 94
entated CamCube API, CamCube allows applications to implement 95
their own routing protocols so as to achieve better application- 96
level performance. However, as aforementioned this design based 97
on the regular Torus suffers long average routing path – O ( N
1/3 ) 98
hops, with N servers, which results in poor routing efficiency. 99
In order to overcome this limitation, Shin Ji-Yong, et al. pro- 100
posed Small-World, which provides an unorthodox random data 101
center network topology. It is constructed based on some regular 102
topologies (such as ring, Torus or cube) with the addition of a large 103
number of random links which can help reduce the network di- 104
ameter and achieve higher routing efficiency. The degree of each 105
node in Small-World is limited to six, taking realistic deployment 106
and 07
me 08
geo 09
me 10
poo 111
12
top 13
to 14
per 15
Tor 16
end 117
diff 18
con 19
CLO 20
len 21
Pl
ca
livelock-free, and helps NovaCube achieve good load balanc-
ing.
3) We introduce a credit-based lossless flow control mecha-
nism in NovaCube network.
4) We design a practical geographical address assignment
mechanism, which also can be applied to the traditional
Torus network.
5) We implement NovaCube architecture, PORA routing algo-
rithm and the flow control mechanism in NS3. Extensive
simulations are conducted to demonstrate the good perfor-
mance of NovaCube .
The rest of the paper is organized as follows. First, we briefly
iew the related research literature in Section 2 . Then Section 3
onstrates the motivation. In Section 4 , NovaCube architecture
introduced and analyzed in detail. Afterwards, the routing al-
ithm PORA is designed in Section 5 . Section 6 introduces the
dit-based flow control mechanism. Section 7 demonstrates a ge-
aphical address assignment mechanism. Section 8 presents the
tem evaluation and simulation results. Finally, Section 9 con-
des this paper.
Related work
Network interconnection
The Torus-based topology well implements the network local-
forming the servers in close proximity of each other, which in-
n -D Torus with radix k is also called k -ary n -cube, which may be used inter-
ngeably in this paper.
sw 22
2.2 23
24
tot 25
of 26
era 27
as 28
( 29
(
ease cite this article as: T. Wang et al., Towards cost-effective and low l
tions (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
low cost into consideration. In addition to traditional routing 1
thods, Small-World also provides content routing coupled with 1
graphical address assignment, which in turn efficiently imple- 1
nts key-value stores. However, its shortest path routing suffers 1
r worst-case throughput and poor load balancing.
CLOT was also a DCN architecture designed based on Torus 1
ology. CLOT shares the same goal with Small-World, which aims 1
reduce the routing path length and improve its overall network 1
formance while retaining the Torus merits. Based on regular 1
us topology, CLOT uses a number of most beneficial small low- 1
switches to connect each node and its most distant nodes in
erent dimensions. In this way, for a n-D CLOT, each switch will 1
nect to 2 n nodes. By employing additional low-end switches, 1
T largely shortens the network diameter and the average path 1
gth. However, it also induces an extra expenditure on these 1
itches. 1
. Power savings in data centers 1
The energy cost of a data center accounts for a large portion of 1
al budget [14–17] . There have emerged a considerable number 1
research and investigation to achieve a green data center. Gen- 1
lly, the existing proposals can be classified into four categories 1
below. 1
1) Network level : This scheme usually resorts to energy- 1
aware routing, VM migration/placement, flow scheduling 130
and workload management mechanism to consolidate traf- 131
fic and turn off idle switches/servers [17–20] . 132
2) Hardware level : This scheme aims to design energy- 133
efficient hardware (e.g. server, switch) by using certain 134
atency data center network architecture, Computer Communi-
T. Wang et al. / Computer Communications xxx (2016) xxx–xxx 3
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
energy-efficient techniques like DVFS, VOVO, PCPG,and so on 135
[21–24] . 136
(3) Architectural level : This scheme designs energy-efficient net- 137
work architecture to achieve power savings, examples like 138
flattened butterfly topology [25] , Torus-based topology [9–139
11] and content-centric networking (CCN) based architec- 140
tures which can reduce the content distribution energy costs 141
[26,27] . 142
(4) Green energy resources : This scheme makes use of green 143
sources to reduce the energy budget such as wind, water, 144
solar energy, heat pumps, and so on [28,29] . 145
NovaCube can be considered as an architectural level approach, 146
which avoids using energy hungry switches. Moreover, NovaCube 147
n- 148
cy 149
150
151
152
on 153
or 154
ch 155
156
ss 157
n, 158
ng 159
160
c- 161
ny 162
ce 163
le 164
ks. 165
s- 166
to 167
al- 168
169
es 170
o- 171
e- 172
he 173
o- 174
ks, 175
rt 176
177
la- 178
is 179
/IP 180
ng 181
rk 182
s- 183
of 184
so 185
st 186
ce 187
us 188
nt 189
190
en 191
192
193
194
195
3.2. Routing issues in Torus 196
Any well qualified routing algorithm design in Torus network 197
must take all important metrics (such as throughput, latency, aver- 198
age path length, load balancing, deadlock free) into consideration. 199
However, the current existing routing algorithms in Torus are far 200
from perfect, as when they improve some certain performance it 201
is usually at the sacrifice of others. Generally the routings in Torus 202
can be divided into two classes: deterministic routing and adaptive 203
routing. A common example of deterministic routing is dimension- 204
ordered routing (DOR) [34] , where the message routes dimension 205
by dimension and the routing is directly determined by the source 206
address and destination address without considering the network 207
state. DOR achieves a minimal routing path, but also eliminates 208
in 209
o- 210
ti- 211
e, 212
6] 213
al 214
ke 215
ng 216
ed 217
on 218
re 219
e- 220
le 221
k. 222
k- 223
t- 224
by 225
al 226
227
n- 228
r- 229
rt- 230
nt 231
to 232
- 233
on 234
es 235
236
237
al- 238
on 239
a 240
241
en 242
m 243
244
de 245
246
1)
ue 247
248
e 249
would also save the cooling cost induced by cooling the heat ge
erated by switches. More discussions about the energy efficien
performance of NovaCube are given in Section 4.2.7 .
3. Motivation
3.1. Why Torus-based clusters
The Torus (or precisely k -ary n -cube) based intserconnecti
has been regarded as an attractive DCN architecture scheme f
data centers because of its own unique advantages, some of whi
are listed below.
Firstly, it incurs lower infrastructure cost since it is a switchle
architecture without needing any expensive switches. In additio
the power consumed by the switches and its associated cooli
power can also be saved.
Secondly, it achieves better fault-tolerance. Traditional archite
ture is usually constructed with a large number of switches, a
failure of which could greatly impact on the network performan
and system reliability. For example, if a ToR switch fails, the who
rack of servers will lose connection with the servers in other rac
Comparatively, the rich interconnectivity and in-degree of Toru
based switchless architecture makes the network far less likely
be partitioned. The path diversity can also provide good load b
ance even on permutation traffic.
Thirdly, the architectural symmetry of Torus topology optimiz
the scalability and granularity of Clusters. It allows systems to ec
nomically scale to tens of thousands of servers, which is well b
yond the capacity of Fat-Tree switches. For an n -ary k -cube, t
network can support up to k n nodes, and scales at a high exp
nential speed which outperforms traditional switched networ
such as Fat-Tree’s O ( p 3 ), and BCube’s O ( p 2 ), where p denotes p -po
switch.
Fourthly, Torus is also highlighted by its low cross-cluster
tency. In traditional switched DCNs, an inevitably severe problem
that the switching latency (several μs in each hop) and the TCP
processing latency (tens of μs) are very high, which leads to a lo
RTT. Comparatively, TCP/IP stack is not needed in Torus netwo
which saves the long TCP/IP processing time, and the NIC proces
ing delay is also lower than switches (e.g., the processing delay
a real VirtualShare NIC engine is only 0.45 μ s). Besides, Torus al
avoids the network oversubscription and provides many equal co
routing paths to avoid network congestion, which can help redu
the queuing delay due to network congestion. Consequently, Tor
achieves a much lower end-to-end delay, which is very importa
in the data center environment.
Fifthly, its high network performance has already been prov
in high-performance systems and supercomputers, such as Cray
Inc.’s Cray SeaStar (2D Torus) [30] , Cray Gemini (3D Torus) [31] ,
IBM’s Blue Gene/L (3D Torus) [32] and Blue Gene/Q (5D Torus)
[33] .
250
2)
Please cite this article as: T. Wang et al., Towards cost-effective and
cations (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
any path diversity provided by Torus topology, which results
poor load balancing and low throughput. As an improved tw
phase DOR algorithm, Valiant routing (VAL) [35] can achieve op
mal worst-case throughput by adding a random intermediate nod
but it destroys locality and suffers longer routing path. ROMM [3
and RLB [37] implements good locality, but cannot achieve optim
worst-case throughput. Comparatively, the adaptive routing (li
MIN AD [38] ) uses local network state information to make routi
decisions, which achieves better load balancing and can be coupl
with a flow control mechanism. However, using local informati
can lead to non-optimal choices while global information is mo
costly to obtain, and the network state may change rapidly. B
sides, adaptive routing is not deadlock free, where a resource cyc
can occur without routing restrictions which leads to a deadloc
Thus, adaptive routings have to apply some dedicated deadloc
avoiding techniques, such as Turn Model Routing (by elimina
ing certain turns in some dimensions) and Virtual Channels (
decomposing each unidirectional physical link into several logic
channels with private buffer resources), to prevent deadlock.
To summarize, the good features of Torus conclusively demo
strate its superiority in constructing a cost-effective and high pe
formance data center network. However, it also suffers some sho
comings, such as the relatively long routing path, and inefficie
routing algorithm with low worst-case throughput. In response
these issues, in this paper we propose some practical and effi
cient solutions from the perspectives of physical interconnecti
and routing while inheriting and keeping the intrinsic advantag
of Torus topology.
4. NovaCube network design
This section presents the network design and theoretical an
ysis of NovaCube . Before introducing the physical interconnecti
structure, we firstly provide a theorem with proof, which offers
theoretical basis of NovaCube design.
Theorem 4.1. For any node A(a 1 , a 2 , … , a n ) in a k-ary n-cube (wh
k is even) if B(b 1 , b 2 , … , b n ) is assumed to be the farthest node fro
A, then B is unique and B’s unique farthest node is exactly A.
Proof. In a k -ary n -cube, if B( b 1 , b 2 , … , b n ) is the farthest no
from A( a 1 , a 2 , … , a n ), where a i ∈ [0, k ), b i ∈ [0, k ), then there is:
b 1 =
(a 1 +
k
2
)mod k, . . . , b n =
(a n +
k
2
)mod k (
Since the result of (a i +
k 2 ) mod k is unique, thus ∀ b i is uniq
and b i ∈ [0, k ). Hence, A’s farthest node B is unique.
Next, assume B’s farthest node is A
′ ( a ′ 1 , a ′ 2 , . . . , a ′ n ), similarly w
have:
a ′ 1 =
(b 1 +
k )
mod k, . . . , a ′ n =
(b n +
k )
mod k (
2 2low latency data center network architecture, Computer Communi-
4 T. Wang et al. / Computer Communications xxx (2016) xxx–xxx
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
ube (f
251
a ′ i
(1)252
a ′ i
(2)253
a ′ i
As 254
A’(255
fro256
4.1.257
258
ogy259
� k 2 260
ing261
the262
jum263
net264
can265
(a n266
bri267
nod268
two269
4.1. 70
71
the 72
pai 73
far 74
link 75
ori 76
Nov 77
4.1. 78
79
to 80
sin 81
tive 82
onl 83
dim 84
is 85
k − 86
pai 87
nod 88
a t 89
4.2 90
91
wo 92
wid 93
4.2 94
95
jum 96
am 97
to 98
D N
Pro 99
is D 00
can 01
Pl
ca
Fig. 2. A 2D 6 × 6-node and 3D 4 × 4 × 4-node NovaC
By combining (1) and (2), we can get:
=
(b i +
k
2
)mod k =
{[(a i +
k
2
)mod k
]+
k
2
}mod k
∵ a i ∈ [0 , k ) , ∴ a i +
k
2
∈
[ k
2
, k +
k
2
) For the case of a i +
k 2 ∈ [ k 2 , k ) , we have
=
{[(a i +
k
2
)mod k
]+
k
2
}mod k
=
(a i +
k
2
+
k
2
)mod k = (a i + k ) mod k = a i
For the case of a i +
k 2 ∈ [ k, k +
k 2 ) , we have
=
{[(a i +
k
2
)mod k
]+
k
2
}mod k
=
(a i +
k
2
− k +
k
2
)mod k = a i mod k = a i
a consequence of the above, a ′ i = a i for ∀ i ∈ [1, n ]. Therefore,
a ′ 1 , a ′ 2 , . . . , a ′ n ) = A( a 1 , a 2 , … , a n ), which means the farthest node
m B is exactly A. This ends the proof. �
NovaCube physical structure
As aforementioned, one critical drawback of k -ary n -cube topol-
is its relatively long network diameter, which is as high as
� n . In order to decrease the network diameter and make rout-
packets to far away destinations more efficiently, based on
regular k -ary n -cube, NovaCube is constructed by adding some
p-over links connecting the farthest node pairs throughout the
work. In a n -D Torus the most distant node of ( a 1 , a 2 , … , a n )
be computed as ( (a 1 + � k 2 � ) mod k, (a 2 + � k 2 � ) mod k, . . . ,
+ � k 2 � ) mod k ), which guides the construction of NovaCube . In
ef, the key principle of NovaCube is to connect the most distant
e pairs by adding one jump-over link. More precisely, there are
construction cases with tiny differences.
ease cite this article as: T. Wang et al., Towards cost-effective and low l
tions (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
or simplicity, not all jump-over links are shown).
1. Case #1: k is even 2
When k is even, then according to Theorem 3.1 any node’s far- 2
st node is unique to each other, and there are k n
2 farthest node 2
rs, where k n is the total number of nodes. In this case, all the k n
2 2
thest node pairs are connected to each other by one jump-over 2
. As a result, the degree of each node will be increased from 2
ginal 2 n to 2 n + 1 . Fig. 2 presents two examples of 2D and 3D 2
aCube . 2
2. Case #2: k is odd 2
When k is odd, one node’s farthest node cannot be guaranteed 2
be unique, nor is the number of node pairs k n
2 an integer either 2
ce k n is odd. In consideration of this fact, we have no alterna- 2
but to settle for the second-best choice. The eclectic way is to 2
y construct ( k -1)-ary n - NovaCube , and keep the k th node in each 2
ension unchanged. Noticing that k ( k ≥ 1) is odd, then k − 1 2
even. Therefore, the construction of n -D NovaCube with radix 2
1 is the same as Case 1. Consequently, there are (k −1) n
2 node 2
rs with node degree of 2 n + 1 that are connected, and n k − n k −1 2
es with node degree of 2 n remain unchanged. This way makes 2
rade-off, however a small one. 2
. Properties of NovaCube 2
As with any network, the performance of the NovaCube net- 2
rk is characterized by its network diameter, bisection band- 2
th, throughput, path diversity and physical cost. 2
.1. Network diameter 2
After connecting the most distant node pairs by additional 2
p-over links, the NovaCube network architecture halves the di- 2
eter, where the diameter is reduced from original D Torus =
⌊k 2
⌋n 2
be current 2
ov aCube =
⌈ ⌊k 2
⌋n
2
⌉
(3)
of. The network diameter of a regular k -ary n -cube ( n -D Torus) 2
Torus =
⌊k 2
⌋n, which means that any node inside the network 3
reach all the other nodes within
⌊k 2
⌋n hops. For any node A 3
atency data center network architecture, Computer Communi-
T. Wang et al. / Computer Communications xxx (2016) xxx–xxx 5
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
Fig. 3. A k -ary n - NovaCube .
in Torus, we assume that node B is the farthest node from node 302
A. Next, we assume set S i to denote the nodes that can be reached 303
from node A at the i -th hop in Torus, where i ∈ [0 , ⌊
k 2
⌋n ] . Then 304
the universal set of all nodes in the network can be expressed as 305
S =
∑ � k 2 � n
i =0 S i . 306
After linking all the most distant node pairs (e.g. A and B) 307
in NovaCube , if we define S ′ i
as the set of nodes that are i hops 308
from node A in NovaCube , then: (1) for the case of � k 2 � n is even, 309
we have S ′ 0
= S 0 , S ′ 1
= S 1 + S � k 2 � n , S
′ 2
= S 2 + S � k 2 � n −1
, . . . , S ′ � k 2 � n/ 2
= 310
S � k 2 � n/ 2
+ S � k 2 � n/ 2+1
; (2) for the case of � k 2 � n is odd, we have 311
S ′ 0
= S 0 , S ′ 1
= S 1 + S � k 2 � n , S
′ 2
= S 2 + S � k 2 � n −1
, . . . , S ′ � k 2 � n/ 2 −1
= 312
S � k 2 � n/ 2 −1
+ S � k 2 � n/ 2 +1
, S ′ � k 2 � n/ 2 = S � k
2 � n/ 2 . This demonstrates 313
that in NovaCube any node A can reach all nodes of the entire net- 314
rk 315
316
317
he 318
rk 319
t- 320
ar- 321
of 322
er 323
r- 324
bi- 325
be 326
k 327
c- 328
nd 329
c- 330
331
4)
a 332
re, 333
st 334
es 335
336
337
to 338
on 339
nd 340
al 341
nd 342
in 343
re 344
d. 345
ad 346
= 347
ck 348
channel is saturated and equal to the channel bandwidth b c . Thus, 349
the ideal throughput �ideal of a topology is 350
�ideal =
b c
ω max (5)
Under uniform traffic pattern, the maximum channel load ω max 351
at the bisection channel has a lower bound, which in turn gives 352
an upper bound on throughput. For a uniform traffic pattern, on 353
average k n
2 packets must go through the B T bisection channels. If 354
the routing and flow control are optimal, then the packets will be 355
distributed evenly among all bisection channels which results in 356
the best throughput. Thus, the load on each bisection channel load 357
is at least 358
ω max ≥k n
2
B T
=
k n
2
k n + 4 k n −1 =
k
2 k + 8
(6)
i- 359
360
7)
an 361
to 362
0] . 363
t- 364
o- 365
h- 366
%. 367
r- 368
369
370
ity 371
c- 372
al- 373
ch 374
he 375
376
ge 377
- 378
er 379
an 380
B 381
of 382
383
8)
se 384
n- 385
i- 386
n- 387
er 388
er 389
work within � k 2 � n/ 2 or � k 2 � n/ 2 hops. Consequently, the netwo
diameter of NovaCube is � k 2 � n/ 2 . This ends the proof. �
4.2.2. Bisection bandwidth
The bisection bandwidth can be calculated by summing up t
link capacities between two equally-sized parts which the netwo
is partitioned into. It can be used to evaluate the worst-case ne
work capacity [39] . Assume the NovaCube network T ( N 1 , N 2 ) is p
titioned into two equal disjoint sets N 1 and N 2 , each element
T ( N 1 , N 2 ) is a bidirectional channel with a node in N 1 and anoth
node in N 2 . Then the number of bidirectional channels in the pa
tition is | T ( N 1 , N 2 ) | , or 2 | T ( N 1 , N 2 ) | channels in total, thus the
section bandwidth is B T = 2 | T (N 1 , N 2 ) | . For a k -ary n - NovaCu
as shown in Fig. 3 , when k is even, there is even number of
k -ary ( n -1)-cube, which can be divided by the minimum bise
tion into two equal sets with 2 k n −1 regular bidirectional links a
k ∗ k n −1
2 jump-over bidirectional links. Therefore, the channel bise
tion bandwidth of k -ary n - NovaCube is computed as:
B T = 2 ∗(
2 k n −1 + k ∗ k n −1
2
)= k n + 4 k n −1 (
According to the result in [40] , the bisection bandwidth of
regular n -dimensional Torus with radix k is B C = 4 k n −1 . Therefo
NovaCube effectively increases the bisection bandwidth by at leaB T −B C
B C =
k n +4 k n −1 −4 k n −1
4 k n −1 =
k 4 ≥ 25% (k ≥ 1) and the ratio increas
accordingly as k increases.
4.2.3. Throughput
Throughput is a key indicator of the network capacity
measure a topology. It not only largely depends on the bisecti
bandwidth, but is also determined by the routing algorithm a
flow control mechanism. However, we can evaluate the ide
throughput of a topology under the assumed perfect routing a
flow control. The maximum throughput means some channel
the network is saturated and the network cannot carry mo
traffic. Thus, the throughput is closely related to the channel loa
We assume the bandwidth of each channel is b c and the worklo
on a channel c is ω c . Then the maximum channel load ω max
Max { ω c , c ∈ C }. The ideal throughput occurs when the bottlene
Please cite this article as: T. Wang et al., Towards cost-effective and
cations (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
Consequently, the upper bound on an ideal throughput under un
form traffic can be derived from Eqs. (5) and (6) :
�ideal =
b c
ω max ≤ 2 k + 8
k b c (
This exhibits that NovaCube achieves better performance th
regular Torus topology in the network capacity with respect
throughput, where the ideal throughput of Torus is only 8 b c / k [4
Here we normalize the worst-case throughput ̂ �nw
to the ne
work capacity: ̂ �nw
=
ω max
ω nw ( ̂ R ) , where ̂ R indicate a routing alg
rithm. Valiant routing (VAL) [35] is a known worst-case throug
put optimal routing algorithm in Torus which obtains ̂ �nw
= 50
Thus, an optimal routing algorithm in NovaCube can achieve no
malized worst-case throughput of at least ̂ �nw
= 62.5%.
4.2.4. Path diversity
Inherited from Torus topology, NovaCube provides a divers
of paths, which can be exploited in routing algorithm by sele
tively distributing traffic over these paths to achieve load b
ancing. Besides, the network reliability also greatly benefits mu
from the path diversity, where the traffic can route around t
faulty nodes/links by taking alternative paths.
The number of distinct paths existing in NovaCube is too hu
to be calculated exactly, for simplicity, we first compute the num
ber of shortest paths in a regular Torus without any jump-ov
links. Assume two nodes A( a 1 , a 2 , . . . , a n ) and B( b 1 , b 2 , . . . , b n ) in
n -dimensional Torus, and the coordinate distance between A and
in the i th dimension is �i = | a i − b i | . Then the total number
shortest paths P ab between A and B is:
P ab =
n ∏
i =1
( n ∑
j= i � j
�i
)=
( ∑ n
i =1 �i )! ∏ n i =1 �i !
(
where the term
(∑ n j= i � j
�i
)computes the number of ways to choo
where to take the �i hops in dimension i out of all the remai
ing hops. It can be seen that a longer distance and a higher d
mension result in a larger number P ab of shortest paths. For i
stance, if given �x = 3, �y = 4, �z = 5 in a 3D Torus, the numb
of shortest paths is as high as 27720. If we further add a larg
low latency data center network architecture, Computer Communi-
6 T. Wang et al. / Computer Communications xxx (2016) xxx–xxx
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
number of additional jump-over links in NovaCube , the number of 390
paths will be larger. If taking the non-minimal paths into consider- 391
ation as designed in some routing algorithms, the number of feasi- 392
ble paths is nearly unlimited. The great path diversity of NovaCube 393
offers many benefits as aforementioned, but it is also confronted 394
with great challenges in designing an efficient, deadlock free and 395
load balanced routing algorithm. 396
4.2.5. Average path length 397
In this subsection, we derive the average path length (APL) for 398
n -D399
= 2400
Nov401
s to402
nei403
404
wh405
d m406
�(
wh407
( ∑
408
E [ �409
hav410
E[�
411
wh412
cre413
the414
4.2415
416
k is417
is g418
Nov419
the420
pai421
1 2 ) k422
link423
The424
N lin
425
wh426
tive427
of 428
link429
the430
Fat431
for432
has433
Mo434
the435
coo436
can437
4.2.7. Power savings 438
The power savings of data center is related to many factors, 439
which can be divided into several categories including hardware 440
level (server/switch using energy-efficient techniques like DVFS, 441
VOVO, PCPG, etc.), network level (energy-aware routing and flow 442
scheduling, job placement, energy-aware VM migration, etc.), 443
architectural level (e.g. switchless architecture), cooling system 4 4 4
design (cooling techniques) and the energy resources (e.g. re- 445
newable or green resources like wind, water, solar energy, heat 446
pumps) [17,41,42] . NovaCube can be considered as an architectural 447
lev 48
cor 49
con 50
bud 51
por 52
ori 53
wil 54
reg 55
oth 56
ene 57
har 58
ach 59
5. 60
61
nam 62
its 63
we 64
loc 65
5.1. 66
No 67
B( b 68
b i || 69
70
The 71
pro 72
qua 73
vaC 74
5.1. 75
76
sou 77
cid 78
S 1 , 79
the 80
PO 81
pac 82
rec 83
S ’s 84
nex 85
me 86
ma 87
p i
wh 88
Fig 89
and 90
= 6 91
Pl
ca
NovaCube with an even radix k , and calculate its value for n
. Due to the symmetry, every node is the same in the k -ary n -
aCube . Hence, we can only consider the APL from a fixed source
any possible destination d . Here we denote m as the jump-over
ghbour of s .
Denote source s = (0, … , 0) and destination d = ( x 1 , … , x n ),
ere x i ∈ [0 , k − 1] . Then we have m = ( k 2 , . . . , k 2 ). Thus, the s -
inimal distance in k -ary n - NovaCube is given as:
s, d) def = min {|| s − d|| 1 , || m − d|| 1 + 1 } (9)
ere || · || p is p -norm of the vector meaning that || x || p =
n i =1 | x i | p )
1 p for the n -dimensional vector x . Then, the APL is
( s , d )]. Without loss of generality, for the case n = 2 , we can
e
(s, d)] =
1 ∗ 5 +
∑ k/ 2 −1 i =2
(8 i − 4) ∗ i + (4 k − 6) ∗ k 2
k 2 − 1
=
k 3
3 +
k 2
2 − 4 k
3 + 1
k 2 − 1
(10)
Thus, the APL of NovaCube approaches to k 3 when k is large,
ich is superior to 2D Torus’s k 2 [40] , and as the dimension in-
ase NovaCube reduces more. In the similar way, we can compute
APL for the k -ary n -D NovaCube .
.6. Cost- effectiveness
The total number of nodes of a n -dimensional Torus with radix
k n and the degree of each node is 2 n , thus the number of links
iven as 2 nk n
2 = nk n . Therefore, the total number of links N links in
aCube can be calculated by summing up nk n regular links and
number of jump-over links. When k is even, there are k n
2 node
rs connected with jump-over links, so there are nk n +
k n
2 = (n +
n links in total. Likewise, when k is odd, there are nk n +
(k −1) n
2
s altogether, of which
(k −1) n
2 is the number of jump-over links.
calculation can be summarized as below:
ks =
⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩
(n +
1
2
)∗ k n : kise v en
nk n +
(k − 1) n
2
: kisodd (11)
The number of links per server in NovaCube is ˜ N links ≤ n +
1 2 ,
ere it is n +
1 2 for even k and n +
(k −1) n
2 k n for odd k . Compara-
ly, the number of links in FatTree [1] is relative to the number
ports p on switches, which is N links = 3 p 3 /4, and the number of
s per server in FatTree is 3. Thus, when the dimension n ≤ 3,
cost-effectiveness of NovaCube ( n +
1 2 ) is almost the same as
Tree (3) or even better than FatTree for n ≤ 2. For example,
a 4096-node topology, FatTree uses 12288 links while NovaCube
10240 links for 2D topology and 14336 links for 3D topology.
reover, NovaCube is a switchless architecture, which can save
high expenditure of expensive switches and racks with related
ling costs. Therefore, drawn from the above analysis, NovaCube
be regarded as a cost-effective architecture for data centers.
ease cite this article as: T. Wang et al., Towards cost-effective and low l
tions (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
el approach, which avoids using energy hungry switches. Ac- 4
ding to the findings in previous studies [19,43,44] , the power 4
sumed by switches accounts for around 15% of total power 4
get. As a switchless architecture, NovaCube will save this 4
tion of power consumption. Besides, intuitively the cooling cost 4
ginally induced by cooling the heat emitted by the switches 4
l be saved as well. From this perspective, NovaCube can be 4
arded as an energy-efficient architecture. Moreover, if some 4
er levels of power saving techniques (e.g. power-aware routings, 4
rgy-efficient work placement and VM migration, energy-saving 4
dware) are employed to NovaCube, more power savings can be 4
ieved. 4
Routing scheme 4
This section presents the specially designed routing algorithms 4
ed PORA for NovaCube , which aims to help NovaCube achieve 4
maximum theoretical performance. PORA is a probabilistic 4
ighted oblivious routing algorithm. Besides, PORA is also live- 4
k and deadlock free. 4
PORA routing algorithm 4
tation 1. The distance between node A( a 1 , a 2 , . . . , a n ) and node 4
1 , b 2 , . . . , b n ) in the i -th dimension is denoted as �i = || a i − 4
1 . The distance between A and B is given as �AB =
∑ n i =1 �i . 4
Generally, the PORA procedure can be divided into two steps. 4
first step is to choose routing direction according to the given 4
bability, while the second step is to route within the designated 4
drant. Without loss of generality, for simplicity we use 2D No- 4
ube to illustrate PORA. 4
1. Direction determination 4
As shown in Fig. 4 , assume a packet needs to route from the 4
rce node S to the destination node D , then firstly it needs to de- 4
e the direction of its first hop. Since S has five neighbour nodes 4
S 2 , S 3 , S 4 , M , where M is its jump-over neighbour (although in 4
case of odd k some special nodes may have no jump-over links, 4
RA still works correctly), thus it has five directions to route the 4
ket. In order to choose the most beneficial next-hop, each di- 4
tion is assigned a probability based on the distance between 4
next-hop node and destination node. Then PORA chooses the 4
t-hop according to their probabilities, where the probabilistic 4
chanism can help PORA achieve a good load balancing. The nor- 4
lized probability function is given as below: 4
=
1 �2
i ∑ ψ
i =1 1
�2 i
(12)
ere ψ is the number of neighbour nodes of the source. Take 4
. 4 as an example, the distances between S’s neighbour nodes 4
destination node D are �S 1 D = 4, �S 2 D
= 4, �S 3 D = 6, �S 4 D
4
, �MD = 3, thus the probability of choosing S 1 as the next-hop 4
atency data center network architecture, Computer Communi-
T. Wang et al. / Computer Communications xxx (2016) xxx–xxx 7
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
+
,-
,-
nd
%, 492
to 493
ur 494
d- 495
x- 496
nd 497
by 498
499
500
fi- 501
er 502
ns 503
d- 504
er 505
a 506
ep 507
is 508
u- 509
ch 510
ult 511
512
et 513
m 514
he 515
as 516
ng 517
518
e 519
520
a- 521
h- 522
523
ke 524
R 525
he 526
nk 527
he 528
as 529
til 530
531
532
a- 533
ay 534
se 535
is 536
re 537
In 538
p, 539
he 540
d- 541
to 542
nt 543
A 544
545
546
ck 547
es 548
y- 549
he 550
o- 551
re 552
s- 553
d- 554
to 555
he 556
n- 557
is 558
ve 559
560
re 561
n- 562
st 563
ds 564
n- 565
er 566
he 567
he 568
n- 569
ly 570
ill 571
of 572
573
574
575
to 576
es 577
d. 578
ta 579
it- 580
et 581
o- 582
nt 583
by 584
as 585
ull 586
en 587
w 588
to 589
p- 590
plies an acknowledg ment mechanism to keep track of lost packets, 591
x
y
Quadrant I +,+
Quadrant II -,
Quadrant III +
Quadrant IV -
jump-over link
S1
S4
S3
S2
M
D
S
Fig. 4. PORA in an 8 × 8 NovaCube (for simplicity, not all wraparound links a
jump-over links are displayed).
is p S 1 =
1
�2 S 1 D ∑ 1
�2 i
= 21.43%, and likewise p S 2 = 21.43%, p S 3 = 9.524
p S 4 = 9.524%, p M
= 38.10%, respectively. Clearly, PORA prefers
choose the shorter path with a higher probability. Each neighbo
node (except the jump-over neighbour) corresponds to one qua
rant in the Cartesian coordinate system as shown in Fig. 4 . For e
ample, in Fig. 4 S 1 , S 2 , S 3 , S 4 correspond to Quadrant I, II, III, a
IV, respectively. The division of quadrants is only determined
the source node and destination node.
5.1.2. Routing within quadrant
There are two cases for this step. For the first, if Step 1
nally selects the regular neighbour node other than the jump-ov
neighbour as the next hop, then all the following routing decisio
towards the destination must be restricted within the correspon
ing quadrant. For the second, in case Step 1 chooses the jump-ov
neighbour node M (e.g. if it has a smaller distance thus with
higher probability to be chosen) as the next hop, then repeat St
1 to determine the quadrant by taking M as the source node. Th
time, in Step 1 PORA will only compute the probability of its reg
lar neighbours without considering its jump-over neighbour, whi
can avoid jumping back to the original source node that may res
in livelock issue.
Once the quadrant is determined, then PORA routes the pack
only within the chosen quadrant, where the routing mechanis
applied is also probabilistic. At each hop, PORA firstly check if t
jump-over link can be used. The jump-over hop can be taken
the candidate next-hop route if and only if it satisfies the followi
two requirements:
• The jump-over neighbour node is also located within the sam
quadrant.
• The distance between jump-over neighbour node and destin
tion node is smaller than the distance between regular neig
bour node and destination node.
If the requirements cannot be satisfied, then PORA will ta
the regular neighbour node as its next-hop using traditional DO
(Dimension-Ordered Routing [34] ) algorithm, which routes t
packet dimension by dimension. Otherwise, if the jump-over li
meets the requirements, then the next-hop is selected from t
jump-over node and DOR node according to the probability
computed in Eq. (12) . This process is repeated at each hop un
it reaches the destination.
Please cite this article as: T. Wang et al., Towards cost-effective and
cations (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
5.2. Livelock prevention
Livelock occurs when a packet is denied routing to its destin
tion forever even though it is never blocked permanently. It m
be travelling around its destination node, never reaching it becau
the channels it requires are always occupied by other packets. Th
can occur if non-greedy adaptive routing is allowed (packets a
misrouted, but are not able to get closer to the destination).
PORA, once the routing direction is determined at the first ste
each of the following hops of PORA will be restricted within t
selected quadrant. Moreover, the routing method within the qua
rant enables the packet to find its next hop, whose distance
the destination node is always smaller than that from the curre
node, which guarantees packet delivery. Thus, we claim that POR
is a livelock-free routing algorithm.
5.3. Deadlock-free implementation
As an another notorious problem in Torus networks, deadlo
is the situation where packets are allowed to hold some resourc
while requesting others, so that the dependency chain forms a c
cle. Then all these packets must wait forever without reaching t
destination, and the throughput will also collapse. The DOR alg
rithm is proven to be deadlock-free in a mesh network, since the
will be no message loop in the network. However, the Torus me
sage loops by itself, thus simply using DOR cannot prevent dea
lock. Virtual channels are proposed as a very effective means
prevent deadlock from happening. Virtual channels are used in t
loops in a network to cut the loop into two different logical cha
nels, so no cyclic dependency will be formed. Virtual channel
easy to implement by using multiple logical queues and effecti
in solving deadlock.
To prevent deadlock in our architecture, we first make su
that jump-over links cannot form loops in the routing. We e
sure that any jump-over links we choose in the quadrant mu
be nearer than the previous hop and regular Torus links towar
the destination; otherwise, regular Torus links are used. This e
sures that the packet will never jump back through the jump-ov
links. Then, we use the DOR routing to prevent the deadlock in t
mesh sub-network. Finally, if the packets have to pass through t
wraparound links in the Torus network, we use two virtual cha
nels to cut a Torus loop into two different logical paths. Thus, on
two virtual channels are needed in each direction, which is st
cost-effective, since the hardware cost increases as the number
virtual channels increases.
6. Flow control
6.1. Credit-based flow control mechanism
Flow control, or known as congestion control, is designed
manage the rate of data transmission between devices or nod
in a network to prevent the network buffers being overwhelme
Too much data arrives exceeding the device capacity results in da
overflow, meaning the data is either lost or must be retransm
ted. Thus, the main objective of flow control is to limit pack
delay and avoid buffer overflow. In traditional Internet, the pr
tocols with flow control functionality like TCP usually impleme
the speed matching between fast transmitter and slow receiver
packet discarding and retransmission. More specifically, a node h
no alternative but to drop the packet when its buffer space is f
(in some special scenarios, packets may be discarded even wh
the buffer is still available if the packet priority is low or the flo
has occupied more than their fair share of resources regarding
QoS restrictions). After packet loss occurs, the network then a
low latency data center network architecture, Computer Communi-
8 T. Wang et al. / Computer Communications xxx (2016) xxx–xxx
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
OQ
Cred
its
OQ
CreditsOQ
Credits
OQ
Credits
Data
Credits
Intermediate Node
Data
Credits
End Node
Data
Credits
Intermediate Nodes
Fig. 5. The credit based flow co
and the sender carries out fast retransmission after receiving three 592
duplicate ACKs or go back to slow start phase after a reasonable 593
timeout RTO (around 200 ms). However, this kind of packet dis- 594
carding based flow control is unsuitable in the Torus based latency 595
sensitive data center network because of its relatively long rout- 596
ing path with high number of hops. For example, if the congestion 597
point is far from the sender (e.g. the TCP incast usually happens 598
at last hop), then its previous long time transmission will be of 599
waste and the retransmission (or even timeout to slow start) not 600
only brings additional latency but also increases network overhead 601
wh602
603
ver604
bas605
pac606
in 607
sta608
nel609
sen610
res611
612
fro613
spo614
of 615
pac616
of 617
one618
to 619
nod620
one621
sio622
a b623
ma624
saf625
am626
exc627
of-628
tra629
diff630
in-631
hav632
is a633
the634
the635
fee636
6.2637
638
ber639
sw640
ach641
stru642
con643
in 44
abi 45
cre 46
firs 47
cur 48
imp 49
50
por 51
stre 52
stre 53
num 54
for 55
the 56
Bes 57
put 58
usi 59
out 60
ing 61
pac 62
the 63
one 64
out 65
ets 66
int 67
acc 68
for 69
7. G 70
7.1. 71
72
vaC 73
app 74
phy 75
the 76
wh 77
wit 78
rou 79
lay 80
dre 81
Nov 82
con 83
Pl
ca
ich may result in a more congested network.
Ideally, a lossless transport protocol is desired. However, it is
y difficult to guarantee zero packet loss using sliding window
ed flow control. Based on this observation, similar to [45] a
ket lossless credit based flow control mechanism is adopted
NovaCube . The key principle is that each node maintains buffer
te of its direct downlink neighbour node for each virtual chan-
, and only if its downstream node has available space, the
der could get some certain number of credits and transfer cor-
ponding amount of packets.
As illustrated in Fig. 5 , a prerequisite for one node to send data
m its output queue (OQ) to its next hop node is that its corre-
nding output port must have enough credits. The default value
credits maintained on one output port equals to the number of
kets that can be accepted by its downstream node. The value
credits will be decreased by one whenever its port sends out
packet. Likewise, once the downstream node has new room
accept a new packet, it will send one credit to its upstream
e whose relevant port correspondingly increases its credits by
. Usually, there is a very small delay between packet transmis-
n and credit feedback, thus the downstream node should have
it larger buffer than expected to avoid overflow and achieve
ximum throughput, or the downstream node can reserve some
ety space when granting credits to its upstream node, for ex-
ple, to set a threshold (e.g. 80% of total space) that cannot be
eeded. The credit can be transmitted either in-band or out-
band, where in-band means packets and credits feedback are
nsmitted over the same channel while out-of-band uses two
erent channels to transmit packets and credits separately. The
band feedback is more complicated in implementation but be-
es more cost-effective. Comparatively, the out-of-band fashion
n expensive way but easier to be implemented. Nevertheless,
se two possible methods can achieve the same effect. Thus, for
sake of simplicity, in our simulations we implement the credit
dback using out-of-band signal.
. Internal structure of nodes
Compared to the traditional network, in NovaCube the num-
of ports on each node is relatively small, thus output queue
itching mechanism can be applied, which not only can help
ieve better performance but also can simplify the internal
cture of nodes. However, it is difficult to implement the flow
trol adopting output queue switching mechanism. As illustrated
ease cite this article as: T. Wang et al., Towards cost-effective and low l
tions (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
Intermediate Node
OQ
Cred
its
OQ
CreditsOQ
Credits
OQ
Credits
Data
Credits
End Node
Data
Credits
Intermediate Nodes
ntrol in NovaCube .
Fig. 6. Internal structure of NovaCube node.
Fig. 5 , the credits of a upstream node indicate the queue avail- 6
lity of its downstream node. In order to determine the value of 6
dits that downstream node can accept, the upstream node must 6
t determine the output port of the downstream node that the 6
rent packet will go through, which increases the difficulty in 6
lementation. 6
In response to this issue, an input queue is introduced at each 6
t as buffering space, as illustrated Fig. 6 . The credits of the up- 6
am node denote the available space of input queue in down- 6
am node. Each output queue assigns each input queue a certain 6
ber of credits, named as internal credits. Each input queue can 6
ward its packets to the corresponding output queue as long as 6
input queue has enough assigned credits for this output queue. 6
ides, each output queue can receive packets from multiple in- 6
queues. In fact, if the input/output queues are implemented 6
ng centralized shared memory, then the division of input and 6
put queues is merely a logical thing, and the packet schedul- 6
from input queue to output queue is just an action of moving 6
ket pointers. As for the issue of Head of Line (HoL) bocking, all 6
input queues can be organized as a shared virtual queue, and 6
virtual input queue can forward certain number of packets to 6
put queue if it has corresponding number of credits. Once pack- 6
of an output queue are transmitted to the next hop node, the 6
ernal credits of its corresponding input queue will be increased 6
ordingly so that the packets buffered in the input queue can be 6
warded to this output queue properly. 6
eographical address assignment mechanism 6
Network layering 6
Similar to the traditional internet, the protocol stack of No- 6
ube is also divided into five abstraction layers which are 6
lication layer, transport layer, network layer, link layer and 6
sical layer. The only difference lies in the network layer, where 6
traditional internet uses IP address to locate different hosts 6
ile NovaCube uses coordinates to direct the data transmission 6
h the benefit of topology’s symmetry, which can improve the 6
ting efficiency greatly. Except for network layer, the other 6
ers are kept the same without any changes. However, IP ad- 6
sses must be provided when creating TCP/UDP sockets, yet 6
aCube only has coordinates. Thus, an adaptation layer, which 6
verts coordinates to IP address format, is need at the network 6
atency data center network architecture, Computer Communi-
T. Wang et al. / Computer Communications xxx (2016) xxx–xxx 9
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
ApplicationsApplica�ons
UDP/TCPUDP/TCP
Coordinate AddressingProtocol (CAP)
Coordinate Addressing Protocol (CAP)
Ethernet...
ApplicationsApplica�ons
UDP/TCPUDP/TCP
IP...IP...
Ethernet...Ethernet...
unmodified
unmodified
outingg. POWou�ng . POW IP routing protocolsIP rou�ng protocols
Applica�on
Transport
Network
unmodified
)PI/PCT(tenretnIebuCavoN
Physical mediumPhysical mediumunmodified
be network abstraction layers.
Dimension flagxyz
nated
ep 684
th 685
es 686
687
688
nt 689
es. 690
rk 691
ss 692
ve 693
on 694
is 695
n- 696
p- 697
to 698
is 699
w- 700
to 701
is- 702
1”703
ns 704
be 705
of 706
e- 707
ce 708
ed 709
710
711
nd 712
ng 713
nd 714
lt 715
he 716
at 717
ull 718
e, 719
on 720
er 721
722
723
all 724
ks, 725
its 726
D 727
D 728
PL 729
- 730
o- 731
in 732
se 733
of 734
735
736
ch 737
lly 738
se 739
he 740
- Q5 741
lar 742
Ethernet...
Coordinate rprotocols, e.Coordinate rprotocols, e.g
Link
Physical mediumPhysical mediumPhysical
Fig. 7. The NovaCu
abc
Fig. 8. Coordi
layer. In this way, the transport layer and its upper layers ke
unmodified, so that NovaCube network can be compatible wi
legacy TCP/IP protocols and the TCP/IP based application servic
can run without any changes ( Fig. 7 ) .
7.2. Coordinate based geographical address assignment
An address translation mechanism is designed to impleme
the convention between IPv4 address and NovaCube coordinat
As illustrated in Fig. 8 , in order to support a NovaCube netwo
with maximum 6 dimensions, the traditional 32-bit IPv4 addre
is divided into seven segments consisting of six pieces with fi
bits and one piece with two bits. The coordinate of each dimensi
is denoted by the five-bit slice, and the remaining two-bit slice
a dimension flag which is used to indicate the number of dime
sions of the network. In this way, a 32-bit IPv4 address can su
port up to six dimensions, where a 6-D NovaCube can hold up
2 30 = 1,073,741,824 (1 billion) servers, thus this kind of division
reasonable and adequate even for a large scale data center. Ho
ever, normally the two-bit dimension flag only can support up
2 2 = 4 dimensions other than six dimensions. In response to this
sue, here we define that only the dimension flag with binary “1
indicates a 6D network address. When the number of dimensio
is less than six, the address space of last dimension will not
used. Therefore, when the dimension flag is “10”, we make use
the first three bits of the last dimension’s address space to repr
sent the specific dimension. The rule of dimension corresponden
is illustrated in Table 1 , and other values are currently consider
illegal.
8. Evaluation
In this section, we evaluate the performance of NovaCube a
PORA routing algorithm under various network conditions by usi
network simulator 3 (NS-3) [46] . The link bandwidth is 1 Gbps a
each link is capable of bidirectional communications. The defau
maximum transmission unit (MTU) of a link is 1500 bytes. T
propagation delay of a link and the processing time for a packet
a node are set to be 4 μ s and 1.5 μs , respectively. Besides, Weib
Please cite this article as: T. Wang et al., Towards cost-effective and
cations (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
address assignment.
Table 1
The representation of different dimensions.
Dimension flag
(binary)
First 3 bits of c
(binary)
Dimension number
(decimal)
11 xxx 6
10 101 5
10 100 4
10 011 3
10 010 2
10 001 1
Distribution is adopted to determine the packet inter-arrival tim
and a random permutation traffic matrix is used in our simulati
where each node sends/receives traffic to/from exactly one oth
server.
8.1. Average path length
One of the biggest advantages of NovaCube resides in its sm
average path length (APL). With the benefit of jump-over lin
the APL of routing in Torus is significantly reduced. Fig. 9 exhib
the simulation results of average path length using PORA in 2
NovaCube and DOR (known to be a shortest path routing) in 2
Torus. The result reveals that PORA indeed achieves a smaller A
in NovaCube than DOR in regular Torus, where a smaller APL im
plies a lower network latency. Even the network diameter of N
vaCube is also slightly smaller than the achieved APL by DOR
Torus. Moreover, the APL achieved by PORA is already very clo
to the theoretical analysis, which demonstrates the optimality
PORA.
8.2. Network latency
Generally, the network latency consists of queuing delay at ea
hop, transmission delay and propagation delay. In order to actua
evaluate the overall packet delivery delay in the network, we u
the global packet lifetime (the time from packet’s generation to t
arrival at its destination) to measure the network latency. We sim
ulated the network latency in different sized NovaCube and regu
low latency data center network architecture, Computer Communi-
10 T. Wang et al. / Computer Communications xxx (2016) xxx–xxx
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
4 6 8 10 12 14 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16PORA
Diameter of NovaCube
DOR
Diameter of Torus
Radix k
AveragePathLength
Fig. 9. The average path length using PORA and DOR.
30
35
40
45
3D NovaCube
3D Torus
econds)
Tor743
ser744
res745
late746
8.3747
748
ity 749
wid750
the751
Tor752
cre753
thr754
8.4755
756
the757
cau758
pat759
pat760
( k =761
Tor762
tha763
rat764
4 5 6 7 8 9 10
200
300
400
500
600
700
800
900
3D NovaCube
3D Torus
Radix k
AverageThroughput(MBps)
Fig. 11. The throughput of different sized NovaCube and Torus.
0% 10% 20% 30% 40% 50% 60% 70%
4
6
8
10
12
14
NovaCube - Link Failure Only
Torus - Link Failure Only
NovaCube - Node Failure Only
Torus - Node Failure Only
Failure Ratio
AveragePathLength
Fig. 12. The average path length under different failures.
Tab
sCo
N
Av
La
Th
dem 65
Tor 66
imp 67
8.5 68
69
tag 70
rou 71
if t 72
hig 73
for 74
up 75
net 76
res 77
and 78
thr 79
Pl
ca
4 5 6 7 8 9 10
10
15
20
25
Radix k
Latency(micros
Fig. 10. The network latency of different sized NovaCube and Torus.
us networks, varying from k = 4 (64 servers) to k = 10 (10 0 0
vers), as shown in Fig. 10 . It can be seen from the simulation
ults that compared to the regular Torus network the network
ncy of NovaCube network is reduced by around 40%.
. Throughput
The throughput can be used to measure the network capac-
of an architecture, and it is usually limited by bisection band-
th and also impacted by the routing algorithm. Fig. 11 shows
achieved average throughput in different sized NovaCube and
us network. The result reveals that the average throughput de-
ases with the increase of network size. NovaCube improves the
oughput of regular Torus network by up to 90%.
. Fault tolerance
The rich connectivity of Torus-based topologies guarantees
network with high reliability. The node/link failures unlikely
se network disconnections, only may lead to a higher routing
h length. Fig. 12 exhibits the simulation results of the average
h lengths under different kinds of failure ratios in 512-server
8) NovaCube network using PORA routing algorithm and
us network using DOR routing algorithm. The results show
t the average path length increases as the link/node failure
io increases. NovaCube network with a richer connectivity
ease cite this article as: T. Wang et al., Towards cost-effective and low l
tions (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
le 2
mparison between 2D NovaCube and 3D NovaCube.
ovaCube 64-server 729-server
2D 3D 2D 3D
erage path length 3.06 1.82 9.46 4.17
tency (m s) 16.12 10.03 49.87 22.92
roughput (MBps) 616.28 823.42 355.31 447.92
onstrates a better performance in fault tolerance than regular 7
us. Another finding is that the node failure has a slightly higher 7
act on the average routing path length. 7
. Comparison between 2D and 3D NovaCube 7
NovaCubes with different dimensions have different advan- 7
es. A lower dimensional NovaCube has lower wiring and 7
ting complexity, and can be easier to be constructed. However, 7
he network size is large, it is better to be constructed with 7
her dimensions, and the average path length will also be lower 7
higher dimensional NovaCube. The network can easily scale 7
with a higher dimension, thus NovaCube can well support 7
work’s future expansion. Table 2 gives some simple simulation 7
ults about the performance comparison between 2D NovaCube 7
3D Novacube with respect to average path length, latency and 7
oughput. As it can be seen, for the same sized network, 3D 7
atency data center network architecture, Computer Communi-
T. Wang et al. / Computer Communications xxx (2016) xxx–xxx 11
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
NovaCube achieves lower average path length, lower latency and 780
higher throughput than 2D NovaCube. Therefore, if not consider 781
the wiring and routing complexity, a higher dimensional NovaCube 782
is a better choice. 783
9. Conclusion 784
In this paper, we proposed a novel data center architecture 785
named NovaCube , and presented its design and key properties. As a 786
switchless architecture, NovaCube ’s cost-effectiveness is highlighted 787
with regard to its energy consumption and infrastructure cost. As 788
proved, NovaCube is also superior to other candidate architectures 789
in terms of network diameter, throughput, average path length, 790
bisection bandwidth, path diversity and fault tolerance. Further- 791
more, the specially designed probabilistic weighted oblivious rout- 792
ing algorithm PORA helps NovaCube achieve near-optimal average 793
path length with better load balancing which can result in a better 794
throughput. Moreover, PORA is also free of livelock and deadlock. 795
The simulation results further prove the good performance of No- 796
vaCube . 797
Acknowledgment 798
This research has been supported by a Grant from Huawei Tech- 799
nologies Co., Ltd. The authors also would like to express their 800
thanks and gratitudes to the anonymous reviewers whose con- 801
structive comments helped improve the manuscript. 802
References 803
[1] M. Al-Fares, A. Loukissas, A. Vahdat, A scalable, commodity data center net- 804 work architecture, in: Proceedings of the ACM SIGCOMM Computer Communi- 805 cation Review, vol. 38, ACM, 2008, pp. 63–74. 806
[2] A. Greenberg, J.R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.A. Maltz, 807 P. Patel, S. Sengupta, VL2: a scalable and flexible data center network, SIG- 808 COMM Comput. Commun. Rev. 39 (4) (2009) 51–62. 809
[3] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, DCell: A scalable and fault-tolerant 810 network structure for data centers, ACM SIGCOMM Comput. Commun. Rev. 38 811
812 be: 813 en- 814
815 ch, 816 of 817 10, 818
819 in- 820 ec- 821 ter 822
823 or- 824 of 825 14, 826
827 er- 828 79 829
830 ing 831 10) 832
833 the 834
835 sed 836 bal 837
838 aid 839 15 840 9–841
842 hi- 843 1–844
845
[14] Z. Guo, Z. Duan, Y. Xu, H.J. Chao, Cutting the electricity cost of distributed 846 datacenters through smart workload dispatching, IEEE Commun. Lett. 17 (12) 847 (2013) 2384–2387. 848
[15] Z. Guo, Z. Duan, Y. Xu, H.J. Chao, Jet: electricity cost-aware dynamic workload 849 management in geographically distributed datacenters, Comput. Commun. 50 850 (2014) 162–174. 851
[16] L. Rao, X. Liu, L. Xie, W. Liu, Minimizing electricity cost: optimization of dis- 852 tributed internet data centers in a multi-electricity-market environment, in: 853 Proceedings of the INFOCOM, 2010, IEEE, 2010, pp. 1–9. 854
[17] T. Wang, Y. Xia, J. Muppala, M. Hamdi, S. Foufou, A general framework for 855 performance guaranteed green data center networking, in: Proceedings of 856 the 2014 IEEE Global Communications Conference (GLOBECOM), IEEE, 2014, 857 pp. 2510–2515. 858
[18] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, 859 N. McKeown, Elastictree: saving energy in data center networks., in: Proceed- 860 ings of the NSDI, vol. 3, 2010, pp. 19–21. 861
[19] T. Wang, B. Qin, Z. Su, Y. Xia, M. Hamdi, S. Foufou, R. Hamila, Towards band- 862 width guaranteed energy efficient data center networking, J. Cloud Comput. 4 863 (1) (2015) 1–15. 864
[20] V. Mann, A. Kumar, P. Dutta, S. Kalyanaraman, Vmflow: leveraging vm mo- 865 bility to reduce network power costs in data centers, in: Proceedings of the 866 NETWORKING 2011, Springer, 2011, pp. 198–211. 867
[21] D. Meisner, B.T. Gold, T.F. Wenisch, Powernap: eliminating server idle power, 868 in: Proceedings of the ACM Sigplan Notices, vol. 44, ACM, 2009, pp. 205– 869 216. 870
[22] J. Leverich, M. Monchiero, V. Talwar, P. Ranganathan, C. Kozyrakis, Power man- 871 agement of datacenter workloads using per-core power gating, Comput. Archit. 872 Lett. 8 (2) (2009) 48–51. 873
[23] H. David, C. Fallin, E. Gorbatov, U.R. Hanebutte, O. Mutlu, Memory power man- 874 agement via dynamic voltage/frequency scaling, in: Proceedings of the 8th 875 ACM international conference on Autonomic computing, ACM, 2011, pp. 31–40. 876
[24] Y. Xia, T. Wang, Z. Su, M. Hamdi, Fine-grain power control for combined input- 877 crosspoint queued switches, in: Proceedings of the Globecom, IEEE, 2014. 878
[25] D. Abts, M.R. Marty, P.M. Wells, P. Klausler, H. Liu, Energy proportional data- 879 center networks, in: Proceedings of the ACM SIGARCH Computer Architecture 880 News, vol. 38, ACM, 2010, pp. 338–347. 881
[26] U. Lee, I. Rimac, V. Hilt, Greening the internet with content-centric networking, 882 in: Proceedings of the 1st International Conference on Energy-Efficient Com- 883 puting and Networking, ACM, 2010, pp. 179–182. 884
[27] K. Guan, G. Atkinson, D.C. Kilper, E. Gulsen, On the energy efficiency of con- 885 tent delivery architectures, in: Proceedings of the 2011 IEEE International Con- 886 ference on Communications Workshops (ICC), IEEE, 2011, pp. 1–6. 887
[28] Í. Goiri, K. Le, T.D. Nguyen, J. Guitart, J. Torres, R. Bianchini, Greenhadoop: 888 leveraging green energy in data-processing frameworks, in: Proceedings of 889 the 7th ACM European Conference on Computer Systems, ACM, 2012, pp. 57– 890 70. 891
[29] M. Arlitt, C. Bash, S. Blagodurov, Y. Chen, T. Christian, D. Gmach, C. Hyser, 892 N. Kumari, Z. Liu, M. Marwah, et al., Towards the design and operation of net- 893 zero energy data centers, in: Proceedings of the 2012 13th IEEE Intersociety 894 Conference on Thermal and Thermomechanical Phenomena in Electronic Sys- 895 tems (ITherm), IEEE, 2012, pp. 552–561. 896
[30] R. Brightwell, K. Pedretti, K.D. Underwood, Initial performance evaluation of 897 the cray seastar interconnect, in: Proceedings of the 13th Symposium on High 898 Performance Interconnects, 20 05, IEEE, 20 05, pp. 51–57. 899
[31] R. Alverson, D. Roweth, L. Kaplan, The gemini system interconnect, in: Pro- 900 ceedings of the 2010 IEEE 18th Annual Symposium on High Performance In- 901 terconnects (HOTI), IEEE, 2010, pp. 83–87. 902
[32] N.R. Adiga, M.A. Blumrich, D. Chen, P. Coteus, A. Gara, M.E. Giampapa, P. Hei- 903 delberger, S. Singh, B.D. Steinmacher-Burow, T. Takken, et al., Blue gene/l torus 904 interconnection network, IBM J. Res. Develop. 49 (2.3) (2005) 265–276. 905
[33] D. Chen, N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S. Kumar, 906 V. Salapura, D. Satterfield, B. Steinmacher-Burow, J. Parker, The IBM blue 907 gene/q interconnection fabric, Micro, IEEE 32 (1) (2012) 32–43. 908
[34] W.J. Dally, H. Aoki, Deadlock-free adaptive routing in multicomputer networks 909 5. 910 in: 911 ut- 912
913 ro- 914 nd 915
916 ed 917
ual 918 9– 919
920 ck- 921 is- 922
923 age 924 rk, 925
926 or- 927
928 et- 929 u- 930
931
(4) (2008) 75–86.
[4] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, S. Lu, Bcua high performance, server-centric network architecture for modular data c
ters, ACM SIGCOMM Comput. Commun. Rev. 39 (4) (2009) 63–74. [5] G. Wang, D. Andersen, M. Kaminsky, K. Papagiannaki, T. Ng, M. Kozu
M. Ryan, c-Through: Part-time optics in data centers, in: P roceedings
the ACM SIGCOMM Computer Communication Review, vol. 40, ACM, 20pp. 327–338.
[6] N. Farrington, G. Porter, S. Radhakrishnan, H. Bazzaz, V. Subramanya, Y. Faman, G. Papen, A. Vahdat, Helios: a hybrid electrical/optical switch archit
ture for modular data centers, in: Proceedings of the ACM SIGCOMM CompuCommunication Review, vol. 40, ACM, 2010, pp. 339–350.
[7] T. Wang, Z. Su, Y. Xia, Y. Liu, J. Muppala, M. Hamdi, Sprintnet: a high perf
mance server-centric network architecture for data centers, in: Proceedingsthe 2014 IEEE International Conference on Communications (ICC), IEEE, 20
pp. 4005–4010. [8] T. Wang, Z. Su, Y. Xia, J. Muppala, M. Hamdi, Designing efficient high p
formance server-centric data center network architecture, Comput. Netw. (2015) 283–296.
[9] H. Abu-Libdeh, P. Costa, A. Rowstron, G. O’Shea, A. Donnelly, Symbiotic rout
in future data centers, ACM SIGCOMM Comput. Commun. Rev. 40 (4) (2051–62.
[10] J.-Y. Shin, B. Wong, E.G. Sirer, Small-world datacenters, in: Proceedings of 2nd ACM Symposium on Cloud Computing, ACM, 2011, p. 2.
[11] T. Wang, Z. Su, Y. Xia, B. Qin, M. Hamdi, Novacube: a low latency torus-banetwork architecture for data centers, in: Proceedings of the 2014 IEEE Glo
Communications Conference (GLOBECOM), IEEE, 2014, pp. 2252–2257. [12] T. Wang, Z. Su, Y. Xia, M. Hamdi, Clot: a cost-effective low-latency overl
torus-based network architecture for data centers, in: Proceedings of the 20
IEEE International Conference on Communications (ICC), IEEE, 2015, pp. 5475484.
[13] T. Wang, Z. Su, Y. Xia, M. Hamdi, Rethinking the data center networking: arctecture, network protocols, and resource sharing, Access, IEEE 2 (2014) 148
1496.
Please cite this article as: T. Wang et al., Towards cost-effective and
cations (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
using virtual channels, IEEE Trans. Parallel Distrib. Syst. 4 (4) (1993) 466–47[35] L.G. Valiant, G.J. Brebner, Universal schemes for parallel communication,
Proceedings of the Thirteenth Annual ACM Symposium on Theory of Comp
ing, ACM, 1981, pp. 263–277. [36] T. Nesson, S.L. Johnsson, Romm routing on mesh and torus networks, in: P
ceedings of the Seventh Annual ACM Symposium on Parallel Algorithms aArchitectures, ACM, 1995, pp. 275–287.
[37] A. Singh, W.J. Dally, B. Towles, A.K. Gupta, Locality-preserving randomizoblivious routing on torus networks, in: Proceedings of the Fourteenth Ann
ACM Symposium on Parallel Algorithms and Architectures, ACM, 2002, pp.
13. [38] L. Gravano, G.D. Pifarre, P.E. Berman, J.L. Sanz, Adaptive deadlock-and livelo
free routing with all minimal paths in torus networks, IEEE Trans. Parallel Dtrib. Syst. 5 (12) (1994) 1233–1251.
[39] N. Farrington, E. Rubow, A. Vahdat, Data center switch architecture in the of merchant silicon, in: Proceedings of the IEEE Hot Interconnects, New Yo
2009.
[40] W. Dally, B. Towles, Principles and Practices of Interconnection Networks, Mgan Kaufmann, 2004.
[41] Y. Zhang, N. Ansari, Hero: hierarchical energy optimization for data center nworks, in: Proceedings of the 2012 IEEE International Conference on Comm
nications (ICC), IEEE, 2012, pp. 2924–2928.
low latency data center network architecture, Computer Communi-
12 T. Wang et al. / Computer Communications xxx (2016) xxx–xxx
ARTICLE IN PRESS
JID: COMCOM [m5G; March 12, 2016;9:50 ]
[42]932 933 934
[43]935 936 937
[44] 38 39 40
[45] 41 42
[46] 43
Pl
ca
Y. Zhang, N. Ansari, On architecture design, congestion notification, TCP in- cast and power consumption in data centers, Commun. Surv. Tutor. IEEE 15
(1) (2013) 39–64. A. Greenberg, J. Hamilton, D. Maltz, P. Patel, The cost of a cloud: research prob-
lems in data center networks, ACM SIGCOMM Comput. Commun. Rev. 39 (1) (2008) 68–73.
ease cite this article as: T. Wang et al., Towards cost-effective and low l
tions (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016
S. Pelley, D. Meisner, T.F. Wenisch, J.W. VanGilder, Understanding and abstract- 9ing total data center power, in: Proceedings of the Workshop on Energy- 9Efficient Design, 2009. 9
H. Kung, R. Morris, Credit-based flow control for ATM networks, IEEE Netw. 9 9(2) (1995) 40–48. 9
http://www.nsnam.org . 9
atency data center network architecture, Computer Communi-