article in press - ایران عرضهiranarze.ir/wp-content/uploads/2016/11/e682.pdf · article in...

ARTICLE IN PRESS

JID: COMCOM [m5G; March 12, 2016;9:50 ]

Computer Communications xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Computer Communications

journal homepage: www.elsevier.com/locate/comcom

Towards cost-effective and low latency data center network

architecture

Ting Wang

a , 1 , ∗, Zhiyang Su

b , 2 , Yu Xia

c , 3 , Bo Qin

b , 4 , Mounir Hamdi d , 5 Q1

a Hong Kong University of Science and Technology, Hong Kong Q2

Q3

b Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong c College of Computer Science, Sichuan Normal University, China d College of Science, Engineering and Technology in Hamad bin Khalifa University, Qatar

a r t i c l e i n f o

Article history:

Received 17 September 2015

Revised 25 February 2016

Accepted 27 February 2016

Available online xxx

Keywords:

Data center networks

Architecture

Torus topology

Probabilistic weighted routing

Deadlock-free

a b s t r a c t

This paper presents the design, analysis, and implementation of a novel data center network architecture,

named NovaCube . Based on regular Torus topology, NovaCube is constructed by adding a number of most

beneficial jump-over links, which offers many distinct advantages and practical benefits. Moreover, in or-

der to enable NovaCube to achieve its maximum theoretical performance, a probabilistic oblivious routing

algorithm PORA is carefully designed. PORA is a both deadlock and livelock free routing algorithm, which

achieves near-optimal performance in terms of average routing path length with better load balancing

thus leading to higher throughput. Theoretical derivation and mathematical analysis together with exten-

sive simulations further prove the good performance of NovaCube and PORA.

© 2016 Elsevier B.V. All rights reserved.

1. Introduction 1

The data center network (DCN) 6 architecture is regarded as one 2

of the most important determinants of network performance in 3

data centers, and plays a significant role in meeting the require- 4

ments of could-based services as well as the agility and dynamic 5

reconfigurability of the infrastructure for changing application de- 6

mands. As a result, many novel proposals, such as Fat-Tree [1] , 7

VL2 [2] , DCell [3] , BCube [4] , c-Through [5] , Helios [6] , SprintNet 8

[7,8] , CamCube [9] , Small-World [10] , NovaCube [11] , CLOT [12] , 9

and so on, have been proposed aiming to efficiently interconnect 10

the servers inside a data center to deliver peak performance to 11

users. 12

Generally, DCN topologies can be classified into four categories: 13

multi-rooted tree-based topology (e.g. Fat-Tree), server-centric 14

∗ Corresponding author. Tel.: +86 13681755836. Q4

E-mail addresses: [email protected] (T. Wang), [email protected] (Z. Su),

[email protected] (Y. Xia), [email protected] (B. Qin), [email protected] (M. Hamdi). 1 Student Member, IEEE 2 Student Member, IEEE 3 Member, IEEE 4 Student Member, IEEE 5 Fellow, IEEE 6 The term “data center network” and “DCN” are used interchangeably in this

paper.

topology (e.g. DCell, BCube, SprintNet), hybrid network (e.g. c- 15

Through, Helios) and direct network (e.g. CamCube, Small-World) 16

[13] . Each of these has their advantages and disadvantages. 17

Tree-based topologies, like FatTree and Clos, can provide full 18

bisection bandwidth, thus the any-to-any performance is good. 19

However, their building cost and complexity is relatively high. 20

The recursive-defined server-centric topology usually concentrates 21

on the scalability and incremental extensibility with a lower 22

building cost; however, the full bisection bandwidth may not be 23

achieved and their performance guarantee is only limited to a 24

small scope. The hybrid network is a hybrid packet and circuit 25

switched network architecture. Compared with packet switching, 26

optical circuit switching can provide higher bandwidth and lower 27

latency in transmission with lower energy consumption. However, 28

optical circuit switching cannot achieve full bisection bandwidth at 29

packet granularity. Furthermore, the optics also suffers from slow 30

switching speed which can take as high as tens of milliseconds. 31

The direct network topology, which directly connects servers 32

to other servers, is a switchless network interconnection without 33

any switches, or routers. It is usually constructed in a regular pat- 34

tern, such as Torus (as show in Fig. 1 ). Besides being widely used 35

in high-performance computing systems, Torus is also an attrac- 36

tive network architecture candidate for data centers. However, this 37

design suffers consistently from poor routing efficiency compared 38

with other designs due to its relatively long network diameter (the 39

maximum shortest path between any node pairs), which is known 40

http://dx.doi.org/10.1016/j.comcom.2016.02.016

0140-3664/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: T. Wang et al., Towards cost-effective and low latency data center network architecture, Computer Communi-

cations (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016


http://www.ScienceDirect.com

http://www.elsevier.com/locate/comcom

mailto:[email protected]







2 T. Wang et al. / Computer Communications xxx (2016) xxx–xxx

ARTICLE IN PRESS


1-D Torus (4-ary 1-cube) 2D Torus (4-ary 2-cube) 3D Torus (3-ary 3-cube)

Fig. 1. Examples of 1D, 2D, 3D Torus topologies.

as ⌊

k 2

⌋n for a n -D Torus 7 with radix k . Besides, a long network di- 41

ameter may also lead to high communication delay. Furthermore, 42

its performance largely depends on the routing algorithms. 43

In order to deal with these imperfections, in this paper we pro- 44

pose a novel container level high-performance Torus-based DCN 45

architecture named NovaCube . The key design principle of NovaC- 46

ube is to connect the farthest node pairs by adding additional 47

jump-over links. In this way, NovaCube can halve the network 48

diameter and receive higher bisection bandwidth and through- 49

put. Moreover, we design a new weighted probabilistic oblivi- 50

ous deadlock-free routing algorithm PORA for NovaCube , which 51

achieves low average routing path length and good load-balancing 52

by exploiting the path diversities. 53

The primary contributions of this paper can be summarized as 54

follows: 55

(1) We propose a novel Torus-based DCN architecture NovaCube , 56

which exhibits good performance in network latency, bisec- 57

tion bandwidth, throughput, path diversity and average path 58

length. 59

(2) We carefully design a weighted probabilistic oblivious rout- 60

ing algorithm PORA, which is both deadlock-free and 61

62

63

(64

65

(66

67

68

(69

70

71

72

73

rev74

dem75

is 76

gor77

cre78

ogr79

sys80

clu81

2. 82

2.1.83

84

ity 85

7

cha

creases the communication efficiency. Besides being widely used 86

in supercomputing, Torus network has also been introduced to 87

the data center networks. Three typical representatives are namely 88

CamCube [9] , Small-World [10] and CLOT [12] . 89

CamCube was proposed by Abu-Libdeh et al., and the servers 90

in CamCube are interconnected in a 3D Torus topology. The Cam- 91

Cube is designed target to shipping container-sized data centers, 92

and is a server-only switchless network design. With the benefit of 93

Torus architecture and the flexibility offered by a low-level link ori- 94

entated CamCube API, CamCube allows applications to implement 95

their own routing protocols so as to achieve better application- 96

level performance. However, as aforementioned this design based 97

on the regular Torus suffers long average routing path – O ( N

1/3 ) 98

hops, with N servers, which results in poor routing efficiency. 99

In order to overcome this limitation, Shin Ji-Yong, et al. pro- 100

posed Small-World, which provides an unorthodox random data 101

center network topology. It is constructed based on some regular 102

topologies (such as ring, Torus or cube) with the addition of a large 103

number of random links which can help reduce the network di- 104

ameter and achieve higher routing efficiency. The degree of each 105

node in Small-World is limited to six, taking realistic deployment 106

and 07

me 08

geo 09

me 10

poo 111

12

top 13

to 14

per 15

Tor 16

end 117

diff 18

con 19

CLO 20

len 21

Pl

ca

livelock-free, and helps NovaCube achieve good load balanc-

ing.

3) We introduce a credit-based lossless flow control mecha-

nism in NovaCube network.

4) We design a practical geographical address assignment

mechanism, which also can be applied to the traditional

Torus network.

5) We implement NovaCube architecture, PORA routing algo-

rithm and the flow control mechanism in NS3. Extensive

simulations are conducted to demonstrate the good perfor-

mance of NovaCube .

The rest of the paper is organized as follows. First, we briefly

iew the related research literature in Section 2 . Then Section 3

onstrates the motivation. In Section 4 , NovaCube architecture

introduced and analyzed in detail. Afterwards, the routing al-

ithm PORA is designed in Section 5 . Section 6 introduces the

dit-based flow control mechanism. Section 7 demonstrates a ge-

aphical address assignment mechanism. Section 8 presents the

tem evaluation and simulation results. Finally, Section 9 con-

des this paper.

Related work

Network interconnection

The Torus-based topology well implements the network local-

forming the servers in close proximity of each other, which in-

n -D Torus with radix k is also called k -ary n -cube, which may be used inter-

ngeably in this paper.

sw 22

2.2 23

24

tot 25

of 26

era 27

as 28

( 29

(

ease cite this article as: T. Wang et al., Towards cost-effective and low l

tions (2016), http://dx.doi.org/10.1016/j.comcom.2016.02.016

low cost into consideration. In addition to traditional routing 1

thods, Small-World also provides content routing coupled with 1

graphical address assignment, which in turn efficiently imple- 1

nts key-value stores. However, its shortest path routing suffers 1

r worst-case throughput and poor load balancing.

CLOT was also a DCN architecture designed based on Torus 1

ology. CLOT shares the same goal with Small-World, which aims 1

reduce the routing path length and improve its overall network 1

formance while retaining the Torus merits. Based on regular 1

us topology, CLOT uses a number of most beneficial small low- 1

switches to connect each node and its most distant nodes in

erent dimensions. In this way, for a n-D CLOT, each switch will 1

nect to 2 n nodes. By employing additional low-end switches, 1

T largely shortens the network diameter and the average path 1

gth. However, it also induces an extra expenditure on these 1

itches. 1

. Power savings in data centers 1

The energy cost of a data center accounts for a large portion of 1

al budget [14–17] . There have emerged a considerable number 1

research and investigation to achieve a green data center. Gen- 1

lly, the existing proposals can be classified into four categories 1

below. 1

1) Network level : This scheme usually resorts to energy- 1

aware routing, VM migration/placement, flow scheduling 130

and workload management mechanism to consolidate traf- 131

fic and turn off idle switches/servers [17–20] . 132

2) Hardware level : This scheme aims to design energy- 133

efficient hardware (e.g. server, switch) by using certain 134

atency data center network architecture, Computer Communi-


T. Wang et al. / Computer Communications xxx (2016) xxx–xxx 3

ARTICLE IN PRESS


energy-efficient techniques like DVFS, VOVO, PCPG,and so on 135

[21–24] . 136

(3) Architectural level : This scheme designs energy-efficient net- 137

work architecture to achieve power savings, examples like 138

flattened butterfly topology [25] , Torus-based topology [9–139

11] and content-centric networking (CCN) based architec- 140

tures which can reduce the content distribution energy costs 141

[26,27] . 142

(4) Green energy resources : This scheme makes use of green 143

sources to reduce the energy budget such as wind, water, 144

solar energy, heat pumps, and so on [28,29] . 145

NovaCube can be considered as an architectural level approach, 146

which avoids using energy hungry switches. Moreover, NovaCube 147

n- 148

cy 149

150

151

152

on 153

or 154

ch 155

156

ss 157

n, 158

ng 159

160

c- 161

ny 162

ce 163

le 164

ks. 165

s- 166

to 167

al- 168

169

es 170

o- 171

e- 172

he 173

o- 174

ks, 175

rt 176

177

la- 178

is 179

/IP 180

ng 181

rk 182

s- 183

of 184

so 185

st 186

ce 187

us 188

nt 189

190

en 191

192

193

194

195

3.2. Routing issues in Torus 196

Any well qualified routing algorithm design in Torus network 197

must take all important metrics (such as throughput, latency, aver- 198

age path length, load balancing, deadlock free) into consideration. 199

However, the current existing routing algorithms in Torus are far 200

from perfect, as when they improve some certain performance it 201

is usually at the sacrifice of others. Generally the routings in Torus 202

can be divided into two classes: deterministic routing and adaptive 203

routing. A common example of deterministic routing is dimension- 204

ordered routing (DOR) [34] , where the message routes dimension 205

by dimension and the routing is directly determined by the source 206

address and destination address without considering the network 207

state. DOR achieves a minimal routing path, but also eliminates 208

in 209

o- 210

ti- 211

e, 212

6] 213

al 214

ke 215

ng 216

ed 217

on 218

re 219

e- 220

le 221

k. 222

k- 223

t- 224

by 225

al 226

227

n- 228

r- 229

rt- 230

nt 231

to 232

- 233

on 234

es 235

236

237

al- 238

on 239

a 240

241

en 242

m 243

244

de 245

246

1)

ue 247

248

e 249

would also save the cooling cost induced by cooling the heat ge

erated by switches. More discussions about the energy efficien

performance of NovaCube are given in Section 4.2.7 .

3. Motivation

3.1. Why Torus-based clusters

The Torus (or precisely k -ary n -cube) based intserconnecti

has been regarded as an attractive DCN architecture scheme f

data centers because of its own unique advantages, some of whi

are listed below.

Firstly, it incurs lower infrastructure cost since it is a switchle

architecture without needing any expensive switches. In additio

the power consumed by the switches and its associated cooli

power can also be saved.

Secondly, it achieves better fault-tolerance. Traditional archite

ture is usually constructed with a large number of switches, a

failure of which could greatly impact on the network performan

and system reliability. For example, if a ToR switch fails, the who

rack of servers will lose connection with the servers in other rac

Comparatively, the rich interconnectivity and in-degree of Toru

based switchless architecture makes the network far less likely

be partitioned. The path diversity can also provide good load b

ance even on permutation traffic.

Thirdly, the architectural symmetry of Torus topology optimiz

the scalability and granularity of Clusters. It allows systems to ec

nomically scale to tens of thousands of servers, which is well b

yond the capacity of Fat-Tree switches. For an n -ary k -cube, t

network can support up to k n nodes, and scales at a high exp

nential speed which outperforms traditional switched networ

such as Fat-Tree’s O ( p 3 ), and BCube’s O ( p 2 ), where p denotes p -po

switch.

Fourthly, Torus is also highlighted by its low cross-cluster

tency. In traditional switched DCNs, an inevitably severe problem

that the switching latency (several μs in each hop) and the TCP

processing latency (tens of μs) are very high, which leads to a lo

RTT. Comparatively, TCP/IP stack is not needed in Torus netwo

which saves the long TCP/IP processing time, and the NIC proces

ing delay is also lower than switches (e.g., the processing delay

a real VirtualShare NIC engine is only 0.45 μ s). Besides, Torus al

avoids the network oversubscription and provides many equal co

routing paths to avoid network congestion, which can help redu

the queuing delay due to network congestion. Consequently, Tor

achieves a much lower end-to-end delay, which is very importa

in the data center environment.

Fifthly, its high network performance has already been prov

in high-performance systems and supercomputers, such as Cray

Inc.’s Cray SeaStar (2D Torus) [30] , Cray Gemini (3D Torus) [31] ,

IBM’s Blue Gene/L (3D Torus) [32] and Blue Gene/Q (5D Torus)

[33] .

250

2)

Please cite this article as: T. Wang et al., Towards cost-effective and


any path diversity provided by Torus topology, which results

poor load balancing and low throughput. As an improved tw

phase DOR algorithm, Valiant routing (VAL) [35] can achieve op

mal worst-case throughput by adding a random intermediate nod

but it destroys locality and suffers longer routing path. ROMM [3

and RLB [37] implements good locality, but cannot achieve optim

worst-case throughput. Comparatively, the adaptive routing (li

MIN AD [38] ) uses local network state information to make routi

decisions, which achieves better load balancing and can be coupl

with a flow control mechanism. However, using local informati

can lead to non-optimal choices while global information is mo

costly to obtain, and the network state may change rapidly. B

sides, adaptive routing is not deadlock free, where a resource cyc

can occur without routing restrictions which leads to a deadloc

Thus, adaptive routings have to apply some dedicated deadloc

avoiding techniques, such as Turn Model Routing (by elimina

ing certain turns in some dimensions) and Virtual Channels (

decomposing each unidirectional physical link into several logic

channels with private buffer resources), to prevent deadlock.

To summarize, the good features of Torus conclusively demo

strate its superiority in constructing a cost-effective and high pe

formance data center network. However, it also suffers some sho

comings, such as the relatively long routing path, and inefficie

routing algorithm with low worst-case throughput. In response

these issues, in this paper we propose some practical and effi

cient solutions from the perspectives of physical interconnecti

and routing while inheriting and keeping the intrinsic advantag

of Torus topology.

4. NovaCube network design

This section presents the network design and theoretical an

ysis of NovaCube . Before introducing the physical interconnecti

structure, we firstly provide a theorem with proof, which offers

theoretical basis of NovaCube design.

Theorem 4.1. For any node A(a 1 , a 2 , … , a n ) in a k-ary n-cube (wh

k is even) if B(b 1 , b 2 , … , b n ) is assumed to be the farthest node fro

A, then B is unique and B’s unique farthest node is exactly A.

Proof. In a k -ary n -cube, if B( b 1 , b 2 , … , b n ) is the farthest no

from A( a 1 , a 2 , … , a n ), where a i ∈ [0, k ), b i ∈ [0, k ), then there is:

b 1 =

(a 1 +

k

2

)mod k, . . . , b n =

(a n +

k

2

)mod k (

Since the result of (a i +

k 2 ) mod k is unique, thus ∀ b i is uniq

and b i ∈ [0, k ). Hence, A’s farthest node B is unique.

Next, assume B’s farthest node is A

′ ( a ′ 1 , a ′ 2 , . . . , a ′ n ), similarly w

have:

a ′ 1 =

(b 1 +

k )

mod k, . . . , a ′ n =

(b n +

k )

mod k (
2 2
low latency data center network architecture, Computer Communi-



ARTICLE IN PRESS


ube (f

251

a ′ i

(1)252

a ′ i

(2)253

a ′ i

As 254

A’(255

fro256

4.1.257

258

ogy259

� k 2 260

ing261

the262

jum263

net264

can265

(a n266

bri267

nod268

two269

4.1. 70

71

the 72

pai 73

far 74

link 75

ori 76

Nov 77

4.1. 78

79

to 80

sin 81

tive 82

onl 83

dim 84

is 85

k − 86

pai 87

nod 88

a t 89

4.2 90

91

wo 92

wid 93

4.2 94

95

jum 96

am 97

to 98

D N

Pro 99

is D 00

can 01

Pl

ca

Fig. 2. A 2D 6 × 6-node and 3D 4 × 4 × 4-node NovaC

By combining (1) and (2), we can get:

=

(b i +

k

2

)mod k =

{[(a i +

k

2

)mod k

]+

k

2

}mod k

∵ a i ∈ [0 , k ) , ∴ a i +

k

2

∈

[ k

2

, k +

k

2

) For the case of a i +

k 2 ∈ [ k 2 , k ) , we have

=

{[(a i +

k

2

)mod k

]+

k

2

}mod k

=

(a i +

k

2

+

k

2

)mod k = (a i + k ) mod k = a i

For the case of a i +

k 2 ∈ [ k, k +

k 2 ) , we have

=

{[(a i +

k

2

)mod k

]+

k

2

}mod k

=

(a i +

k

2

− k +

k

2

)mod k = a i mod k = a i

a consequence of the above, a ′ i = a i for ∀ i ∈ [1, n ]. Therefore,

a ′ 1 , a ′ 2 , . . . , a ′ n ) = A( a 1 , a 2 , … , a n ), which means the farthest node

m B is exactly A. This ends the proof. �

NovaCube physical structure

As aforementioned, one critical drawback of k -ary n -cube topol-

is its relatively long network diameter, which is as high as

� n . In order to decrease the network diameter and make rout-

packets to far away destinations more efficiently, based on

regular k -ary n -cube, NovaCube is constructed by adding some

p-over links connecting the farthest node pairs throughout the

work. In a n -D Torus the most distant node of ( a 1 , a 2 , … , a n )

be computed as ( (a 1 + � k 2 � ) mod k, (a 2 + � k 2 � ) mod k, . . . ,

+ � k 2 � ) mod k ), which guides the construction of NovaCube . In

ef, the key principle of NovaCube is to connect the most distant

e pairs by adding one jump-over link. More precisely, there are

construction cases with tiny differences.



or simplicity, not all jump-over links are shown).

1. Case #1: k is even 2

When k is even, then according to Theorem 3.1 any node’s far- 2

st node is unique to each other, and there are k n

2 farthest node 2

rs, where k n is the total number of nodes. In this case, all the k n

2 2

thest node pairs are connected to each other by one jump-over 2

. As a result, the degree of each node will be increased from 2

ginal 2 n to 2 n + 1 . Fig. 2 presents two examples of 2D and 3D 2

aCube . 2

2. Case #2: k is odd 2

When k is odd, one node’s farthest node cannot be guaranteed 2

be unique, nor is the number of node pairs k n

2 an integer either 2

ce k n is odd. In consideration of this fact, we have no alterna- 2

but to settle for the second-best choice. The eclectic way is to 2

y construct ( k -1)-ary n - NovaCube , and keep the k th node in each 2

ension unchanged. Noticing that k ( k ≥ 1) is odd, then k − 1 2

even. Therefore, the construction of n -D NovaCube with radix 2

1 is the same as Case 1. Consequently, there are (k −1) n

2 node 2

rs with node degree of 2 n + 1 that are connected, and n k − n k −1 2

es with node degree of 2 n remain unchanged. This way makes 2

rade-off, however a small one. 2

. Properties of NovaCube 2

As with any network, the performance of the NovaCube net- 2

rk is characterized by its network diameter, bisection band- 2

th, throughput, path diversity and physical cost. 2

.1. Network diameter 2

After connecting the most distant node pairs by additional 2

p-over links, the NovaCube network architecture halves the di- 2

eter, where the diameter is reduced from original D Torus =

⌊k 2

⌋n 2

be current 2

ov aCube =

⌈ ⌊k 2

⌋n

2

⌉

(3)

of. The network diameter of a regular k -ary n -cube ( n -D Torus) 2

Torus =

⌊k 2

⌋n, which means that any node inside the network 3

reach all the other nodes within

⌊k 2

⌋n hops. For any node A 3




ARTICLE IN PRESS


Fig. 3. A k -ary n - NovaCube .

in Torus, we assume that node B is the farthest node from node 302

A. Next, we assume set S i to denote the nodes that can be reached 303

from node A at the i -th hop in Torus, where i ∈ [0 , ⌊

k 2

⌋n ] . Then 304

the universal set of all nodes in the network can be expressed as 305

S =

∑ � k 2 � n

i =0 S i . 306

After linking all the most distant node pairs (e.g. A and B) 307

in NovaCube , if we define S ′ i

as the set of nodes that are i hops 308

from node A in NovaCube , then: (1) for the case of � k 2 � n is even, 309

we have S ′ 0

= S 0 , S ′ 1

= S 1 + S � k 2 � n , S

′ 2

= S 2 + S � k 2 � n −1

, . . . , S ′ � k 2 � n/ 2

= 310

S � k 2 � n/ 2

+ S � k 2 � n/ 2+1

; (2) for the case of � k 2 � n is odd, we have 311

S ′ 0

= S 0 , S ′ 1

= S 1 + S � k 2 � n , S

′ 2

= S 2 + S � k 2 � n −1

, . . . , S ′ � k 2 � n/ 2 −1

= 312

S � k 2 � n/ 2 −1

+ S � k 2 � n/ 2 +1

, S ′ � k 2 � n/ 2 = S � k

2 � n/ 2 . This demonstrates 313

that in NovaCube any node A can reach all nodes of the entire net- 314

rk 315

316

317

he 318

rk 319

t- 320

ar- 321

of 322

er 323

r- 324

bi- 325

be 326

k 327

c- 328

nd 329

c- 330

331

4)

a 332

re, 333

st 334

es 335

336

337

to 338

on 339

nd 340

al 341

nd 342

in 343

re 344

d. 345

ad 346

= 347

ck 348

channel is saturated and equal to the channel bandwidth b c . Thus, 349

the ideal throughput �ideal of a topology is 350

�ideal =

b c

ω max (5)

Under uniform traffic pattern, the maximum channel load ω max 351

at the bisection channel has a lower bound, which in turn gives 352

an upper bound on throughput. For a uniform traffic pattern, on 353

average k n

2 packets must go through the B T bisection channels. If 354

the routing and flow control are optimal, then the packets will be 355

distributed evenly among all bisection channels which results in 356

the best throughput. Thus, the load on each bisection channel load 357

is at least 358

ω max ≥k n

2

B T

=

k n

2

k n + 4 k n −1 =

k

2 k + 8

(6)

i- 359

360

7)

an 361

to 362

0] . 363

t- 364

o- 365

h- 366

%. 367

r- 368

369

370

ity 371

c- 372

al- 373

ch 374

he 375

376

ge 377

- 378

er 379

an 380

B 381

of 382

383

8)

se 384

n- 385

i- 386

n- 387

er 388

er 389

work within � k 2 � n/ 2 or � k 2 � n/ 2 hops. Consequently, the netwo

diameter of NovaCube is � k 2 � n/ 2 . This ends the proof. �

4.2.2. Bisection bandwidth

The bisection bandwidth can be calculated by summing up t

link capacities between two equally-sized parts which the netwo

is partitioned into. It can be used to evaluate the worst-case ne

work capacity [39] . Assume the NovaCube network T ( N 1 , N 2 ) is p

titioned into two equal disjoint sets N 1 and N 2 , each element

T ( N 1 , N 2 ) is a bidirectional channel with a node in N 1 and anoth

node in N 2 . Then the number of bidirectional channels in the pa

tition is | T ( N 1 , N 2 ) | , or 2 | T ( N 1 , N 2 ) | channels in total, thus the

section bandwidth is B T = 2 | T (N 1 , N 2 ) | . For a k -ary n - NovaCu

as shown in Fig. 3 , when k is even, there is even number of

k -ary ( n -1)-cube, which can be divided by the minimum bise

tion into two equal sets with 2 k n −1 regular bidirectional links a

k ∗ k n −1

2 jump-over bidirectional links. Therefore, the channel bise

tion bandwidth of k -ary n - NovaCube is computed as:

B T = 2 ∗(

2 k n −1 + k ∗ k n −1

2

)= k n + 4 k n −1 (

According to the result in [40] , the bisection bandwidth of

regular n -dimensional Torus with radix k is B C = 4 k n −1 . Therefo

NovaCube effectively increases the bisection bandwidth by at leaB T −B C

B C =

k n +4 k n −1 −4 k n −1

4 k n −1 =

k 4 ≥ 25% (k ≥ 1) and the ratio increas

accordingly as k increases.

4.2.3. Throughput

Throughput is a key indicator of the network capacity

measure a topology. It not only largely depends on the bisecti

bandwidth, but is also determined by the routing algorithm a

flow control mechanism. However, we can evaluate the ide

throughput of a topology under the assumed perfect routing a

flow control. The maximum throughput means some channel

the network is saturated and the network cannot carry mo

traffic. Thus, the throughput is closely related to the channel loa

We assume the bandwidth of each channel is b c and the worklo

on a channel c is ω c . Then the maximum channel load ω max

Max { ω c , c ∈ C }. The ideal throughput occurs when the bottlene



Consequently, the upper bound on an ideal throughput under un

form traffic can be derived from Eqs. (5) and (6) :

�ideal =

b c

ω max ≤ 2 k + 8

k b c (

This exhibits that NovaCube achieves better performance th

regular Torus topology in the network capacity with respect

throughput, where the ideal throughput of Torus is only 8 b c / k [4

Here we normalize the worst-case throughput ̂ �nw

to the ne

work capacity: ̂ �nw

=

ω max

ω nw ( ̂ R ) , where ̂ R indicate a routing alg

rithm. Valiant routing (VAL) [35] is a known worst-case throug

put optimal routing algorithm in Torus which obtains ̂ �nw

= 50

Thus, an optimal routing algorithm in NovaCube can achieve no

malized worst-case throughput of at least ̂ �nw

= 62.5%.

4.2.4. Path diversity

Inherited from Torus topology, NovaCube provides a divers

of paths, which can be exploited in routing algorithm by sele

tively distributing traffic over these paths to achieve load b

ancing. Besides, the network reliability also greatly benefits mu

from the path diversity, where the traffic can route around t

faulty nodes/links by taking alternative paths.

The number of distinct paths existing in NovaCube is too hu

to be calculated exactly, for simplicity, we first compute the num

ber of shortest paths in a regular Torus without any jump-ov

links. Assume two nodes A( a 1 , a 2 , . . . , a n ) and B( b 1 , b 2 , . . . , b n ) in

n -dimensional Torus, and the coordinate distance between A and

in the i th dimension is �i = | a i − b i | . Then the total number

shortest paths P ab between A and B is:

P ab =

n ∏

i =1

( n ∑

j= i � j

�i

)=

( ∑ n

i =1 �i )! ∏ n i =1 �i !

(

where the term

(∑ n j= i � j

�i

)computes the number of ways to choo

where to take the �i hops in dimension i out of all the remai

ing hops. It can be seen that a longer distance and a higher d

mension result in a larger number P ab of shortest paths. For i

stance, if given �x = 3, �y = 4, �z = 5 in a 3D Torus, the numb

of shortest paths is as high as 27720. If we further add a larg




ARTICLE IN PRESS


number of additional jump-over links in NovaCube , the number of 390

paths will be larger. If taking the non-minimal paths into consider- 391

ation as designed in some routing algorithms, the number of feasi- 392

ble paths is nearly unlimited. The great path diversity of NovaCube 393

offers many benefits as aforementioned, but it is also confronted 394

with great challenges in designing an efficient, deadlock free and 395

load balanced routing algorithm. 396

4.2.5. Average path length 397

In this subsection, we derive the average path length (APL) for 398

n -D399

= 2400

Nov401

s to402

nei403

404

wh405

d m406

�(

wh407

( ∑

408

E [ �409

hav410

E[�

411

wh412

cre413

the414

4.2415

416

k is417

is g418

Nov419

the420

pai421

1 2 ) k422

link423

The424

N lin

425

wh426

tive427

of 428

link429

the430

Fat431

for432

has433

Mo434

the435

coo436

can437

4.2.7. Power savings 438

The power savings of data center is related to many factors, 439

which can be divided into several categories including hardware 440

level (server/switch using energy-efficient techniques like DVFS, 441

VOVO, PCPG, etc.), network level (energy-aware routing and flow 442

scheduling, job placement, energy-aware VM migration, etc.), 443

architectural level (e.g. switchless architecture), cooling system 4 4 4

design (cooling techniques) and the energy resources (e.g. re- 445

newable or green resources like wind, water, solar energy, heat 446

pumps) [17,41,42] . NovaCube can be considered as an architectural 447

lev 48

cor 49

con 50

bud 51

por 52

ori 53

wil 54

reg 55

oth 56

ene 57

har 58

ach 59

5. 60

61

nam 62

its 63

we 64

loc 65

5.1. 66

No 67

B( b 68

b i || 69

70

The 71

pro 72

qua 73

vaC 74

5.1. 75

76

sou 77

cid 78

S 1 , 79

the 80

PO 81

pac 82

rec 83

S ’s 84

nex 85

me 86

ma 87

p i

wh 88

Fig 89

and 90

= 6 91

Pl

ca

NovaCube with an even radix k , and calculate its value for n

. Due to the symmetry, every node is the same in the k -ary n -

aCube . Hence, we can only consider the APL from a fixed source

any possible destination d . Here we denote m as the jump-over

ghbour of s .

Denote source s = (0, … , 0) and destination d = ( x 1 , … , x n ),

ere x i ∈ [0 , k − 1] . Then we have m = ( k 2 , . . . , k 2 ). Thus, the s -

inimal distance in k -ary n - NovaCube is given as:

s, d) def = min {|| s − d|| 1 , || m − d|| 1 + 1 } (9)

ere || · || p is p -norm of the vector meaning that || x || p =

n i =1 | x i | p )

1 p for the n -dimensional vector x . Then, the APL is

( s , d )]. Without loss of generality, for the case n = 2 , we can

e

(s, d)] =

1 ∗ 5 +

∑ k/ 2 −1 i =2

(8 i − 4) ∗ i + (4 k − 6) ∗ k 2

k 2 − 1

=

k 3

3 +

k 2

2 − 4 k

3 + 1

k 2 − 1

(10)

Thus, the APL of NovaCube approaches to k 3 when k is large,

ich is superior to 2D Torus’s k 2 [40] , and as the dimension in-

ase NovaCube reduces more. In the similar way, we can compute

APL for the k -ary n -D NovaCube .

.6. Cost- effectiveness

The total number of nodes of a n -dimensional Torus with radix

k n and the degree of each node is 2 n , thus the number of links

iven as 2 nk n

2 = nk n . Therefore, the total number of links N links in

aCube can be calculated by summing up nk n regular links and

number of jump-over links. When k is even, there are k n

2 node

rs connected with jump-over links, so there are nk n +

k n

2 = (n +

n links in total. Likewise, when k is odd, there are nk n +

(k −1) n

2

s altogether, of which

(k −1) n

2 is the number of jump-over links.

calculation can be summarized as below:

ks =

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩

(n +

1

2

)∗ k n : kise v en

nk n +

(k − 1) n

2

: kisodd (11)

The number of links per server in NovaCube is ˜ N links ≤ n +

1 2 ,

ere it is n +

1 2 for even k and n +

(k −1) n

2 k n for odd k . Compara-

ly, the number of links in FatTree [1] is relative to the number

ports p on switches, which is N links = 3 p 3 /4, and the number of

s per server in FatTree is 3. Thus, when the dimension n ≤ 3,

cost-effectiveness of NovaCube ( n +

1 2 ) is almost the same as

Tree (3) or even better than FatTree for n ≤ 2. For example,

a 4096-node topology, FatTree uses 12288 links while NovaCube

10240 links for 2D topology and 14336 links for 3D topology.

reover, NovaCube is a switchless architecture, which can save

high expenditure of expensive switches and racks with related

ling costs. Therefore, drawn from the above analysis, NovaCube

be regarded as a cost-effective architecture for data centers.



el approach, which avoids using energy hungry switches. Ac- 4

ding to the findings in previous studies [19,43,44] , the power 4

sumed by switches accounts for around 15% of total power 4

get. As a switchless architecture, NovaCube will save this 4

tion of power consumption. Besides, intuitively the cooling cost 4

ginally induced by cooling the heat emitted by the switches 4

l be saved as well. From this perspective, NovaCube can be 4

arded as an energy-efficient architecture. Moreover, if some 4

er levels of power saving techniques (e.g. power-aware routings, 4

rgy-efficient work placement and VM migration, energy-saving 4

dware) are employed to NovaCube, more power savings can be 4

ieved. 4

Routing scheme 4

This section presents the specially designed routing algorithms 4

ed PORA for NovaCube , which aims to help NovaCube achieve 4

maximum theoretical performance. PORA is a probabilistic 4

ighted oblivious routing algorithm. Besides, PORA is also live- 4

k and deadlock free. 4

PORA routing algorithm 4

tation 1. The distance between node A( a 1 , a 2 , . . . , a n ) and node 4

1 , b 2 , . . . , b n ) in the i -th dimension is denoted as �i = || a i − 4

1 . The distance between A and B is given as �AB =

∑ n i =1 �i . 4

Generally, the PORA procedure can be divided into two steps. 4

first step is to choose routing direction according to the given 4

bability, while the second step is to route within the designated 4

drant. Without loss of generality, for simplicity we use 2D No- 4

ube to illustrate PORA. 4

1. Direction determination 4

As shown in Fig. 4 , assume a packet needs to route from the 4

rce node S to the destination node D , then firstly it needs to de- 4

e the direction of its first hop. Since S has five neighbour nodes 4

S 2 , S 3 , S 4 , M , where M is its jump-over neighbour (although in 4

case of odd k some special nodes may have no jump-over links, 4

RA still works correctly), thus it has five directions to route the 4

ket. In order to choose the most beneficial next-hop, each di- 4

tion is assigned a probability based on the distance between 4

next-hop node and destination node. Then PORA chooses the 4

t-hop according to their probabilities, where the probabilistic 4

chanism can help PORA achieve a good load balancing. The nor- 4

lized probability function is given as below: 4

=

1 �2

i ∑ ψ

i =1 1

�2 i

(12)

ere ψ is the number of neighbour nodes of the source. Take 4

. 4 as an example, the distances between S’s neighbour nodes 4

destination node D are �S 1 D = 4, �S 2 D

= 4, �S 3 D = 6, �S 4 D

4

, �MD = 3, thus the probability of choosing S 1 as the next-hop 4




ARTICLE IN PRESS


+

,-

,-

nd

%, 492

to 493

ur 494

d- 495

x- 496

nd 497

by 498

499

500

fi- 501

er 502

ns 503

d- 504

er 505

a 506

ep 507

is 508

u- 509

ch 510

ult 511

512

et 513

m 514

he 515

as 516

ng 517

518

e 519

520

a- 521

h- 522

523

ke 524

R 525

he 526

nk 527

he 528

as 529

til 530

531

532

a- 533

ay 534

se 535

is 536

re 537

In 538

p, 539

he 540

d- 541

to 542

nt 543

A 544

545

546

ck 547

es 548

y- 549

he 550

o- 551

re 552

s- 553

d- 554

to 555

he 556

n- 557

is 558

ve 559

560

re 561

n- 562

st 563

ds 564

n- 565

er 566

he 567

he 568

n- 569

ly 570

ill 571

of 572

573

574

575

to 576

es 577

d. 578

ta 579

it- 580

et 581

o- 582

nt 583

by 584

as 585

ull 586

en 587

w 588

to 589

p- 590

plies an acknowledg ment mechanism to keep track of lost packets, 591

x

y

Quadrant I +,+

Quadrant II -,

Quadrant III +

Quadrant IV -

jump-over link

S1

S4

S3

S2

M

D

S

Fig. 4. PORA in an 8 × 8 NovaCube (for simplicity, not all wraparound links a

jump-over links are displayed).

is p S 1 =

1

�2 S 1 D ∑ 1

�2 i

= 21.43%, and likewise p S 2 = 21.43%, p S 3 = 9.524

p S 4 = 9.524%, p M

= 38.10%, respectively. Clearly, PORA prefers

choose the shorter path with a higher probability. Each neighbo

node (except the jump-over neighbour) corresponds to one qua

rant in the Cartesian coordinate system as shown in Fig. 4 . For e

ample, in Fig. 4 S 1 , S 2 , S 3 , S 4 correspond to Quadrant I, II, III, a

IV, respectively. The division of quadrants is only determined

the source node and destination node.

5.1.2. Routing within quadrant

There are two cases for this step. For the first, if Step 1

nally selects the regular neighbour node other than the jump-ov

neighbour as the next hop, then all the following routing decisio

towards the destination must be restricted within the correspon

ing quadrant. For the second, in case Step 1 chooses the jump-ov

neighbour node M (e.g. if it has a smaller distance thus with

higher probability to be chosen) as the next hop, then repeat St

1 to determine the quadrant by taking M as the source node. Th

time, in Step 1 PORA will only compute the probability of its reg

lar neighbours without considering its jump-over neighbour, whi

can avoid jumping back to the original source node that may res

in livelock issue.

Once the quadrant is determined, then PORA routes the pack

only within the chosen quadrant, where the routing mechanis

applied is also probabilistic. At each hop, PORA firstly check if t

jump-over link can be used. The jump-over hop can be taken

the candidate next-hop route if and only if it satisfies the followi

two requirements:

• The jump-over neighbour node is also located within the sam

quadrant.

• The distance between jump-over neighbour node and destin

tion node is smaller than the distance between regular neig

bour node and destination node.

If the requirements cannot be satisfied, then PORA will ta

the regular neighbour node as its next-hop using traditional DO

(Dimension-Ordered Routing [34] ) algorithm, which routes t

packet dimension by dimension. Otherwise, if the jump-over li

meets the requirements, then the next-hop is selected from t

jump-over node and DOR node according to the probability

computed in Eq. (12) . This process is repeated at each hop un

it reaches the destination.



5.2. Livelock prevention

Livelock occurs when a packet is denied routing to its destin

tion forever even though it is never blocked permanently. It m

be travelling around its destination node, never reaching it becau

the channels it requires are always occupied by other packets. Th

can occur if non-greedy adaptive routing is allowed (packets a

misrouted, but are not able to get closer to the destination).

PORA, once the routing direction is determined at the first ste

each of the following hops of PORA will be restricted within t

selected quadrant. Moreover, the routing method within the qua

rant enables the packet to find its next hop, whose distance

the destination node is always smaller than that from the curre

node, which guarantees packet delivery. Thus, we claim that POR

is a livelock-free routing algorithm.

5.3. Deadlock-free implementation

As an another notorious problem in Torus networks, deadlo

is the situation where packets are allowed to hold some resourc

while requesting others, so that the dependency chain forms a c

cle. Then all these packets must wait forever without reaching t

destination, and the throughput will also collapse. The DOR alg

rithm is proven to be deadlock-free in a mesh network, since the

will be no message loop in the network. However, the Torus me

sage loops by itself, thus simply using DOR cannot prevent dea

lock. Virtual channels are proposed as a very effective means

prevent deadlock from happening. Virtual channels are used in t

loops in a network to cut the loop into two different logical cha

nels, so no cyclic dependency will be formed. Virtual channel

easy to implement by using multiple logical queues and effecti

in solving deadlock.

To prevent deadlock in our architecture, we first make su

that jump-over links cannot form loops in the routing. We e

sure that any jump-over links we choose in the quadrant mu

be nearer than the previous hop and regular Torus links towar

the destination; otherwise, regular Torus links are used. This e

sures that the packet will never jump back through the jump-ov

links. Then, we use the DOR routing to prevent the deadlock in t

mesh sub-network. Finally, if the packets have to pass through t

wraparound links in the Torus network, we use two virtual cha

nels to cut a Torus loop into two different logical paths. Thus, on

two virtual channels are needed in each direction, which is st

cost-effective, since the hardware cost increases as the number

virtual channels increases.

6. Flow control

6.1. Credit-based flow control mechanism

Flow control, or known as congestion control, is designed

manage the rate of data transmission between devices or nod

in a network to prevent the network buffers being overwhelme

Too much data arrives exceeding the device capacity results in da

overflow, meaning the data is either lost or must be retransm

ted. Thus, the main objective of flow control is to limit pack

delay and avoid buffer overflow. In traditional Internet, the pr

tocols with flow control functionality like TCP usually impleme

the speed matching between fast transmitter and slow receiver

packet discarding and retransmission. More specifically, a node h

no alternative but to drop the packet when its buffer space is f

(in some special scenarios, packets may be discarded even wh

the buffer is still available if the packet priority is low or the flo

has occupied more than their fair share of resources regarding

QoS restrictions). After packet loss occurs, the network then a




ARTICLE IN PRESS


OQ

Cred

its

OQ

CreditsOQ

Credits

OQ

Credits

Data

Credits

Intermediate Node

Data

Credits

End Node

Data

Credits

Intermediate Nodes

Fig. 5. The credit based flow co

and the sender carries out fast retransmission after receiving three 592

duplicate ACKs or go back to slow start phase after a reasonable 593

timeout RTO (around 200 ms). However, this kind of packet dis- 594

carding based flow control is unsuitable in the Torus based latency 595

sensitive data center network because of its relatively long rout- 596

ing path with high number of hops. For example, if the congestion 597

point is far from the sender (e.g. the TCP incast usually happens 598

at last hop), then its previous long time transmission will be of 599

waste and the retransmission (or even timeout to slow start) not 600

only brings additional latency but also increases network overhead 601

wh602

603

ver604

bas605

pac606

in 607

sta608

nel609

sen610

res611

612

fro613

spo614

of 615

pac616

of 617

one618

to 619

nod620

one621

sio622

a b623

ma624

saf625

am626

exc627

of-628

tra629

diff630

in-631

hav632

is a633

the634

the635

fee636

6.2637

638

ber639

sw640

ach641

stru642

con643

in 44

abi 45

cre 46

firs 47

cur 48

imp 49

50

por 51

stre 52

stre 53

num 54

for 55

the 56

Bes 57

put 58

usi 59

out 60

ing 61

pac 62

the 63

one 64

out 65

ets 66

int 67

acc 68

for 69

7. G 70

7.1. 71

72

vaC 73

app 74

phy 75

the 76

wh 77

wit 78

rou 79

lay 80

dre 81

Nov 82

con 83

Pl

ca

ich may result in a more congested network.

Ideally, a lossless transport protocol is desired. However, it is

y difficult to guarantee zero packet loss using sliding window

ed flow control. Based on this observation, similar to [45] a

ket lossless credit based flow control mechanism is adopted

NovaCube . The key principle is that each node maintains buffer

te of its direct downlink neighbour node for each virtual chan-

, and only if its downstream node has available space, the

der could get some certain number of credits and transfer cor-

ponding amount of packets.

As illustrated in Fig. 5 , a prerequisite for one node to send data

m its output queue (OQ) to its next hop node is that its corre-

nding output port must have enough credits. The default value

credits maintained on one output port equals to the number of

kets that can be accepted by its downstream node. The value

credits will be decreased by one whenever its port sends out

packet. Likewise, once the downstream node has new room

accept a new packet, it will send one credit to its upstream

e whose relevant port correspondingly increases its credits by

. Usually, there is a very small delay between packet transmis-

n and credit feedback, thus the downstream node should have

it larger buffer than expected to avoid overflow and achieve

ximum throughput, or the downstream node can reserve some

ety space when granting credits to its upstream node, for ex-

ple, to set a threshold (e.g. 80% of total space) that cannot be

eeded. The credit can be transmitted either in-band or out-

band, where in-band means packets and credits feedback are

nsmitted over the same channel while out-of-band uses two

erent channels to transmit packets and credits separately. The

band feedback is more complicated in implementation but be-

es more cost-effective. Comparatively, the out-of-band fashion

n expensive way but easier to be implemented. Nevertheless,

se two possible methods can achieve the same effect. Thus, for

sake of simplicity, in our simulations we implement the credit

dback using out-of-band signal.

. Internal structure of nodes

Compared to the traditional network, in NovaCube the num-

of ports on each node is relatively small, thus output queue

itching mechanism can be applied, which not only can help

ieve better performance but also can simplify the internal

cture of nodes. However, it is difficult to implement the flow

trol adopting output queue switching mechanism. As illustrated



Intermediate Node

OQ

Cred

its

OQ

CreditsOQ

Credits

OQ

Credits

Data

Credits

End Node

Data

Credits

Intermediate Nodes

ntrol in NovaCube .

Fig. 6. Internal structure of NovaCube node.

Fig. 5 , the credits of a upstream node indicate the queue avail- 6

lity of its downstream node. In order to determine the value of 6

dits that downstream node can accept, the upstream node must 6

t determine the output port of the downstream node that the 6

rent packet will go through, which increases the difficulty in 6

lementation. 6

In response to this issue, an input queue is introduced at each 6

t as buffering space, as illustrated Fig. 6 . The credits of the up- 6

am node denote the available space of input queue in down- 6

am node. Each output queue assigns each input queue a certain 6

ber of credits, named as internal credits. Each input queue can 6

ward its packets to the corresponding output queue as long as 6

input queue has enough assigned credits for this output queue. 6

ides, each output queue can receive packets from multiple in- 6

queues. In fact, if the input/output queues are implemented 6

ng centralized shared memory, then the division of input and 6

put queues is merely a logical thing, and the packet schedul- 6

from input queue to output queue is just an action of moving 6

ket pointers. As for the issue of Head of Line (HoL) bocking, all 6

input queues can be organized as a shared virtual queue, and 6

virtual input queue can forward certain number of packets to 6

put queue if it has corresponding number of credits. Once pack- 6

of an output queue are transmitted to the next hop node, the 6

ernal credits of its corresponding input queue will be increased 6

ordingly so that the packets buffered in the input queue can be 6

warded to this output queue properly. 6

eographical address assignment mechanism 6

Network layering 6

Similar to the traditional internet, the protocol stack of No- 6

ube is also divided into five abstraction layers which are 6

lication layer, transport layer, network layer, link layer and 6

sical layer. The only difference lies in the network layer, where 6

traditional internet uses IP address to locate different hosts 6

ile NovaCube uses coordinates to direct the data transmission 6

h the benefit of topology’s symmetry, which can improve the 6

ting efficiency greatly. Except for network layer, the other 6

ers are kept the same without any changes. However, IP ad- 6

sses must be provided when creating TCP/UDP sockets, yet 6

aCube only has coordinates. Thus, an adaptation layer, which 6

verts coordinates to IP address format, is need at the network 6




ARTICLE IN PRESS


ApplicationsApplica�ons

UDP/TCPUDP/TCP

Coordinate AddressingProtocol (CAP)

Coordinate Addressing Protocol (CAP)

Ethernet...

ApplicationsApplica�ons

UDP/TCPUDP/TCP

IP...IP...

Ethernet...Ethernet...

unmodified

unmodified

outingg. POWou�ng . POW IP routing protocolsIP rou�ng protocols

Applica�on

Transport

Network

unmodified

)PI/PCT(tenretnIebuCavoN

Physical mediumPhysical mediumunmodified

be network abstraction layers.

Dimension flagxyz

nated

ep 684

th 685

es 686

687

688

nt 689

es. 690

rk 691

ss 692

ve 693

on 694

is 695

n- 696

p- 697

to 698

is 699

w- 700

to 701

is- 702

1”703

ns 704

be 705

of 706

e- 707

ce 708

ed 709

710

711

nd 712

ng 713

nd 714

lt 715

he 716

at 717

ull 718

e, 719

on 720

er 721

722

723

all 724

ks, 725

its 726

D 727

D 728

PL 729

- 730

o- 731

in 732

se 733

of 734

735

736

ch 737

lly 738

se 739

he 740

- Q5 741

lar 742

Ethernet...

Coordinate rprotocols, e.Coordinate rprotocols, e.g

Link

Physical mediumPhysical mediumPhysical

Fig. 7. The NovaCu

abc

Fig. 8. Coordi

layer. In this way, the transport layer and its upper layers ke

unmodified, so that NovaCube network can be compatible wi

legacy TCP/IP protocols and the TCP/IP based application servic

can run without any changes ( Fig. 7 ) .

7.2. Coordinate based geographical address assignment

An address translation mechanism is designed to impleme

the convention between IPv4 address and NovaCube coordinat

As illustrated in Fig. 8 , in order to support a NovaCube netwo

with maximum 6 dimensions, the traditional 32-bit IPv4 addre

is divided into seven segments consisting of six pieces with fi

bits and one piece with two bits. The coordinate of each dimensi

is denoted by the five-bit slice, and the remaining two-bit slice

a dimension flag which is used to indicate the number of dime

sions of the network. In this way, a 32-bit IPv4 address can su

port up to six dimensions, where a 6-D NovaCube can hold up

2 30 = 1,073,741,824 (1 billion) servers, thus this kind of division

reasonable and adequate even for a large scale data center. Ho

ever, normally the two-bit dimension flag only can support up

2 2 = 4 dimensions other than six dimensions. In response to this

sue, here we define that only the dimension flag with binary “1

indicates a 6D network address. When the number of dimensio

is less than six, the address space of last dimension will not

used. Therefore, when the dimension flag is “10”, we make use

the first three bits of the last dimension’s address space to repr

sent the specific dimension. The rule of dimension corresponden

is illustrated in Table 1 , and other values are currently consider

illegal.

8. Evaluation

In this section, we evaluate the performance of NovaCube a

PORA routing algorithm under various network conditions by usi

network simulator 3 (NS-3) [46] . The link bandwidth is 1 Gbps a

each link is capable of bidirectional communications. The defau

maximum transmission unit (MTU) of a link is 1500 bytes. T

propagation delay of a link and the processing time for a packet

a node are set to be 4 μ s and 1.5 μs , respectively. Besides, Weib



address assignment.

Table 1

The representation of different dimensions.

Dimension flag

(binary)

First 3 bits of c

(binary)

Dimension number

(decimal)

11 xxx 6

10 101 5

10 100 4

10 011 3

10 010 2

10 001 1

Distribution is adopted to determine the packet inter-arrival tim

and a random permutation traffic matrix is used in our simulati

where each node sends/receives traffic to/from exactly one oth

server.

8.1. Average path length

One of the biggest advantages of NovaCube resides in its sm

average path length (APL). With the benefit of jump-over lin

the APL of routing in Torus is significantly reduced. Fig. 9 exhib

the simulation results of average path length using PORA in 2

NovaCube and DOR (known to be a shortest path routing) in 2

Torus. The result reveals that PORA indeed achieves a smaller A

in NovaCube than DOR in regular Torus, where a smaller APL im

plies a lower network latency. Even the network diameter of N

vaCube is also slightly smaller than the achieved APL by DOR

Torus. Moreover, the APL achieved by PORA is already very clo

to the theoretical analysis, which demonstrates the optimality

PORA.

8.2. Network latency

Generally, the network latency consists of queuing delay at ea

hop, transmission delay and propagation delay. In order to actua

evaluate the overall packet delivery delay in the network, we u

the global packet lifetime (the time from packet’s generation to t

arrival at its destination) to measure the network latency. We sim

ulated the network latency in different sized NovaCube and regu




ARTICLE IN PRESS


4 6 8 10 12 14 16

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16PORA

Diameter of NovaCube

DOR

Diameter of Torus

Radix k

AveragePathLength

Fig. 9. The average path length using PORA and DOR.

30

35

40

45

3D NovaCube

3D Torus

econds)

Tor743

ser744

res745

late746

8.3747

748

ity 749

wid750

the751

Tor752

cre753

thr754

8.4755

756

the757

cau758

pat759

pat760

( k =761

Tor762

tha763

rat764

4 5 6 7 8 9 10

200

300

400

500

600

700

800

900

3D NovaCube

3D Torus

Radix k

AverageThroughput(MBps)

Fig. 11. The throughput of different sized NovaCube and Torus.

0% 10% 20% 30% 40% 50% 60% 70%

4

6

8

10

12

14

NovaCube - Link Failure Only

Torus - Link Failure Only

NovaCube - Node Failure Only

Torus - Node Failure Only

Failure Ratio

AveragePathLength

Fig. 12. The average path length under different failures.

Tab

sCo

N

Av

La

Th

dem 65

Tor 66

imp 67

8.5 68

69

tag 70

rou 71

if t 72

hig 73

for 74

up 75

net 76

res 77

and 78

thr 79

Pl

ca

4 5 6 7 8 9 10

10

15

20

25

Radix k

Latency(micros

Fig. 10. The network latency of different sized NovaCube and Torus.

us networks, varying from k = 4 (64 servers) to k = 10 (10 0 0

vers), as shown in Fig. 10 . It can be seen from the simulation

ults that compared to the regular Torus network the network

ncy of NovaCube network is reduced by around 40%.

. Throughput

The throughput can be used to measure the network capac-

of an architecture, and it is usually limited by bisection band-

th and also impacted by the routing algorithm. Fig. 11 shows

achieved average throughput in different sized NovaCube and

us network. The result reveals that the average throughput de-

ases with the increase of network size. NovaCube improves the

oughput of regular Torus network by up to 90%.

. Fault tolerance

The rich connectivity of Torus-based topologies guarantees

network with high reliability. The node/link failures unlikely

se network disconnections, only may lead to a higher routing

h length. Fig. 12 exhibits the simulation results of the average

h lengths under different kinds of failure ratios in 512-server

8) NovaCube network using PORA routing algorithm and

us network using DOR routing algorithm. The results show

t the average path length increases as the link/node failure

io increases. NovaCube network with a richer connectivity



le 2

mparison between 2D NovaCube and 3D NovaCube.

ovaCube 64-server 729-server

2D 3D 2D 3D

erage path length 3.06 1.82 9.46 4.17

tency (m s) 16.12 10.03 49.87 22.92

roughput (MBps) 616.28 823.42 355.31 447.92

onstrates a better performance in fault tolerance than regular 7

us. Another finding is that the node failure has a slightly higher 7

act on the average routing path length. 7

. Comparison between 2D and 3D NovaCube 7

NovaCubes with different dimensions have different advan- 7

es. A lower dimensional NovaCube has lower wiring and 7

ting complexity, and can be easier to be constructed. However, 7

he network size is large, it is better to be constructed with 7

her dimensions, and the average path length will also be lower 7

higher dimensional NovaCube. The network can easily scale 7

with a higher dimension, thus NovaCube can well support 7

work’s future expansion. Table 2 gives some simple simulation 7

ults about the performance comparison between 2D NovaCube 7

3D Novacube with respect to average path length, latency and 7

oughput. As it can be seen, for the same sized network, 3D 7




ARTICLE IN PRESS


NovaCube achieves lower average path length, lower latency and 780

higher throughput than 2D NovaCube. Therefore, if not consider 781

the wiring and routing complexity, a higher dimensional NovaCube 782

is a better choice. 783

9. Conclusion 784

In this paper, we proposed a novel data center architecture 785

named NovaCube , and presented its design and key properties. As a 786

switchless architecture, NovaCube ’s cost-effectiveness is highlighted 787

with regard to its energy consumption and infrastructure cost. As 788

proved, NovaCube is also superior to other candidate architectures 789

in terms of network diameter, throughput, average path length, 790

bisection bandwidth, path diversity and fault tolerance. Further- 791

more, the specially designed probabilistic weighted oblivious rout- 792

ing algorithm PORA helps NovaCube achieve near-optimal average 793

path length with better load balancing which can result in a better 794

throughput. Moreover, PORA is also free of livelock and deadlock. 795

The simulation results further prove the good performance of No- 796

vaCube . 797

Acknowledgment 798

This research has been supported by a Grant from Huawei Tech- 799

nologies Co., Ltd. The authors also would like to express their 800

thanks and gratitudes to the anonymous reviewers whose con- 801

structive comments helped improve the manuscript. 802

References 803

[1] M. Al-Fares, A. Loukissas, A. Vahdat, A scalable, commodity data center net- 804 work architecture, in: Proceedings of the ACM SIGCOMM Computer Communi- 805 cation Review, vol. 38, ACM, 2008, pp. 63–74. 806

[2] A. Greenberg, J.R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.A. Maltz, 807 P. Patel, S. Sengupta, VL2: a scalable and flexible data center network, SIG- 808 COMM Comput. Commun. Rev. 39 (4) (2009) 51–62. 809

[3] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, DCell: A scalable and fault-tolerant 810 network structure for data centers, ACM SIGCOMM Comput. Commun. Rev. 38 811

812 be: 813 en- 814

815 ch, 816 of 817 10, 818

819 in- 820 ec- 821 ter 822

823 or- 824 of 825 14, 826

827 er- 828 79 829

830 ing 831 10) 832

833 the 834

835 sed 836 bal 837

838 aid 839 15 840 9–841

842 hi- 843 1–844

845

[14] Z. Guo, Z. Duan, Y. Xu, H.J. Chao, Cutting the electricity cost of distributed 846 datacenters through smart workload dispatching, IEEE Commun. Lett. 17 (12) 847 (2013) 2384–2387. 848

[15] Z. Guo, Z. Duan, Y. Xu, H.J. Chao, Jet: electricity cost-aware dynamic workload 849 management in geographically distributed datacenters, Comput. Commun. 50 850 (2014) 162–174. 851

[16] L. Rao, X. Liu, L. Xie, W. Liu, Minimizing electricity cost: optimization of dis- 852 tributed internet data centers in a multi-electricity-market environment, in: 853 Proceedings of the INFOCOM, 2010, IEEE, 2010, pp. 1–9. 854

[17] T. Wang, Y. Xia, J. Muppala, M. Hamdi, S. Foufou, A general framework for 855 performance guaranteed green data center networking, in: Proceedings of 856 the 2014 IEEE Global Communications Conference (GLOBECOM), IEEE, 2014, 857 pp. 2510–2515. 858

[18] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, 859 N. McKeown, Elastictree: saving energy in data center networks., in: Proceed- 860 ings of the NSDI, vol. 3, 2010, pp. 19–21. 861

[19] T. Wang, B. Qin, Z. Su, Y. Xia, M. Hamdi, S. Foufou, R. Hamila, Towards band- 862 width guaranteed energy efficient data center networking, J. Cloud Comput. 4 863 (1) (2015) 1–15. 864

[20] V. Mann, A. Kumar, P. Dutta, S. Kalyanaraman, Vmflow: leveraging vm mo- 865 bility to reduce network power costs in data centers, in: Proceedings of the 866 NETWORKING 2011, Springer, 2011, pp. 198–211. 867

[21] D. Meisner, B.T. Gold, T.F. Wenisch, Powernap: eliminating server idle power, 868 in: Proceedings of the ACM Sigplan Notices, vol. 44, ACM, 2009, pp. 205– 869 216. 870

[22] J. Leverich, M. Monchiero, V. Talwar, P. Ranganathan, C. Kozyrakis, Power man- 871 agement of datacenter workloads using per-core power gating, Comput. Archit. 872 Lett. 8 (2) (2009) 48–51. 873

[23] H. David, C. Fallin, E. Gorbatov, U.R. Hanebutte, O. Mutlu, Memory power man- 874 agement via dynamic voltage/frequency scaling, in: Proceedings of the 8th 875 ACM international conference on Autonomic computing, ACM, 2011, pp. 31–40. 876

[24] Y. Xia, T. Wang, Z. Su, M. Hamdi, Fine-grain power control for combined input- 877 crosspoint queued switches, in: Proceedings of the Globecom, IEEE, 2014. 878

[25] D. Abts, M.R. Marty, P.M. Wells, P. Klausler, H. Liu, Energy proportional data- 879 center networks, in: Proceedings of the ACM SIGARCH Computer Architecture 880 News, vol. 38, ACM, 2010, pp. 338–347. 881

[26] U. Lee, I. Rimac, V. Hilt, Greening the internet with content-centric networking, 882 in: Proceedings of the 1st International Conference on Energy-Efficient Com- 883 puting and Networking, ACM, 2010, pp. 179–182. 884

[27] K. Guan, G. Atkinson, D.C. Kilper, E. Gulsen, On the energy efficiency of con- 885 tent delivery architectures, in: Proceedings of the 2011 IEEE International Con- 886 ference on Communications Workshops (ICC), IEEE, 2011, pp. 1–6. 887

[28] Í. Goiri, K. Le, T.D. Nguyen, J. Guitart, J. Torres, R. Bianchini, Greenhadoop: 888 leveraging green energy in data-processing frameworks, in: Proceedings of 889 the 7th ACM European Conference on Computer Systems, ACM, 2012, pp. 57– 890 70. 891

[29] M. Arlitt, C. Bash, S. Blagodurov, Y. Chen, T. Christian, D. Gmach, C. Hyser, 892 N. Kumari, Z. Liu, M. Marwah, et al., Towards the design and operation of net- 893 zero energy data centers, in: Proceedings of the 2012 13th IEEE Intersociety 894 Conference on Thermal and Thermomechanical Phenomena in Electronic Sys- 895 tems (ITherm), IEEE, 2012, pp. 552–561. 896

[30] R. Brightwell, K. Pedretti, K.D. Underwood, Initial performance evaluation of 897 the cray seastar interconnect, in: Proceedings of the 13th Symposium on High 898 Performance Interconnects, 20 05, IEEE, 20 05, pp. 51–57. 899

[31] R. Alverson, D. Roweth, L. Kaplan, The gemini system interconnect, in: Pro- 900 ceedings of the 2010 IEEE 18th Annual Symposium on High Performance In- 901 terconnects (HOTI), IEEE, 2010, pp. 83–87. 902

[32] N.R. Adiga, M.A. Blumrich, D. Chen, P. Coteus, A. Gara, M.E. Giampapa, P. Hei- 903 delberger, S. Singh, B.D. Steinmacher-Burow, T. Takken, et al., Blue gene/l torus 904 interconnection network, IBM J. Res. Develop. 49 (2.3) (2005) 265–276. 905

[33] D. Chen, N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S. Kumar, 906 V. Salapura, D. Satterfield, B. Steinmacher-Burow, J. Parker, The IBM blue 907 gene/q interconnection fabric, Micro, IEEE 32 (1) (2012) 32–43. 908

[34] W.J. Dally, H. Aoki, Deadlock-free adaptive routing in multicomputer networks 909 5. 910 in: 911 ut- 912

913 ro- 914 nd 915

916 ed 917

ual 918 9– 919

920 ck- 921 is- 922

923 age 924 rk, 925

926 or- 927

928 et- 929 u- 930

931

(4) (2008) 75–86.

[4] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, S. Lu, Bcua high performance, server-centric network architecture for modular data c

ters, ACM SIGCOMM Comput. Commun. Rev. 39 (4) (2009) 63–74. [5] G. Wang, D. Andersen, M. Kaminsky, K. Papagiannaki, T. Ng, M. Kozu

M. Ryan, c-Through: Part-time optics in data centers, in: P roceedings

the ACM SIGCOMM Computer Communication Review, vol. 40, ACM, 20pp. 327–338.

[6] N. Farrington, G. Porter, S. Radhakrishnan, H. Bazzaz, V. Subramanya, Y. Faman, G. Papen, A. Vahdat, Helios: a hybrid electrical/optical switch archit

ture for modular data centers, in: Proceedings of the ACM SIGCOMM CompuCommunication Review, vol. 40, ACM, 2010, pp. 339–350.

[7] T. Wang, Z. Su, Y. Xia, Y. Liu, J. Muppala, M. Hamdi, Sprintnet: a high perf

mance server-centric network architecture for data centers, in: Proceedingsthe 2014 IEEE International Conference on Communications (ICC), IEEE, 20

pp. 4005–4010. [8] T. Wang, Z. Su, Y. Xia, J. Muppala, M. Hamdi, Designing efficient high p

formance server-centric data center network architecture, Comput. Netw. (2015) 283–296.

[9] H. Abu-Libdeh, P. Costa, A. Rowstron, G. O’Shea, A. Donnelly, Symbiotic rout

in future data centers, ACM SIGCOMM Comput. Commun. Rev. 40 (4) (2051–62.

[10] J.-Y. Shin, B. Wong, E.G. Sirer, Small-world datacenters, in: Proceedings of 2nd ACM Symposium on Cloud Computing, ACM, 2011, p. 2.

[11] T. Wang, Z. Su, Y. Xia, B. Qin, M. Hamdi, Novacube: a low latency torus-banetwork architecture for data centers, in: Proceedings of the 2014 IEEE Glo

Communications Conference (GLOBECOM), IEEE, 2014, pp. 2252–2257. [12] T. Wang, Z. Su, Y. Xia, M. Hamdi, Clot: a cost-effective low-latency overl

torus-based network architecture for data centers, in: Proceedings of the 20

IEEE International Conference on Communications (ICC), IEEE, 2015, pp. 5475484.

[13] T. Wang, Z. Su, Y. Xia, M. Hamdi, Rethinking the data center networking: arctecture, network protocols, and resource sharing, Access, IEEE 2 (2014) 148

1496.



using virtual channels, IEEE Trans. Parallel Distrib. Syst. 4 (4) (1993) 466–47[35] L.G. Valiant, G.J. Brebner, Universal schemes for parallel communication,

Proceedings of the Thirteenth Annual ACM Symposium on Theory of Comp

ing, ACM, 1981, pp. 263–277. [36] T. Nesson, S.L. Johnsson, Romm routing on mesh and torus networks, in: P

ceedings of the Seventh Annual ACM Symposium on Parallel Algorithms aArchitectures, ACM, 1995, pp. 275–287.

[37] A. Singh, W.J. Dally, B. Towles, A.K. Gupta, Locality-preserving randomizoblivious routing on torus networks, in: Proceedings of the Fourteenth Ann

ACM Symposium on Parallel Algorithms and Architectures, ACM, 2002, pp.

13. [38] L. Gravano, G.D. Pifarre, P.E. Berman, J.L. Sanz, Adaptive deadlock-and livelo

free routing with all minimal paths in torus networks, IEEE Trans. Parallel Dtrib. Syst. 5 (12) (1994) 1233–1251.

[39] N. Farrington, E. Rubow, A. Vahdat, Data center switch architecture in the of merchant silicon, in: Proceedings of the IEEE Hot Interconnects, New Yo

2009.

[40] W. Dally, B. Towles, Principles and Practices of Interconnection Networks, Mgan Kaufmann, 2004.

[41] Y. Zhang, N. Ansari, Hero: hierarchical energy optimization for data center nworks, in: Proceedings of the 2012 IEEE International Conference on Comm

nications (ICC), IEEE, 2012, pp. 2924–2928.




ARTICLE IN PRESS


[42]932 933 934

[43]935 936 937

[44] 38 39 40

[45] 41 42

[46] 43

Pl

ca

Y. Zhang, N. Ansari, On architecture design, congestion notification, TCP incast and power consumption in data centers, Commun. Surv. Tutor. IEEE 15

(1) (2013) 39–64. A. Greenberg, J. Hamilton, D. Maltz, P. Patel, The cost of a cloud: research prob-

lems in data center networks, ACM SIGCOMM Comput. Commun. Rev. 39 (1) (2008) 68–73.



S. Pelley, D. Meisner, T.F. Wenisch, J.W. VanGilder, Understanding and abstract- 9ing total data center power, in: Proceedings of the Workshop on Energy- 9Efficient Design, 2009. 9

H. Kung, R. Morris, Credit-based flow control for ATM networks, IEEE Netw. 9 9(2) (1995) 40–48. 9

http://www.nsnam.org . 9


http://www.nsnam.org


article in press - ایران عرضهiranarze.ir/wp-content/uploads/2016/11/e682.pdf · article in...

Documents