bcube : a high performance, server-centric network architecture for modular data centers

65
BCube: A High Performance, Server- centric Network Architecture for Modular Data Centers B96611024 謝謝謝 B96b02016 謝謝謝 1

Upload: gali

Post on 23-Feb-2016

104 views

Category:

Documents


0 download

DESCRIPTION

BCube : A High Performance, Server-centric Network Architecture for Modular Data Centers. B96611024 謝宗廷 B96b02016 張煥基. Outline. Introduction Bcube structure Bcube source routing OTHER DESIGN ISSUES GRACEFUL DEGRADATION Implementation Architecture Conclusion. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers

BCube: A High Performance, Server-centric Network Architecture for Modular Data CentersB96611024 B96b02016 11OutlineIntroductionBcube structureBcube source routingOTHER DESIGN ISSUESGRACEFUL DEGRADATIONImplementation ArchitectureConclusion

22IntroductionOrganizations now use the MDC. (shorter deployment time, higher system and power density, lower cooling and manufacturing cost.)The Bcube is a high-performance and robust network architecture for an MDC network architecture. BCube is designed to well support all these traffic patterns. (one-to-one, one-to-several, one-to-all, or all-to-all.)

3Bandwidth-intensive application supportOne-to-one:one server moves data to another server. (disk backup)One-to-several:one server transfers the same copy of data to several receivers. (distributed file systems)One-to-all:a server transfers the same copy of data to all the other servers in the cluster (boardcast)All-to-all:very server transmits data to all the other servers (mapreduce)

44BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers BCUBE STRUCTURE55Bcube construction (Bcubek,n)

1.server switch,serverserver,switch2.Kn,Klevel , n BcubeknBcube k-1Bcube0 nserver3.4.BcubeknBcubek-1,Bcubek-1nkserver,Bcubeknk+1 server,switchnBcube,swithchn6Bcube1

1.Bcube0 ,n4,Bcube04server,Bcube0,Bcube1,Bcube1Bcube0,switchserverBcube02.switchBcube0server,Bcube0server,switch7Bcube2(n=4)

level21.Bcube2Bcube1,Bcube116server,16Switch, Bcube1,64server2.Bcube2level216switch,43=64server

8Each server in a BCubek has k + 1 ports, which are numbered from level-0 to level-k.a BCubek has N = nk+1 servers and k+1 level of switches, with each level having nk n-port switches.a BCubek using an address array akak-1 a0.91.BcubeknBcubek-1serverlevel kswitch2. BcubeknBcubek-1,2Bcubek-1,Bcubek-1nBcubek-29Single-path Routing in BCubeuse h(A;B) to denote the Hamming distance of two servers A and B.Two servers are neighbors if they connect to the same switch. The Hamming distance of two neighboring servers is one. More specifcally, two neighboring servers that connect to the same level-i switch only differ at the i-th digit in their address arrays.10level i switch server idigitAB,Switchdigit10The diameter, which is the longest shortestpath among all the server pairs, of a BCubek, is k + 1.k is a small integer, typically at most 3. There-fore, BCube is a low-diameter network.11Bcubekserver address digit0digitk ,switchdigit,k+1digit,k+111Multi-paths for One-to-one TrafficTwo parallel paths between a source server and a destination server exist if they are node-disjoint, , i.e., the intermediate servers and switches on one path do not appear on theother.It is also easy to observe that the number of parallel paths between two servers be upper bounded by k + 1, since each server has only k + 1 links.1212There are k + 1 parallel paths between anytwo servers in a BCubek.13

digit,K+113There are h(A;B) and k + 1-h(A;B) paths in the first and second categories, respectively.observe that the maximum path length of the paths constructed by BuildPathSet be k + 2.It is easy to see that BCube should also well support several-to-one and all-to-one traffic patterns. 141.digit,digit14Speedup for One-to-several TrafficThese complete graphs can speed up data replications in distributed file systemssrc has n-1 choices for each di. Therefore, src can build (n - 1)k+1 such complete graphs.When a client writes a chunk to r chunk servers, it sends 1/r of the chunk to each of the chunk server. This will be r times faster than the pipeline model.

1515Source:00000Want to build a complete graph:00001,00010,00100,01000,10000Complete graph: (00000,00001,00010,00100)01000->01001->0000101000->01010->0001001000->01100->00100

1616Speedup for One-to-all TrafficIn one-to-all, a source server delivers a file to all the other servers.It is easy to see that under tree and fat-tree, the time for all the receivers to receive the file is at least L.A source can deliver a file of size L to all the other servers in L /k+1 time in a BCubek.constructing k+1 edge-disjoint server spanning trees from the k + 1 neighbors of the source.1717When a source distributes a file to all the other servers, it can split the file into k +1 parts and simultaneously deliver all the parts via different spanning trees.18

18Aggregate Bottleneck Throughput for All-to-all Trafficthe flows that receive the smallest throughput are called the bottleneck flows.The aggregate bottleneck throughput (ABT) is defined as the number of flows times the throughput of the bottleneck flow.n/n-1 (N -1), where n is the switch port number and N is the number of servers.1919BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers BCUBE SOURCE ROUTING202021sourcedestinationintermediateK+1 pathProbe packet21Source: obtain k+1 parallel paths and then probes these paths.if one path is found not available, the source uses the Breadth First Search (BFS) algorithm to find another parallel path.removes the existing parallel paths and the failed links from the BCubek graph, and then uses BFS to search for a paththe number of parallel paths must be smaller than k + 1.

2222Intermediate:Case1: if its next hop is not available, it returns a path failure message (which includes the failed link) to the source.Case2: it updates the available bandwidth field of the probe packet if its available bandwidth is smaller than the existing value.2323Destination:a destination server receives a probe packet, it first updates the available bandwidth field of the probe packet if the available bandwidth of the incoming link is smaller than the value carried in the probe packet. It then sends the value back to the source in a probe response messages24245. OTHER DESIGN ISSUES5.1 Partial BCube2525Why Partial BCube???In some cases, it may be difficult or unnecessary to build a complete BCube structure. For example, when n = 8 and k = 3, we have 4096 servers in a BCube3. 8 ** 4 = 4096

However, due to space constraint, we may only be able to pack 2048 servers.2626partial BCubek (1) build the BCube k1s

(2) use partial layer-k switches to interconnect the BCube k1s.

2727Example28

2829

29

Solution

When building a partial BCubek, we first build the needed BCubek1s, we then connect the BCubek1s using a full layer-k switches.3030Pro and con offull layer-k switchesBCubeRouting performs just as in a complete BCube, and BSR just works as before.switches in layer-k are not fully utilized31315. OTHER DESIGN ISSUES5.2 Packaging and Wiring

3232ConditionWe show how packaging and wiring can be addressed for a container with2048 servers and 1280 8-port switches(a partial BCube with n = 8 and k = 3).

333340-feet container34

34One rack35

16 layer-18 layer-216 layer-3

35One rack = OneBCube 136

64 servers16 (8-port switches)36One super-rack = OneBCube 2

The level-2wiresarewithin asuper-rackand level-3wires arebetweensuper-racks. 375. OTHER DESIGN ISSUES5.3 Routing to External Networks38We assume that both internal and external computers use TCP/IP.

We propose aggregator and gateway for external communication. 39We can use a 48X1G+1X10G aggregator to replace several mini-switches and use the 10G link to connect to the external network.

The servers that connect to the aggregator become gateways.40When an internal server sends a packet to an external IP address(1) choose one of the gateways.

(2) The packet is then routed to the gateway using BSR (BCube Source Routing)

(3) After the gateway receives the packet, it strips the BCube protocol header and forwards the packet to the external network via the 10G uplink416.GRACEFUL DEGRADATION42aggregate bottleneck throughput (ABT) ABT reects the all-to-all network capacity.

ABT =( the bottleneck ow) *( the number of total ows in the all-to-all trac model )

Graceful degradation states that when server or switch failure increases, ABT reduces slowly and there are no dramatic performance falls.

43

In this section, we use simulations to compare the aggregate bottleneck throughput (ABT) of BCube, fat-tree [1], and DCell [9], under random server and switch failures.

44THE FAT TREE

45DCell

46Figure 1: A DCell1 network when n=4. It is composed of 5 DCell0 networks. When we consider each DCell0 as a virtual node, these virtual nodes then form a complete graph.Assumption:all the links are 1Gb/s and there are 2048 servers.

switch:we use 8-port switches to construct the network structures.

47BCube networkwe use is a partial BCube3 with n = 8 that uses 4 full BCube2 .

fat-tree structure

ve layers of switches, with layers0 to 3 having 512 switches per-layer and layer-4 having 256switches. DCellpartial DCell2 which contains 28 full DCell1 and one partial DCell1 with 32 servers.4848BCube(1)only BCube provides high ABT and graceful degradationfat-treewhen there is no failure, both BCube and fat-tree provide high ABT values,2006Gb/s for BCube and 1895 Gb/s for fat-tree. DCell(1) ABT: 298Gb/sFirst, the trac is imbalanced at dierent levels of links in DCell.Second, partial DCell makes the trac imbalanced even for links at the same level. load-balancing

4949ABT under server failure

50ABT under switch failure

51BCube BCube performs well under both server and switch failures.

the degradation is graceful.whenthe switch failure ratio reaches 20%:fat-tree ABT 267Gb/sBCube ABT 765Gb/s 527. IMPLEMENTATION AND EVALUATION

Implementation Architecture53BCube stackWe have prototyped the BCube architecture by designing and implementing a BCube protocol stack.

54BCube stackBCube stack BCube stack BSR protocolroutingneighbor maintenance protocolmaintains a neighbor status tablethe packet sending/receiving partinteracts with the TCP/IP stackpacket forwarding enginerelays packets for other servers.5555BCube packet

56BCube header

57BCube header source and destination BCube addresses packet id protocol typepayload lengthheader checksum58BCube storesthe complete pathand anext hop index(NHI) in the headerof every BCubepacket

58NHA

59Packet Forwarding Enginerelays packets for other servers60BCube stackBCube stack BCube stack BSR protocolroutingneighbor maintenance protocolmaintains a neighbor status tablethe packet sending/receiving partinteracts with the TCP/IP stackpacket forwarding enginerelays packets for other servers.6161packet forwarding engineWe have designed an efficient packet forwarding engine which decides the next hop of a packet by only one table lookup.

62neighbor status table1)NeighborMAC :MAC address (2)OutPort, and :the port that connects to the neighbor(3)StatusFlag:if the neighbor is available

packet forwarding procedure62Sending packets to the next hop

It then extracts the status and the MAC address of the next hop, using the NHA value as the index.

63CONCLUSIONBCube as a novel network architecture for shipping-container-based modular data centers (MDC)

64accelerates one- to-x traffic patternsprovides high network capacity for all-to-all traffic64how to scale our server-centric design from the single container to multiple containers65