Download - Data Center Network Multipathing
1
Data Center Network Multipathing
Peregrine: An All-Layer-2 Container Computer NetworkTzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§*Industrial Technology Research Institute, Taiwan
§Computer Science Department, Stony Brook University
IEEE Cloud 2012
Leveraging Performance of Multiroot Data Center Networks by Reactive RerouteAdrian S.-W. Tam, Kang Xi H,. Jonathan Chao
Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit
2010 18th IEEE Symposium on High Performance Interconnects
Presenter: Jason, Tsung-Cheng, HOUAdvisor: Wanjiun Liao
May 17th, 2012
2
Motivation
• Summarize features of the popular multi-root Clos / fat-tree data center topologyTake ITRI’s prototype as an example
• Surveyed solutions of multipathing• Recap Jin-Jia Chang’s presentation on QCN• Present another solution to multipathing• Compare several multipathing methods
3
Agenda
• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods
Peregrine: An All-Layer-2 Container Computer NetworkTzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§*Industrial Technology Research Institute, Taiwan
§Computer Science Department, Stony Brook University
IEEE Cloud 2012
4
Multi-Root Clos / Fat-Tree
• Adopted by various publications– VL2, PortLand, BCube, Elastic Tree, Peregrine
• Scale-out, cheap commodity switches• Through fixed maximum switches / hops
– If no bouncing, no routing loop
• Nearly full bisection, multipathing, symmetric• Possibly tremendous routing table entries• Up and down paths, handled differently• High rate but limited capability, buffer, CPU..
5
High rate but limited capability
• All-L2 Ethernet switches• Up to 1 GE or 10 GE links, dozens ports• Limited buffer, hundred K bytes• Limited CPU ability, processing bottleneck• Limited flow table entries, at most dozen Ks• Optimized for fast table lookups• Take Peregrine for example
– ITRT’s industrial, commodity production prototype– Others, mostly experimental or high-end
6
Topology: Folded Clos
A rack
A container
cross container
12 racks
7
Within One Rack
• 48 servers 2 CPUs per 96 CPUs• 48 servers 1 GE NIC 4 192 ports• 4 ToR switches 1 GE 48 192 GE12 server racks in one containerÞ 576 servers, 1152 CPUs, 2304 GE, 2304 ports
8
Within One Container
12 6 6
5-to-5 per rackBut only 4 ports
• 5 Agg. switches 48 10 GE• 12 Storage servers 40 disksÞ 2400 GE between Agg-ToRÞ 2304 GE between ToR-Server
9
DS and RAS• Directory Server
– Address association, mgmt, and reuse– Performs IP-MAC lookup, mappings– Updates mappings to end hosts
• Route Algorithm Server– Collects entries of the traffic matrix– Runs load-balancing algorithms, based on TM– Distributes routing entries to switches, update DS
• Within one container, cross-container unclear• Scalability unclear, VM mobility unclear
(Only refers to sth like mobile IP)
10
Routing, Balancing, and Tolerance• Hosts apply to DS for addresses• Kernel Agent redirects ARP to DS• Each MAC forms a spanning tree
– Two STs may overlap, but node-pair-path cannot
• Four MACs for a host: MAC-In-MAC encap.– (Direct, Indirect) (Primary, Backup)– ToR or vSwitch as a intermediary– Dual-mode, two-stage
• Switch RAS DS HostAlters dst-MAC, alters route– Change routes when failover or balancing
11
Logical Architecture
12
Dual-Mode Forwarding
13
Switching to Backup
14
ITRI Container Computer Prototype• 6.096m shipping container• 12 server racks, 12 storage racks• All-L2 network, commodity switches• “Folded” Clos topology• Directory Server, Route Algorithm Server• Unclear: Load-balancing algo., VM mobility,
DS-RAS scalability, cross-container• In the future: OpenFlow, OpenStack
(Currently not using OpenFlow to connect switches… how? unclear)
15
Discussions
• Spanning tree for multipathing and load-balancing: Simple but limited flexibility
• How to plug and play? Scalable?– A new switch leads to reconfiguration– VM migration = affects TM and direct routes?
• DS-RAS: a simple version of controllerBut mechanism, performance unclear
• Seems to be trying to combined various advantages: Address mapping, ST multipathing, converged network, folded-Clos
16
Agenda
• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods
17
Multipathing• VLB:
– Traffic splits to intermediate points– Automatically balances load– Ideally great, but subject to PKT reordering
• ECMP-hashing– Different hashing functions, big difference– Flow always sticks to one path during transmit
• Hedera:– Flow-to-core mapping, flow scheduling– Requires global information, higher complexity
18
Multipathing
• Spanning Tree / VLAN: (Spain)– Near-static, pre-computation required, but simple– Re-computes when topology changes– Segmentation of resources, limited flexibility
• Multipath TCP:– One flow, many parallel paths– VLAN-based routing in publication (like Spain)– Shifts traffic to less congested paths– A new transport mechanism, adaptive– Still with segmentation of resources
19
Multipathing References• M. Kodialam, T. V. Kakshman, S. Sengupta, “Efficient and Robust Routing of Highly
Variable Traffic”, HotHets, 2004.• R. Zhang-Shen and N. McKeown “Designing a Predictable Internet Backbone Network”,
Third Workshop on Hot Topics in Networks (HotNets-III), November 2004.• A. Greenberg et al., “VL2: A Scalable and Flexible Data Center Network”, ACM SIGCOMM
2009.• M YSORE, R. N., PAMPORIS, A., FARRINGTON, N., H UANG, N., MIRI , P., R
ADHAKRISHNAN, S., S UBRAMANYA, V., AND VAHDAT, A. “PortLand: A Scalable, Fault-Tolerant Layer 2 Data Center Network Fabric.” In Proceedings of ACM SIGCOMM, 2009.
• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010.
• J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. “SPAIN: COTS Data-Center Ethernet for Multipathing over Arbitrary Topologies.” In USENIX NSDI, April 2010.
• C. Raiciu, C. Pluntke, S. Barre, A. Greenhalgh, D. Wischik, and M. Handley. “Data center networking with multipath TCP.” In HotNets, 2010.
20
Agenda
• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods
Data center transport mechanisms: Congestion control theory and IEEE standardizationM. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman,
Communication, Control, and Computing, 2008 46th Annual Allerton Conference on
AF-QCN: Approximate fairness with quantized congestion notification for multitenanted data centersA. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar,
B. In High Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on
21
Data Center Bridging Task Group
• Converged network– LAN: no priority control
Qbb: Priority-based Flow Control– FCoE (SAN): no congestion control
Qau: Quantized Congestion Notification
• Need to survey more on converged network– Respective features and requirements– Could be a very important trend
22
QCN
• CP: Congestion Point– A switch monitors queue, Q, Qeg, Qold
– Samples and sends Fb msg to RP– Fb a combination of (queue, rate) excess– Targets for no PKT loss
• RP: Reaction Point– A host with Rate Limiter, Counter, and Timer– Retries for more BW like AIMD– Decreases according to Fb msg– Counter and Timer both controls RL
23
QCN
24
QCN
25
AF-QCN
26
Modify Fb Msg to Imply More
27
Agenda
• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods
Leveraging Performance of Multiroot Data Center Networks by Reactive RerouteAdrian S.-W. Tam, Kang Xi H,. Jonathan Chao
Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit
28
Exploit Multipath Property
• Use QCN to further leverage redundancy– Per-flow CN adjusts BW: Spectral– Relocates flows among paths: Spatial– Both mitigates congestions
• Multiroot, Clos / fat-tree topology– Upward: destination based, deterministic– Downward: could be randomized or rerouted
• Hashed ECMP: Distributes flow population• Flow-reroute: Balancing congested links
29
Reactive Reroute
• Edge switches counts received QCNs-Ports– Only edge switches will reroute, consider enough– Only for upward PKTs, not for downward
• Reroutes flows (elephant && congested), detects by counting QCNs in a short period
• Three reroute methods:– Uniform random– Min. prob. of congestion (conditional prob.)– Weighted of above two
• Freezes a rerouted flow to avoid flapping
30
Algorithm Pseudo Code
Only when within a short period
31
NS-3 Simulation
• Simulation for 1 second• Also a TCP simulation
32
Throughput and Latency
33
Outlier Latency
• Very large flows are throttled by L2 congestion control, thus with large latency
• 60% within 1ms, but in average it takes 15ms!
34
Discussion
• Why Min. reroute is always worse?– Some flows’ path overlap in the beginning– Edge switches have no global information– Receives QCN from the same (port, agg)
Synchronized reroute
• Operates a centralized controller?– Authors argue that gain is very small– But they do not present more on the “outliers”– The flows with longest latencies, the larger– The larger flows could be some vital connections
35
Discussion
• L2 congestion control protects TCP over UDP• No PKT loss, almost no incast problem• Out-of-order problem is more severe for UDP• However, because switch buffer is tightly
monitored, the number of out-of-order PKTs is limited at most as (5nr/s)(n: buffer size) (r: sending rate) (s: link rate)
• Freezes a rerouted flow: Also limits reordering
36
Agenda
• Multi-Root Clos / Fat-Tree Topology• Surveyed Solutions to Multipathing• 802.1Qau – QCN• QCN and Reactive Reroute• Comparison of Multipathing Methods
Comparative Evaluation of CEE-based Switch Adaptive Routing
Daniel Crisan, Mitch Gusat, Cyriel Minkenberg,
2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC CAVES), 2010
37
Multipathing Methods
• Deterministic, static, or preconfigured– Single fixed path– VLAN-based, multiple fixed paths, ST-per-VLAN
• Oblivious, randomized– Hashed by headers– Split to intermediaries
• Reactive, switch adaptive routing• Controller-enabled centralized scheduling
38
Comparison
• Deterministic, static, or preconfigured– Simple, no re-ordering
• Oblivious, randomized, good when…– Single prio., symmetric traffic
• Reactive, switch adaptive routing, realistic…– Multiple prio., asymmetric
• Controller-enabled centralized scheduling– Large input set, higher complexity– Controller hard to implement, high cost low gain?
• Convergence and virtualization are trends
39
Discussion
• Data center traffic patterns are evolving and unknown a priori in many cases
• Justifies multiple routing / balancing schemesCurrently no single killer solution
• Should be able to switch between modesReactive-Adaptive and Randomized
• Role of controller still to be optimized– Could be useful for criti cal flows / situation– Detect and react in slower manner– Not ideal for dynamic fast reaction
40
Reference• Tzi-cker Chiueh, Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, Yu-Ming Huang ,
“Peregrine: An All-Layer-2 Container Computer Network”, IEEE Cloud 2012
• M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman, “Data center transport mechanisms: Congestion control theory and IEEE standardization,” Communication, Control, and Computing, 2008 46th Annual Allerton Conference on
• A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar. “AF-QCN: Approximate fairness with quantized congestion notification for multitenanted data centers”, In High Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on
• Adrian S.-W. Tam, Kang Xi H., Jonathan Chao , “Leveraging Performance of Multiroot Data Center Networks by Reactive Reroute”, 2010 18th IEEE Symposium on High Performance Interconnects
• Daniel Crisan, Mitch Gusat, Cyriel Minkenberg, “Comparative Evaluation of CEE-based Switch Adaptive Routing”, 2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC CAVES), 2010