scalable management of enterprise and data center networks minlan yu minlanyu@cs.princeton.edu...

Scalable Management of Enterprise and Data Center Networks

Minlan Yuminlanyu@cs.princeton.edu

Princeton University

Edge Networks

Data centers (cloud)

Internet

Enterprise networks(corporate and campus)

Home networks

Redesign Networks for Management

• Management is important, yet underexplored– Taking 80% of IT budget – Responsible for 62% of outages

• Making management easier – The network should be truly transparent

Redesign the networks to make them easier and cheaper to manage

Main Challenges

Simple Switches(cost, energy)

Flexible Policies (routing, security,

measurement)

Large Networks (hosts, switches, apps)

Large Enterprise Networks

Hosts (10K - 100K)

Switches(1K - 5K)

Applications(100 - 1K)

Large Data Center Networks

…. …. ….

Switches(1K - 10K)

Servers and Virtual Machines(100K – 1M)

Applications(100 - 1K)

Flexible Policies

Customized Routing

Access Control

MeasurementDiagnosis

… …

Considerations:- Performance- Security- Mobility- Energy-saving- Cost reduction- Debugging- Maintenance… …

Switch Constraints

Switch

Small, on-chip memory(expensive,

power-hungry)

Increasing link speed(10Gbps and more)

Storing lots of state• Forwarding rules for many hosts/switches • Access control and QoS for many apps/users• Monitoring counters for specific flows

Edge Network Management

Specify policies

Management System

Configure devices

Collect measurements

on switchesBUFFALO [CONEXT’09]Scaling packet forwardingDIFANE [SIGCOMM’10]Scaling flexible policy

on hostsSNAP [NSDI’11]Scaling diagnosis

Research Approach

New algorithms & data structure

Effective use of switch memory

Efficient data collection/analysis

Systems prototyping

Prototype on OpenFlow

Prototype on Win/Linux OS

Evaluation & deployment

Evaluation on AT&T data

Deployment in Microsoft

DIFANE

Effective use of switch memory

Prototype on Click

Evaluation on real topo/traceBUFFALO

BUFFALO [CONEXT’09] Scaling Packet Forwarding on Switches

Packet Forwarding in Edge Networks

• Hash table in SRAM to store forwarding table– Map MAC addresses to next hop– Hash collisions:

• Overprovision to avoid running out of memory– Perform poorly when out of memory– Difficult and expensive to upgrade memory

00:11:22:33:44:55

00:11:22:33:44:66

aa:11:22:33:44:77

… …

Bloom Filters

• Bloom filters in SRAM– A compact data structure for a set of elements– Calculate s hash functions to store element x– Easy to check membership – Reduce memory at the expense of false positives

h1(x) h2(x) hs(x)01000 10100 00010

V0Vm-1

• One Bloom filter (BF) per next hop– Store all addresses forwarded to that next hop

Nexthop 1

Nexthop 2

Nexthop T

……Packetdestination

Bloom Filters

BUFFALO: Bloom Filter Forwarding

Comparing with Hash Table

• Save 65% memory with 0.1% false positives

400600

8001000

12001400

16001800

200002468

101214

hash tablefp=0.01%fp=0.1%fp=1%

# Forwarding Table Entries (K)

• More benefits over hash table– Performance degrades gracefully as tables grow– Handle worst-case workloads well

False Positive Detection

• Multiple matches in the Bloom filters– One of the matches is correct– The others are caused by false positives

Nexthop 1

Nexthop 2

Nexthop T

……Packetdestination

Bloom Filters Multiple hits

Handle False Positives• Design goals

– Should not modify the packet– Never go to slow memory– Ensure timely packet delivery

• When a packet has multiple matches– Exclude incoming interface

• Avoid loops in “one false positive” case

– Random selection from matching next hops• Guarantee reachability with multiple false positives

One False Positive• Most common case: one false positive

– When there are multiple matching next hops– Avoid sending to incoming interface

• Provably at most a two-hop loop– Stretch <= Latency(AB) + Latency(BA)

False positive

Shortest path

Stretch Bound

• Provable expected stretch bound – With k false positives, proved to be at most– Proved by random walk theories

• However, stretch bound is actually not bad– False positives are independent– Probability of k false positives drops exponentially

• Tighter bounds in special topologies– For tree, expected stretch is (k > 1)

BUFFALO Switch Architecture

Prototype Evaluation

• Environment– Prototype implemented in kernel-level Click– 3.0 GHz 64-bit Intel Xeon– 2 MB L2 data cache, used as SRAM size M

• Forwarding table– 10 next hops, 200K entries

• Peak forwarding rate– 365 Kpps, 1.9 μs per packet– 10% faster than hash-based EtherSwitch

BUFFALO Conclusion• Indirection for scalability

– Send false-positive packets to random port– Gracefully increase stretch with the growth of

forwarding table• Bloom filter forwarding architecture

– Small, bounded memory requirement– One Bloom filter per next hop– Optimization of Bloom filter sizes– Dynamic updates using counting Bloom filters

DIFANE [SIGCOMM’10] Scaling Flexible Policies on Switches

Do It Fast ANd Easy

Traditional Network

Data plane:Limited policies

Control plane:Hard to manage

Management plane:offline, sometimes manual

New trends: Flow-based switches & logically centralized control

Data plane: Flow-based Switches

• Perform simple actions based on rules– Rules: Match on bits in the packet header– Actions: Drop, forward, count – Store rules in high speed memory (TCAM)

25drop

forward via link 1

Flow spacesrc. (X)

dst.(Y)

Count packets

1. X:* Y:1 drop2. X:5 Y:3 drop3. X:1 Y:* count4. X:* Y:* forward

TCAM (Ternary Content Addressable Memory)

Control Plane: Logically CentralizedRCP [NSDI’05], 4D [CCR’05], Ethane [SIGCOMM’07], NOX [CCR’08], Onix [OSDI’10],Software defined networking

DIFANE:A scalable way to apply

fine-grained policies

Pre-install Rules in Switches

Packets hit the rules Forward

• Problems: Limited TCAM space in switches– No host mobility support– Switches do not have enough memory

Pre-install rules

Controller

Install Rules on Demand (Ethane)

First packetmisses the rules

Buffer and send packet header to the controller

Install rules

Forward

Controller

• Problems: Limited resource in the controller– Delay of going through the controller– Switch complexity– Misbehaving hosts

Design Goals of DIFANE

• Scale with network growth– Limited TCAM at switches– Limited resources at the controller

• Improve per-packet performance – Always keep packets in the data plane

• Minimal modifications in switches– No changes to data plane hardware

Combine proactive and reactive approaches for better scalability

DIFANE: Doing it Fast and Easy(two stages)

Stage 1

The controller proactively generates the rules and distributes them to authority switches.

Partition and Distribute the Flow Rules

Ingress Switch

Egress Switch

Distribute partition information Authority

Switch A

AuthoritySwitch B

Authority Switch C

reject

acceptFlow space

Controller

Authority Switch A

Authority Switch B

Authority Switch C

Stage 2

The authority switches keep packets always in the data plane and reactively cache rules.

Following packets

Packet Redirection and Rule Caching

Ingress Switch

Authority Switch

Egress Switch

First packet Redirect

Forward

Feedback:

Cache rules

Hit cached rules and forward

A slightly longer path in the data plane is faster than going through the control plane

Locate Authority Switches

• Partition information in ingress switches– Using a small set of coarse-grained wildcard rules– … to locate the authority switch for each packet

• A distributed directory service of rules – Hashing does not work for wildcards

Authority Switch A

AuthoritySwitch B

Authority Switch C

X:0-1 Y:0-3 AX:2-5 Y: 0-1 BX:2-5 Y:2-3 C

Following packets

Packet Redirection and Rule Caching

Ingress Switch

Authority Switch

Egress SwitchFirst

packet Redirect Forward

Feedback:

Cache rules

Hit cached rules and forward

Cache Rules

Partition Rules

Auth. Rules

Three Sets of Rules in TCAMType Priority Field 1 Field 2 Action Timeout

Cache Rules

1 00** 111* Forward to Switch B 10 sec

2 1110 11** Drop 10 sec

… … … … …

Authority Rules

14 00** 001* ForwardTrigger cache manager

Infinity

15 0001 0*** Drop, Trigger cache manager

… … … … …

Partition Rules

109 0*** 000* Redirect to auth. switch

110 …… … … … …

In ingress switchesreactively installed by authority switches

In authority switchesproactively installed by controller

In every switchproactively installed by controller

Cache Rules

DIFANE Switch PrototypeBuilt with OpenFlow switch

Data Plane

Control Plane

CacheManager

Send Cache Updates

Recv Cache Updates

Only in Auth.

Switches

Authority RulesPartition Rules

Notification

Cache rules

Just software modification for authority switches

Caching Wildcard Rules• Overlapping wildcard rules

– Cannot simply cache matching rules

Priority:R1>R2>R3>R4

Caching Wildcard Rules• Multiple authority switches

– Contain independent sets of rules– Avoid cache conflicts in ingress switch

Authority switch 1

Authority switch 2

Partition Wildcard Rules• Partition rules

– Minimize the TCAM entries in switches– Decision-tree based rule partition algorithm

Cut BCut B is better than Cut A

Traffic generator

Testbed for Throughput Comparison

Controller

Authority Switch

Ethane

Traffic generator

DIFANE

Ingress switch

…. ….

Controller

• Testbed with around 40 computers

Peak Throughput

1K 10K 100K 1000K1K

1,000KDIFANENOX

Sending rate (flows/sec)

ut (fl

2 3 41 ingress switch

ControllerBottleneck (50K)

DIFANE (800K)

Ingress switchBottleneck(20K)

DIFANE is self-scaling:Higher throughput with more authority switches.

DIFANEEthane

• One authority switch; First Packet of each flow

Scaling with Many Rules

• Analyze rules from campus and AT&T networks– Collect configuration data on switches– Retrieve network-wide rules– E.g., 5M rules, 3K switches in an IPTV network

• Distribute rules among authority switches– Only need 0.3% - 3% authority switches– Depending on network size, TCAM size, #rules

Summary: DIFANE in the Sweet Spot

Logically-centralized

Distributed

Traditional network(Hard to manage)

OpenFlow/Ethane(Not scalable)

DIFANE: Scalable managementController is still in charge

Switches host a distributed directory of the rules

SNAP [NSDI’11]Scaling Performance Diagnosis for Data Centers

Scalable Net-App Profiler

Applications inside Data Centers

Front end Server

Aggregator Workers

…. …. ….

Challenges of Datacenter Diagnosis

• Large complex applications– Hundreds of application components– Tens of thousands of servers

• New performance problems– Update code to add features or fix bugs– Change components while app is still in operation

• Old performance problems (Human factors)– Developers may not understand network well – Nagle’s algorithm, delayed ACK, etc.

Diagnosis in Today’s Data Center

OS Packet sniffer

App logs:#Reqs/secResponse time1% req. >200ms delay

Switch logs:#bytes/pkts per minute

Packet trace:Filter out trace for long delay req.

SNAP:Diagnose net-app interactions

Application-specific

Too expensive

Too coarse-grainedGeneric, fine-grained, and lightweight

SNAP: A Scalable Net-App Profiler

that runs everywhere, all the time

Management System

SNAP Architecture

At each host for every connection

Collect data

Performance Classifier

Cross-connection correlation

Adaptively polling per-socket statistics in OS - Snapshots (#bytes in send buffer)- Cumulative counters (#FastRetrans)

Classifying based on the stages of data transfer- Sender appsend buffernetworkreceiver

Topology, routingConn proc/app

Offending app, host, link, or switch

Online, lightweight processing & diagnosis

Offline, cross-conn diagnosis

SNAP in the Real World

• Deployed in a production data center– 8K machines, 700 applications– Ran SNAP for a week, collected terabytes of data

• Diagnosis results– Identified 15 major performance problems– 21% applications have network performance problems

Characterizing Perf. Limitations

Send Buffer

Receiver

Network

#Apps that are limited for > 50% of the time

6 Apps

8 Apps144 Apps

– Send buffer not large enough

– Fast retransmission – Timeout

– Not reading fast enough (CPU, disk, etc.)– Not ACKing fast enough (Delayed ACK)

Delayed ACK Problem • Delayed ACK affected many delay sensitive apps

– even #pkts per record 1,000 records/sec odd #pkts per record 5 records/sec– Delayed ACK was used to reduce bandwidth usage and

server interrupts

200 ms

….Proposed solutions:Delayed ACK should be disabled in data centers

ACK every other packet

Diagnosing Delayed ACK with SNAP

• Monitor at the right place– Scalable, lightweight data collection at all hosts

• Algorithms to identify performance problems– Identify delayed ACK with OS information

• Correlate problems across connections– Identify the apps with significant delayed ACK issues

• Fix the problem with operators and developers– Disable delayed ACK in data centers

Edge Network Management

Specify policies

Management System

Configure devices

Collect measurements

on switchesBUFFALO [CONEXT’09]Scaling packet forwardingDIFANE [SIGCOMM’10]Scaling flexible policy

on hostsSNAP [NSDI’11]Scaling diagnosis

Thanks!

scalable management of enterprise and data center networks minlan yu minlanyu@cs.princeton.edu...

Documents

2004 princeton 33 defense

building scalable .net applications

princeton borough market update

linedrawingsfromvolumedata - princeton university

scalable web architecture

scalable hadoop in the cloud

ba summer princeton presentation

princeton...

introduction generale - princeton university

princeton borough district #1

scalable post-mortem debugging

scalable performance data server

universitÉdemontrÉal fast,scalable,andflexiblec+

communication - princeton university

piglia princeton

résumé - princeton university

scalable gaussian processes

alpha scalable interest rates

joseph henry laboratories, princeton university, princeton

scape - scalable preservation environments