hedera: dynamic flow scheduling for data center networks mohammad al-fares sivasankar radhakrishnan...
Post on 04-Jan-2016
214 Views
Preview:
TRANSCRIPT
Hedera: Dynamic Flow Scheduling for Data Center Networks
•Mohammad Al-Fares•Sivasankar Radhakrishnan•Barath Raghavan•Nelson Huang•Amin Vahdat
Presented by 馬嘉伶
• Easy to understand problem
MapReduce style DC applications need bandwidth
DC networks have many ECMP paths between servers
Flow-hash-based load balancing insufficient
ECMP Paths• Many equal cost paths going up to the core switches• Only one path down from each core switch• Randomly allocate paths to flows using hash of the flow
• Agnostic to available resources• Long lasting collisions between long (elephant) flows
DS
Collisions of elephant flows
• Collisions possible in two different ways• Upward path• Downward path
D1
S1
D2
S2
D3
S3
D4
S4
Collisions of elephant flows
• Average of 61% of bisection bandwidth wasted on a network of 27K servers
D1
S1
D2
S2
D3
S3
D4
S4
Hedera Scheduler
• Hedera: Dynamic Flow Scheduling
- Optimize achievable bisection bandwidth by assigning flow non-conflicting paths
- Uses flow demand estimation + placement heuristics to find good flow-to-core mappings
EstimateFlow Demands
PlaceFlows
DetectLarge Flows
Hedera Scheduler
• Detect Large Flows• Flows that need bandwidth but are network-limited
• Estimate Flow Demands• Use max-min fairness to allocate flows between src-dst pairs
• Place Flows• Use estimated demands to heuristically find better placement of
large flows on the ECMP paths
EstimateFlow Demands
PlaceFlows
DetectLarge Flows
Elephant Detection
• Scheduler continually polls edge switches for flow byte-counts• Flows exceeding B/s threshold are “large”
• > %10 of hosts’ link capacity (i.e. > 100Mbps)
• What if only “small” flows?• Default ECMP load-balancing efficient.
GigE
Elephant Detection
• Hydera complements ECMP!
- Default forwarding uses ECMP
- Hedera schedules large flows that cause bisection bandwidth problems.
Demand Estimation
• Flows can be constrained in two ways• Host-limited (at source, or at destination)• Network-limited
• Measured flow rate is misleading
• Need to find a flow’s “natural” bandwidth requirement when not limited by the network
• Forget network, just allocate capacity between flows using max-min fairness
Demand Estimation
• Given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively…
• Sender equally distributes bandwidth among outgoing flows that are not receiver-limited
• Network-limited receivers decrease exceeded capacity equally between incoming flows
• Repeat until all flows converge
•Guaranteed to converge in O(|F|) time
Demand Estimation
A
B
C
X
Y
FlowEstimat
eConv.
?
AX
AY
BY
CY
Sender
Available Unconv. BW Flows Share
A 1 2 1/2
B 1 1 1
C 1 1 1
Senders
Demand Estimation
Recv RL? Non-SLFlows Share
X No - -
Y Yes 3 1/3
ReceiversFlow
Estimate
Conv. ?
AX 1/2
AY 1/2
BY 1
CY 1
A
B
C
X
Y
Demand Estimation
FlowEstimat
eConv.
?
AX 1/2
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Sender
Available Unconv. BW Flows Share
A 2/3 1 2/3
B 0 0 0
C 0 0 0
Senders
A
B
C
X
Y
Demand Estimation
FlowEstimat
eConv.
?
AX 2/3 Yes
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Recv RL? Non-SLFlows Share
X No - -
Y No - -
Receivers
A
B
C
X
Y
Flow placement
• Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized• That is maximum utilization of theoretically available b/w
• Two approaches• Global First Fit:
Greedily choose path that has sufficient unreserved b/w• Simulated Annealing:
Iteratively find a globally better mapping of paths to flows
Global First-Fit
• New flow detected, linearly search all possible paths from SD
• Place flow on first path whose component links can fit that flow
?Flow AFlow BFlow C
? ?0 1 2 3
Scheduler
Global First-Fit
• Flows placed upon detection, are not moved• Once flow ends, entries + reservations time out
Flow AFlow BFlow C
0 1 2 3
Scheduler
Simulated Annealing• Probabilistic search for good flow-to-core mappings
- Goal: Maximize achievable bisection bandwidth• Current flow-to-core mapping generates neighbor state
- Calculate total exceeded bandwidth capacity
- Accept move to neighbor state if bisection BW gain• Few thousand iterations for each scheduling round
- Avoid local-minima; non-zero prob. to worse state
Simulated Annealing
Simulated Annealing
• Example run: 3 flows, 3 iterations
0 1 2 3Flow AFlow BFlow C
Core210
? ? ?202
?203
Scheduler
Simulated Annealing
• Final state is published to the switches and used as the initial state for next round
0 1 3Flow AFlow BFlow C
Core ? ? ? ?203
2
Scheduler
Fault-Tolerance
• Link / Switch failure• Use PortLand’s fault notification protocol• Hedera routes around failed components
0 1 3Flow AFlow BFlow C
2
Scheduler
Fault-Tolerance
• Scheduler failure
• Soft-state, not required for correctness (connectivity)
• Switches fall back to ECMP
0 1 3Flow AFlow BFlow C
2
Scheduler
Evaluation• 16-host testbed
- k=4 fat-tree data-plane
- 20 machines; 4-port
NetFGPAs / OpenFlow
- Parallel 48-port non-blocking
Quanta switch(debug & control)• 1 Scheduler machine
- Dynamic traffic monitoring
- OpenFlow routing control
0
2
4
6
8
10
12
14
16ECMP Global First-Fit
Simulated Annealing Non-blocking
Communication Pattern
Bis
ecti
on
Ban
dw
idth
(G
bp
s)
Evaluation
0
2
4
6
8
10
12
14
16ECMP Global First-Fit
Simulated Annealing Non-blocking
Communication PatternB
isec
tio
n B
and
wid
th (
Gb
ps)
Evaluation
• 16-hosts: 120 GB all-to-all in-memory shuffle
• Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch
ECMP GFF SA Control
Total Shuffle Time (s) 438.4 335.5 336.0 306.4
Avg. Completion Time (s) 358.1 258.7 262.0 226.6
Avg. Bisection BW (Gbps) 2.81 3.89 3.84 4.44
Avg. host goodput (MB/s) 20.9 29.0 28.6 33.1
Data Shuffle
Conclusions• Simulated Annealing delivers significant bisection BW
gains over standard ECMP• Hedera complements ECMP • If you are running MapReduce/Hadoop jobs on your
network, you stand to benefit greatly from Hedera; tiny investment!
top related