hedera : dynamic flow scheduling for data center networks

28
Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat Presented by 馬馬馬

Upload: york

Post on 24-Feb-2016

135 views

Category:

Documents


0 download

DESCRIPTION

Hedera : Dynamic Flow Scheduling for Data Center Networks. Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat. Presented by 馬嘉伶. Easy to understand problem MapReduce style DC applications need bandwidth DC networks have many ECMP paths between servers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera: Dynamic Flow Scheduling for Data Center Networks

•Mohammad Al-Fares•Sivasankar Radhakrishnan•Barath Raghavan•Nelson Huang•Amin Vahdat

Presented by 馬嘉伶

Page 2: Hedera : Dynamic Flow Scheduling for Data Center Networks

• Easy to understand problem MapReduce style DC applications need bandwidth DC networks have many ECMP paths between servers Flow-hash-based load balancing insufficient

Page 3: Hedera : Dynamic Flow Scheduling for Data Center Networks

ECMP Paths• Many equal cost paths going up to the core switches• Only one path down from each core switch• Randomly allocate paths to flows using hash of the flow

• Agnostic to available resources• Long lasting collisions between long (elephant) flows

DS

Page 4: Hedera : Dynamic Flow Scheduling for Data Center Networks

Collisions of elephant flows• Collisions possible in two different ways

• Upward path• Downward path

D1

S1

D2

S2

D3

S3

D4

S4

Page 5: Hedera : Dynamic Flow Scheduling for Data Center Networks

Collisions of elephant flows• Average of 61% of bisection bandwidth wasted on a network of 27K servers

D1

S1

D2

S2

D3

S3

D4

S4

Page 6: Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera Scheduler

• Hedera: Dynamic Flow Scheduling - Optimize achievable bisection bandwidth by assigning flow non-conflicting paths - Uses flow demand estimation + placement heuristics to find good flow-to-core mappings

EstimateFlow Demands

PlaceFlows

DetectLarge Flows

Page 7: Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera Scheduler

• Detect Large Flows• Flows that need bandwidth but are network-limited

• Estimate Flow Demands• Use max-min fairness to allocate flows between src-dst pairs

• Place Flows• Use estimated demands to heuristically find better placement of

large flows on the ECMP paths

EstimateFlow Demands

PlaceFlows

DetectLarge Flows

Page 8: Hedera : Dynamic Flow Scheduling for Data Center Networks

Elephant Detection• Scheduler continually polls edge switches for

flow byte-counts• Flows exceeding B/s threshold are “large”

• > %10 of hosts’ link capacity (i.e. > 100Mbps)

• What if only “small” flows?• Default ECMP load-balancing efficient.

GigE

Page 9: Hedera : Dynamic Flow Scheduling for Data Center Networks

Elephant Detection• Hydera complements ECMP! - Default forwarding uses ECMP - Hedera schedules large flows that cause bisection bandwidth problems.

Page 10: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation• Flows can be constrained in two ways

• Host-limited (at source, or at destination)• Network-limited

• Measured flow rate is misleading

• Need to find a flow’s “natural” bandwidth requirement when not limited by the network

• Forget network, just allocate capacity between flows using max-min fairness

Page 11: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation• Given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively…

• Sender equally distributes bandwidth among outgoing flows that are not receiver-limited

• Network-limited receivers decrease exceeded capacity equally between incoming flows

• Repeat until all flows converge

•Guaranteed to converge in O(|F|) time

Page 12: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

A

B

C

X

Y

Flow Estimate

Conv. ?

AXAYBYCY

Sender

Available Unconv. BW Flows Share

A 1 2 1/2B 1 1 1C 1 1 1

Senders

Page 13: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

Recv RL? Non-SLFlows Share

X No - -Y Yes 3 1/3

ReceiversFlow Estimat

eConv.

?AX 1/2AY 1/2BY 1CY 1

A

B

C

X

Y

Page 14: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

Flow Estimate

Conv. ?

AX 1/2AY 1/3 YesBY 1/3 YesCY 1/3 Yes

Sender

Available Unconv. BW Flows Share

A 2/3 1 2/3B 0 0 0C 0 0 0

Senders

A

B

C

X

Y

Page 15: Hedera : Dynamic Flow Scheduling for Data Center Networks

Demand Estimation

Flow Estimate

Conv. ?

AX 2/3 YesAY 1/3 YesBY 1/3 YesCY 1/3 Yes

Recv RL? Non-SLFlows Share

X No - -Y No - -

Receivers

A

B

C

X

Y

Page 16: Hedera : Dynamic Flow Scheduling for Data Center Networks

Flow placement• Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized• That is maximum utilization of theoretically available b/w

• Two approaches• Global First Fit:

Greedily choose path that has sufficient unreserved b/w• Simulated Annealing:

Iteratively find a globally better mapping of paths to flows

Page 17: Hedera : Dynamic Flow Scheduling for Data Center Networks

Global First-Fit

• New flow detected, linearly search all possible paths from SD• Place flow on first path whose component links can fit that flow

?Flow AFlow BFlow C

? ?0 1 2 3

Scheduler

Page 18: Hedera : Dynamic Flow Scheduling for Data Center Networks

Global First-Fit

• Flows placed upon detection, are not moved• Once flow ends, entries + reservations time out

Flow AFlow BFlow C

0 1 2 3

Scheduler

Page 19: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing• Probabilistic search for good flow-to-core mappings - Goal: Maximize achievable bisection bandwidth• Current flow-to-core mapping generates neighbor state - Calculate total exceeded bandwidth capacity - Accept move to neighbor state if bisection BW gain• Few thousand iterations for each scheduling round - Avoid local-minima; non-zero prob. to worse state

Page 20: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

Page 21: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

• Example run: 3 flows, 3 iterations

0 1 2 3Flow AFlow BFlow C

Core210

? ? ?202

?203

Scheduler

Page 22: Hedera : Dynamic Flow Scheduling for Data Center Networks

Simulated Annealing

• Final state is published to the switches and used as the initial state for next round

0 1 3Flow AFlow BFlow C

Core ? ? ? ?203

2

Scheduler

Page 23: Hedera : Dynamic Flow Scheduling for Data Center Networks

Fault-Tolerance

• Link / Switch failure• Use PortLand’s fault notification protocol• Hedera routes around failed components

0 1 3Flow AFlow BFlow C

2

Scheduler

Page 24: Hedera : Dynamic Flow Scheduling for Data Center Networks

Fault-Tolerance

• Scheduler failure• Soft-state, not required for correctness (connectivity)• Switches fall back to ECMP

0 1 3Flow AFlow BFlow C

2

Scheduler

Page 25: Hedera : Dynamic Flow Scheduling for Data Center Networks

Evaluation• 16-host testbed - k=4 fat-tree data-plane - 20 machines; 4-port NetFGPAs / OpenFlow - Parallel 48-port non-blocking Quanta switch(debug & control)• 1 Scheduler machine - Dynamic traffic monitoring - OpenFlow routing control

Page 26: Hedera : Dynamic Flow Scheduling for Data Center Networks

02468

10121416

ECMP Global First-Fit

Simulated Annealing Non-blocking

Communication Pattern

Bis

ectio

n B

andw

idth

(Gbp

s)

Evaluation

0

2

4

6

8

10

12

14

16ECMP Global First-Fit

Simulated Annealing Non-blocking

Communication PatternB

isec

tion

Ban

dwid

th (G

bps)

Page 27: Hedera : Dynamic Flow Scheduling for Data Center Networks

Evaluation

• 16-hosts: 120 GB all-to-all in-memory shuffle

• Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

ECMP GFF SA ControlTotal Shuffle Time (s) 438.4 335.5 336.0 306.4

Avg. Completion Time (s) 358.1 258.7 262.0 226.6

Avg. Bisection BW (Gbps) 2.81 3.89 3.84 4.44

Avg. host goodput (MB/s) 20.9 29.0 28.6 33.1

Data Shuffle

Page 28: Hedera : Dynamic Flow Scheduling for Data Center Networks

Conclusions• Simulated Annealing delivers significant bisection BW

gains over standard ECMP• Hedera complements ECMP • If you are running MapReduce/Hadoop jobs on your

network, you stand to benefit greatly from Hedera; tiny investment!