data center tcp (dctcp)59.108.48.5/course/mdn/2016presentations/1601214530_zhangyingfan.pdfdata...
TRANSCRIPT
Data Center TCP (DCTCP)
张樱凡
2016年12月19日
北京大学计算机科学技术研究所
1
2
Outline
o Introductiono Communications In Data Centerso DCTCP Algorithmo Experimento Conclusion
3
Outline
o Introductionn Backgroundn Data Center Networkn Queue Management
o Communications In Data Centerso DCTCP Algorithmo Experimento Conclusion
4
Background
o Data Centern Big datan For efficiently processing and storage
0
10000
20000
30000
40000
50000
60000
70000
80000
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
PBperm
onth
GlobalIPTraffic
5
Background
o Data Center Definitionn A data center is a facility used to house computer
systems and associated components, such as telecommunications and storage systems
n Including redundant or backup power supplies, redundant data communications connections, environmental controls and various security devices
o Data Center Designn Highly available and performantn Low cost, commodityn Flexible and extendable
6
Data Center Network
o Basic Requirementn Low latency for short flowsn High burst tolerancen High utilization for long flows
o Additional Requirementn Switch buffer occupancies need to be persistently
low while maintaining high throughput for the long flows
n Can be implementable with mechanisms in existing hardware
7
Data Center Network
o Data Center Environmentn Low round trip times (less than 250μs)n Little statistical multiplexingn Network is homogeneousn A single administrative controln Separate from external traffic
8
Queue Management
o Delay-basedn Measure RTTn RTT increase as a sign of growing queueing delayn Susceptible to noise when latency is low
o Active Queue Managementn Explicit feedback from switchesn ECNn DCTCP
9
Outline
o Introductiono Communications In Data Centers
n Partition/Aggregate Patternn Workload Characterizationn Performance Impairments
o DCTCP Algorithmo Experimento Conclusion
10
Partition/Aggregate Pattern
o Structuren Multi-layern Requests from higher layers are broken into pieces
and farmed out to lower layersn Latency is key metric
11
Partition/Aggregate Pattern
o Characteristicn Permissible latency (all-up SLA) 230-300msn Lag with one layer delaying uppers layersn Miss deadline lead to drop that responsen High percentiles for worker latencies mattern Important in application design
12
Workload Characterization
o Testbedn Total 6000 servers in over 150 racksn 44 servers per rackn Connects to a Top of Rack switch via 1Gbps linkn Switches are shallow buffered, with 4MB of buffers
shared among 48 1Gbps ports and two 10Gbps ports
o Flow typen Soft real-time query trafficn Short control messagen Background traffic
13
Workload Characterization
o Query Trafficn Follow the Partition/Aggregate Patternn One high level aggregatorn 43 servers as mid-level aggregators and worksn Time between arrivals of queries at MLAs
14
Workload Characterization
o Background Trafficn Large flows (1MB~50MB) copy
data to workersn Small flows (50KB~1MB) are
short message
15
Workload Characterization
o Concurrencyn Number of flows one MLA or worker participates in
concurrently
16
Performance Impairments
o Incastn Appear from P/A patternn Responses from workers are
synchronizedn Cause deadline missing
p reduce the RTOn How to avoid
p Reduce response sizep Jittering
17
Performance Impairments
o Queue buildupn Long-lived, greedy TCP flows cause the
length of queue to grown Cause small flows dropn Even when no flow drops, the small
experience latencyn Experiment: Intra-RTT 100μs, inter-RTT
250μs
18
Performance Impairments
o Buffer pressuren Individual flows through different portsn Long-lived flows consume shared
buffersn Reduce buffer available to absorb bursts
for P/A pattern
19
Outline
o Introductiono Communications In Data Centerso DCTCP Algorithm
n Algorithmn Benefitsn Analysisn Parametersn Discussion
o Experimento Conclusion
20
Algorithm
o Explicit Congestion Notificationn RFC3168 (2001)n End-to-end notification of network congestion
without dropping packetso Algorithm design
n Deriving multi-bit feedback from the information present in the single-bit sequence of marks
n Estimate the extent of congestionn React in proportion to the extent of congestion
21
Algorithm
o Mark at switchn Threshold Kn Mark CE codepoint if queue occupancy greater
than Ko ECN-Echo at Receiver
n Convey the exact sequence of marked packets back to the sender
n Stat-machine
22
Algorithm
o Controller at Sendern Estimate of fraction of marked packet, 𝛼n Update: 𝛼 ← 1− 𝑔 ×𝛼 + 𝑔×𝐹n 𝛼 estimate the probability the queue size greater
than Kn React: 𝑐𝑤𝑛𝑑 ← 𝑐𝑤𝑛𝑑×(1 − .
/)
23
Benefits
o Incastn Useless when the number of flows is very highn Work when bursts last several RTTs
o Queue buildupn React depend on Kn Limit queue length
o Buffer pressuren Queue length doesn’t grow exceedingly largen A few congested ports won’t harm other ports
24
Analysis
o Assumptionsn N synchronized flown Link Capacity Cn Round-trip time RTT
o Deductionn 𝑄 𝑡 = 𝑁𝑊 𝑡 − 𝐶×𝑅𝑇𝑇n Queue size process is a sawtooth
25
Analysis
o Assumptionsn 𝑁 synchronized flown Link Capacity 𝐶n Round-trip time 𝑅𝑇𝑇
o Deductionn 𝑄 𝑡 = 𝑁𝑊 𝑡 − 𝐶×𝑅𝑇𝑇n Queue size process is a sawtooth
o Targetn Maximum queue size Q:;<n Amplitude of queue oscillations 𝐴n Period 𝑇>
26
Analysis
o Single sender
𝑊∗ = (𝐶×𝑅𝑇𝑇 + 𝐾)/𝑁
27
Analysis
o Single sender
𝛼 = 𝑆(𝑊∗,𝑊∗ + 1)/𝑆((𝑊∗ + 1)(1 − ./ ,𝑊
∗ + 1)
⇒ 𝛼/ 1 − .D = (2𝑊∗ + 1)/ 𝑊∗ + 1 / ≈ /
G∗
⇒ 𝛼 ≈ /G∗
28
Analysis
o Target
𝐷 = 𝑊∗ + 1 − 𝑊∗ + 1 1−𝛼2
𝐴 = 𝑁𝐷 = 𝑁 𝑊∗ + 1𝛼2 ≈
𝑁2 2𝑊∗ =
12 2𝑁(𝐶×𝑅𝑇𝑇 + 𝐾)
𝑇> = 𝐷 =12
2 𝐶×𝑅𝑇𝑇 + 𝐾𝑁
𝑄:;< = 𝑁 𝑊∗ + 1 − 𝐶×𝑅𝑇𝑇 = 𝐾 +𝑁
29
Analysis
o Validaten 10𝐺𝑏𝑝𝑠n 𝐾 = 40n 𝑔 = O
OPn Evaluate queue length
30
Parameters
o Marking Threshold Kn Queue doesn’t underflow
𝑄:QR = 𝑄:;< − 𝐴 = 𝐾 +𝑁 −12 2𝑁 𝐶×𝑅𝑇𝑇 + 𝐾 > 0
⇒𝐾 > >×VWWX
o Estimation Gain gn Ensure the moving average “spans” at least one
congestion event1 − 𝑔 WY > O
/
⇒𝑔 < O.\]P/ >×VWW^_
31
Discussion
o AQM is not enoughn AQM schemes like RED, PI don’t work well
o Convergence and Synchronizationn DCTCP trade off convergence time
p RTTs in data center are a few 100µsecp Most microbursts are too small to convergep Big flow can tolerate this delay
n “On-off” style can cause synchronizationp Reaction to congestion is not severe
o Practical considerationsn Implementations and system detail can cause
burstsp K and g are selected by experiments
32
Outline
o Introductiono Communications In Data Centerso DCTCP Algorithmo Experiment
n Environmentn DCTCP Performancen Impairment Microbenchmarksn Benchmark Traffic
o Conclusion
33
Environment
o Machinesn 94 machines in 3 racksn 80 have 1Gbps NICsn 14 have 10Gbps NICsn CPU and memory are never a bottleneck
o Switches
Ports ECNTriumph 48 1Gbps, 4 10Gbps 4MB YScorpion 24 10Gbps 4MB YCAT4948 48 1Gbps, 2 10Gbps 16MB N
34
DCTCP Performance
o Throughput & Queue Sizen Long-lived flowsn Use Triumph switch with
1Gbps linksn 𝐾 = 20n TCP and DCTCP achieve
throughput of 0.95Gbpso Other values of K
n 10Gpbs link
35
DCTCP Performance
o Compare to REDn 10Gbps linksn 𝐾 = 65
36
DCTCP Performance
o Fairness and Convergencen 6 hosts connected via 1Gbps to the Triumph
switchn 𝐾 = 20
37
DCTCP Performance
o Multi-hop networksn S1 46Mbpsn S3 54Mbpsn S2 475Mbpsn TCP suffers
timeout for some connections
38
Impairment Microbenchmarks
o Basic Incastn 41 machines connected to the Triumph switch with
1Gbps linksn Each port 100 packets buffern One client requests 1MB/n bytes from n serversn Metric: completion time
39
Impairment Microbenchmarks
o Incast with dynamic bufferingn 𝑅𝑇𝑂:QR = 10𝑚𝑠
40
Impairment Microbenchmarks
o All-to-all incastn Each machine requests 25KB from remaining
machine
41
Impairment Microbenchmarks
o Queue buildupn 4 machines connected to a Triumph switch with
1Gbps linksn 1 receiver and 3 sendersn 2 sender send big TCP flow to receiver, 1 sends
20KB chunks of data with long-lived connections
42
Impairment Microbenchmarks
o Buffer pressuren 44 machines connected to a Triumph switch with
1Gbps linksn 11 hosts participate in 10-1 incast pattern, 1 client
requests a total 1MB data from 10 serversn 33 hosts start 66 big flow as background traffic
43
Benchmark Traffic
o Testbedn 45 servers connected to a Triumph top of rack
switch by 1Gbps linksn 1 additional server connected to a 10Gbps port as
the rest of data centern three types of traffic: query, short-message and
background trafficn Query traffic: P/A structure of real APP, each query
requests 2KB datan 𝑅𝑇𝑂:QR = 10𝑚𝑠n 𝐾 = 20 for 1Gbps and 65 for 10Gbps
44
Benchmark Traffic
o Background traffic completion time
o Query traffic completion time
45
Benchmark Traffic
o Scaled trafficn Increase the size of update flows larger than 1MB
by a factor of 10n Also try deep buffering and RED
46
Outline
o Introductiono Communications In Data Centerso DCTCP Algorithmo Experimento Conclusion
47
Conclusion
o Proposed a new variant of TCP, called Data Center TCP (DCTCP)
o Measure a 6000 server data center cluster and observe several performance impairments
o Design DCTCP through use of the multi-bit feedback derived from the series of ECN marks
o experiments at 1 and 10Gbps speeds show that DCTCP meets its design goals
48
Thanks