r2p2: making rpcs first-class datacenter citizens•l7 loadbalancing •e.g. nginx reverse proxy...
TRANSCRIPT
R2P2: Making RPCs first-class datacenter citizens
Marios Kogias <[email protected]>
Datacenter Communication• Infrastructure:
• Clos topology• 10/40/100G links• Few μs RTTs• Kernel-bypass• In-network programmability
• Applications:• Data-stores, search, etc…• Complex Fan-in/Fan-out patterns• Tight tail-latency SLOs• Service time variability• μs-scale Remote Procedure Calls
LoadBalancer
Root Root
LeafLeafLeaf
2
Q: What is an RPC?
Q: What is a typical RPC stack?
Q: Identify the layers involved
Q: What is a latency SLO?
Transport
RPC
Application
Paradigm MismatchMultiplexing independent RPCs over a reliable byte-stream, e.g. TCP
P3 P2 P1
R4 R3 R2 R1
4
Problems:1. Ordering and Head-of-line blocking• TCP imposes ordering of
requests• RPCs are independent
• Lost packets can affect several requests
P2 P1P3
R1R3R2
2. RPC-agnostic network
• TCP hides RPC semantics• Software middleboxes:
• Deep packet inspection• Connection termination, e.g. L7 LB
P2 P1P3
R2R4R3 R1
Outline
• R2P2, a transport protocol for RPCs that exposes the RPC abstraction to the network and enables in-network policy enforcement
• Usecase: In-network RPC loadbalancing over R2P2
• Identify reusable system design principles• Suggested reading Hints for Computer System Design
R2P2: Request Response Pair Protocol
• Independent RR pairs• Not connections• Not messages
• No protocol-enforced ordering• No fate sharing• Lost packets only affect equivalent RRP
• Per-RPC decisions:• Timeout• At-least/At-most once
6
Client S
X
Hint: Leave it to the client
Paradigm MismatchMultiplexing independent RPCs over a reliable byte-stream, e.g. TCP
P3 P2 P1
R4 R3 R2 R1
7
Problems:1. Ordering and Head-of-line blocking
• TCP imposes ordering of requests• RPCs are independent
• Lost packets can affect several requests
P2 P1P3
R1R3 R2
2. RPC-agnostic network
• TCP hides RPC semantics• Software middleboxes:
• Deep packet inspection• Connection termination (L7 LB)
P2 P1P3
R2R4 R3 R1
Client-Server Decoupling
• RR pair identified by:• Source IP• Source Port• Request-id
• Break point-to-point RPC semantics• Request Destination != Reply Source• Per request policy enforcement
8
Client Middlebox
S1
S2
RPC Policy Enforcement over R2P2
Clients
S1
S2
SN
MiddleboxREQ0 REQ0
(REQREADY)
(REQN)
REPLY
(FEEDBACK)
9
RPC-aware middleboxOnly policy enforcementNo IO Bottleneck
Hint: Separate normal and worst case
Paradigm MismatchMultiplexing independent RPCs over a reliable byte-stream, e.g. TCP
P3 P2 P1
R4 R3 R2 R1
10
Problems:1. Ordering and Head-of-line blocking
• TCP imposes ordering of requests• RPCs are independent
• Lost packets can affect several requests
P2 P1P3
R1R3 R2
2. RPC-agnostic network
• TCP hides RPC semantics• Software middleboxes:
• Deep packet inspection• Connection termination (L7 LB)
P2 P1P3
R2R4 R3 R1
• Network-based RPC load balancing [ATC 2019]• Target selection based on request type [ATC 2019]• And more to discuss later…
RPC Policies explored
11
Request-level Load Balancing over TCP
Clients
S1
S2
SN
Middlebox
12
Vanilla Request-Level LoadBalancing
• L7 loadbalancing• e.g. NGINX reverse proxy
• Terminate client connections• Open other connections to the servers
13
0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)
0
100
200
300
400
500
99th
Lat
ency
(µs)
• 4 servers x 16 threads• HTTP-based RPC• 𝑆 = 25𝜇𝑠 exponential distribution• Max throughput 2.56 MRPS• NGINX with Join-Shortest-Queue
L7 loadbalancers suffer from the mismatch and become IO bottlenecks
NGINX-JSQ
In-network Request-level Loadbalancing
• Software DPDK R2P2 router• 5μs latency overhead• IOPS-bottlenecked with 2 cores
• P4 dataplane on Barefoot Tofino• 1μs latency overhead
14
• 4 servers x 16 threads• 𝑆 = 25𝜇𝑠 exponential distribution• Max throughput 2.56 MRPS
NGINX-JSQ RANDOM
0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)
0
100
200
300
400
500
99th
Lat
ency
(µs)
Can we do better...?
In-network Request-level Loadbalancing
• Software DPDK R2P2 router• 5μs latency overhead• IOPS-bottlenecked with 2 cores
• P4 dataplane on Barefoot Tofino• 1μs latency overhead
15
• 4 servers x 16 threads• 𝑆 = 25𝜇𝑠 Exponential distribution• Max throughput 2.56 MRPS
0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)
0
100
200
300
400
500
99th
Lat
ency
(µs)
NGINX-JSQ RANDOM SW-JBSQ(3)
Yes, we can!
Q: Why do the 3 curves perform differently?
Request-level Load-Balancing
16
S1
DispatcherS2
S N
S1
DispatcherS2
S N
• N x M/G/1• Transient load-imbalance• Scalable throughput
• Equivalent to L4 loadbancing
• M/G/N• Better tail-latency• Communication overhead
• Could be implemented as L7 loadbalancing
Load
N x M/G/1M/G/N
TheoryPractice
Challenge
How can we implement RPC loadbalancing with single-queue performance across multiple servers while achieving high throughput and low latency?
Join-Bounded-Shortest-Queue JBSQ(n)• Split-Queue model• One central “unbounded” queue• Several distributed bounded queues
• Delay scheduling decision for better placement• Trade-off• Throughput
• High n can lead to bad placement• Tail-latency
• Small n exposes the communication overhead
∞
18
n==2
Always think about the trade-offs
JBSQ RPC Load Balancing on R2P2• Central queue of REQ0s in the
middlebox• Middlebox maintains
#outstanding RPCs per server• Feedback messages for each
completed RPC
19
Clients
S1
S2
SN
Middlebox
REQ0
REQ0
(REQREADY)
(REQN)
REPLY
(FEEDBACK)
∞
Software (DPDK)or
Hardware (P4)
JBSQ Evaluation
4 servers (DPDK) x 16 threads𝑆 = 10𝜇𝑠 Exponential distribution
4-byte packets over R2P2
20
RANDOMM/G/64
0 2 4 6Load (MRPS)
0
50
100
150
99th
Lat
ency
(µs)
JBSQ Evaluation
4 servers (DPDK) x 16 threads𝑆 = 10𝜇𝑠 Exponential distribution
4-byte packets over R2P2
21
0 2 4 6Load (MRPS)
0
50
100
150
99th
Lat
ency
(µs)
RANDOMSW-JBSQ(1)M/G/64
n = 1 is not enough to saturate throughput
?
How will the SW-JBSQ(1) curve look like and why?
Q: What can we do to get more throughput?
JBSQ Evaluation
4 servers (DPDK) x 16 threads𝑆 = 10𝜇𝑠 Exponential distribution
4-byte packets over R2P2
22
0 2 4 6Load (MRPS)
0
50
100
150
99th
Lat
ency
(µs)
RANDOMSW-JBSQ(1)SW-JBSQ(5)M/G/64
Max throughput under SLO
JBSQ Evaluation
4 servers (DPDK) x 16 threads𝑆 = 10𝜇𝑠 Exponential distribution
4-byte packets over R2P2
23
0 2 4 6Load (MRPS)
0
50
100
150
99th
Lat
ency
(µs)
RANDOMSW-JBSQ(1)SW-JBSQ(5)P4-JBSQ(3)M/G/64
More efficient HW implementation
Smaller n is better for tail-latency
Alternative Policies
Header Size
PacketId/Packet Count
F
0 16
ReqId
Magic
MessageType ReservedPolicy L
24
• Policy field in R2P2 header• Existing policies:• ROUTE_ANY• ROUTE_FIXED
• Alternative policies:• STICKY• HASH• Etc..
Redis• KV-store• Master-Slave replication
• SETs only go to Master• GETs are loadbalanced
• 3+1 DPDK servers• USR Facebook workload• Baseline:• Linux TCP
TCP-DIRECT RANDOM SW-JBSQ(20)
0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)
0
100
200
300
99th
Lat
ency
(µs)
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Load (MRPS)
0
100
200
300
99th
Lat
ency
(µs)
5.3x
4.09x4.8x
Observations:• R2P2 and DPDK increase throughput• Scheduling benefits are more significant as
service time variability increases
5.6x
25
R2P2 and JBSQ vs ZygOS
26
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Load (MRPS)
0
50
100
150
99th
Lat
ency
(µs)
M/M/16ZygOSP4-JBSQ(3) 1 servers x 16 threads
𝑆 = 10𝜇𝑠 Exponential distribution64-byte packets
Lessons Learnt from R2P2
27
1. Pushing functionality in the network is a viable option
2. Programmable switches can undertake some of this functionality
3. Adding network hops for better scheduling can improve performance
1. Try to properly place functionality in the right layer• Can you think of alternative RPC policies / functionality that can be
implemented with this new abstraction?
2. Separate normal and worst case• Mention other usecases of this hint
3. Leave it to the client• Mention other usecases of this hint
Design Points to Remember
Conclusion• R2P2 – transport protocol for RPCs• Exposes the RPC abstraction to the network• Enables in-network policy enforcement
• In-network RPC loadbalancing• Software/Hardware middlebox• JBSQ scheduling policy
• Extensible in-network policies
https://github.com/epfl-dcsl/r2p2
29
Thank you!