r2p2: making rpcs first-class datacenter citizens•l7 loadbalancing •e.g. nginx reverse proxy...

R2P2: Making RPCs first-class datacenter citizens

Marios Kogias <[email protected]>

Datacenter Communication• Infrastructure:

• Clos topology• 10/40/100G links• Few μs RTTs• Kernel-bypass• In-network programmability

• Applications:• Data-stores, search, etc…• Complex Fan-in/Fan-out patterns• Tight tail-latency SLOs• Service time variability• μs-scale Remote Procedure Calls

LoadBalancer

Root Root

LeafLeafLeaf

2

Q: What is an RPC?

Q: What is a typical RPC stack?

Q: Identify the layers involved

Q: What is a latency SLO?

Transport

RPC

Application

Paradigm MismatchMultiplexing independent RPCs over a reliable byte-stream, e.g. TCP

P3 P2 P1

R4 R3 R2 R1

4

Problems:1. Ordering and Head-of-line blocking• TCP imposes ordering of

requests• RPCs are independent

• Lost packets can affect several requests

P2 P1P3

R1R3R2

2. RPC-agnostic network

• TCP hides RPC semantics• Software middleboxes:

• Deep packet inspection• Connection termination, e.g. L7 LB

P2 P1P3

R2R4R3 R1

Outline

• R2P2, a transport protocol for RPCs that exposes the RPC abstraction to the network and enables in-network policy enforcement

• Usecase: In-network RPC loadbalancing over R2P2

• Identify reusable system design principles• Suggested reading Hints for Computer System Design

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/acrobat-17.pdf

R2P2: Request Response Pair Protocol

• Independent RR pairs• Not connections• Not messages

• No protocol-enforced ordering• No fate sharing• Lost packets only affect equivalent RRP

• Per-RPC decisions:• Timeout• At-least/At-most once

6

Client S

X

Hint: Leave it to the client


P3 P2 P1

R4 R3 R2 R1

7

Problems:1. Ordering and Head-of-line blocking

• TCP imposes ordering of requests• RPCs are independent


P2 P1P3

R1R3 R2



• Deep packet inspection• Connection termination (L7 LB)

P2 P1P3

R2R4 R3 R1

Client-Server Decoupling

• RR pair identified by:• Source IP• Source Port• Request-id

• Break point-to-point RPC semantics• Request Destination != Reply Source• Per request policy enforcement

8

Client Middlebox

S1

S2

RPC Policy Enforcement over R2P2

Clients

S1

S2

SN

MiddleboxREQ0 REQ0

(REQREADY)

(REQN)

REPLY

(FEEDBACK)

9

RPC-aware middleboxOnly policy enforcementNo IO Bottleneck

Hint: Separate normal and worst case


P3 P2 P1

R4 R3 R2 R1

10

Problems:1. Ordering and Head-of-line blocking

• TCP imposes ordering of requests• RPCs are independent


P2 P1P3

R1R3 R2



• Deep packet inspection• Connection termination (L7 LB)

P2 P1P3

R2R4 R3 R1

• Network-based RPC load balancing [ATC 2019]• Target selection based on request type [ATC 2019]• And more to discuss later…

RPC Policies explored

11

Request-level Load Balancing over TCP

Clients

S1

S2

SN

Middlebox

12

Vanilla Request-Level LoadBalancing

• L7 loadbalancing• e.g. NGINX reverse proxy

• Terminate client connections• Open other connections to the servers

13

0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)

0

100

200

300

400

500

99th

Lat

ency

(µs)

• 4 servers x 16 threads• HTTP-based RPC• 𝑆 = 25𝜇𝑠 exponential distribution• Max throughput 2.56 MRPS• NGINX with Join-Shortest-Queue

L7 loadbalancers suffer from the mismatch and become IO bottlenecks

NGINX-JSQ

In-network Request-level Loadbalancing

• Software DPDK R2P2 router• 5μs latency overhead• IOPS-bottlenecked with 2 cores

• P4 dataplane on Barefoot Tofino• 1μs latency overhead

14

• 4 servers x 16 threads• 𝑆 = 25𝜇𝑠 exponential distribution• Max throughput 2.56 MRPS

NGINX-JSQ RANDOM

0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)

0

100

200

300

400

500

99th

Lat

ency

(µs)

Can we do better...?

In-network Request-level Loadbalancing

• Software DPDK R2P2 router• 5μs latency overhead• IOPS-bottlenecked with 2 cores

• P4 dataplane on Barefoot Tofino• 1μs latency overhead

15

• 4 servers x 16 threads• 𝑆 = 25𝜇𝑠 Exponential distribution• Max throughput 2.56 MRPS

0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)

0

100

200

300

400

500

99th

Lat

ency

(µs)

NGINX-JSQ RANDOM SW-JBSQ(3)

Yes, we can!

Q: Why do the 3 curves perform differently?

Request-level Load-Balancing

16

S1

DispatcherS2

S N

S1

DispatcherS2

S N

• N x M/G/1• Transient load-imbalance• Scalable throughput

• Equivalent to L4 loadbancing

• M/G/N• Better tail-latency• Communication overhead

• Could be implemented as L7 loadbalancing

Load

N x M/G/1M/G/N

TheoryPractice

Challenge

How can we implement RPC loadbalancing with single-queue performance across multiple servers while achieving high throughput and low latency?

Join-Bounded-Shortest-Queue JBSQ(n)• Split-Queue model• One central “unbounded” queue• Several distributed bounded queues

• Delay scheduling decision for better placement• Trade-off• Throughput

• High n can lead to bad placement• Tail-latency

• Small n exposes the communication overhead

∞

18

n==2

Always think about the trade-offs

JBSQ RPC Load Balancing on R2P2• Central queue of REQ0s in the

middlebox• Middlebox maintains

#outstanding RPCs per server• Feedback messages for each

completed RPC

19

Clients

S1

S2

SN

Middlebox

REQ0

REQ0

(REQREADY)

(REQN)

REPLY

(FEEDBACK)

∞

Software (DPDK)or

Hardware (P4)

JBSQ Evaluation

4 servers (DPDK) x 16 threads𝑆 = 10𝜇𝑠 Exponential distribution

4-byte packets over R2P2

20

RANDOMM/G/64

0 2 4 6Load (MRPS)

0

50

100

150

99th

Lat

ency

(µs)

JBSQ Evaluation



21

0 2 4 6Load (MRPS)

0

50

100

150

99th

Lat

ency

(µs)

RANDOMSW-JBSQ(1)M/G/64

n = 1 is not enough to saturate throughput

?

How will the SW-JBSQ(1) curve look like and why?

Q: What can we do to get more throughput?

JBSQ Evaluation



22

0 2 4 6Load (MRPS)

0

50

100

150

99th

Lat

ency

(µs)

RANDOMSW-JBSQ(1)SW-JBSQ(5)M/G/64

Max throughput under SLO

JBSQ Evaluation



23

0 2 4 6Load (MRPS)

0

50

100

150

99th

Lat

ency

(µs)

RANDOMSW-JBSQ(1)SW-JBSQ(5)P4-JBSQ(3)M/G/64

More efficient HW implementation

Smaller n is better for tail-latency

Alternative Policies

Header Size

PacketId/Packet Count

F

0 16

ReqId

Magic

MessageType ReservedPolicy L

24

• Policy field in R2P2 header• Existing policies:• ROUTE_ANY• ROUTE_FIXED

• Alternative policies:• STICKY• HASH• Etc..

Redis• KV-store• Master-Slave replication

• SETs only go to Master• GETs are loadbalanced

• 3+1 DPDK servers• USR Facebook workload• Baseline:• Linux TCP

TCP-DIRECT RANDOM SW-JBSQ(20)

0.0 0.5 1.0 1.5 2.0 2.5Load (MRPS)

0

100

200

300

99th

Lat

ency

(µs)

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Load (MRPS)

0

100

200

300

99th

Lat

ency

(µs)

5.3x

4.09x4.8x

Observations:• R2P2 and DPDK increase throughput• Scheduling benefits are more significant as

service time variability increases

5.6x

25

R2P2 and JBSQ vs ZygOS

26

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Load (MRPS)

0

50

100

150

99th

Lat

ency

(µs)

M/M/16ZygOSP4-JBSQ(3) 1 servers x 16 threads

𝑆 = 10𝜇𝑠 Exponential distribution64-byte packets

Lessons Learnt from R2P2

27

1. Pushing functionality in the network is a viable option

2. Programmable switches can undertake some of this functionality

3. Adding network hops for better scheduling can improve performance

1. Try to properly place functionality in the right layer• Can you think of alternative RPC policies / functionality that can be

implemented with this new abstraction?

2. Separate normal and worst case• Mention other usecases of this hint

3. Leave it to the client• Mention other usecases of this hint

Design Points to Remember

Conclusion• R2P2 – transport protocol for RPCs• Exposes the RPC abstraction to the network• Enables in-network policy enforcement

• In-network RPC loadbalancing• Software/Hardware middlebox• JBSQ scheduling policy

• Extensible in-network policies

https://github.com/epfl-dcsl/r2p2

29

Thank you!

https://github.com/epfl-dcsl/r2p2

r2p2: making rpcs first-class datacenter citizens•l7 loadbalancing •e.g. nginx reverse proxy...

Documents