ゼロから作るパケット転送用os (internet week 2014)
TRANSCRIPT
ゼロから作る高速パケット転送用OS 東京大学大学院情報理工学系研究科
特任助教 浅井大史 <[email protected]>
2014年11月18日
Internet&Week&2014 &
S6&
! SDN:%So'ware%Defined%Network%" Forwarding%Plane Control%Plane %
# Control%Plane %
! NFV:%Network%Func:on%Virtualiza:on%" %
# CPU
! CPU %=% OS%" Inexpensive)" Flexible)" Extensible)
X&as&a&Service&
Network&Func:on&Virtualiza:on&
Service&Func:on&Chaining&
“Networked”%Opera:ng%System%manages%% %“compu:ng”%resources%
“Networking”%Opera:ng%System%manages%% %“compu:ng”%and%“network”%resources%
! Not%good%for%networking%" VM %
%# Tick %# %– TLB etc.%
# I/O %– I/O %
! %" %
# %" %
# %# %# %# %# %
! %" %
# %" %
# %# %# %# %# %
! NIC %%
" %# CPU?%# Memory?%# PCIe%bus?%# or%something%else?%
! Ethernet%" 64SByte% = %
# 1GbE:%1.488Mpps%=%672%ns/packet%
# 10GbE:%14.88Mpps%=%67.2%ns/packet%
# 40GbE:%59.52Mpps%=%16.8%ns/packet%
# 100GbE:%148.8Mpps%=%6.72%ns/packet
! %%
" CPU CPU %" %
# %– netmap%[Rizzo,%USENIX%ATC%2012]%– Intel®%DPDK%
" %# Linux NAPI%# %
– Intel®%DPDK%
PCIe
CPU
I/O Hub
IntegratedMemory
Controller
CPU
Memory Memory
IntegratedMemory
Controller
(a) (a)
(c)
(b)
I/OControllerHub
On-board NIC
Direct Media Interface
(a) 3.3GHz%clock%CPU%• 0.3ns%per%cycle%(220%cycles%/%packet)%
• +% %(b) CPUSMemory%bus%(N.B.,%64%bit%wide%access)%
• DDR3S1333%Dual%Channel:%21.333GB/s%(170.667Gbps)%• DDR3S1600%Dual%Channel:%25.600GB/s%(204.800Gbps)%• DDR3S1866%Dual%Channel:%29.867GB/s%(238.933Gbps)%%
(c) PCIe%bus%• Gen2:%500MB/s%(x1)%=%4Gbps%
• usually%x8%for%a%twoSport%10GbE%NIC%• x16%is%not%enough%for%a%twoSport%40GbE%NIC%
• Gen3:%985MB/s%(x1)%=%7.88Gbps%(d) DMI%bus%
• v1.0:%2GB/s%(1GB/s%per%direc:on%=%8Gbps)%• v2.0:%4GB/s%(2GB/s%per%direc:on%=%16Gbps)
Bokleneck?
Bokleneck?
! Data%access%latency%(*)%" L1%cache:%4S5%cycles%~%1.2S1.5ns%" L2%cache:%12%cycles%~%3.6ns%" L3%cache:%27.85%cycles%~%8.4ns%" RAM:%28%cycles%+%49S56%ns%~%65ns%
# %
(*)%hkp://www.7Scpu.com/cpu/SandyBridge.html
���� �������$
! PCIe %=%Memory%Mapped%I/O%(MMIO)%
– %
" 1529.17%cycles%/%read%# 392.1%ns%/%read%
" 282.621%cycles%/%write%# 72.47%ns%/%write
1M %CPU Performance%Monitoring%Counter%(PMC)
CPU:%Intel%Core%i7%4770K%Memory:%Corsair%DDR3S1866%8GB%x4%NIC:%Intel%X520SDA2%
Ring%bufferDescriptors Buffer
Generic%NIC%architecture
Ring%bufferDescriptors Buffer
Packet&recep:on&
1. NIC%receives%a%packet%2. NIC%transfer%the%packet%data%to%
a%buffer%in%RAM%via%DMA%3. NIC%proceeds%the%head%pointer%4. So'ware%processes%the%packet%5. So'ware%proceeds%the%tail%
pointer%to%release%the%packet%
(3)%head
(2)
(5)%tail
Generic%NIC%architecture
Ring%bufferDescriptors Buffer
Packet&transmission&
1. So'ware%writes%a%packet%to%a%buffer%in%RAM%
2. So'ware%proceeds%the%tail%pointer%to%commit%the%packet%
3. NIC%transfer%the%packet%data%from%the%buffer%in%RAM%via%DMA%
4. NIC%transmit%the%packet%5. NIC%proceeds%the%head%pointer%
to%no:fy%the%packet%is%transmiked%
(2)%tail
(1)
(5)%head
Generic%NIC%architecture
Ring%bufferDescriptors Buffer
Packet&recep:on&
1. NIC%receives%a%packet%2. NIC%transfer%the%packet%data%to%
a%buffer%in%RAM%via%DMA%3. NIC%proceeds%the%head%pointer%4. So'ware%processes%the%packet%5. So'ware%proceeds%the%tail%
pointer%to%release%the%packet%
(3)%head
(2)
(5)%tail
Ring%bufferDescriptors Buffer
Packet&transmission&
1. So'ware%writes%a%packet%to%a%buffer%in%RAM%
2. So'ware%proceeds%the%tail%pointer%to%commit%the%packet%
3. NIC%transfer%the%packet%data%from%the%buffer%in%RAM%via%DMA%
4. NIC%transmit%the%packet%5. NIC%proceeds%the%head%pointer%
to%no:fy%the%packet%is%transmiked%
(2)%tail
(1)
(5)%head
! %" UDP CPU %" Tx %
n% %# Descriptor %# n Tx%tailtxq_tail = 0;for ( ;; ) {
txq_head = read_txq_head();/* Available Tx queue length */txq_len = txq_sz
- (txq_sz - txq_head + txq_tail) % txq_sz;/* Check the available Tx queue length */if ( txq_len < n ) continue;for ( i = 0; i < n; i++ ) {
// Set packet to the ring buffer to txq_tailtxq_ring[txq_tail].pkt = pkt_to_transmit;txq_tail = (txq_tail + 1) % txq_sz
}/* Commit */write_txq_tail(txq_tail);
}
~392.1ns
~72.47ns
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8
Pack
et ra
te [M
pps]
Bulk transfer size [packets]
Frame = 64B96B
128B192B256B384B512B768B
1024B1536B
14.88Mpps
=%n
~500ns/packet
~250ns/packet
~125ns/packet
RX%queue%ring TX%queue%ring
Timehw sw sw hw
Strategy&
• %• PCIe %
rxq_tail = txq_tail = 0;blkcnt = 0;/* # of packets to be routed in bulk transfer */nr_blk = 256 /* can be another value */;for ( ;; ) {
/* Rx queue head */rx_desc = GET_RX_DESC_HEAD(netdev);
if ( DMA_COMPLETED(rx_desc) ) {// Lookup routing table and copy from Rx to Tx// Rewrite destination MAC address, TTL--,// and calculate checksumblkcnt++;if ( blkcnt >= nr_blk ) {
blkcnt = 0;write_rxq_tail(rxq_tail);write_txq_tail(txq_tail);
}} else {
blkcnt = 0;write_rxq_tail(rxq_tail);write_txq_tail(txq_tail);
}}
Transmitter RouterRX TX
RX
untag
untag
untag
CPU: % %Intel(R)%Core(TM)%i7%4770K%(3.90GHz,%quad%core)%%Memory: %32GiB,%DDR3S1866%NIC: % %Intel(R)%X520SDA2%(2%ports)%
%%5
OS OS
0
1
2
3
4
5
6
7
8
9
10
0 200 400 600 800 1000 1200 1400 1600
Thro
ughp
ut [G
bps]
Frame size [byte]
My implementationLinux
Line rate
1 %TTL CPU
! %" %
# Spirent%Communica:ons Spirent%TestCenter%– Interop%Tokyo%2014 %
%# %
– SPTSN4US110%– CVS10GSS8%%
" PC %# CPU:%Intel®%Core%i7%4770K%%# Memory:%DDRS3S1866%(8GB%x4)%# NIC:%Intel®%X520SDA2%(1 )%
1%
10%
100%
1000%
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
Latency&[us]
Test&traffic&(64Obyte&frame)&[Gbps]
avg%
min%
max%
90% ~10us
0.001Mpps &
! Networking%Opera:ng%System%" %
# I/O %etc.%
" %OS %
! 40GbE%NIC %
! %" Not%CPU%
# %" Not%memory%" PCIe%MMIO%
! OS%" %
# 10GbE %# 10GbE%x4% Tx%