a transport protocol
TRANSCRIPT
-
7/30/2019 A TRANSPORT PROTOCOL
1/81
A TRANSPORT PROTOCOL
FOR
DEDICATED END-TO-END CIRCUITS
A Thesis
Presented to
the faculty of the School of Engineering and Applied Science
University of Virginia
In Partial Fulfillment
of the requirements for the Degree
Master of Science
Computer Engineering
by
Anant P. Mudambi
January 2006
-
7/30/2019 A TRANSPORT PROTOCOL
2/81
APPROVAL SHEET
This thesis is submitted in partial fulfillment of the requirements for the degree of
Master of Science
Computer Engineering
Anant P. Mudambi
This thesis has been read and approved by the examining committee:
Malathi Veeraraghavan (Advisor)
Marty A. Humphrey (Chair)
Stephen G. Wilson
Accepted for the School of Engineering and Applied Science:
Dean, School of Engineering and Applied Science
January 2006
-
7/30/2019 A TRANSPORT PROTOCOL
3/81
Abstract
E-science projects involving geographically distributed data sources, computing resources and sci-
entists, have special networking requirements such as a steady throughput and deterministic behav-
ior. The connectionless Internet model is not well-suited to meet such requirements. Connection-
oriented networks that offer guaranteed-rate, dedicated circuits have been proposed to meet the
high-end networking needs of distributed scientific research. In this work we describe the design
and implementation of a transport protocol for such dedicated circuits.
We present an initial user-space, UDP-based implementation called Fixed Rate Transport Proto-
col (FRTP). The constraints imposed by a user-space implementation led us to implement a lower-
overhead kernel-space solution that we call Circuit-TCP (C-TCP). The key feature of C-TCP is to
maintain a fixed sending rate, closely matched to the circuit rate, with the aim of achieving highcircuit utilization. We implemented C-TCP by modifying the Linux TCP/IP stack. Experimental
results on a wide-area circuit-switched testbed show that C-TCP is able to quickly utilize circuit
bandwidth and sustain a high data transfer rate.
iii
-
7/30/2019 A TRANSPORT PROTOCOL
4/81
Acknowledgments
I would like to thank Prof. Malathi Veeraraghavan, for her advice and for keeping me on the right
track. I thank the members of the CHEETAH research group, Xuan, Xiangfei, Zhanxiang and
Xiuduan, for all their help.
Anil and Kavita, thank you for keeping me motivated. Finally, the biggest thank you to my
parents, for their incredible support and love.
iv
-
7/30/2019 A TRANSPORT PROTOCOL
5/81
Contents
1 INTRODUCTION 1
2 BACKGROUND 3
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 TCP Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 UDP-based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Novel Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 End-host Factors that Affect Data Transfer Performance . . . . . . . . . . . . . . . 6
2.2.1 Memory and I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1.1 Zero-copy Networking . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Protocol Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Disk Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Process scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Components of CHEETAH . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Features of a CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 The CHEETAH Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 End-host Software Support for CHEETAH . . . . . . . . . . . . . . . . . 14
3 UDP-BASED TRANSPORT PROTOCOL 16
3.1 SABUL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 SABUL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
v
-
7/30/2019 A TRANSPORT PROTOCOL
6/81
Contents vi
3.2 Modifications to SABUL : FRTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Problems with the FRTP Implementation . . . . . . . . . . . . . . . . . . 22
3.2.2 Possible Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 TCP-BASED SOLUTION 27
4.1 Transmission Control Protocol - A Primer . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.3 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.4 Self Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Reasons for Selecting TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Circuit-TCP Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Connection Establishment . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.3 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.4 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.5 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 C-TCP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Web100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Utility of Disabling Slow Start . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.2 Sustained Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.2.1 Reno-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46
4.5.2.2 BIC-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46
4.5.2.3 C-TCP Performance . . . . . . . . . . . . . . . . . . . . . . . . 47
5 CONTROL-PLANE FUNCTIONS 49
5.1 Selecting the Circuit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
-
7/30/2019 A TRANSPORT PROTOCOL
7/81
Contents vii
5.2 Setting up the Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 CONCLUSIONS 56
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.1 Transport Protocol Design for Dedicated Circuits . . . . . . . . . . . . . . 56
6.1.2 Transport Protocol Implementation . . . . . . . . . . . . . . . . . . . . . 57
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A DISK WRITE RATE ESTIMATION 59
A.1 How Linux Handles Disk Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Bibliography 66
-
7/30/2019 A TRANSPORT PROTOCOL
8/81
List of Figures
2.1 Memory I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 CHEETAH experimental testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Architecture of CHEETAH end-host software . . . . . . . . . . . . . . . . . . . . 15
3.1 Architecture of a generic UDP-based protocol . . . . . . . . . . . . . . . . . . . . 17
3.2 Need for receiver flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 TCP self clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Congestion control in the control plane . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Structure of the Web100 stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Maximum buffer space required for a C-TCP burst . . . . . . . . . . . . . . . . . 41
4.5 Testbed configuration for C-TCP tests . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 TCP and C-TCP comparison for different transfer sizes . . . . . . . . . . . . . . . 43
4.7 Start-up behavior of TCP and C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.8 Throughput and RTT using Reno-TCP . . . . . . . . . . . . . . . . . . . . . . . . 46
4.9 Throughput and RTT using BIC-TCP . . . . . . . . . . . . . . . . . . . . . . . . 47
4.10 Throughput and RTT using C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Throughput variability of disk-to-disk transfers . . . . . . . . . . . . . . . . . . . 50
5.2 Trade-off between circuit utilization and delay . . . . . . . . . . . . . . . . . . . . 51
viii
-
7/30/2019 A TRANSPORT PROTOCOL
9/81
List of Tables
5.1 xdd benchmark results on zelda4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Disk write rate (Mbps) for individual runs using 32 KB request sizes . . . . . . . . 52
A.1 End host configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.2 Disk write rate results using xdd . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
ix
-
7/30/2019 A TRANSPORT PROTOCOL
10/81
List of Abbreviations
ACK Acknowledgement
AIMD Additive Increase Multiplicative Decrease
API Application Programming Interface
AQM Active Queue Management
BDP Bandwidth Delay Product
BIC-TCP Binary Increase Congestion control TCP
CHEETAH Circuit-switched High-speed End-to-End Transport ArcHitecture
COW Copy On Write
C-TCP Circuit TCP
cwnd congestion window
DMA Direct Memory Access
DNS Domain Name Service
DRAGON Dynamic Resource Allocation via GMPLS Optical Networks
FAST Fast AQM Scalable TCP
FRTP Fixed Rate Transport Protocol
GbE Giga bit Ethernet
Gbps Giga bits per second
GB Giga Byte
GMPLS Generalized Multiprotocol Label Switching
x
-
7/30/2019 A TRANSPORT PROTOCOL
11/81
List of Abbreviations xi
HS-TCP HighSpeed TCP
I/O Input/Output
IP Internet Protocol
KB Kilo Byte
LAN Local Area Network
LMP Link Management Protocol
Mbps Mega bits per second
MB Mega Byte
MSPP Multi-Service Provisioning Platform
MTU Maximum Transmission Unit
NAK Negative ACK
NETBLT Network Blast Transfer
NIC Network Interface Card
OC Optical Carrier
OCS Optical Connectivity Service
OS Operating System
OSPF Open Shortest Path First
RBUDP Reliable Blast UDP
RED Random Early Detect
RSVP-TE Resource Reservation Protocol
RTO Retransmission Time-out
RTT Round Trip Time
rwnd receiver advertised window
SABUL Simple Available Bandwidth Utilization Library
SACK Selective ACK
SONET Synchronous Optical Network
ssthresh slow start threshold
TCP Transmission Control Protocol
-
7/30/2019 A TRANSPORT PROTOCOL
12/81
List of Abbreviations xii
TDM Time Division Multiplexing
TSI Terascale Supernova Initiative
UDP User Datagram Protocol
UDT UDP-based Data Transfer protocol
XCP eXplicit Control Protocol
-
7/30/2019 A TRANSPORT PROTOCOL
13/81
Chapter 1
INTRODUCTION
Many fields of research require significant computing resources to conduct simulations and/or to
analyze large amounts of data. Large data sets collected by remote instruments may need to be
processed. The SETI@home project [2], which uses data collected by the National Astronomy
and Ionospheric Centers radio telescope in Arecibo, Peru, is one such example. The telescope
generates about 35 GB of data per day that is stored in removable tapes and physically transported
to the server in Berkeley, California. In some cases, computations generate massive amounts of
output that has to be distributed to scientists who are physically at a distance from the computation
resource. For instance, the Terascale Supernova Initiative (TSI) project involves simulations run on
supercomputers at the Oak Ridge National Laboratory (ORNL), the results of which are used by
physicists at remote sites like the North Carolina State University (NCSU).
Networks connecting the data generation point, the computation resource and the scientists
workplace make collabarative e-science much more practical. The large amounts of data involved
and, in some cases (e.g., real-time visualization), stringent delay/jitter requirements make it nec-
essary to use networks with large bandwidths and deterministic behavior. E-science applications
require high, constantly available bandwidth for their data transfer needs. It is difficult to provide
such rate-guaranteed services in packet-switched, connectionless networks, such as the present-day
Internet. This is because of the possibility of a large number of simultaneous flows competing for
the available network capacity. Therefore, the use of connection-oriented, dedicated circuits has
been proposed as a solution. Many research groups are implementing testbeds and the supporting
1
-
7/30/2019 A TRANSPORT PROTOCOL
14/81
Chapter 1. INTRODUCTION 2
software to show the feasibility of such a solution.
The problem addressed in this thesis is the design of a transport protocol for dedicated circuits.
Many of the assumptions on which traditional transport protocols for packet-switched networks
are based need to be examined. For instance, the possibility of losses due to network buffer over-
flows makes congestion control an important function on connectionless networks. On connection-
oriented networks, because network resources are reserved for each data transfer, the end points of
the transfer have more control over whether or not network buffers will overflow. By maintaining
a data transfer rate that is matched to the reserved circuits rate the need for congestion control
can be eliminated. On the other hand, a transport layer function such as flow control is needed on
both connectionless and connection-oriented networks because it addresses a problem that network
resource reservation does not solve.
Our approach is to design the transport protocol under the assumption that resources are re-
served for a data transfers exclusive use. The transport protocol should not have any features
that leave the reserved circuit unutilized. We implemented the transport protocol and tested it on a
wide-area, connection-oriented network testbed. This protocol is called Circuit-TCP (C-TCP).
The rest of this thesis is organized as follows. Chapter 2 provides background information on
previous work in this area as well as issues that affect the design and performance of our transport
protocol. In Chapter 3, we describe the Fixed Rate Transport Protocol (FRTP) that was implemented
in the user space over UDP. The shortcomings of a user space impementation are pointed out.
Chapter 4 describes the design and implementation of C-TCP, our kernel space transport protocol
based on TCP. Experimental results over a testbed are used to compare C-TCP with TCP over
dedicated circuits. In Chapter 5 the control plane issues of determining the circuit rate and then
setting up the circuit are considered. The conclusions of this work are presented in Chapter 6.
-
7/30/2019 A TRANSPORT PROTOCOL
15/81
Chapter 2
BACKGROUND
In this chapter we first look at other work that has been done in the development of transport pro-
tocols for high-performance networks. Next we point out some of the factors that play a significant
role in achieving high throughput on dedicated circuits. Many of these are end-host issues that we
discovered while implementing our transport protocol. This work has been conducted as a part
of the Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH) project. An
overview of CHEETAH is presented at the end of this chapter.
2.1 Related Work
There has been significant activity in developing transport protocols suitable for high-bandwidth
and/or high-delay networks. Even though very little of it is focussed explicitly towards dedicated
circuits there is enough of an overlap in the problems to justify a closer examination. High-
performance protocols can be classified as TCP enhancements, UDP-based and novel protocols.
Ease of deployment and familiarity with the sockets API to the TCP and UDP protocol stacks are
reasons for the popularity of TCP and UDP-based solutions.
2.1.1 TCP Enhancements
TCP is the most widely used reliable transport protocol on connectionless, packet-switched net-
works. We describe basic TCP operation in Chapter 4. It is designed to work under a wide range
3
-
7/30/2019 A TRANSPORT PROTOCOL
16/81
Chapter 2. BACKGROUND 4
of conditions and this makes a few of its design decisions non-optimal for high-speed networks.
In recent years a number of protocol extensions to TCP have been proposed and implemented to
address this issue. Selective acknowledgements (SACKs) [27,16] have been proposed to deal more
efficiently with multiple losses in a round trip time (RTT) [13]. TCP uses cumulative acknowl-
edgements (ACKs) which means a data byte is not ACKed unless all data earlier in the sequence
space has been received successfully. SACKs inform the sender about out-of-sequence data already
received and help prevent unnecessary retransmissions. Two protocol extensions timestamps op-
tion and window scaling were proposed in [22]. The timestamps option field in a data packets
TCP header is filled in by a sender and echoed back in the corresponding ACK. It serves two pur-
poses. First, the timestamp can be used to estimate the round trip time more accurately and more
often. This gives the sender a better value for retransmission timeout (RTO) computation. Second,
the timestamp in a received packet can be used to prevent sequence number wraparound. The TCP
header has a 16-bit field for the window size, which limits the window size to 64 KB. This is insuf-
ficient for high-bandwidth, high-delay networks. The window scaling option allows a scaling factor
to be chosen during connection establishment. Subsequent window advertisements are right shifted
by the selected scaling factor. Scaling factors of upto 14 are allowed, thus by using this option a
window size of upto 1 GB can be advertised.
Standard TCP (also called Reno TCP) has been found wanting in high-bandwidth, high-delay
environments, mainly due to its congestion control algorithm. TCPs Additive Increase Multi-
plicative Decrease (AIMD) algorithm is considered too slow in utilizing available capacity and too
drastic in cutting back when network congestion is inferred. Modifications to the TCP conges-
tion control algorithm have led to the development of HighSpeed TCP [14], Scalable TCP [25],
FAST [23], and BIC-TCP [39], among others. Standard TCP requires unrealistically low loss rates
to achieve high throughputs. HighSpeed TCP is a proposed change to the TCP AIMD parameters
that allows a TCP connection to achieve high sending rates under more realistic loss conditions.
Scalable TCP also proposes modified AIMD parameters that speed up TCPs recovery from loss.
FAST infers network congestion and adjusts its window size based on queueing delays rather than
loss. BIC-TCP (BIC stands for Binary Increase Congestion control) is a new congestion control
-
7/30/2019 A TRANSPORT PROTOCOL
17/81
Chapter 2. BACKGROUND 5
algorithm that scales well to high bandwidth (i.e., it can achieve a high throughput at reasonable
packet loss rates) and is TCP-friendly (i.e., when the loss rate is high its performance is the same
as standard TCPs). In addition, unlike HighSpeed or Scalable TCP, BIC-TCPs congestion control
is designed such that two flows with different RTTs share the available bandwidth in a reasonably
fair manner.
2.1.2 UDP-based Protocols
To overcome the shortcomings of TCP, many researchers have implemented protocols over UDP by
adding required functionality, such as reliability, in the user space. The most common model is to
use UDP for the data transfer and a separate TCP or UDP channel for control traffic. SABUL [18],
Tsunami, Hurricane [38], and RBUDP [20] use a TCP control channel and UDT [19] uses UDP
for both data and control channels. The advantage of these solutions is that their user-space imple-
mentation makes deployment easy. At the same time, there are some limitations that arise because
these protocols are implemented in the user-space. In Chapter 3, we describe SABUL. Our attempt
at modifying SABUL to implement a transport protocol for dedicated circuits and the shortcomings
of a user-space transport protocol implementation are also pointed out.
2.1.3 Novel Protocols
Some novel protocols designed exclusively for high-performance data transfer have also been pro-
posed. The eXplicit Control Protocol (XCP) [24] was proposed to solve TCPs stability and effi-
ciency problems. By separating link utilization control from fairness control, XCP is able to make
more efficient use of network resources in a fair manner. XCPs requirement of multi-bit congestion
signals from the network makes it harder to deploy since routers in the network need to be modified.
NETBLT [10] was proposed for high-speed bulk data transfer. It provides reliable data transfer by
sending blocks of data in a lock-step manner. This degrades bandwidth utilization while the sender
awaits an acknowledgement (ACK) for each block.
-
7/30/2019 A TRANSPORT PROTOCOL
18/81
Chapter 2. BACKGROUND 6
2.2 End-host Factors that Affect Data Transfer Performance
Setting up a dedicated circuit involves resource reservation in the network. Depending on the
network composition, the resources reserved could be wavelengths, ports on a switch or time slots.
Ideally, we would like to fully use the reserved resources for exactly the time required to complete
the transfer. During the implementation of our transport protocol, we found that there are many
factors that make it hard to achieve this ideal. In this section we list a few of these factors that
impact the performance of transport protocol implementations.
2.2.1 Memory and I/O bus usage
First, consider an application that uses the transport protocol to carry out a file transfer. At the
sending end, the application has to
1. Read data from the disk, e.g. by invoking a read system call.
2. Send the data out on the network, e.g. by invoking a send system call.
There are two types of overhead in carrying out these operations. The system calls involve the over-
head of saving the process registers on stack before the system call handler is invoked. Secondly,
the two steps shown above could involve multiple passes over the memory and I/O bus. This is
illustrated in Figure 2.1(a). The figure shows the bus operations involved in moving data from the
disk to user space buffers (step 1 above), and from the user space buffer to kernel network buffers
(part of step 2). To avoid having to access the disk each time, for multiple accesses to a chunk of
data, the operating system caches recently accessed disk data in memory. This cache is called the
page cache, and direct memory access (DMA) is used for transfers between the page cache and the
disk (operation I in Figure 2.1(a)). Two passes over the memory bus are needed to transfer the data
from the page cache to the user space buffer (operation II). To send data out to the network, it is
again copied from the user space buffer to kernel network buffers (operation III). We do not show
the transfer from the kernel network buffer to the NIC, which is the final step in getting data out
into the network. For data transfers using TCP sockets on Linux, the sendfile system call can be
-
7/30/2019 A TRANSPORT PROTOCOL
19/81
Chapter 2. BACKGROUND 7
PROCESSOR
MEMORY
PAGE
CACHE
I
II
III
NIC
HARD DISK
USER-SPACE
KERNEL-SPACE
MEMORY
(a) Using readand send
PROCESSOR
MEMORY
PAGE
CACHE
HARD DISKNIC
KERNEL-SPACE
MEMORY
USER-SPACE
(b) Using sendfile
Figure 2.1: Memory I/O bus usage
used to cut down the number of passes over the memory bus to three. As shown in Figure 2.1(b),
sendfile copies data directly from the page cache to the kernel network buffers, thus avoiding the
copy to user space and back. In addition, sendfile needs to be invoked just once for a single file, so
the overhead of making a system call is paid only once per file.
2.2.1.1 Zero-copy Networking
Other methods for avoiding the copy from user-space memory to kernel-space memory have been
proposed. Such methods are known by the common term zero-copy networking. For a classification
of zero-copy schemes see [7]. The zero in zero-copy networking indicates that there is no memory-
to-memory copy involved in the transfer of data between a user space buffer and the network. So,
in Figure 2.1(a), a zero-copy scheme would eliminate memory-to-memory copies after operation
II. How the data got into the user- or kernel-space buffer in the first place, and whether that required
a copy is not considered. Zero-copy schemes can be supported if an application interacts directly
with the NIC without passing through the kernel, or if the buffers are shared between user and
kernel space, rather than being copied. For an application to directly read from and write to the NIC
buffer, protocol procesing has to be done on the NIC. At the sender, buffers can be shared between
the application and the kernel if the application can guarantee that a buffer that has not yet been
transmitted will not be overwritten. One way to ensure this would be if the system call invoked to
-
7/30/2019 A TRANSPORT PROTOCOL
20/81
Chapter 2. BACKGROUND 8
send some data returns only after all of that data has been successfully transmitted. Since a reliable
transport protocol can consider a buffer to have been successfully transmitted only when all of the
data in that buffer has successfully reached the intended receiver, the application may need to wait
a while before it can reuse a buffer. An interesting alternative is to mark the buffer as copy-on-write
(COW), so that the contents of the buffer are copied to a separate buffer if and when the application
tries to overwrite it. Implementation of send-side zero-copy schemes for different operating systems
are described in [28].
Now consider the steps at a receiver. A receiver performs the steps shown in Figure 2.1(a) in
reverse order (there is no sendfile equivalent for the receiver). One way to implement zero-copy
on the receiver is to change the page table of the application process when it issues a recv system
call. This is called page flipping in [28]. Page flipping works only if the NIC separates the packet
payload and header, if the packet payload is an exact multiple of the page size and if the buffer
provided by the application is aligned to page boundaries. Because of these requirements there has
been little effort to implement such a scheme.
Several factors that influence communication overhead are presented in [33]. The memory and
I/O bus usage for schemes with different kernel and interface hardware support are compared. For
instance, the author shows how, by using DMA, NIC buffering and checksum offload, the number
of passes over the bus can be reduced from six to one.
2.2.2 Protocol Overhead
Apart from the memory and I/O bus, the other main end host resource that could become a bottle-
neck is processor cycles. TCP/IP, being the most widely used protocol stack, has received attention
in this regard. In [9] the processing overhead of TCP/IP is estimated and the authors conclusion
is that with a proper implementation, TCP/IP can sustain high throughputs efficiently. More recent
work presented in [17] takes into consideration the OS and hardware support that a TCP implemen-
tation will require.
The overhead of a transport layer protocol can be divided into two categories: per-packet costs
and per-byte costs [9, 28, 6]. Per-packet costs include protocol processing (e.g., processing the
-
7/30/2019 A TRANSPORT PROTOCOL
21/81
Chapter 2. BACKGROUND 9
sequence numbers on each packet in TCP) and interrupt processing. Per-byte costs are incurred
when data is copied or during checksum calculation.
Per-packet overhead can be reduced by reducing the number of packets handled during the
transfer. For a given transfer size, the number of packets can be reduced by using larger packets.
The maximum transmission unit (MTU) of the network constrains the packet size that an end host
can use. For instance, Ethernet imposes a 1500-byte limit on the IP datagram size. The concept
of jumbograms was introduced by Alteon Networks to allow Ethernet frames of upto 9000 bytes,
and many gigabit Ethernet NICs now support larger frame sizes. Larger packet sizes can decrease
protocol processing overhead as well as the overhead of interrupt processing. NICs interrupt the
processor on frame transmission and reception. An interrupt is costly for the processor because
the state of the currently running process has to be saved and an interrupt handler invoked to deal
with the interrupt. As interface rates increase to 1 Gbps and higher, interrupt overhead can become
significant. Many high-speed NICs support interrupt coalescing so that the processor is interrupted
for a group of transmitted or received packets, rather than for each individual packet.
Schemes to reduce per-byte costs involved in copying data over the memory I/O bus were
described in Section 2.2.1. Checksum calculation can be combined with a copy operation and
carried out efficiently in software. For instance, the sender could calculate the checksum when data
is being copied from the user-space buffer to the kernel-space buffer. Another way to reduce the
processors checksum calculation burden is to offload it to the interface card.
2.2.3 Disk Access
All the factors considered so far affect data transfer throughput. In designing a transport protocol
for dedicated circuits, not only does a high throughput have to be maintained, the circuit utilization
should also be high. Thus end host factors that cause variability in the throughput also need to
be considered. For disk-to-disk data transfers, disk access can limit throughput as well as cause
variability. The file system used can have an effect on disk access performance. The time to
physically move the disk read/write head to the area on the hard disk where the desired data resides,
called seek time, is a major component of the disk access latency. File accesses tend to be sequential,
-
7/30/2019 A TRANSPORT PROTOCOL
22/81
Chapter 2. BACKGROUND 10
so a file system that tries to keep all parts of a file clustered together on the hard disk would perform
better than one in which a file is broken up into small pieces spread all over the hard disk.
At the sender, data needs to be read from the disk to memory. System calls to do this are
synchronous. When the system call returns successfully, the requested data is available in memory
for immediate use. Operating systems try to improve the efficiency of disk reads by reading in
more than the requested amount, so that some of the subsequent requests can be satisfied without
involving the disk hardware. At the data receiver, the system call to write to disk is asynchronous
by default. This means that when the system call returns it is not guaranteed that the data has been
written to disk; instead it could just be buffered in memory. Asynchronous writes are tailored to
make the common case of small, random writes efficient, since they allow the operating system
to schedule disk writes in an efficient manner. The operating system could reorder the writes to
minimize seeks. In Linux, for instance, data written to disk is actually copied to memory buffers
in the page cache and these buffers are marked dirty. Two kernel threads, bdflush and kupdate, are
responsible for flushing dirty buffers to disk. The bdflush kernel thread is activated when the number
of dirty buffers exceeds a threshold, and kupdate is activated whenever a buffer has remained dirty
too long. As a consequence of the kernel caching and delayed synchronization between memory
buffers and the disk, there can be significant variability in the conditions under which a disk write
system call operates.
2.2.4 Process scheduling
The final factor we consider is the effect of the process scheduler. All modern operating sys-
tems are multitasking. Processes run on the processor for short intervals of time and then either
relinquish the CPU voluntarily (e.g. if they block waiting for I/O) or are forcibly evicted by the
operating system if their time slot runs out. This gives users the impression that multiple processes
are running simultaneously. Multitasking, like packet-switched networking, tries to fairly divide up
a resource (processor cycles for multitasking; bandwidth for packet-switched networking) among
all contenders (multiple processes; multiple flows) for the resource. This behavior is at odds with
resource reservation in a connection-oriented network such as CHEETAH. If the degree of mul-
-
7/30/2019 A TRANSPORT PROTOCOL
23/81
Chapter 2. BACKGROUND 11
titasking at an end host is high then a data transfer application may not get the processor cycles
required to fully use the reserved circuit. Even if the required number of free cycles are available,
the process scheduler might not be able to schedule the data transfer application in the monotonic
fashion required to send and receive data at the fixed circuit rate.
2.3 CHEETAH Network
CHEETAH, which stands for Circuit-switched High-speed End-to-End Transport ArcHitecture, is a
network architecture that has been proposed [37] to provide high-speed, end-to-end connectivity on
a call-by-call basis. Since the transport protocol proposed in this thesis is to be used over a dedicatedcircuit through a CHEETAH network, in this section we provide a description of CHEETAH.
2.3.1 Components of CHEETAH
Many applications in the scientific computing domain require high throughput transfers with deter-
ministic behavior. A circuit-switched path through the network can meet such requirements better
than a packet-switched path. CHEETAH aims to bring the benefits of a dedicated circuit to an end-
user. In order to allow wide implementation, CHEETAH has been designed to build on existing
network infrastructure instead of requiring radical changes. Ethernet and SONET (Synchronous
Optical Network) are the most widely used technologies in local area networks (LANs) and wide
area networks (WANs) respectively. To take advantage of this, a CHEETAH end-to-end path con-
sists of Ethernet links at the edges and Ethernet-over-SONET links in the core. Multi-Service
Provisioning Platforms (MSPPs) are hardware devices that make such end-to-end paths possible.
MSPPs are capable of mapping between the packet-switched Ethernet domain and the time divi-
sion multiplexed (TDM) SONET domain. MSPPs are an important component of the CHEETAH
architecture for three reasons.
1. The end hosts can use common Ethernet NICs and do not need, for instance, SONET line
cards.
2. Many enterprises already have MSPPs deployed to connect to their ISPs backbone network.
-
7/30/2019 A TRANSPORT PROTOCOL
24/81
Chapter 2. BACKGROUND 12
3. Standard signaling protocols, as defined for Generalized Multiprotocol Label Switching
(GMPLS) networks, are (being) implemented in MSPPs. This is essential to support dynamic
call-by-call sharing in a CHEETAH network.
2.3.2 Features of a CHEETAH Network
One of the salient features of CHEETAH is that it is an add-on service to the existing packet-
switched service through the Internet. This means, firstly, that applications requiring CHEETAH
service can co-exist with applications for which a path through the packet-switched Internet is good
enough. Secondly, because network resources are finite, it is possible that an applications request
for a dedicated circuit is rejected; in such cases, the Internet path provides an alternative so that
the applications data transfer does not fail. To realize this feature, end hosts are equipped with an
additional NIC that is used exclusively for data transfer over a CHEETAH circuit.
To make the CHEETAH architecture scalable, the network resource reservation necessary to
set up an end-to-end circuit should be done in a distributed and dynamic manner. Standardized
signaling protocols that operate in a distributed manner, such as the hop-by-hop signaling in GM-
PLS protocols, are key to achieving scalability. CHEETAH uses RSVP-TE 1 signaling in the control
plane. Dynamic circuit set up and tear down means that these operations are performed when (and
only when) required, as opposed to statically provisioning a circuit for a long period of time. Dy-
namic operation is essential for scalability because it allows the resources to be better utilized, thus
driving down costs. End-host applications that want to use a CHEETAH circuit are best-placed
to decide when the circuit should be set up or torn down. Therefore an end host connected to the
CHEETAH network runs signaling software that can be used by applications to attempt circuit set
up on a call-by-call basis.
With end-host signaling in place, applications that want to use a CHEETAH circuit can do so
in a dynamic manner. This leads to the question of whether, just because it can be done, a circuit
set up should be attempted for a given data transfer. In [37], analytical arguments are used to show
1Resource Reservation Protocol-Traffic Engineering. This is the signaling component of the GMPLS protocols, the
other components being Link Management Protocol (LMP) and Open Shortest Path First-TE (OSPF-TE).
-
7/30/2019 A TRANSPORT PROTOCOL
25/81
Chapter 2. BACKGROUND 13
that, for data transfers above a threshold size, transfer delay can be reduced by using a CHEETAH
circuit rather than an Internet path. It is also worth noting that there are situations in which the
overhead of circuit set up makes it advantageous to use a path through the Internet, although for
wide-area bulk data transfer a dedicated circuit invariably trumps an Internet path.
2.3.3 The CHEETAH Testbed
To study the feasibility of the CHEETAH concept, an experimental testbed has been set up. This
testbed extends between North Carolina State University (NCSU), Raleigh, NC, and Oak Ridge Na-
tional Laboratory (ORNL), Oak Ridge, TN and passes through the MCNC point-of-presence (PoP)
in Research Triangle Park, NC and the Southern Crossroads/Southern LambdaRail (SOX/SLR) PoP
in Atlanta, GA. The testbed layout is shown in Figure 2.2. In this testbed, the Sycamore SN16000
Intelligent Optical Switch is used as the MSPP. In the figure we show end hosts connected directly
or through Ethernet switches to the gigabit Ethernet card on the SN16000. The cross connect card
is configured through the control card to set up a circuit. The SN16000 has an implementation of
the GMPLS signaling protocol that follows the standard and has been tested for interoperability.
"
"
' ( 0 3
( 5 7 9
' A
B
' 0 5 F G G G
3 H
( P 0 Q 3 0
"
"
' ( 0 3
( 5 7 9
' A
B
' 0 5 F G G G
3 H
`
`
"
"
' ( 0 3
( 5 7 9
' A
B
' 0 5 F G G G
3 H
a 0 0
d
H
S
d
H
3 H
0 ' i 0
Figure 2.2: CHEETAH experimental testbed
-
7/30/2019 A TRANSPORT PROTOCOL
26/81
Chapter 2. BACKGROUND 14
The testbed has been designed to support the networking needs of the TSI project mentioned
at the beginning of this chapter. We present results of experiments conducted over this testbed in
Chapter 4.
2.3.4 End-host Software Support for CHEETAH
To allow applications to start using CHEETAH circuits, software support is required to make the
end hosts CHEETAH enabled. The architecture of the end-host software is shown in Figure 2.3.
The relevant components of the CHEETAH end-host software are shown inside a dotted box to
signify that the application could either interact with each component individually or make higher-
level calls that hide the details of the components being invoked. To be able to use a CHEETAH
circuit between two end hosts, both should support CHEETAH.
The Optical Connectivity Service (OCS) client allows applications to query whether a re-
mote host is on the CHEETAH network. OCS uses the Internets Domain Name Service (DNS)
to provide additional information such as the IP address of the remote ends secondary NIC. As
mentioned earlier, depending on the situation, either a CHEETAH circuit or a path through the In-
ternet may be better for a particular transfer. The routing decision module takes measurements of
relevant network parameters (e.g., available bandwidth and average loss rate) and uses these along
with the parameters of a particular transfer (e.g., the file size and requested circuit rate) to decide
whether or not a CHEETAH circuit set up should be attempted. To achieve CHEETAHs goal of
distributed circuit set up, an RSVP-TE signaling module runs on each end host. The RSVP-TE
module exchanges control messages with the enterprise MSPP to set up and tear down circuits.
These control messages are routed through the primary NIC over the Internet. The final software
component is the transport protocol module. Depending on whether a circuit or an Internet path
is being used, the transport protocol used will be C-TCP or TCP. In this thesis the focus will be on
the design, implementation and evaluation of C-TCP.
To end this chapter we mention some of the other projects focused on connection-oriented
networking for e-science projects. UltraScience Net [36] is a Department of Energy sponsored
research testbed connecting Atlanta, Chicago, Seattle and Sunnyvale. This network uses a central-
-
7/30/2019 A TRANSPORT PROTOCOL
27/81
Chapter 2. BACKGROUND 15
! "
' ( 0 ! 3 5
6 8
9 3 !
5
0 A C E
! 3
E G " G C G ! Q 6 3
C G S " G T
U (
0 3 ! " G T
U (
Q 0 C C
3 G 3
0 ' U E 3 a G b Q 0 C C
C G S " G T
U (
0 3 ! " G T
U (
Figure 2.3: Architecture of CHEETAH end-host software
ized scheme for the control-plane functions. Another effort is the Dynamic Resource Allocation via
GMPLS Optical Networks (DRAGON) project [12]. DRAGON uses GMPLS protocols to support
dynamic bandwidth provisioning.
-
7/30/2019 A TRANSPORT PROTOCOL
28/81
Chapter 3
UDP-BASED TRANSPORT PROTOCOL
In Chapter 2 we mentioned a few protocols that are based on UDP. There are good reasons for
taking this approach:
UDP provides the minimal functionality of a transport protocol. It transfers datagrams be-
tween two processes but makes no guarantees about their delivery. UDPs minimalism leaves
no scope for anything to be taken out of its implementation. Thus a new protocol built over
UDP has to add extra (and only the required) functionality. The significance of this is that
these additions can be done in the user space, without requiring changes to the operating
system code. This makes UDP-based solutions as easy to use and portable as an application
program.
The sockets API to the UDP and TCP kernel code is widely deployed and used. This makes
implementation easier and faster.
The basic design of all UDP-based protocols is similar and is shown in Figure 3.1. Data packets
are transferred using UDP sockets. A separate TCP or UDP channel is used to carry control pack-ets. Control packets serve to add features to the data transfer not provided by UDPs best-effort
service. We used the Simple Available Bandwidth Utilization Library (SABUL), a UDP-based data
transfer application, to implement the Fixed Rate Transport Protocol (FRTP). In this chapter we first
present an overview of the SABUL protocol and implementation. Then we describe the changes
16
-
7/30/2019 A TRANSPORT PROTOCOL
29/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 17
! # % '
) 0 2 4 5 0 @ A C3 2 E 2 F
G C 4 C @ A C 29 2 P F
Figure 3.1: Architecture of a generic UDP-based protocol
that we made to SABUL to implement FRTP. The advantages and shortcomings of this approach
are discussed.
3.1 SABUL Overview
SABUL is designed for bulk data transfers over high-bandwidth networks. SABULs architecture
is the same as that shown in Figure 3.1. TCP is used for control packet transmission from the data
receiver to the data sender. SABUL adds reliability, rate-based congestion control and flow control
to UDPs basic data transfer service.
Providing end-to-end reliabilityguaranteeing that all the data sent is received in the same or-
der and without duplicatesis a function of the transport layer. SABUL implements the following
error control scheme for reliable transfer. It adds a sequence number to each UDP data packet.
The receiver detects packet loss using the sequence numbers of the received packets. On inferring
loss, the receiver immediately sends a negative-acknowledgement (NAK) control packet to convey
this information to the sender. The sender then recovers from the error by retransmitting the lost
packet(s). The receiver maintains an ERR timer to periodically send NAKs if there are missing
packets. This is to provide protection against lost retransmissions. For file transfers, reading data
-
7/30/2019 A TRANSPORT PROTOCOL
30/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 18
from the disk for each retransmission is very expensive in time. Therefore, the sender keeps the
transmitted data in memory until it is acknowledged. A SABUL receiver periodically sends an ac-
knowledgement (ACK) control packet, acknowledging all packets received in-order. On receiving
an ACK, the sender can free the buffer space occupied by data that is confirmed to have been re-
ceived. In addition the SABUL sender has a timer that is reset each time a control packet is received.
If this timer (called the EXP timer) expires because no control information has been received, the
sender assumes that all unacknowledged packets have been lost and retransmits them.
SABUL uses a rate-based congestion control scheme. The sender modifies the sending rate
depending on the degree of congestion in the network. The SABUL receiver sends a periodic syn-
chronization (SYN) control packet containing the number of data packets received in the previous
SYN period. The sender uses this information to estimate the amount of loss and hence the con-
gestion in the network. Depending on whether the loss is above or below a threshold, the sending
rate is reduced or increased, respectively. The sending rate is modified by changing the inter-packet
gap.
SABUL is a user space implementation which means a SABUL receiver cannot distinguish
between loss due to network congestion and loss due to its receive buffer (the kernel UDP buffer)
overflowing. The information in SYN packets represents both types of loss, and therefore, SABULs
rate-based congestion control also serves as a reactive flow control strategy. In addition, a fixed
window is used to limit the amount of unacknowledged data in the network.
3.1.1 SABUL Implementation
The SABUL implementation is described next. It is important to separate the SABUL transport
protocol from an application that uses it. In the description below we refer to an application using
SABUL as the sending application or receiving application. The sending application generates
the data that is to be transferred using SABUL, for example by reading it from a file on the hard
disk. The receiving application, likewise, consumes the data transferred using SABUL. SABUL
is implemented in C++. The sending application invokes a SABUL method to put data into the
protocol buffer. SABUL manages the protocol buffer and transmits or retransmits data packets
-
7/30/2019 A TRANSPORT PROTOCOL
31/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 19
from it. Two threads are used. One handles the interface with the sending application, mainly the
filling of the protocol buffer. The other thread is responsible for sending out data packets. The
sequence numbers of packets that need to be retransmitted are recorded in a loss list. Pseudocode
for the sender side functionality is shown below:
INITIALIZATION:
Create TCP socket on well-known port number
Listen for a connection
Accept connection from client
Get the UDP port number on which the receiver is expecting data
Calculate the inter-packet gap required to maintain the desired sending rate
Fork a new thread to handle the data transmission
DATA TRANSMISSION:
WHILE data transfer is not over
WHILE protocol buffer is empty AND data transfer is not over
Wait for data from the sending application
ENDWHILE
Poll control channel for control packets
IF control packet received THEN
Process control packet /* See below */
ENDIF
IF loss list is not empty THEN
Remove first packet from the loss list
ELSE
Form a new packet
ENDIF
Send the data packet by invoking the send() system call on the UDP socket
Wait till it is time to send the next packet
-
7/30/2019 A TRANSPORT PROTOCOL
32/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 20
ENDWHILE
CONTROL PACKET PROCESSING:
IF ACK packet THEN
Release buffer space held by the acknowledged packet(s)
Update loss list
Inform sending application of availability of buffer space
ELSE IF NAK packet THEN
Update loss list
ELSE IF SYN packet THEN
Adjust sending rate
ENDIF
Two threads are used at the receiver too. One thread (call it the network thread) is responsible
for receiving data packets, writing the data into the protocol buffer and sending control packets.
The other thread (main thread) handles the interface with the receiving application, transferring
data from the protocol buffer to the application buffer. SABUL uses an optimization when the
receiving application asks to read more data than the protocol buffer has. The main thread sets a
flag indicating such a situation. On seeing this flag, the network thread copies all available data
into the application buffer and resets the flag. As the rest of the data requested by the receiving
application arrives, it is copied directly into the application buffer saving a memory copy. The
receiver side pseudocode follows.
INITIALIZATION:
Create TCP and UDP sockets
Connect to the sender
Inform the sender of the UDP port number
Fork a new thread to receive data
RECEIVING DATA:
-
7/30/2019 A TRANSPORT PROTOCOL
33/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 21
WHILE all the data has not been received
IF receiving application is waiting for data THEN
Copy all ACKed data from protocol buffer to application buffer
ENDIF
IF ACK timer expired THEN
Send ACK packet
ENDIF
IF ERR timer expired THEN
Send NAK packet with sequence numbers of missing packets
ENDIF
IF SYN timer expired THEN
Send SYN packet with number of packets received in previous SYN inte
ENDIF
Get the address into which to receive the next expected data packet
Receive a data packet on the UDP socket
IF missing packets THEN
Add missing packets sequence numbers to loss list
Send an immediate NAK packet
ENDIF
Update state variables like next expected sequence number, ACK sequence numb
Update loss list
ENDWHILE
3.2 Modifications to SABUL : FRTP
Our initial idea for a transport protocol that can be used over dedicated circuits was that, since
bandwidth is reserved, the data should be just streamed across at the circuit rate. Transmitting at a
rate lower than the reserved circuit rate would leave bandwidth unutilized. Transmitting at a higher
-
7/30/2019 A TRANSPORT PROTOCOL
34/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 22
rate would eventually lead to a buffer filling up and overflowing. Therefore we wanted a transport
protocol that could monotonically send data packets at a fixed rate. SABUL seemed like a perfect
match for doing this since it can maintain a fixed sending rate if its rate-based congestion control
was disabled. FRTP, our transport protocol for dedicated circuits, could be implemented just like
SABUL, except that the rate altering congestion control would be stripped out.
The first modification to SABUL code was to remove the rate-based congestion control that
modified the sending rate. Second, we added support for using separate NICs for the data and
control channels. This was in line with the CHEETAH concept of having two NICs on CHEETAH-
enabled hosts. SABUL (and hence, FRTP) has many parameters that can be tweaked to improve
its performance. The application, protocol and UDP buffer sizes can be changed. The values of
the different timers that SABUL uses are also available for adjustment. We ran experiments in a
laboratory setting [40] to determine the effect of some of these parameters on FRTP performance,
and possibly determine the optimal values. Although we failed to determine a set of optimal values
for the parameters, these experiments did reveal some of the flawed assumptions we were making.
3.2.1 Problems with the FRTP Implementation
We observed that even though FRTP was set up to send at a fixed rate, the throughput achieved
(amount of data transferred / transfer time) was lower than the sending rate. This difference grew as
the sending rate was increased. We found that the reasons for this discrepancy were two-fold. First,
the FRTP implementation was not able to maintain a monotonic sending rate. Second, even if the
sender was able to maintain a constant sending rate, the receiving application could not empty the
buffers at the same (or higher) rate. This led to receiver buffer overflow and retransmissions, which
reduced the throughput.
FRTP implements a fixed sending rate by maintaining a fixed inter-packet gap. For instance,
if 1500 byte packets are being used, a 1 Gbps sending rate can be maintained by ensuring that the
gap between successive transmitted packets is 12 s (= 1500 bytes / 1 Gbps). Commodity operating
systems do not provide straightforward methods (if at all) to measure such small intervals of time
and certainly do not provide a method to reliably schedule a periodic action at such a fine granular-
-
7/30/2019 A TRANSPORT PROTOCOL
35/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 23
ity. For instance, the timer tick granularity available to user-space processes in Linux is 10 ms. To
overcome this, FRTP uses busy waiting to bide away the time between packet transmissions. If the
next packet needs to be sent at time t, FRTP does the following:
WHILE ((current time) < t)
NOP
ENDWHILE
The rdtsc (read time stamp counter) instruction, provided by Pentium processors, is used to get
an accurate value for the current time. The rdtsc instruction reads the time stamp counter that is
incremented at every processor tick. NOP is a no operation instruction. The busy waiting solution is
wasteful since the NOPs use up processor cycles that could have been used to accomplish something
more useful. It also does nothing to make the periodic invocation of an event reliable. If the sending
process were the only one running on the processor then the busy waiting scheme works to reliably
perform a periodic action. If a different process is running on the processor at t, the FRTP sending
process will miss its deadline. In fact, since FRTP itself uses 2 threads at the sender, the thread
responsible for filling the protocol buffer could interfere with the data sending threads busy waiting
induced periodicity.
SABULs rate adjustment scheme has been removed from FRTP. Therefore FRTP does not have
even the reactive flow control of SABUL. This is acceptable if we can be sure that flow control is
not required. The FRTP receiver architecture for a transfer to disk can be represented as shown in
Figure 3.2. Using the notation introduced in Section 3.1, the network thread handles the transfer
marked I and the main thread and the receiving application handle II and III respectively. The
process scheduler has to put the threads on the processor for the transfers to take place. Transfer III
additionally depends on how long the write to disk takes. These factors introduce variability into
the receiving rate. Buffers can hide this variability so that even a constant sending rate does not
cause buffer overflow. For a sending rate S(t) held at a constant value S, a receiving rate R(t) and a
receive buffer of size B, for no loss to occur:
S.Z
0R(t)dtB [0,T] (3.1)
-
7/30/2019 A TRANSPORT PROTOCOL
36/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 24
UDP bufferProtocol
bufferApplication
buffer
Disk
Ke rn el- sp ace Us er -sp ac e
I IIIII
Figure 3.2: Need for receiver flow control
where [0,T] is the transfer interval. The (false) assumption behind our initial belief that it is enough
to just stream the data at the reserved circuit rate was that equation (3.1) holds throughout the
transfer. From our experiments we realized that not only is R(t) varying, we do not even know a
closed form expression for it, making the choice of S and B to satisfy equation (3.1) difficult. A
pragmatic approach is to assign sensible values to Sand B, so that (3.1) is satisfied most of the time.
When it is not satisfied, there are losses and the error control algorithm will recover from the loss.
This is what we were seeing in our laboratory experiments (but with S(t) also varying with time).
A flow control protocol would attempt to ensure that the above equation is satisfied all the time, by
varying S(t). Unfortunately this implementation of FRTP has no flow control.
3.2.2 Possible Solutions
Our attempts to solve the two problems we identified with the FRTP implementation use of busy
waiting for ensuring a steady rate and lack of flow control are described next. The ideal solution
for maintaining a fixed inter-packet gap would involve transmitting a packet, giving up the processor
and reclaiming it when it is time to send the next packet. Linux offers a system call to relinquish
-
7/30/2019 A TRANSPORT PROTOCOL
37/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 25
the processor. To see why it is not possible to reclaim the processor at a deterministic future time
it is essential to understand how the Linux scheduler schedules processes to run. Two queues (for
our purposes only two of the queues are important) are maintained, one of processes that are ready
to run (the RUNNABLE queue) and the other of processes that are waiting for some condition that
will make them ready to run (the INTERRUPTIBLE queue). For instance, if a process executes
instructions to write to disk, it is put in the INTERRUPTIBLE queue. When the write to disk
completes and the hard drive interrupts the processor the process is put back in the RUNNABLE
queue. So what happens when, after transmitting a packet, the FRTP sending process gives up the
CPU? Usually, the system call used to relinquish the processor allows the process to specify a time
after which it is to be made runnable again. The process is put in the INTERRUPTIBLE queue and
when the operating system determines that the time for which the process had asked to sleep has
passed, it is put back in the RUNNABLE queue. The problem arises because the operating system
uses the timer interrupts (which have a 10 ms period in Linux) to check whether the sleep time has
passed. Therefore if a process asked to sleep for 1 second, it is guaranteed to become runnable
after a time between 1.0 and 1.01 seconds, but if it asks to sleep for 100 s it will become runnable
after some time between 100 s and 10100 s. Note that if we give this process the highest priority
then its becoming runnable implies that it runs on the processor, so we ignore the scheduling delay
between a process becoming ready to run and actually running. Thus on Linux (and other operating
systems that dont support real-time processes) it is not possible for a user space process to send
packets monotonically at a high rate.
An alternate approach would be to maintain the sending rate, not on a packet-by-packet basis,
but in a longer time frame. This can be done by ensuring that N packets are sent every T units
of time such that (N/T) is the desired sending rate. This would cause a burst ofN packets in the
network so we would like to keep T as small as possible. In the limit N becomes 1 and we get what
SABUL attempts to implement. The sending process should get a periodic impulse every T units
of time and in response send out the N packets. Linux offers user-space processes the ability to
receive such periodic impulses in the form of signals. A process can use the settimer() system call
to activate a timer. This timer causes a signal to be sent periodically to the process. We modified the
-
7/30/2019 A TRANSPORT PROTOCOL
38/81
Chapter 3. UDP-BASED TRANSPORT PROTOCOL 26
FRTP code to use periodic signals to maintain the sending rate. This reduced the CPU utilization at
the sender compared to the earlier busy waiting scheme. But the lack of real-time support on Linux
meant that even if the signals were being sent like clockwork the user-space process was not always
able to start sending the next burst of packets immediately. We observed that occasionally some
signals would be missed because an earlier one was still pending.
We now consider the problem of adding flow control to FRTP. Since flow control is supposed to
avoid receiver buffer overflow, the data receiver is best placed to provide the information based on
which the sender can control the flow of data. SABULs sending rate adjustment in response to lost
packets is a form of flow control that does not use explicit information from the receiver. SABULs
flow control scheme was not very effective since we observed substantial loss and retransmission.
To be able to send back buffer status information, the receiver has to have timely access to this in-
formation. Although, the FRTP receiver can accurately figure out how much free space is available
in the protocol and application buffers (see Figure 3.2), it does not have access to the current status
of the UDP buffer in kernel memory. The kernel does not make any effort to avoid UDP buffer
overflows. The filling and emptying of a user space buffer are fully in the control of a user space
process. So if a user space buffer is short on free space, the process can choose not to read in more
data. With the UDP buffer the kernel has no control over the filling of the buffer since packets arrive
asynchronously over the network. That is why flow control is necessary to prevent the UDP buffer
from overflowing. Therefore, any flow control scheme which requires explicit buffer status infor-
mation from the receiver would need support from the kernel. By choosing to implement FRTP in
the user space over UDP, we lose the opportunity to implement such a flow control scheme.
-
7/30/2019 A TRANSPORT PROTOCOL
39/81
Chapter 4
TCP-BASED SOLUTION
In the previous chapter we pointed out the shortcomings of a UDP-based transport protocol that
were uncovered while implementing FRTP using SABUL. We realized that more support from
the operating system would be required to better match the behavior of the end hosts with that of
the network in which resources were reserved. This chapter describes our efforts to implement a
transport protocol for dedicated circuits that is more closely tied in with the operating system than
the user-space FRTP. Our protocol is based on the TCP implementation in Linux. To reiterate this
fact, we call this protocol Circuit-TCP (C-TCP).
In this chapter, first an overview of TCP is presented. Then we look at the advantages of using
TCP to implement a transport protocol for dedicated circuits. Next, we present the implementation
of C-TCP. C-TCP has been tested on the CHEETAH testbed. Results from these experiments and a
discussion of their significance concludes this chapter.
4.1 Transmission Control Protocol - A Primer
TCP is the transport protocol of the TCP/IP suite of protocols. It is a connection-oriented protocol
that provides reliability, distributed congestion control and end-to-end flow control. Note that the
meaning of TCP being a connection-oriented protocol is different from the use of the phrase in
connection-oriented network. In order to provide its end-to-end services, TCP maintains state
for each data stream. Thus, TCP creates a connection between two end points wishing to commu-
27
-
7/30/2019 A TRANSPORT PROTOCOL
40/81
Chapter 4. TCP-BASED SOLUTION 28
nicate reliably (the end points can be processes on end hosts), maintains state information about
the connection and disconnects the two end points when they no longer need TCPs service. In
a connection-oriented network, a connection refers to physical network resources that have been
reserved, and that taken together form an end-to-end path.
Applications wishing to use TCPs service use the sockets interface that the TCP/IP stack in the
operating system provides. Two processes that want to use TCP to communicate create sockets and
then one of the processes connects its socket to the remote socket. A connection is established if
the connection request is accepted by the remote end. TCP uses a 3-way handshake to establish a
connection. Connection establishment also initializes all of the state information that TCP requires
to provide its service. This state is stored in the data structures associated with the sockets on each
end of a connection. We now present brief descriptions of four of TCPs functions. For a more
detailed description please see [29], [8] and [1].
4.1.1 Error Control
Each unique data byte transferred by TCP is assigned a unique sequence number. During connection
establishment the two ends of a connection exchange starting sequence numbers. The TCP at the
receiving end maintains information about sequence numbers that have been successfully received,
the next expected sequence number and so on. The receiver can make use of the sequence numbers
of received data to infer data reordering with certainty, but not data loss. In fact, neither the TCP
at the sender nor the one at the receiver can reliably detect packet loss since a packet presumed lost
could just be delayed in the network. TCP uses acknowledgements (ACKs) of successfully received
data and a sender-based retransmission time-out (RTO) mechanism to infer data loss. The time-out
value is calculated carefully using estimates of RTT and RTT variance, to reduce the possibility of
falsely detecting loss or waiting too long to retransmit lost data. An optimization that was proposed
and has been widely implemented is the use of triple duplicate ACKs to infer loss early rather than
wait for the RTO to expire. A TCP receiver sends back a duplicate ACK whenever an out-of-order
packet arrives. For instance, suppose packets Pn, Pn+1, Pn+2, Pn+3 and Pn+4 contain data that is
contiguous in the sequence number space. IfPn+1 goes missing, then the receiving TCP sends back
-
7/30/2019 A TRANSPORT PROTOCOL
41/81
Chapter 4. TCP-BASED SOLUTION 29
duplicate ACKs acknowledging the sucessful receipt of Pn when Pn+2, Pn+3 and Pn+4 arrive. On
getting 3 duplicate ACKs, a TCP sender assumes that the data packet immediately following the
(multiply) ACKed data was lost. The sender retransmits this packet immediately. This is called fast
retransmit. As was pointed out in Chapter 2, many enhancements to TCP have been proposed and
implemented, such as the use of SACKs, that improve TCPs loss recovery, among other things.
4.1.2 Flow Control
Flow control allows a receiving TCP to control the amount of data sent by a sending TCP. With
each ACK, the receiving TCP returns the amount of free space available in its receive buffer. This
value is called the receiver advertised window (rwnd). The sending TCP accomplishes flow control
by ensuring that the amount of unacknowledged data (the demand for receiver buffer space) does
not exceed rwnd (the supply of buffer space on the receiver).
4.1.3 Congestion Control
The original specification of TCP [29] did not have congestion control. TCPs congestion control
algorithm was proposed in [21]. Just as flow control tries to match the supply and demand for the
receiver buffer space, congestion control matches the supply and demand for network resources
like bandwidth and switch/router buffer space. This is a much more complex problem because
TCP is designed to work on packet-switched networks in which multiple data flows share network
resources. TCPs congestion control algorithm is a distributed solution in which each data flow
performs congestion control using only its own state information, with no inter-flow information
exchange.
TCP congestion control is composed of three parts.
1. Estimate the current available supply of the network resources and match the flows demand
to that value.
2. Detect when congestion occurs (i.e. demand exceeds supply).
3. On detecting congestion, take steps to reduce it.
-
7/30/2019 A TRANSPORT PROTOCOL
42/81
Chapter 4. TCP-BASED SOLUTION 30
TCP maintains a state variable, congestion window (cwnd), which is its estimate of how much
data can be sustained in the network. TCP ensures that the amount of unacknowledged data does
not exceed cwnd,1 and thus uses cwnd to vary a flows resource demand. Since a sending TCP
has no explicit, real-time information about the amount of resources available in the network, the
cwnd is altered in a controlled manner, in the hope of matching it to the available resources. The
cwnd is increased in two phases. The first phase, which is also the one in which TCP starts, is
called slow start. During slow start cwndis incremented by one packet for each returning ACK that
acknowledges new data. Thus, ifcwndat time t was C(t), all of the unacknowledged data at t would
get acknowledged by time (t+RT T) and C(t+RT T) would be C(t)+C(t) = (2C(t)). Slow start
is used whenever the value ofcwndis below a threshold value called slow start threshold (ssthresh).
When cwndincreases beyond ssthresh, TCP enters the congestion avoidance phase in which the rate
of cwnd increase is reduced. During congestion avoidance, each returning ACK increments cwnd
from C to (C+ 1C
). An approximation used by many implementations is to increment C to (C+ 1)
at the end of an RTT (assuming the unit for cwnd is packets).
The second component of congestion control is congestion detection. TCP uses packet loss as
an indicator of network congestion. Thus, each time a sending TCP infers loss, either through RTO
or triple duplicate ACKs, it is assumed that the loss was because of network congestion. Other
congestion indicators have been proposed. For instance, in Chapter 2 we mentioned that FAST
uses queueing delay to detect network congestion. Some researchers have proposed that a more
proactive approach should be adopted, and congestion should be anticipated and prevented, rather
than reacted to. Such a proactive approach would require congestion information from the network
nodes. See [5] for a discussion of the Active Queue Management (AQM) mechanisms that routers
need to implement, and [15] for a description of the Random Early Detect (RED) AQM scheme.
In [30], the modifications that need to be made to TCP in order to take advantage of the congestion
information provided by routers using AQM is presented.
The third component of congestion control is taking action to reduce congestion once its been
detected. The fact that congestion occurred (and was detected) means that TCPs estimate of the
1Recall that flow control requires the amount of unacknowledged data to be less than rwnd. TCP implementations
use min(rwnd,cwnd) to bound the amount of unacknowledged data.
-
7/30/2019 A TRANSPORT PROTOCOL
43/81
Chapter 4. TCP-BASED SOLUTION 31
available network resource supply is too high. Thus, to deal with congestion, TCP reduces its
estimate by cutting down cwnd. On detecting loss, the sending TCP first reduces ssthresh to half of
the flight size, where flight size is the amount of data that has been sent but not yet acknowledged
(the amount in flight). The next step is to reduce cwnd. The amount of reduction varies depending
on whether the loss detection was by RTO or triple duplicate ACKs. If an RTO occurred then the
congestion in the network is probably severe, so TCP sets cwndto 1 packet. The receipt of duplicate
ACKs means that packets are getting through to the receiver and hence congestion is not that severe.
Therefore, in this case cwnd is set to (ssthresh + 3) packets and incremented by 1 packet for each
additional duplicate ACK. This is called fast recovery.
The linear increase of cwnd by one packet per RTT, during congestion avoidance, and its
decrease by a factor of two during recovery from loss is called Additive Increase Multiplicative
Decrease (AIMD). TCP uses an AI factor of one (cwnd cwnd+ 1) and an MD factor of two
(cwnd cwnd
1 12
).
4.1.4 Self Clocking
Although TCP does not explicitly perform rate control, the use of ACK packets leads to a handy
rate maintenance property called self clocking [21]. Consider the situation shown in Figure 4.1.
The node marked SENDER is sending data to the RECEIVER that is three hops away.2 The links
LIN K1,LIN K2 and LINK3 are logically separated to show data flow in both directions. The width
of a link is indicative of its bandwidth, so LINK2 is the bottleneck in this network. The shaded
blocks are packets (data packets and ACKs), with packet size proportional to a blocks area. The
figure shows the time instant when the sender has transmitted a windows worth of packets at the
rate of LINK1. Because all these packets have to pass through the bottleneck link, they reach the
receiver at LINK2s rate. This is shown by the separation between packets on LINK3. The receiver
generates an ACK for each successfully received data packet. If we assume that the processing time
for each received data packet is the same, then the ACKs returned by the receiver have the same
spacing as the received data packets. This ACK spacing is preserved on the return path. Each ACK
2This figure is adapted from one in [21].
-
7/30/2019 A TRANSPORT PROTOCOL
44/81
Chapter 4. TCP-BASED SOLUTION 32
allows the sender to transmit new data packets. If a sender has cwnd worth of data outstanding in
the network, new data packets are transmitted only when ACKs arrive. Thus, the sending rate (in
data packets per unit time) is maintained at the rate of ACK arrival, which in turn is determined by
the bottleneck link rate. This property of returning ACKs clocking out data packets is called self
clocking.
! !
! !
! !
! !
! !
! !
" "
" "
" "
" "
" "
" "
# #
# #
# #
# #
# #
# #
$ $
$ $
$ $
$ $
$ $
$ $
% %
% %
% %
% %
% %
% %
& &
& &
& &
& &
& &
& &
& &
' '
' '
' '
' '
' '
' '
' '
( (
( (
( (
( (
( (
( (
( (
) )
) )
) )
) )
) )
) )
) )
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
1 1
1 1
1 1
1 1
1 1
1 1
2 2
2 2
2 2
3 3
3 3
3 3
4 4
4 4
4 4
5 5
5 5
5 5
6
6
6
7
7
7
8
8
8
9
9
9
@
@
@
@
@
@
@
A
A
A
A
A
A
A
B
B
B
B
B
B
B
C
C
C
C
C
C
C
D
D
D
D
D
D
D
E
E
E
E
E
E
E
F F
F F
F F
F F
F F
F F
F F
G G
G G
G G
G G
G G
G G
G G
ACKs ACKs ACKs
SEND
ER
NETWORKNODE
NETWORKNODE
RECE
IVER
LINK1 LINK2 LINK3
Figure 4.1: TCP self clocking
4.2 Reasons for Selecting TCP
In Chapter 3, two problems were identified in a user-space UDP-based implementation of FRTP.
1. Use of busy waiting to maintain a fixed inter-packet gap, and thus a fixed rate, does not work
very well. Even if it did work perfectly, it is wasteful of CPU cycles.
2. The difficulty of maintaining a fixed receiving rate makes flow control very attractive. A
proactive scheme, in which the receiver is able to prevent buffer overflow, requires kernel
support that a user space FRTP cannot get. By removing SABULs rate-based congestion
control, FRTP forgoes SABULs reactive flow control too. Thus, FRTP has null flow control.
-
7/30/2019 A TRANSPORT PROTOCOL
45/81
Chapter 4. TCP-BASED SOLUTION 33
In this section, two issues are addressed. First, whether TCP is better at tackling the two problems
listed above. Second, are there other issues unique to TCP that need to be considered.
The description of TCPs slow start and AIMD schemes in Section 4.1.3 shows that TCP does
not maintain a fixed sending rate. TCP is designed with the assumption that the available bandwidth
in the network (called supply in Section 4.1) is changing over time, as other data flows start or end,
and that its instantaneous value is not known. TCPs congestion control algorithms attempt to match
a flows sending rate to the available network bandwidth, inspite of this incomplete knowledge. But,
such a sending rate altering algorithm is not needed on dedicated ciruits.
If we assume that TCPs congestion control can be disabled, how well can TCP maintain a fixed
sending rate and at what granularity? The self clocking property provides a low-overhead way to
maintain a steady sending rate. In steady state, each returning ACK clocks out a data packet so a
steady sending rate can be maintained at a granularity of packets. Moreover, packet transmission is
initiated as a result of an interrupt (the NIC raises an interrupt when an ACK is received), and so is
much less likely to be disturbed by the behavior of the process scheduler. This is a major advantage
of shifting the responsibility of maintaining a steady rate to the kernel domain.
The variability in the receiving rate is because of the receiving applications interaction with the
process scheduler and the disk. This problem is not solved by using a different transport protocol.
But, TCPs flow control is designed to minimize the impact of such variability on data transfer
performance. TCP uses a window-based flow control scheme (see Section 4.1.2) that prevents
receive buffer overflow, unlike SABUL, which reacts to packet loss caused by buffer overflow.
TCP appears to adequately deal with the two problems identified in implementing FRTP. In
addition there are a few other reasons for choosing TCP which we point out next. Once it had been
established that flow control required kernel support, our choice was essentially made. We did not
have the expertise to implement a kernel-space protocol starting from scratch. So, our protocol had
to be implemented by modifying an existing, stable kernel-space transport protocol. TCP and UDP
are so widely used and well understood that, unless some other protocol is clearly more suitable, it
makes sense to modify TCP or UDP. Another reason for choosing to use TCP is that error control
comes for free. In the next section, the protocol design for C-TCP is presented and it should be clear
-
7/30/2019 A TRANSPORT PROTOCOL
46/81
Chapter 4. TCP-BASED SOLUTION 34
that for the majority of transport protocol functions, what TCP implements worksregardless of
whether the underlying network is connectionless or connection-oriented.
So is TCP the answer to all our problems? Well, no. Without any modifications TCPs conges-
tion control algorithm is not suitable for use over a dedicated circuit. One of the main differences
between TCP and C-TCP is the congestion control algorithm used. We describe C-TCP in more
detail in the next two sections. A practical issue with doing any kernel-space modification is that
ease of use for the solution is much lower than a user space application, which can be downloaded,
built and installed, since the host has to be rebooted.
4.3 Circuit-TCP Design
In this section the design of C-TCP is described. Five functions of a transport protocol are con-
sidered, namely connection establishment, congestion control, multiplexing, flow control and error
control. For each of these functions, we consider whether it is required on a dedicated circuit and if
so, how to provide the function.
4.3.1 Connection Establishment
It is useful in the design of a transport protocol to think in terms of control and data planes. Control
plane functions support the data plane. For instance, TCPs three-way handshake for connection
establishment is used to agree upon an initial sequence number to be used in the data transfer that
follows. C-TCP requires state to be maintained for each data flow using C-TCP. The connection-
establishment and release schemes are used unaltered from TCP.
4.3.2 Congestion Control
Network congestion occurs when the