a transport protocol

7/30/2019 A TRANSPORT PROTOCOL

1/81

A TRANSPORT PROTOCOL

FOR

DEDICATED END-TO-END CIRCUITS

A Thesis

Presented to

the faculty of the School of Engineering and Applied Science

University of Virginia

In Partial Fulfillment

of the requirements for the Degree

Master of Science

Computer Engineering

by

Anant P. Mudambi

January 2006


2/81

APPROVAL SHEET

This thesis is submitted in partial fulfillment of the requirements for the degree of

Master of Science

Computer Engineering

Anant P. Mudambi

This thesis has been read and approved by the examining committee:

Malathi Veeraraghavan (Advisor)

Marty A. Humphrey (Chair)

Stephen G. Wilson

Accepted for the School of Engineering and Applied Science:

Dean, School of Engineering and Applied Science

January 2006


3/81

Abstract

E-science projects involving geographically distributed data sources, computing resources and sci-

entists, have special networking requirements such as a steady throughput and deterministic behav-

ior. The connectionless Internet model is not well-suited to meet such requirements. Connection-

oriented networks that offer guaranteed-rate, dedicated circuits have been proposed to meet the

high-end networking needs of distributed scientific research. In this work we describe the design

and implementation of a transport protocol for such dedicated circuits.

We present an initial user-space, UDP-based implementation called Fixed Rate Transport Proto-

col (FRTP). The constraints imposed by a user-space implementation led us to implement a lower-

overhead kernel-space solution that we call Circuit-TCP (C-TCP). The key feature of C-TCP is to

maintain a fixed sending rate, closely matched to the circuit rate, with the aim of achieving highcircuit utilization. We implemented C-TCP by modifying the Linux TCP/IP stack. Experimental

results on a wide-area circuit-switched testbed show that C-TCP is able to quickly utilize circuit

bandwidth and sustain a high data transfer rate.

iii


4/81

Acknowledgments

I would like to thank Prof. Malathi Veeraraghavan, for her advice and for keeping me on the right

track. I thank the members of the CHEETAH research group, Xuan, Xiangfei, Zhanxiang and

Xiuduan, for all their help.

Anil and Kavita, thank you for keeping me motivated. Finally, the biggest thank you to my

parents, for their incredible support and love.

iv


5/81

Contents

1 INTRODUCTION 1

2 BACKGROUND 3

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 TCP Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 UDP-based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Novel Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 End-host Factors that Affect Data Transfer Performance . . . . . . . . . . . . . . . 6

2.2.1 Memory and I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1.1 Zero-copy Networking . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Protocol Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Disk Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Process scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Components of CHEETAH . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Features of a CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 The CHEETAH Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.4 End-host Software Support for CHEETAH . . . . . . . . . . . . . . . . . 14

3 UDP-BASED TRANSPORT PROTOCOL 16

3.1 SABUL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 SABUL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v


6/81

Contents vi

3.2 Modifications to SABUL : FRTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Problems with the FRTP Implementation . . . . . . . . . . . . . . . . . . 22

3.2.2 Possible Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 TCP-BASED SOLUTION 27

4.1 Transmission Control Protocol - A Primer . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.3 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.4 Self Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Reasons for Selecting TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Circuit-TCP Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Connection Establishment . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.2 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.3 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.4 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.5 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 C-TCP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.1 Web100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 Utility of Disabling Slow Start . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5.2 Sustained Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.2.1 Reno-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46

4.5.2.2 BIC-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46

4.5.2.3 C-TCP Performance . . . . . . . . . . . . . . . . . . . . . . . . 47

5 CONTROL-PLANE FUNCTIONS 49

5.1 Selecting the Circuit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


7/81

Contents vii

5.2 Setting up the Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 CONCLUSIONS 56

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.1 Transport Protocol Design for Dedicated Circuits . . . . . . . . . . . . . . 56

6.1.2 Transport Protocol Implementation . . . . . . . . . . . . . . . . . . . . . 57

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A DISK WRITE RATE ESTIMATION 59

A.1 How Linux Handles Disk Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bibliography 66


8/81

List of Figures

2.1 Memory I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 CHEETAH experimental testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Architecture of CHEETAH end-host software . . . . . . . . . . . . . . . . . . . . 15

3.1 Architecture of a generic UDP-based protocol . . . . . . . . . . . . . . . . . . . . 17

3.2 Need for receiver flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 TCP self clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Congestion control in the control plane . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Structure of the Web100 stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Maximum buffer space required for a C-TCP burst . . . . . . . . . . . . . . . . . 41

4.5 Testbed configuration for C-TCP tests . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 TCP and C-TCP comparison for different transfer sizes . . . . . . . . . . . . . . . 43

4.7 Start-up behavior of TCP and C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.8 Throughput and RTT using Reno-TCP . . . . . . . . . . . . . . . . . . . . . . . . 46

4.9 Throughput and RTT using BIC-TCP . . . . . . . . . . . . . . . . . . . . . . . . 47

4.10 Throughput and RTT using C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Throughput variability of disk-to-disk transfers . . . . . . . . . . . . . . . . . . . 50

5.2 Trade-off between circuit utilization and delay . . . . . . . . . . . . . . . . . . . . 51

viii


9/81

List of Tables

5.1 xdd benchmark results on zelda4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Disk write rate (Mbps) for individual runs using 32 KB request sizes . . . . . . . . 52

A.1 End host configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.2 Disk write rate results using xdd . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

ix


10/81

List of Abbreviations

ACK Acknowledgement

AIMD Additive Increase Multiplicative Decrease

API Application Programming Interface

AQM Active Queue Management

BDP Bandwidth Delay Product

BIC-TCP Binary Increase Congestion control TCP

CHEETAH Circuit-switched High-speed End-to-End Transport ArcHitecture

COW Copy On Write

C-TCP Circuit TCP

cwnd congestion window

DMA Direct Memory Access

DNS Domain Name Service

DRAGON Dynamic Resource Allocation via GMPLS Optical Networks

FAST Fast AQM Scalable TCP

FRTP Fixed Rate Transport Protocol

GbE Giga bit Ethernet

Gbps Giga bits per second

GB Giga Byte

GMPLS Generalized Multiprotocol Label Switching

x


11/81

List of Abbreviations xi

HS-TCP HighSpeed TCP

I/O Input/Output

IP Internet Protocol

KB Kilo Byte

LAN Local Area Network

LMP Link Management Protocol

Mbps Mega bits per second

MB Mega Byte

MSPP Multi-Service Provisioning Platform

MTU Maximum Transmission Unit

NAK Negative ACK

NETBLT Network Blast Transfer

NIC Network Interface Card

OC Optical Carrier

OCS Optical Connectivity Service

OS Operating System

OSPF Open Shortest Path First

RBUDP Reliable Blast UDP

RED Random Early Detect

RSVP-TE Resource Reservation Protocol

RTO Retransmission Time-out

RTT Round Trip Time

rwnd receiver advertised window

SABUL Simple Available Bandwidth Utilization Library

SACK Selective ACK

SONET Synchronous Optical Network

ssthresh slow start threshold

TCP Transmission Control Protocol


12/81

List of Abbreviations xii

TDM Time Division Multiplexing

TSI Terascale Supernova Initiative

UDP User Datagram Protocol

UDT UDP-based Data Transfer protocol

XCP eXplicit Control Protocol


13/81

Chapter 1

INTRODUCTION

Many fields of research require significant computing resources to conduct simulations and/or to

analyze large amounts of data. Large data sets collected by remote instruments may need to be

processed. The SETI@home project [2], which uses data collected by the National Astronomy

and Ionospheric Centers radio telescope in Arecibo, Peru, is one such example. The telescope

generates about 35 GB of data per day that is stored in removable tapes and physically transported

to the server in Berkeley, California. In some cases, computations generate massive amounts of

output that has to be distributed to scientists who are physically at a distance from the computation

resource. For instance, the Terascale Supernova Initiative (TSI) project involves simulations run on

supercomputers at the Oak Ridge National Laboratory (ORNL), the results of which are used by

physicists at remote sites like the North Carolina State University (NCSU).

Networks connecting the data generation point, the computation resource and the scientists

workplace make collabarative e-science much more practical. The large amounts of data involved

and, in some cases (e.g., real-time visualization), stringent delay/jitter requirements make it nec-

essary to use networks with large bandwidths and deterministic behavior. E-science applications

require high, constantly available bandwidth for their data transfer needs. It is difficult to provide

such rate-guaranteed services in packet-switched, connectionless networks, such as the present-day

Internet. This is because of the possibility of a large number of simultaneous flows competing for

the available network capacity. Therefore, the use of connection-oriented, dedicated circuits has

been proposed as a solution. Many research groups are implementing testbeds and the supporting

1


14/81

Chapter 1. INTRODUCTION 2

software to show the feasibility of such a solution.

The problem addressed in this thesis is the design of a transport protocol for dedicated circuits.

Many of the assumptions on which traditional transport protocols for packet-switched networks

are based need to be examined. For instance, the possibility of losses due to network buffer over-

flows makes congestion control an important function on connectionless networks. On connection-

oriented networks, because network resources are reserved for each data transfer, the end points of

the transfer have more control over whether or not network buffers will overflow. By maintaining

a data transfer rate that is matched to the reserved circuits rate the need for congestion control

can be eliminated. On the other hand, a transport layer function such as flow control is needed on

both connectionless and connection-oriented networks because it addresses a problem that network

resource reservation does not solve.

Our approach is to design the transport protocol under the assumption that resources are re-

served for a data transfers exclusive use. The transport protocol should not have any features

that leave the reserved circuit unutilized. We implemented the transport protocol and tested it on a

wide-area, connection-oriented network testbed. This protocol is called Circuit-TCP (C-TCP).

The rest of this thesis is organized as follows. Chapter 2 provides background information on

previous work in this area as well as issues that affect the design and performance of our transport

protocol. In Chapter 3, we describe the Fixed Rate Transport Protocol (FRTP) that was implemented

in the user space over UDP. The shortcomings of a user space impementation are pointed out.

Chapter 4 describes the design and implementation of C-TCP, our kernel space transport protocol

based on TCP. Experimental results over a testbed are used to compare C-TCP with TCP over

dedicated circuits. In Chapter 5 the control plane issues of determining the circuit rate and then

setting up the circuit are considered. The conclusions of this work are presented in Chapter 6.


15/81

Chapter 2

BACKGROUND

In this chapter we first look at other work that has been done in the development of transport pro-

tocols for high-performance networks. Next we point out some of the factors that play a significant

role in achieving high throughput on dedicated circuits. Many of these are end-host issues that we

discovered while implementing our transport protocol. This work has been conducted as a part

of the Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH) project. An

overview of CHEETAH is presented at the end of this chapter.

2.1 Related Work

There has been significant activity in developing transport protocols suitable for high-bandwidth

and/or high-delay networks. Even though very little of it is focussed explicitly towards dedicated

circuits there is enough of an overlap in the problems to justify a closer examination. High-

performance protocols can be classified as TCP enhancements, UDP-based and novel protocols.

Ease of deployment and familiarity with the sockets API to the TCP and UDP protocol stacks are

reasons for the popularity of TCP and UDP-based solutions.

2.1.1 TCP Enhancements

TCP is the most widely used reliable transport protocol on connectionless, packet-switched net-

works. We describe basic TCP operation in Chapter 4. It is designed to work under a wide range

3


16/81

Chapter 2. BACKGROUND 4

of conditions and this makes a few of its design decisions non-optimal for high-speed networks.

In recent years a number of protocol extensions to TCP have been proposed and implemented to

address this issue. Selective acknowledgements (SACKs) [27,16] have been proposed to deal more

efficiently with multiple losses in a round trip time (RTT) [13]. TCP uses cumulative acknowl-

edgements (ACKs) which means a data byte is not ACKed unless all data earlier in the sequence

space has been received successfully. SACKs inform the sender about out-of-sequence data already

received and help prevent unnecessary retransmissions. Two protocol extensions timestamps op-

tion and window scaling were proposed in [22]. The timestamps option field in a data packets

TCP header is filled in by a sender and echoed back in the corresponding ACK. It serves two pur-

poses. First, the timestamp can be used to estimate the round trip time more accurately and more

often. This gives the sender a better value for retransmission timeout (RTO) computation. Second,

the timestamp in a received packet can be used to prevent sequence number wraparound. The TCP

header has a 16-bit field for the window size, which limits the window size to 64 KB. This is insuf-

ficient for high-bandwidth, high-delay networks. The window scaling option allows a scaling factor

to be chosen during connection establishment. Subsequent window advertisements are right shifted

by the selected scaling factor. Scaling factors of upto 14 are allowed, thus by using this option a

window size of upto 1 GB can be advertised.

Standard TCP (also called Reno TCP) has been found wanting in high-bandwidth, high-delay

environments, mainly due to its congestion control algorithm. TCPs Additive Increase Multi-

plicative Decrease (AIMD) algorithm is considered too slow in utilizing available capacity and too

drastic in cutting back when network congestion is inferred. Modifications to the TCP conges-

tion control algorithm have led to the development of HighSpeed TCP [14], Scalable TCP [25],

FAST [23], and BIC-TCP [39], among others. Standard TCP requires unrealistically low loss rates

to achieve high throughputs. HighSpeed TCP is a proposed change to the TCP AIMD parameters

that allows a TCP connection to achieve high sending rates under more realistic loss conditions.

Scalable TCP also proposes modified AIMD parameters that speed up TCPs recovery from loss.

FAST infers network congestion and adjusts its window size based on queueing delays rather than

loss. BIC-TCP (BIC stands for Binary Increase Congestion control) is a new congestion control


17/81


algorithm that scales well to high bandwidth (i.e., it can achieve a high throughput at reasonable

packet loss rates) and is TCP-friendly (i.e., when the loss rate is high its performance is the same

as standard TCPs). In addition, unlike HighSpeed or Scalable TCP, BIC-TCPs congestion control

is designed such that two flows with different RTTs share the available bandwidth in a reasonably

fair manner.

2.1.2 UDP-based Protocols

To overcome the shortcomings of TCP, many researchers have implemented protocols over UDP by

adding required functionality, such as reliability, in the user space. The most common model is to

use UDP for the data transfer and a separate TCP or UDP channel for control traffic. SABUL [18],

Tsunami, Hurricane [38], and RBUDP [20] use a TCP control channel and UDT [19] uses UDP

for both data and control channels. The advantage of these solutions is that their user-space imple-

mentation makes deployment easy. At the same time, there are some limitations that arise because

these protocols are implemented in the user-space. In Chapter 3, we describe SABUL. Our attempt

at modifying SABUL to implement a transport protocol for dedicated circuits and the shortcomings

of a user-space transport protocol implementation are also pointed out.

2.1.3 Novel Protocols

Some novel protocols designed exclusively for high-performance data transfer have also been pro-

posed. The eXplicit Control Protocol (XCP) [24] was proposed to solve TCPs stability and effi-

ciency problems. By separating link utilization control from fairness control, XCP is able to make

more efficient use of network resources in a fair manner. XCPs requirement of multi-bit congestion

signals from the network makes it harder to deploy since routers in the network need to be modified.

NETBLT [10] was proposed for high-speed bulk data transfer. It provides reliable data transfer by

sending blocks of data in a lock-step manner. This degrades bandwidth utilization while the sender

awaits an acknowledgement (ACK) for each block.


18/81


2.2 End-host Factors that Affect Data Transfer Performance

Setting up a dedicated circuit involves resource reservation in the network. Depending on the

network composition, the resources reserved could be wavelengths, ports on a switch or time slots.

Ideally, we would like to fully use the reserved resources for exactly the time required to complete

the transfer. During the implementation of our transport protocol, we found that there are many

factors that make it hard to achieve this ideal. In this section we list a few of these factors that

impact the performance of transport protocol implementations.

2.2.1 Memory and I/O bus usage

First, consider an application that uses the transport protocol to carry out a file transfer. At the

sending end, the application has to

1. Read data from the disk, e.g. by invoking a read system call.

2. Send the data out on the network, e.g. by invoking a send system call.

There are two types of overhead in carrying out these operations. The system calls involve the over-

head of saving the process registers on stack before the system call handler is invoked. Secondly,

the two steps shown above could involve multiple passes over the memory and I/O bus. This is

illustrated in Figure 2.1(a). The figure shows the bus operations involved in moving data from the

disk to user space buffers (step 1 above), and from the user space buffer to kernel network buffers

(part of step 2). To avoid having to access the disk each time, for multiple accesses to a chunk of

data, the operating system caches recently accessed disk data in memory. This cache is called the

page cache, and direct memory access (DMA) is used for transfers between the page cache and the

disk (operation I in Figure 2.1(a)). Two passes over the memory bus are needed to transfer the data

from the page cache to the user space buffer (operation II). To send data out to the network, it is

again copied from the user space buffer to kernel network buffers (operation III). We do not show

the transfer from the kernel network buffer to the NIC, which is the final step in getting data out

into the network. For data transfers using TCP sockets on Linux, the sendfile system call can be


19/81


PROCESSOR

MEMORY

PAGE

CACHE

I

II

III

NIC

HARD DISK

USER-SPACE

KERNEL-SPACE

MEMORY

(a) Using readand send

PROCESSOR

MEMORY

PAGE

CACHE

HARD DISKNIC

KERNEL-SPACE

MEMORY

USER-SPACE

(b) Using sendfile

Figure 2.1: Memory I/O bus usage

used to cut down the number of passes over the memory bus to three. As shown in Figure 2.1(b),

sendfile copies data directly from the page cache to the kernel network buffers, thus avoiding the

copy to user space and back. In addition, sendfile needs to be invoked just once for a single file, so

the overhead of making a system call is paid only once per file.

2.2.1.1 Zero-copy Networking

Other methods for avoiding the copy from user-space memory to kernel-space memory have been

proposed. Such methods are known by the common term zero-copy networking. For a classification

of zero-copy schemes see [7]. The zero in zero-copy networking indicates that there is no memory-

to-memory copy involved in the transfer of data between a user space buffer and the network. So,

in Figure 2.1(a), a zero-copy scheme would eliminate memory-to-memory copies after operation

II. How the data got into the user- or kernel-space buffer in the first place, and whether that required

a copy is not considered. Zero-copy schemes can be supported if an application interacts directly

with the NIC without passing through the kernel, or if the buffers are shared between user and

kernel space, rather than being copied. For an application to directly read from and write to the NIC

buffer, protocol procesing has to be done on the NIC. At the sender, buffers can be shared between

the application and the kernel if the application can guarantee that a buffer that has not yet been

transmitted will not be overwritten. One way to ensure this would be if the system call invoked to


20/81


send some data returns only after all of that data has been successfully transmitted. Since a reliable

transport protocol can consider a buffer to have been successfully transmitted only when all of the

data in that buffer has successfully reached the intended receiver, the application may need to wait

a while before it can reuse a buffer. An interesting alternative is to mark the buffer as copy-on-write

(COW), so that the contents of the buffer are copied to a separate buffer if and when the application

tries to overwrite it. Implementation of send-side zero-copy schemes for different operating systems

are described in [28].

Now consider the steps at a receiver. A receiver performs the steps shown in Figure 2.1(a) in

reverse order (there is no sendfile equivalent for the receiver). One way to implement zero-copy

on the receiver is to change the page table of the application process when it issues a recv system

call. This is called page flipping in [28]. Page flipping works only if the NIC separates the packet

payload and header, if the packet payload is an exact multiple of the page size and if the buffer

provided by the application is aligned to page boundaries. Because of these requirements there has

been little effort to implement such a scheme.

Several factors that influence communication overhead are presented in [33]. The memory and

I/O bus usage for schemes with different kernel and interface hardware support are compared. For

instance, the author shows how, by using DMA, NIC buffering and checksum offload, the number

of passes over the bus can be reduced from six to one.

2.2.2 Protocol Overhead

Apart from the memory and I/O bus, the other main end host resource that could become a bottle-

neck is processor cycles. TCP/IP, being the most widely used protocol stack, has received attention

in this regard. In [9] the processing overhead of TCP/IP is estimated and the authors conclusion

is that with a proper implementation, TCP/IP can sustain high throughputs efficiently. More recent

work presented in [17] takes into consideration the OS and hardware support that a TCP implemen-

tation will require.

The overhead of a transport layer protocol can be divided into two categories: per-packet costs

and per-byte costs [9, 28, 6]. Per-packet costs include protocol processing (e.g., processing the


21/81


sequence numbers on each packet in TCP) and interrupt processing. Per-byte costs are incurred

when data is copied or during checksum calculation.

Per-packet overhead can be reduced by reducing the number of packets handled during the

transfer. For a given transfer size, the number of packets can be reduced by using larger packets.

The maximum transmission unit (MTU) of the network constrains the packet size that an end host

can use. For instance, Ethernet imposes a 1500-byte limit on the IP datagram size. The concept

of jumbograms was introduced by Alteon Networks to allow Ethernet frames of upto 9000 bytes,

and many gigabit Ethernet NICs now support larger frame sizes. Larger packet sizes can decrease

protocol processing overhead as well as the overhead of interrupt processing. NICs interrupt the

processor on frame transmission and reception. An interrupt is costly for the processor because

the state of the currently running process has to be saved and an interrupt handler invoked to deal

with the interrupt. As interface rates increase to 1 Gbps and higher, interrupt overhead can become

significant. Many high-speed NICs support interrupt coalescing so that the processor is interrupted

for a group of transmitted or received packets, rather than for each individual packet.

Schemes to reduce per-byte costs involved in copying data over the memory I/O bus were

described in Section 2.2.1. Checksum calculation can be combined with a copy operation and

carried out efficiently in software. For instance, the sender could calculate the checksum when data

is being copied from the user-space buffer to the kernel-space buffer. Another way to reduce the

processors checksum calculation burden is to offload it to the interface card.

2.2.3 Disk Access

All the factors considered so far affect data transfer throughput. In designing a transport protocol

for dedicated circuits, not only does a high throughput have to be maintained, the circuit utilization

should also be high. Thus end host factors that cause variability in the throughput also need to

be considered. For disk-to-disk data transfers, disk access can limit throughput as well as cause

variability. The file system used can have an effect on disk access performance. The time to

physically move the disk read/write head to the area on the hard disk where the desired data resides,

called seek time, is a major component of the disk access latency. File accesses tend to be sequential,


22/81


so a file system that tries to keep all parts of a file clustered together on the hard disk would perform

better than one in which a file is broken up into small pieces spread all over the hard disk.

At the sender, data needs to be read from the disk to memory. System calls to do this are

synchronous. When the system call returns successfully, the requested data is available in memory

for immediate use. Operating systems try to improve the efficiency of disk reads by reading in

more than the requested amount, so that some of the subsequent requests can be satisfied without

involving the disk hardware. At the data receiver, the system call to write to disk is asynchronous

by default. This means that when the system call returns it is not guaranteed that the data has been

written to disk; instead it could just be buffered in memory. Asynchronous writes are tailored to

make the common case of small, random writes efficient, since they allow the operating system

to schedule disk writes in an efficient manner. The operating system could reorder the writes to

minimize seeks. In Linux, for instance, data written to disk is actually copied to memory buffers

in the page cache and these buffers are marked dirty. Two kernel threads, bdflush and kupdate, are

responsible for flushing dirty buffers to disk. The bdflush kernel thread is activated when the number

of dirty buffers exceeds a threshold, and kupdate is activated whenever a buffer has remained dirty

too long. As a consequence of the kernel caching and delayed synchronization between memory

buffers and the disk, there can be significant variability in the conditions under which a disk write

system call operates.

2.2.4 Process scheduling

The final factor we consider is the effect of the process scheduler. All modern operating sys-

tems are multitasking. Processes run on the processor for short intervals of time and then either

relinquish the CPU voluntarily (e.g. if they block waiting for I/O) or are forcibly evicted by the

operating system if their time slot runs out. This gives users the impression that multiple processes

are running simultaneously. Multitasking, like packet-switched networking, tries to fairly divide up

a resource (processor cycles for multitasking; bandwidth for packet-switched networking) among

all contenders (multiple processes; multiple flows) for the resource. This behavior is at odds with

resource reservation in a connection-oriented network such as CHEETAH. If the degree of mul-


23/81


titasking at an end host is high then a data transfer application may not get the processor cycles

required to fully use the reserved circuit. Even if the required number of free cycles are available,

the process scheduler might not be able to schedule the data transfer application in the monotonic

fashion required to send and receive data at the fixed circuit rate.

2.3 CHEETAH Network

CHEETAH, which stands for Circuit-switched High-speed End-to-End Transport ArcHitecture, is a

network architecture that has been proposed [37] to provide high-speed, end-to-end connectivity on

a call-by-call basis. Since the transport protocol proposed in this thesis is to be used over a dedicatedcircuit through a CHEETAH network, in this section we provide a description of CHEETAH.

2.3.1 Components of CHEETAH

Many applications in the scientific computing domain require high throughput transfers with deter-

ministic behavior. A circuit-switched path through the network can meet such requirements better

than a packet-switched path. CHEETAH aims to bring the benefits of a dedicated circuit to an end-

user. In order to allow wide implementation, CHEETAH has been designed to build on existing

network infrastructure instead of requiring radical changes. Ethernet and SONET (Synchronous

Optical Network) are the most widely used technologies in local area networks (LANs) and wide

area networks (WANs) respectively. To take advantage of this, a CHEETAH end-to-end path con-

sists of Ethernet links at the edges and Ethernet-over-SONET links in the core. Multi-Service

Provisioning Platforms (MSPPs) are hardware devices that make such end-to-end paths possible.

MSPPs are capable of mapping between the packet-switched Ethernet domain and the time divi-

sion multiplexed (TDM) SONET domain. MSPPs are an important component of the CHEETAH

architecture for three reasons.

1. The end hosts can use common Ethernet NICs and do not need, for instance, SONET line

cards.

2. Many enterprises already have MSPPs deployed to connect to their ISPs backbone network.


24/81


3. Standard signaling protocols, as defined for Generalized Multiprotocol Label Switching

(GMPLS) networks, are (being) implemented in MSPPs. This is essential to support dynamic

call-by-call sharing in a CHEETAH network.

2.3.2 Features of a CHEETAH Network

One of the salient features of CHEETAH is that it is an add-on service to the existing packet-

switched service through the Internet. This means, firstly, that applications requiring CHEETAH

service can co-exist with applications for which a path through the packet-switched Internet is good

enough. Secondly, because network resources are finite, it is possible that an applications request

for a dedicated circuit is rejected; in such cases, the Internet path provides an alternative so that

the applications data transfer does not fail. To realize this feature, end hosts are equipped with an

additional NIC that is used exclusively for data transfer over a CHEETAH circuit.

To make the CHEETAH architecture scalable, the network resource reservation necessary to

set up an end-to-end circuit should be done in a distributed and dynamic manner. Standardized

signaling protocols that operate in a distributed manner, such as the hop-by-hop signaling in GM-

PLS protocols, are key to achieving scalability. CHEETAH uses RSVP-TE 1 signaling in the control

plane. Dynamic circuit set up and tear down means that these operations are performed when (and

only when) required, as opposed to statically provisioning a circuit for a long period of time. Dy-

namic operation is essential for scalability because it allows the resources to be better utilized, thus

driving down costs. End-host applications that want to use a CHEETAH circuit are best-placed

to decide when the circuit should be set up or torn down. Therefore an end host connected to the

CHEETAH network runs signaling software that can be used by applications to attempt circuit set

up on a call-by-call basis.

With end-host signaling in place, applications that want to use a CHEETAH circuit can do so

in a dynamic manner. This leads to the question of whether, just because it can be done, a circuit

set up should be attempted for a given data transfer. In [37], analytical arguments are used to show

1Resource Reservation Protocol-Traffic Engineering. This is the signaling component of the GMPLS protocols, the

other components being Link Management Protocol (LMP) and Open Shortest Path First-TE (OSPF-TE).


25/81


that, for data transfers above a threshold size, transfer delay can be reduced by using a CHEETAH

circuit rather than an Internet path. It is also worth noting that there are situations in which the

overhead of circuit set up makes it advantageous to use a path through the Internet, although for

wide-area bulk data transfer a dedicated circuit invariably trumps an Internet path.

2.3.3 The CHEETAH Testbed

To study the feasibility of the CHEETAH concept, an experimental testbed has been set up. This

testbed extends between North Carolina State University (NCSU), Raleigh, NC, and Oak Ridge Na-

tional Laboratory (ORNL), Oak Ridge, TN and passes through the MCNC point-of-presence (PoP)

in Research Triangle Park, NC and the Southern Crossroads/Southern LambdaRail (SOX/SLR) PoP

in Atlanta, GA. The testbed layout is shown in Figure 2.2. In this testbed, the Sycamore SN16000

Intelligent Optical Switch is used as the MSPP. In the figure we show end hosts connected directly

or through Ethernet switches to the gigabit Ethernet card on the SN16000. The cross connect card

is configured through the control card to set up a circuit. The SN16000 has an implementation of

the GMPLS signaling protocol that follows the standard and has been tested for interoperability.

"

"

' ( 0 3

( 5 7 9

' A

B

' 0 5 F G G G

3 H

( P 0 Q 3 0

"

"

' ( 0 3

( 5 7 9

' A

B

' 0 5 F G G G

3 H

`

`

"

"

' ( 0 3

( 5 7 9

' A

B

' 0 5 F G G G

3 H

a 0 0

d

H

S

d

H

3 H

0 ' i 0

Figure 2.2: CHEETAH experimental testbed


26/81


The testbed has been designed to support the networking needs of the TSI project mentioned

at the beginning of this chapter. We present results of experiments conducted over this testbed in

Chapter 4.

2.3.4 End-host Software Support for CHEETAH

To allow applications to start using CHEETAH circuits, software support is required to make the

end hosts CHEETAH enabled. The architecture of the end-host software is shown in Figure 2.3.

The relevant components of the CHEETAH end-host software are shown inside a dotted box to

signify that the application could either interact with each component individually or make higher-

level calls that hide the details of the components being invoked. To be able to use a CHEETAH

circuit between two end hosts, both should support CHEETAH.

The Optical Connectivity Service (OCS) client allows applications to query whether a re-

mote host is on the CHEETAH network. OCS uses the Internets Domain Name Service (DNS)

to provide additional information such as the IP address of the remote ends secondary NIC. As

mentioned earlier, depending on the situation, either a CHEETAH circuit or a path through the In-

ternet may be better for a particular transfer. The routing decision module takes measurements of

relevant network parameters (e.g., available bandwidth and average loss rate) and uses these along

with the parameters of a particular transfer (e.g., the file size and requested circuit rate) to decide

whether or not a CHEETAH circuit set up should be attempted. To achieve CHEETAHs goal of

distributed circuit set up, an RSVP-TE signaling module runs on each end host. The RSVP-TE

module exchanges control messages with the enterprise MSPP to set up and tear down circuits.

These control messages are routed through the primary NIC over the Internet. The final software

component is the transport protocol module. Depending on whether a circuit or an Internet path

is being used, the transport protocol used will be C-TCP or TCP. In this thesis the focus will be on

the design, implementation and evaluation of C-TCP.

To end this chapter we mention some of the other projects focused on connection-oriented

networking for e-science projects. UltraScience Net [36] is a Department of Energy sponsored

research testbed connecting Atlanta, Chicago, Seattle and Sunnyvale. This network uses a central-


27/81


! "

' ( 0 ! 3 5

6 8

9 3 !

5

0 A C E

! 3

E G " G C G ! Q 6 3

C G S " G T

U (

0 3 ! " G T

U (

Q 0 C C

3 G 3

0 ' U E 3 a G b Q 0 C C

C G S " G T

U (

0 3 ! " G T

U (

Figure 2.3: Architecture of CHEETAH end-host software

ized scheme for the control-plane functions. Another effort is the Dynamic Resource Allocation via

GMPLS Optical Networks (DRAGON) project [12]. DRAGON uses GMPLS protocols to support

dynamic bandwidth provisioning.


28/81

Chapter 3

UDP-BASED TRANSPORT PROTOCOL

In Chapter 2 we mentioned a few protocols that are based on UDP. There are good reasons for

taking this approach:

UDP provides the minimal functionality of a transport protocol. It transfers datagrams be-

tween two processes but makes no guarantees about their delivery. UDPs minimalism leaves

no scope for anything to be taken out of its implementation. Thus a new protocol built over

UDP has to add extra (and only the required) functionality. The significance of this is that

these additions can be done in the user space, without requiring changes to the operating

system code. This makes UDP-based solutions as easy to use and portable as an application

program.

The sockets API to the UDP and TCP kernel code is widely deployed and used. This makes

implementation easier and faster.

The basic design of all UDP-based protocols is similar and is shown in Figure 3.1. Data packets

are transferred using UDP sockets. A separate TCP or UDP channel is used to carry control pack-ets. Control packets serve to add features to the data transfer not provided by UDPs best-effort

service. We used the Simple Available Bandwidth Utilization Library (SABUL), a UDP-based data

transfer application, to implement the Fixed Rate Transport Protocol (FRTP). In this chapter we first

present an overview of the SABUL protocol and implementation. Then we describe the changes

16


29/81

Chapter 3. UDP-BASED TRANSPORT PROTOCOL 17

! # % '

) 0 2 4 5 0 @ A C3 2 E 2 F

G C 4 C @ A C 29 2 P F

Figure 3.1: Architecture of a generic UDP-based protocol

that we made to SABUL to implement FRTP. The advantages and shortcomings of this approach

are discussed.

3.1 SABUL Overview

SABUL is designed for bulk data transfers over high-bandwidth networks. SABULs architecture

is the same as that shown in Figure 3.1. TCP is used for control packet transmission from the data

receiver to the data sender. SABUL adds reliability, rate-based congestion control and flow control

to UDPs basic data transfer service.

Providing end-to-end reliabilityguaranteeing that all the data sent is received in the same or-

der and without duplicatesis a function of the transport layer. SABUL implements the following

error control scheme for reliable transfer. It adds a sequence number to each UDP data packet.

The receiver detects packet loss using the sequence numbers of the received packets. On inferring

loss, the receiver immediately sends a negative-acknowledgement (NAK) control packet to convey

this information to the sender. The sender then recovers from the error by retransmitting the lost

packet(s). The receiver maintains an ERR timer to periodically send NAKs if there are missing

packets. This is to provide protection against lost retransmissions. For file transfers, reading data


30/81


from the disk for each retransmission is very expensive in time. Therefore, the sender keeps the

transmitted data in memory until it is acknowledged. A SABUL receiver periodically sends an ac-

knowledgement (ACK) control packet, acknowledging all packets received in-order. On receiving

an ACK, the sender can free the buffer space occupied by data that is confirmed to have been re-

ceived. In addition the SABUL sender has a timer that is reset each time a control packet is received.

If this timer (called the EXP timer) expires because no control information has been received, the

sender assumes that all unacknowledged packets have been lost and retransmits them.

SABUL uses a rate-based congestion control scheme. The sender modifies the sending rate

depending on the degree of congestion in the network. The SABUL receiver sends a periodic syn-

chronization (SYN) control packet containing the number of data packets received in the previous

SYN period. The sender uses this information to estimate the amount of loss and hence the con-

gestion in the network. Depending on whether the loss is above or below a threshold, the sending

rate is reduced or increased, respectively. The sending rate is modified by changing the inter-packet

gap.

SABUL is a user space implementation which means a SABUL receiver cannot distinguish

between loss due to network congestion and loss due to its receive buffer (the kernel UDP buffer)

overflowing. The information in SYN packets represents both types of loss, and therefore, SABULs

rate-based congestion control also serves as a reactive flow control strategy. In addition, a fixed

window is used to limit the amount of unacknowledged data in the network.

3.1.1 SABUL Implementation

The SABUL implementation is described next. It is important to separate the SABUL transport

protocol from an application that uses it. In the description below we refer to an application using

SABUL as the sending application or receiving application. The sending application generates

the data that is to be transferred using SABUL, for example by reading it from a file on the hard

disk. The receiving application, likewise, consumes the data transferred using SABUL. SABUL

is implemented in C++. The sending application invokes a SABUL method to put data into the

protocol buffer. SABUL manages the protocol buffer and transmits or retransmits data packets


31/81


from it. Two threads are used. One handles the interface with the sending application, mainly the

filling of the protocol buffer. The other thread is responsible for sending out data packets. The

sequence numbers of packets that need to be retransmitted are recorded in a loss list. Pseudocode

for the sender side functionality is shown below:

INITIALIZATION:

Create TCP socket on well-known port number

Listen for a connection

Accept connection from client

Get the UDP port number on which the receiver is expecting data

Calculate the inter-packet gap required to maintain the desired sending rate

Fork a new thread to handle the data transmission

DATA TRANSMISSION:

WHILE data transfer is not over

WHILE protocol buffer is empty AND data transfer is not over

Wait for data from the sending application

ENDWHILE

Poll control channel for control packets

IF control packet received THEN

Process control packet /* See below */

ENDIF

IF loss list is not empty THEN

Remove first packet from the loss list

ELSE

Form a new packet

ENDIF

Send the data packet by invoking the send() system call on the UDP socket

Wait till it is time to send the next packet


32/81


ENDWHILE

CONTROL PACKET PROCESSING:

IF ACK packet THEN

Release buffer space held by the acknowledged packet(s)

Update loss list

Inform sending application of availability of buffer space

ELSE IF NAK packet THEN

Update loss list

ELSE IF SYN packet THEN

Adjust sending rate

ENDIF

Two threads are used at the receiver too. One thread (call it the network thread) is responsible

for receiving data packets, writing the data into the protocol buffer and sending control packets.

The other thread (main thread) handles the interface with the receiving application, transferring

data from the protocol buffer to the application buffer. SABUL uses an optimization when the

receiving application asks to read more data than the protocol buffer has. The main thread sets a

flag indicating such a situation. On seeing this flag, the network thread copies all available data

into the application buffer and resets the flag. As the rest of the data requested by the receiving

application arrives, it is copied directly into the application buffer saving a memory copy. The

receiver side pseudocode follows.

INITIALIZATION:

Create TCP and UDP sockets

Connect to the sender

Inform the sender of the UDP port number

Fork a new thread to receive data

RECEIVING DATA:


33/81


WHILE all the data has not been received

IF receiving application is waiting for data THEN

Copy all ACKed data from protocol buffer to application buffer

ENDIF

IF ACK timer expired THEN

Send ACK packet

ENDIF

IF ERR timer expired THEN

Send NAK packet with sequence numbers of missing packets

ENDIF

IF SYN timer expired THEN

Send SYN packet with number of packets received in previous SYN inte

ENDIF

Get the address into which to receive the next expected data packet

Receive a data packet on the UDP socket

IF missing packets THEN

Add missing packets sequence numbers to loss list

Send an immediate NAK packet

ENDIF

Update state variables like next expected sequence number, ACK sequence numb

Update loss list

ENDWHILE

3.2 Modifications to SABUL : FRTP

Our initial idea for a transport protocol that can be used over dedicated circuits was that, since

bandwidth is reserved, the data should be just streamed across at the circuit rate. Transmitting at a

rate lower than the reserved circuit rate would leave bandwidth unutilized. Transmitting at a higher


34/81


rate would eventually lead to a buffer filling up and overflowing. Therefore we wanted a transport

protocol that could monotonically send data packets at a fixed rate. SABUL seemed like a perfect

match for doing this since it can maintain a fixed sending rate if its rate-based congestion control

was disabled. FRTP, our transport protocol for dedicated circuits, could be implemented just like

SABUL, except that the rate altering congestion control would be stripped out.

The first modification to SABUL code was to remove the rate-based congestion control that

modified the sending rate. Second, we added support for using separate NICs for the data and

control channels. This was in line with the CHEETAH concept of having two NICs on CHEETAH-

enabled hosts. SABUL (and hence, FRTP) has many parameters that can be tweaked to improve

its performance. The application, protocol and UDP buffer sizes can be changed. The values of

the different timers that SABUL uses are also available for adjustment. We ran experiments in a

laboratory setting [40] to determine the effect of some of these parameters on FRTP performance,

and possibly determine the optimal values. Although we failed to determine a set of optimal values

for the parameters, these experiments did reveal some of the flawed assumptions we were making.

3.2.1 Problems with the FRTP Implementation

We observed that even though FRTP was set up to send at a fixed rate, the throughput achieved

(amount of data transferred / transfer time) was lower than the sending rate. This difference grew as

the sending rate was increased. We found that the reasons for this discrepancy were two-fold. First,

the FRTP implementation was not able to maintain a monotonic sending rate. Second, even if the

sender was able to maintain a constant sending rate, the receiving application could not empty the

buffers at the same (or higher) rate. This led to receiver buffer overflow and retransmissions, which

reduced the throughput.

FRTP implements a fixed sending rate by maintaining a fixed inter-packet gap. For instance,

if 1500 byte packets are being used, a 1 Gbps sending rate can be maintained by ensuring that the

gap between successive transmitted packets is 12 s (= 1500 bytes / 1 Gbps). Commodity operating

systems do not provide straightforward methods (if at all) to measure such small intervals of time

and certainly do not provide a method to reliably schedule a periodic action at such a fine granular-


35/81


ity. For instance, the timer tick granularity available to user-space processes in Linux is 10 ms. To

overcome this, FRTP uses busy waiting to bide away the time between packet transmissions. If the

next packet needs to be sent at time t, FRTP does the following:

WHILE ((current time) < t)

NOP

ENDWHILE

The rdtsc (read time stamp counter) instruction, provided by Pentium processors, is used to get

an accurate value for the current time. The rdtsc instruction reads the time stamp counter that is

incremented at every processor tick. NOP is a no operation instruction. The busy waiting solution is

wasteful since the NOPs use up processor cycles that could have been used to accomplish something

more useful. It also does nothing to make the periodic invocation of an event reliable. If the sending

process were the only one running on the processor then the busy waiting scheme works to reliably

perform a periodic action. If a different process is running on the processor at t, the FRTP sending

process will miss its deadline. In fact, since FRTP itself uses 2 threads at the sender, the thread

responsible for filling the protocol buffer could interfere with the data sending threads busy waiting

induced periodicity.

SABULs rate adjustment scheme has been removed from FRTP. Therefore FRTP does not have

even the reactive flow control of SABUL. This is acceptable if we can be sure that flow control is

not required. The FRTP receiver architecture for a transfer to disk can be represented as shown in

Figure 3.2. Using the notation introduced in Section 3.1, the network thread handles the transfer

marked I and the main thread and the receiving application handle II and III respectively. The

process scheduler has to put the threads on the processor for the transfers to take place. Transfer III

additionally depends on how long the write to disk takes. These factors introduce variability into

the receiving rate. Buffers can hide this variability so that even a constant sending rate does not

cause buffer overflow. For a sending rate S(t) held at a constant value S, a receiving rate R(t) and a

receive buffer of size B, for no loss to occur:

S.Z

0R(t)dtB [0,T] (3.1)


36/81


UDP bufferProtocol

bufferApplication

buffer

Disk

Ke rn el- sp ace Us er -sp ac e

I IIIII

Figure 3.2: Need for receiver flow control

where [0,T] is the transfer interval. The (false) assumption behind our initial belief that it is enough

to just stream the data at the reserved circuit rate was that equation (3.1) holds throughout the

transfer. From our experiments we realized that not only is R(t) varying, we do not even know a

closed form expression for it, making the choice of S and B to satisfy equation (3.1) difficult. A

pragmatic approach is to assign sensible values to Sand B, so that (3.1) is satisfied most of the time.

When it is not satisfied, there are losses and the error control algorithm will recover from the loss.

This is what we were seeing in our laboratory experiments (but with S(t) also varying with time).

A flow control protocol would attempt to ensure that the above equation is satisfied all the time, by

varying S(t). Unfortunately this implementation of FRTP has no flow control.

3.2.2 Possible Solutions

Our attempts to solve the two problems we identified with the FRTP implementation use of busy

waiting for ensuring a steady rate and lack of flow control are described next. The ideal solution

for maintaining a fixed inter-packet gap would involve transmitting a packet, giving up the processor

and reclaiming it when it is time to send the next packet. Linux offers a system call to relinquish


37/81


the processor. To see why it is not possible to reclaim the processor at a deterministic future time

it is essential to understand how the Linux scheduler schedules processes to run. Two queues (for

our purposes only two of the queues are important) are maintained, one of processes that are ready

to run (the RUNNABLE queue) and the other of processes that are waiting for some condition that

will make them ready to run (the INTERRUPTIBLE queue). For instance, if a process executes

instructions to write to disk, it is put in the INTERRUPTIBLE queue. When the write to disk

completes and the hard drive interrupts the processor the process is put back in the RUNNABLE

queue. So what happens when, after transmitting a packet, the FRTP sending process gives up the

CPU? Usually, the system call used to relinquish the processor allows the process to specify a time

after which it is to be made runnable again. The process is put in the INTERRUPTIBLE queue and

when the operating system determines that the time for which the process had asked to sleep has

passed, it is put back in the RUNNABLE queue. The problem arises because the operating system

uses the timer interrupts (which have a 10 ms period in Linux) to check whether the sleep time has

passed. Therefore if a process asked to sleep for 1 second, it is guaranteed to become runnable

after a time between 1.0 and 1.01 seconds, but if it asks to sleep for 100 s it will become runnable

after some time between 100 s and 10100 s. Note that if we give this process the highest priority

then its becoming runnable implies that it runs on the processor, so we ignore the scheduling delay

between a process becoming ready to run and actually running. Thus on Linux (and other operating

systems that dont support real-time processes) it is not possible for a user space process to send

packets monotonically at a high rate.

An alternate approach would be to maintain the sending rate, not on a packet-by-packet basis,

but in a longer time frame. This can be done by ensuring that N packets are sent every T units

of time such that (N/T) is the desired sending rate. This would cause a burst ofN packets in the

network so we would like to keep T as small as possible. In the limit N becomes 1 and we get what

SABUL attempts to implement. The sending process should get a periodic impulse every T units

of time and in response send out the N packets. Linux offers user-space processes the ability to

receive such periodic impulses in the form of signals. A process can use the settimer() system call

to activate a timer. This timer causes a signal to be sent periodically to the process. We modified the


38/81


FRTP code to use periodic signals to maintain the sending rate. This reduced the CPU utilization at

the sender compared to the earlier busy waiting scheme. But the lack of real-time support on Linux

meant that even if the signals were being sent like clockwork the user-space process was not always

able to start sending the next burst of packets immediately. We observed that occasionally some

signals would be missed because an earlier one was still pending.

We now consider the problem of adding flow control to FRTP. Since flow control is supposed to

avoid receiver buffer overflow, the data receiver is best placed to provide the information based on

which the sender can control the flow of data. SABULs sending rate adjustment in response to lost

packets is a form of flow control that does not use explicit information from the receiver. SABULs

flow control scheme was not very effective since we observed substantial loss and retransmission.

To be able to send back buffer status information, the receiver has to have timely access to this in-

formation. Although, the FRTP receiver can accurately figure out how much free space is available

in the protocol and application buffers (see Figure 3.2), it does not have access to the current status

of the UDP buffer in kernel memory. The kernel does not make any effort to avoid UDP buffer

overflows. The filling and emptying of a user space buffer are fully in the control of a user space

process. So if a user space buffer is short on free space, the process can choose not to read in more

data. With the UDP buffer the kernel has no control over the filling of the buffer since packets arrive

asynchronously over the network. That is why flow control is necessary to prevent the UDP buffer

from overflowing. Therefore, any flow control scheme which requires explicit buffer status infor-

mation from the receiver would need support from the kernel. By choosing to implement FRTP in

the user space over UDP, we lose the opportunity to implement such a flow control scheme.


39/81

Chapter 4

TCP-BASED SOLUTION

In the previous chapter we pointed out the shortcomings of a UDP-based transport protocol that

were uncovered while implementing FRTP using SABUL. We realized that more support from

the operating system would be required to better match the behavior of the end hosts with that of

the network in which resources were reserved. This chapter describes our efforts to implement a

transport protocol for dedicated circuits that is more closely tied in with the operating system than

the user-space FRTP. Our protocol is based on the TCP implementation in Linux. To reiterate this

fact, we call this protocol Circuit-TCP (C-TCP).

In this chapter, first an overview of TCP is presented. Then we look at the advantages of using

TCP to implement a transport protocol for dedicated circuits. Next, we present the implementation

of C-TCP. C-TCP has been tested on the CHEETAH testbed. Results from these experiments and a

discussion of their significance concludes this chapter.

4.1 Transmission Control Protocol - A Primer

TCP is the transport protocol of the TCP/IP suite of protocols. It is a connection-oriented protocol

that provides reliability, distributed congestion control and end-to-end flow control. Note that the

meaning of TCP being a connection-oriented protocol is different from the use of the phrase in

connection-oriented network. In order to provide its end-to-end services, TCP maintains state

for each data stream. Thus, TCP creates a connection between two end points wishing to commu-

27


40/81

Chapter 4. TCP-BASED SOLUTION 28

nicate reliably (the end points can be processes on end hosts), maintains state information about

the connection and disconnects the two end points when they no longer need TCPs service. In

a connection-oriented network, a connection refers to physical network resources that have been

reserved, and that taken together form an end-to-end path.

Applications wishing to use TCPs service use the sockets interface that the TCP/IP stack in the

operating system provides. Two processes that want to use TCP to communicate create sockets and

then one of the processes connects its socket to the remote socket. A connection is established if

the connection request is accepted by the remote end. TCP uses a 3-way handshake to establish a

connection. Connection establishment also initializes all of the state information that TCP requires

to provide its service. This state is stored in the data structures associated with the sockets on each

end of a connection. We now present brief descriptions of four of TCPs functions. For a more

detailed description please see [29], [8] and [1].

4.1.1 Error Control

Each unique data byte transferred by TCP is assigned a unique sequence number. During connection

establishment the two ends of a connection exchange starting sequence numbers. The TCP at the

receiving end maintains information about sequence numbers that have been successfully received,

the next expected sequence number and so on. The receiver can make use of the sequence numbers

of received data to infer data reordering with certainty, but not data loss. In fact, neither the TCP

at the sender nor the one at the receiver can reliably detect packet loss since a packet presumed lost

could just be delayed in the network. TCP uses acknowledgements (ACKs) of successfully received

data and a sender-based retransmission time-out (RTO) mechanism to infer data loss. The time-out

value is calculated carefully using estimates of RTT and RTT variance, to reduce the possibility of

falsely detecting loss or waiting too long to retransmit lost data. An optimization that was proposed

and has been widely implemented is the use of triple duplicate ACKs to infer loss early rather than

wait for the RTO to expire. A TCP receiver sends back a duplicate ACK whenever an out-of-order

packet arrives. For instance, suppose packets Pn, Pn+1, Pn+2, Pn+3 and Pn+4 contain data that is

contiguous in the sequence number space. IfPn+1 goes missing, then the receiving TCP sends back


41/81


duplicate ACKs acknowledging the sucessful receipt of Pn when Pn+2, Pn+3 and Pn+4 arrive. On

getting 3 duplicate ACKs, a TCP sender assumes that the data packet immediately following the

(multiply) ACKed data was lost. The sender retransmits this packet immediately. This is called fast

retransmit. As was pointed out in Chapter 2, many enhancements to TCP have been proposed and

implemented, such as the use of SACKs, that improve TCPs loss recovery, among other things.

4.1.2 Flow Control

Flow control allows a receiving TCP to control the amount of data sent by a sending TCP. With

each ACK, the receiving TCP returns the amount of free space available in its receive buffer. This

value is called the receiver advertised window (rwnd). The sending TCP accomplishes flow control

by ensuring that the amount of unacknowledged data (the demand for receiver buffer space) does

not exceed rwnd (the supply of buffer space on the receiver).

4.1.3 Congestion Control

The original specification of TCP [29] did not have congestion control. TCPs congestion control

algorithm was proposed in [21]. Just as flow control tries to match the supply and demand for the

receiver buffer space, congestion control matches the supply and demand for network resources

like bandwidth and switch/router buffer space. This is a much more complex problem because

TCP is designed to work on packet-switched networks in which multiple data flows share network

resources. TCPs congestion control algorithm is a distributed solution in which each data flow

performs congestion control using only its own state information, with no inter-flow information

exchange.

TCP congestion control is composed of three parts.

1. Estimate the current available supply of the network resources and match the flows demand

to that value.

2. Detect when congestion occurs (i.e. demand exceeds supply).

3. On detecting congestion, take steps to reduce it.


42/81


TCP maintains a state variable, congestion window (cwnd), which is its estimate of how much

data can be sustained in the network. TCP ensures that the amount of unacknowledged data does

not exceed cwnd,1 and thus uses cwnd to vary a flows resource demand. Since a sending TCP

has no explicit, real-time information about the amount of resources available in the network, the

cwnd is altered in a controlled manner, in the hope of matching it to the available resources. The

cwnd is increased in two phases. The first phase, which is also the one in which TCP starts, is

called slow start. During slow start cwndis incremented by one packet for each returning ACK that

acknowledges new data. Thus, ifcwndat time t was C(t), all of the unacknowledged data at t would

get acknowledged by time (t+RT T) and C(t+RT T) would be C(t)+C(t) = (2C(t)). Slow start

is used whenever the value ofcwndis below a threshold value called slow start threshold (ssthresh).

When cwndincreases beyond ssthresh, TCP enters the congestion avoidance phase in which the rate

of cwnd increase is reduced. During congestion avoidance, each returning ACK increments cwnd

from C to (C+ 1C

). An approximation used by many implementations is to increment C to (C+ 1)

at the end of an RTT (assuming the unit for cwnd is packets).

The second component of congestion control is congestion detection. TCP uses packet loss as

an indicator of network congestion. Thus, each time a sending TCP infers loss, either through RTO

or triple duplicate ACKs, it is assumed that the loss was because of network congestion. Other

congestion indicators have been proposed. For instance, in Chapter 2 we mentioned that FAST

uses queueing delay to detect network congestion. Some researchers have proposed that a more

proactive approach should be adopted, and congestion should be anticipated and prevented, rather

than reacted to. Such a proactive approach would require congestion information from the network

nodes. See [5] for a discussion of the Active Queue Management (AQM) mechanisms that routers

need to implement, and [15] for a description of the Random Early Detect (RED) AQM scheme.

In [30], the modifications that need to be made to TCP in order to take advantage of the congestion

information provided by routers using AQM is presented.

The third component of congestion control is taking action to reduce congestion once its been

detected. The fact that congestion occurred (and was detected) means that TCPs estimate of the

1Recall that flow control requires the amount of unacknowledged data to be less than rwnd. TCP implementations

use min(rwnd,cwnd) to bound the amount of unacknowledged data.


43/81


available network resource supply is too high. Thus, to deal with congestion, TCP reduces its

estimate by cutting down cwnd. On detecting loss, the sending TCP first reduces ssthresh to half of

the flight size, where flight size is the amount of data that has been sent but not yet acknowledged

(the amount in flight). The next step is to reduce cwnd. The amount of reduction varies depending

on whether the loss detection was by RTO or triple duplicate ACKs. If an RTO occurred then the

congestion in the network is probably severe, so TCP sets cwndto 1 packet. The receipt of duplicate

ACKs means that packets are getting through to the receiver and hence congestion is not that severe.

Therefore, in this case cwnd is set to (ssthresh + 3) packets and incremented by 1 packet for each

additional duplicate ACK. This is called fast recovery.

The linear increase of cwnd by one packet per RTT, during congestion avoidance, and its

decrease by a factor of two during recovery from loss is called Additive Increase Multiplicative

Decrease (AIMD). TCP uses an AI factor of one (cwnd cwnd+ 1) and an MD factor of two

(cwnd cwnd

1 12

).

4.1.4 Self Clocking

Although TCP does not explicitly perform rate control, the use of ACK packets leads to a handy

rate maintenance property called self clocking [21]. Consider the situation shown in Figure 4.1.

The node marked SENDER is sending data to the RECEIVER that is three hops away.2 The links

LIN K1,LIN K2 and LINK3 are logically separated to show data flow in both directions. The width

of a link is indicative of its bandwidth, so LINK2 is the bottleneck in this network. The shaded

blocks are packets (data packets and ACKs), with packet size proportional to a blocks area. The

figure shows the time instant when the sender has transmitted a windows worth of packets at the

rate of LINK1. Because all these packets have to pass through the bottleneck link, they reach the

receiver at LINK2s rate. This is shown by the separation between packets on LINK3. The receiver

generates an ACK for each successfully received data packet. If we assume that the processing time

for each received data packet is the same, then the ACKs returned by the receiver have the same

spacing as the received data packets. This ACK spacing is preserved on the return path. Each ACK

2This figure is adapted from one in [21].


44/81


allows the sender to transmit new data packets. If a sender has cwnd worth of data outstanding in

the network, new data packets are transmitted only when ACKs arrive. Thus, the sending rate (in

data packets per unit time) is maintained at the rate of ACK arrival, which in turn is determined by

the bottleneck link rate. This property of returning ACKs clocking out data packets is called self

clocking.

! !

! !

! !

! !

! !

! !

" "

" "

" "

" "

" "

" "

# #

# #

# #

# #

# #

# #

$ $

$ $

$ $

$ $

$ $

$ $

% %

% %

% %

% %

% %

% %

& &

& &

& &

& &

& &

& &

& &

' '

' '

' '

' '

' '

' '

' '

( (

( (

( (

( (

( (

( (

( (

) )

) )

) )

) )

) )

) )

) )

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

1 1

1 1

1 1

1 1

1 1

1 1

2 2

2 2

2 2

3 3

3 3

3 3

4 4

4 4

4 4

5 5

5 5

5 5

6

6

6

7

7

7

8

8

8

9

9

9

@

@

@

@

@

@

@

A

A

A

A

A

A

A

B

B

B

B

B

B

B

C

C

C

C

C

C

C

D

D

D

D

D

D

D

E

E

E

E

E

E

E

F F

F F

F F

F F

F F

F F

F F

G G

G G

G G

G G

G G

G G

G G

ACKs ACKs ACKs

SEND

ER

NETWORKNODE

NETWORKNODE

RECE

IVER

LINK1 LINK2 LINK3

Figure 4.1: TCP self clocking

4.2 Reasons for Selecting TCP

In Chapter 3, two problems were identified in a user-space UDP-based implementation of FRTP.

1. Use of busy waiting to maintain a fixed inter-packet gap, and thus a fixed rate, does not work

very well. Even if it did work perfectly, it is wasteful of CPU cycles.

2. The difficulty of maintaining a fixed receiving rate makes flow control very attractive. A

proactive scheme, in which the receiver is able to prevent buffer overflow, requires kernel

support that a user space FRTP cannot get. By removing SABULs rate-based congestion

control, FRTP forgoes SABULs reactive flow control too. Thus, FRTP has null flow control.


45/81


In this section, two issues are addressed. First, whether TCP is better at tackling the two problems

listed above. Second, are there other issues unique to TCP that need to be considered.

The description of TCPs slow start and AIMD schemes in Section 4.1.3 shows that TCP does

not maintain a fixed sending rate. TCP is designed with the assumption that the available bandwidth

in the network (called supply in Section 4.1) is changing over time, as other data flows start or end,

and that its instantaneous value is not known. TCPs congestion control algorithms attempt to match

a flows sending rate to the available network bandwidth, inspite of this incomplete knowledge. But,

such a sending rate altering algorithm is not needed on dedicated ciruits.

If we assume that TCPs congestion control can be disabled, how well can TCP maintain a fixed

sending rate and at what granularity? The self clocking property provides a low-overhead way to

maintain a steady sending rate. In steady state, each returning ACK clocks out a data packet so a

steady sending rate can be maintained at a granularity of packets. Moreover, packet transmission is

initiated as a result of an interrupt (the NIC raises an interrupt when an ACK is received), and so is

much less likely to be disturbed by the behavior of the process scheduler. This is a major advantage

of shifting the responsibility of maintaining a steady rate to the kernel domain.

The variability in the receiving rate is because of the receiving applications interaction with the

process scheduler and the disk. This problem is not solved by using a different transport protocol.

But, TCPs flow control is designed to minimize the impact of such variability on data transfer

performance. TCP uses a window-based flow control scheme (see Section 4.1.2) that prevents

receive buffer overflow, unlike SABUL, which reacts to packet loss caused by buffer overflow.

TCP appears to adequately deal with the two problems identified in implementing FRTP. In

addition there are a few other reasons for choosing TCP which we point out next. Once it had been

established that flow control required kernel support, our choice was essentially made. We did not

have the expertise to implement a kernel-space protocol starting from scratch. So, our protocol had

to be implemented by modifying an existing, stable kernel-space transport protocol. TCP and UDP

are so widely used and well understood that, unless some other protocol is clearly more suitable, it

makes sense to modify TCP or UDP. Another reason for choosing to use TCP is that error control

comes for free. In the next section, the protocol design for C-TCP is presented and it should be clear


46/81


that for the majority of transport protocol functions, what TCP implements worksregardless of

whether the underlying network is connectionless or connection-oriented.

So is TCP the answer to all our problems? Well, no. Without any modifications TCPs conges-

tion control algorithm is not suitable for use over a dedicated circuit. One of the main differences

between TCP and C-TCP is the congestion control algorithm used. We describe C-TCP in more

detail in the next two sections. A practical issue with doing any kernel-space modification is that

ease of use for the solution is much lower than a user space application, which can be downloaded,

built and installed, since the host has to be rebooted.

4.3 Circuit-TCP Design

In this section the design of C-TCP is described. Five functions of a transport protocol are con-

sidered, namely connection establishment, congestion control, multiplexing, flow control and error

control. For each of these functions, we consider whether it is required on a dedicated circuit and if

so, how to provide the function.

4.3.1 Connection Establishment

It is useful in the design of a transport protocol to think in terms of control and data planes. Control

plane functions support the data plane. For instance, TCPs three-way handshake for connection

establishment is used to agree upon an initial sequence number to be used in the data transfer that

follows. C-TCP requires state to be maintained for each data flow using C-TCP. The connection-

establishment and release schemes are used unaltered from TCP.

4.3.2 Congestion Control

Network congestion occurs when the

a transport protocol

Documents