a transport protocol

Upload: sukhmander-singh

Post on 14-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 A TRANSPORT PROTOCOL

    1/81

    A TRANSPORT PROTOCOL

    FOR

    DEDICATED END-TO-END CIRCUITS

    A Thesis

    Presented to

    the faculty of the School of Engineering and Applied Science

    University of Virginia

    In Partial Fulfillment

    of the requirements for the Degree

    Master of Science

    Computer Engineering

    by

    Anant P. Mudambi

    January 2006

  • 7/30/2019 A TRANSPORT PROTOCOL

    2/81

    APPROVAL SHEET

    This thesis is submitted in partial fulfillment of the requirements for the degree of

    Master of Science

    Computer Engineering

    Anant P. Mudambi

    This thesis has been read and approved by the examining committee:

    Malathi Veeraraghavan (Advisor)

    Marty A. Humphrey (Chair)

    Stephen G. Wilson

    Accepted for the School of Engineering and Applied Science:

    Dean, School of Engineering and Applied Science

    January 2006

  • 7/30/2019 A TRANSPORT PROTOCOL

    3/81

    Abstract

    E-science projects involving geographically distributed data sources, computing resources and sci-

    entists, have special networking requirements such as a steady throughput and deterministic behav-

    ior. The connectionless Internet model is not well-suited to meet such requirements. Connection-

    oriented networks that offer guaranteed-rate, dedicated circuits have been proposed to meet the

    high-end networking needs of distributed scientific research. In this work we describe the design

    and implementation of a transport protocol for such dedicated circuits.

    We present an initial user-space, UDP-based implementation called Fixed Rate Transport Proto-

    col (FRTP). The constraints imposed by a user-space implementation led us to implement a lower-

    overhead kernel-space solution that we call Circuit-TCP (C-TCP). The key feature of C-TCP is to

    maintain a fixed sending rate, closely matched to the circuit rate, with the aim of achieving highcircuit utilization. We implemented C-TCP by modifying the Linux TCP/IP stack. Experimental

    results on a wide-area circuit-switched testbed show that C-TCP is able to quickly utilize circuit

    bandwidth and sustain a high data transfer rate.

    iii

  • 7/30/2019 A TRANSPORT PROTOCOL

    4/81

    Acknowledgments

    I would like to thank Prof. Malathi Veeraraghavan, for her advice and for keeping me on the right

    track. I thank the members of the CHEETAH research group, Xuan, Xiangfei, Zhanxiang and

    Xiuduan, for all their help.

    Anil and Kavita, thank you for keeping me motivated. Finally, the biggest thank you to my

    parents, for their incredible support and love.

    iv

  • 7/30/2019 A TRANSPORT PROTOCOL

    5/81

    Contents

    1 INTRODUCTION 1

    2 BACKGROUND 3

    2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.1.1 TCP Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.1.2 UDP-based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.3 Novel Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.2 End-host Factors that Affect Data Transfer Performance . . . . . . . . . . . . . . . 6

    2.2.1 Memory and I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2.1.1 Zero-copy Networking . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Protocol Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.2.3 Disk Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2.4 Process scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.3.1 Components of CHEETAH . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.3.2 Features of a CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . 12

    2.3.3 The CHEETAH Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.3.4 End-host Software Support for CHEETAH . . . . . . . . . . . . . . . . . 14

    3 UDP-BASED TRANSPORT PROTOCOL 16

    3.1 SABUL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.1.1 SABUL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    v

  • 7/30/2019 A TRANSPORT PROTOCOL

    6/81

    Contents vi

    3.2 Modifications to SABUL : FRTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.2.1 Problems with the FRTP Implementation . . . . . . . . . . . . . . . . . . 22

    3.2.2 Possible Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4 TCP-BASED SOLUTION 27

    4.1 Transmission Control Protocol - A Primer . . . . . . . . . . . . . . . . . . . . . . 27

    4.1.1 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.1.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.1.3 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.1.4 Self Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.2 Reasons for Selecting TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.3 Circuit-TCP Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.3.1 Connection Establishment . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.3.2 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.3.3 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.3.4 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4.3.5 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4.4 C-TCP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    4.4.1 Web100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    4.5.1 Utility of Disabling Slow Start . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.5.2 Sustained Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.5.2.1 Reno-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46

    4.5.2.2 BIC-TCP Performance . . . . . . . . . . . . . . . . . . . . . . 46

    4.5.2.3 C-TCP Performance . . . . . . . . . . . . . . . . . . . . . . . . 47

    5 CONTROL-PLANE FUNCTIONS 49

    5.1 Selecting the Circuit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

  • 7/30/2019 A TRANSPORT PROTOCOL

    7/81

    Contents vii

    5.2 Setting up the Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    6 CONCLUSIONS 56

    6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    6.1.1 Transport Protocol Design for Dedicated Circuits . . . . . . . . . . . . . . 56

    6.1.2 Transport Protocol Implementation . . . . . . . . . . . . . . . . . . . . . 57

    6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    A DISK WRITE RATE ESTIMATION 59

    A.1 How Linux Handles Disk Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    A.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    Bibliography 66

  • 7/30/2019 A TRANSPORT PROTOCOL

    8/81

    List of Figures

    2.1 Memory I/O bus usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2 CHEETAH experimental testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.3 Architecture of CHEETAH end-host software . . . . . . . . . . . . . . . . . . . . 15

    3.1 Architecture of a generic UDP-based protocol . . . . . . . . . . . . . . . . . . . . 17

    3.2 Need for receiver flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.1 TCP self clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.2 Congestion control in the control plane . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.3 Structure of the Web100 stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.4 Maximum buffer space required for a C-TCP burst . . . . . . . . . . . . . . . . . 41

    4.5 Testbed configuration for C-TCP tests . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.6 TCP and C-TCP comparison for different transfer sizes . . . . . . . . . . . . . . . 43

    4.7 Start-up behavior of TCP and C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.8 Throughput and RTT using Reno-TCP . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.9 Throughput and RTT using BIC-TCP . . . . . . . . . . . . . . . . . . . . . . . . 47

    4.10 Throughput and RTT using C-TCP . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.1 Throughput variability of disk-to-disk transfers . . . . . . . . . . . . . . . . . . . 50

    5.2 Trade-off between circuit utilization and delay . . . . . . . . . . . . . . . . . . . . 51

    viii

  • 7/30/2019 A TRANSPORT PROTOCOL

    9/81

    List of Tables

    5.1 xdd benchmark results on zelda4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    5.2 Disk write rate (Mbps) for individual runs using 32 KB request sizes . . . . . . . . 52

    A.1 End host configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    A.2 Disk write rate results using xdd . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    ix

  • 7/30/2019 A TRANSPORT PROTOCOL

    10/81

    List of Abbreviations

    ACK Acknowledgement

    AIMD Additive Increase Multiplicative Decrease

    API Application Programming Interface

    AQM Active Queue Management

    BDP Bandwidth Delay Product

    BIC-TCP Binary Increase Congestion control TCP

    CHEETAH Circuit-switched High-speed End-to-End Transport ArcHitecture

    COW Copy On Write

    C-TCP Circuit TCP

    cwnd congestion window

    DMA Direct Memory Access

    DNS Domain Name Service

    DRAGON Dynamic Resource Allocation via GMPLS Optical Networks

    FAST Fast AQM Scalable TCP

    FRTP Fixed Rate Transport Protocol

    GbE Giga bit Ethernet

    Gbps Giga bits per second

    GB Giga Byte

    GMPLS Generalized Multiprotocol Label Switching

    x

  • 7/30/2019 A TRANSPORT PROTOCOL

    11/81

    List of Abbreviations xi

    HS-TCP HighSpeed TCP

    I/O Input/Output

    IP Internet Protocol

    KB Kilo Byte

    LAN Local Area Network

    LMP Link Management Protocol

    Mbps Mega bits per second

    MB Mega Byte

    MSPP Multi-Service Provisioning Platform

    MTU Maximum Transmission Unit

    NAK Negative ACK

    NETBLT Network Blast Transfer

    NIC Network Interface Card

    OC Optical Carrier

    OCS Optical Connectivity Service

    OS Operating System

    OSPF Open Shortest Path First

    RBUDP Reliable Blast UDP

    RED Random Early Detect

    RSVP-TE Resource Reservation Protocol

    RTO Retransmission Time-out

    RTT Round Trip Time

    rwnd receiver advertised window

    SABUL Simple Available Bandwidth Utilization Library

    SACK Selective ACK

    SONET Synchronous Optical Network

    ssthresh slow start threshold

    TCP Transmission Control Protocol

  • 7/30/2019 A TRANSPORT PROTOCOL

    12/81

    List of Abbreviations xii

    TDM Time Division Multiplexing

    TSI Terascale Supernova Initiative

    UDP User Datagram Protocol

    UDT UDP-based Data Transfer protocol

    XCP eXplicit Control Protocol

  • 7/30/2019 A TRANSPORT PROTOCOL

    13/81

    Chapter 1

    INTRODUCTION

    Many fields of research require significant computing resources to conduct simulations and/or to

    analyze large amounts of data. Large data sets collected by remote instruments may need to be

    processed. The SETI@home project [2], which uses data collected by the National Astronomy

    and Ionospheric Centers radio telescope in Arecibo, Peru, is one such example. The telescope

    generates about 35 GB of data per day that is stored in removable tapes and physically transported

    to the server in Berkeley, California. In some cases, computations generate massive amounts of

    output that has to be distributed to scientists who are physically at a distance from the computation

    resource. For instance, the Terascale Supernova Initiative (TSI) project involves simulations run on

    supercomputers at the Oak Ridge National Laboratory (ORNL), the results of which are used by

    physicists at remote sites like the North Carolina State University (NCSU).

    Networks connecting the data generation point, the computation resource and the scientists

    workplace make collabarative e-science much more practical. The large amounts of data involved

    and, in some cases (e.g., real-time visualization), stringent delay/jitter requirements make it nec-

    essary to use networks with large bandwidths and deterministic behavior. E-science applications

    require high, constantly available bandwidth for their data transfer needs. It is difficult to provide

    such rate-guaranteed services in packet-switched, connectionless networks, such as the present-day

    Internet. This is because of the possibility of a large number of simultaneous flows competing for

    the available network capacity. Therefore, the use of connection-oriented, dedicated circuits has

    been proposed as a solution. Many research groups are implementing testbeds and the supporting

    1

  • 7/30/2019 A TRANSPORT PROTOCOL

    14/81

    Chapter 1. INTRODUCTION 2

    software to show the feasibility of such a solution.

    The problem addressed in this thesis is the design of a transport protocol for dedicated circuits.

    Many of the assumptions on which traditional transport protocols for packet-switched networks

    are based need to be examined. For instance, the possibility of losses due to network buffer over-

    flows makes congestion control an important function on connectionless networks. On connection-

    oriented networks, because network resources are reserved for each data transfer, the end points of

    the transfer have more control over whether or not network buffers will overflow. By maintaining

    a data transfer rate that is matched to the reserved circuits rate the need for congestion control

    can be eliminated. On the other hand, a transport layer function such as flow control is needed on

    both connectionless and connection-oriented networks because it addresses a problem that network

    resource reservation does not solve.

    Our approach is to design the transport protocol under the assumption that resources are re-

    served for a data transfers exclusive use. The transport protocol should not have any features

    that leave the reserved circuit unutilized. We implemented the transport protocol and tested it on a

    wide-area, connection-oriented network testbed. This protocol is called Circuit-TCP (C-TCP).

    The rest of this thesis is organized as follows. Chapter 2 provides background information on

    previous work in this area as well as issues that affect the design and performance of our transport

    protocol. In Chapter 3, we describe the Fixed Rate Transport Protocol (FRTP) that was implemented

    in the user space over UDP. The shortcomings of a user space impementation are pointed out.

    Chapter 4 describes the design and implementation of C-TCP, our kernel space transport protocol

    based on TCP. Experimental results over a testbed are used to compare C-TCP with TCP over

    dedicated circuits. In Chapter 5 the control plane issues of determining the circuit rate and then

    setting up the circuit are considered. The conclusions of this work are presented in Chapter 6.

  • 7/30/2019 A TRANSPORT PROTOCOL

    15/81

    Chapter 2

    BACKGROUND

    In this chapter we first look at other work that has been done in the development of transport pro-

    tocols for high-performance networks. Next we point out some of the factors that play a significant

    role in achieving high throughput on dedicated circuits. Many of these are end-host issues that we

    discovered while implementing our transport protocol. This work has been conducted as a part

    of the Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH) project. An

    overview of CHEETAH is presented at the end of this chapter.

    2.1 Related Work

    There has been significant activity in developing transport protocols suitable for high-bandwidth

    and/or high-delay networks. Even though very little of it is focussed explicitly towards dedicated

    circuits there is enough of an overlap in the problems to justify a closer examination. High-

    performance protocols can be classified as TCP enhancements, UDP-based and novel protocols.

    Ease of deployment and familiarity with the sockets API to the TCP and UDP protocol stacks are

    reasons for the popularity of TCP and UDP-based solutions.

    2.1.1 TCP Enhancements

    TCP is the most widely used reliable transport protocol on connectionless, packet-switched net-

    works. We describe basic TCP operation in Chapter 4. It is designed to work under a wide range

    3

  • 7/30/2019 A TRANSPORT PROTOCOL

    16/81

    Chapter 2. BACKGROUND 4

    of conditions and this makes a few of its design decisions non-optimal for high-speed networks.

    In recent years a number of protocol extensions to TCP have been proposed and implemented to

    address this issue. Selective acknowledgements (SACKs) [27,16] have been proposed to deal more

    efficiently with multiple losses in a round trip time (RTT) [13]. TCP uses cumulative acknowl-

    edgements (ACKs) which means a data byte is not ACKed unless all data earlier in the sequence

    space has been received successfully. SACKs inform the sender about out-of-sequence data already

    received and help prevent unnecessary retransmissions. Two protocol extensions timestamps op-

    tion and window scaling were proposed in [22]. The timestamps option field in a data packets

    TCP header is filled in by a sender and echoed back in the corresponding ACK. It serves two pur-

    poses. First, the timestamp can be used to estimate the round trip time more accurately and more

    often. This gives the sender a better value for retransmission timeout (RTO) computation. Second,

    the timestamp in a received packet can be used to prevent sequence number wraparound. The TCP

    header has a 16-bit field for the window size, which limits the window size to 64 KB. This is insuf-

    ficient for high-bandwidth, high-delay networks. The window scaling option allows a scaling factor

    to be chosen during connection establishment. Subsequent window advertisements are right shifted

    by the selected scaling factor. Scaling factors of upto 14 are allowed, thus by using this option a

    window size of upto 1 GB can be advertised.

    Standard TCP (also called Reno TCP) has been found wanting in high-bandwidth, high-delay

    environments, mainly due to its congestion control algorithm. TCPs Additive Increase Multi-

    plicative Decrease (AIMD) algorithm is considered too slow in utilizing available capacity and too

    drastic in cutting back when network congestion is inferred. Modifications to the TCP conges-

    tion control algorithm have led to the development of HighSpeed TCP [14], Scalable TCP [25],

    FAST [23], and BIC-TCP [39], among others. Standard TCP requires unrealistically low loss rates

    to achieve high throughputs. HighSpeed TCP is a proposed change to the TCP AIMD parameters

    that allows a TCP connection to achieve high sending rates under more realistic loss conditions.

    Scalable TCP also proposes modified AIMD parameters that speed up TCPs recovery from loss.

    FAST infers network congestion and adjusts its window size based on queueing delays rather than

    loss. BIC-TCP (BIC stands for Binary Increase Congestion control) is a new congestion control

  • 7/30/2019 A TRANSPORT PROTOCOL

    17/81

    Chapter 2. BACKGROUND 5

    algorithm that scales well to high bandwidth (i.e., it can achieve a high throughput at reasonable

    packet loss rates) and is TCP-friendly (i.e., when the loss rate is high its performance is the same

    as standard TCPs). In addition, unlike HighSpeed or Scalable TCP, BIC-TCPs congestion control

    is designed such that two flows with different RTTs share the available bandwidth in a reasonably

    fair manner.

    2.1.2 UDP-based Protocols

    To overcome the shortcomings of TCP, many researchers have implemented protocols over UDP by

    adding required functionality, such as reliability, in the user space. The most common model is to

    use UDP for the data transfer and a separate TCP or UDP channel for control traffic. SABUL [18],

    Tsunami, Hurricane [38], and RBUDP [20] use a TCP control channel and UDT [19] uses UDP

    for both data and control channels. The advantage of these solutions is that their user-space imple-

    mentation makes deployment easy. At the same time, there are some limitations that arise because

    these protocols are implemented in the user-space. In Chapter 3, we describe SABUL. Our attempt

    at modifying SABUL to implement a transport protocol for dedicated circuits and the shortcomings

    of a user-space transport protocol implementation are also pointed out.

    2.1.3 Novel Protocols

    Some novel protocols designed exclusively for high-performance data transfer have also been pro-

    posed. The eXplicit Control Protocol (XCP) [24] was proposed to solve TCPs stability and effi-

    ciency problems. By separating link utilization control from fairness control, XCP is able to make

    more efficient use of network resources in a fair manner. XCPs requirement of multi-bit congestion

    signals from the network makes it harder to deploy since routers in the network need to be modified.

    NETBLT [10] was proposed for high-speed bulk data transfer. It provides reliable data transfer by

    sending blocks of data in a lock-step manner. This degrades bandwidth utilization while the sender

    awaits an acknowledgement (ACK) for each block.

  • 7/30/2019 A TRANSPORT PROTOCOL

    18/81

    Chapter 2. BACKGROUND 6

    2.2 End-host Factors that Affect Data Transfer Performance

    Setting up a dedicated circuit involves resource reservation in the network. Depending on the

    network composition, the resources reserved could be wavelengths, ports on a switch or time slots.

    Ideally, we would like to fully use the reserved resources for exactly the time required to complete

    the transfer. During the implementation of our transport protocol, we found that there are many

    factors that make it hard to achieve this ideal. In this section we list a few of these factors that

    impact the performance of transport protocol implementations.

    2.2.1 Memory and I/O bus usage

    First, consider an application that uses the transport protocol to carry out a file transfer. At the

    sending end, the application has to

    1. Read data from the disk, e.g. by invoking a read system call.

    2. Send the data out on the network, e.g. by invoking a send system call.

    There are two types of overhead in carrying out these operations. The system calls involve the over-

    head of saving the process registers on stack before the system call handler is invoked. Secondly,

    the two steps shown above could involve multiple passes over the memory and I/O bus. This is

    illustrated in Figure 2.1(a). The figure shows the bus operations involved in moving data from the

    disk to user space buffers (step 1 above), and from the user space buffer to kernel network buffers

    (part of step 2). To avoid having to access the disk each time, for multiple accesses to a chunk of

    data, the operating system caches recently accessed disk data in memory. This cache is called the

    page cache, and direct memory access (DMA) is used for transfers between the page cache and the

    disk (operation I in Figure 2.1(a)). Two passes over the memory bus are needed to transfer the data

    from the page cache to the user space buffer (operation II). To send data out to the network, it is

    again copied from the user space buffer to kernel network buffers (operation III). We do not show

    the transfer from the kernel network buffer to the NIC, which is the final step in getting data out

    into the network. For data transfers using TCP sockets on Linux, the sendfile system call can be

  • 7/30/2019 A TRANSPORT PROTOCOL

    19/81

    Chapter 2. BACKGROUND 7

    PROCESSOR

    MEMORY

    PAGE

    CACHE

    I

    II

    III

    NIC

    HARD DISK

    USER-SPACE

    KERNEL-SPACE

    MEMORY

    (a) Using readand send

    PROCESSOR

    MEMORY

    PAGE

    CACHE

    HARD DISKNIC

    KERNEL-SPACE

    MEMORY

    USER-SPACE

    (b) Using sendfile

    Figure 2.1: Memory I/O bus usage

    used to cut down the number of passes over the memory bus to three. As shown in Figure 2.1(b),

    sendfile copies data directly from the page cache to the kernel network buffers, thus avoiding the

    copy to user space and back. In addition, sendfile needs to be invoked just once for a single file, so

    the overhead of making a system call is paid only once per file.

    2.2.1.1 Zero-copy Networking

    Other methods for avoiding the copy from user-space memory to kernel-space memory have been

    proposed. Such methods are known by the common term zero-copy networking. For a classification

    of zero-copy schemes see [7]. The zero in zero-copy networking indicates that there is no memory-

    to-memory copy involved in the transfer of data between a user space buffer and the network. So,

    in Figure 2.1(a), a zero-copy scheme would eliminate memory-to-memory copies after operation

    II. How the data got into the user- or kernel-space buffer in the first place, and whether that required

    a copy is not considered. Zero-copy schemes can be supported if an application interacts directly

    with the NIC without passing through the kernel, or if the buffers are shared between user and

    kernel space, rather than being copied. For an application to directly read from and write to the NIC

    buffer, protocol procesing has to be done on the NIC. At the sender, buffers can be shared between

    the application and the kernel if the application can guarantee that a buffer that has not yet been

    transmitted will not be overwritten. One way to ensure this would be if the system call invoked to

  • 7/30/2019 A TRANSPORT PROTOCOL

    20/81

    Chapter 2. BACKGROUND 8

    send some data returns only after all of that data has been successfully transmitted. Since a reliable

    transport protocol can consider a buffer to have been successfully transmitted only when all of the

    data in that buffer has successfully reached the intended receiver, the application may need to wait

    a while before it can reuse a buffer. An interesting alternative is to mark the buffer as copy-on-write

    (COW), so that the contents of the buffer are copied to a separate buffer if and when the application

    tries to overwrite it. Implementation of send-side zero-copy schemes for different operating systems

    are described in [28].

    Now consider the steps at a receiver. A receiver performs the steps shown in Figure 2.1(a) in

    reverse order (there is no sendfile equivalent for the receiver). One way to implement zero-copy

    on the receiver is to change the page table of the application process when it issues a recv system

    call. This is called page flipping in [28]. Page flipping works only if the NIC separates the packet

    payload and header, if the packet payload is an exact multiple of the page size and if the buffer

    provided by the application is aligned to page boundaries. Because of these requirements there has

    been little effort to implement such a scheme.

    Several factors that influence communication overhead are presented in [33]. The memory and

    I/O bus usage for schemes with different kernel and interface hardware support are compared. For

    instance, the author shows how, by using DMA, NIC buffering and checksum offload, the number

    of passes over the bus can be reduced from six to one.

    2.2.2 Protocol Overhead

    Apart from the memory and I/O bus, the other main end host resource that could become a bottle-

    neck is processor cycles. TCP/IP, being the most widely used protocol stack, has received attention

    in this regard. In [9] the processing overhead of TCP/IP is estimated and the authors conclusion

    is that with a proper implementation, TCP/IP can sustain high throughputs efficiently. More recent

    work presented in [17] takes into consideration the OS and hardware support that a TCP implemen-

    tation will require.

    The overhead of a transport layer protocol can be divided into two categories: per-packet costs

    and per-byte costs [9, 28, 6]. Per-packet costs include protocol processing (e.g., processing the

  • 7/30/2019 A TRANSPORT PROTOCOL

    21/81

    Chapter 2. BACKGROUND 9

    sequence numbers on each packet in TCP) and interrupt processing. Per-byte costs are incurred

    when data is copied or during checksum calculation.

    Per-packet overhead can be reduced by reducing the number of packets handled during the

    transfer. For a given transfer size, the number of packets can be reduced by using larger packets.

    The maximum transmission unit (MTU) of the network constrains the packet size that an end host

    can use. For instance, Ethernet imposes a 1500-byte limit on the IP datagram size. The concept

    of jumbograms was introduced by Alteon Networks to allow Ethernet frames of upto 9000 bytes,

    and many gigabit Ethernet NICs now support larger frame sizes. Larger packet sizes can decrease

    protocol processing overhead as well as the overhead of interrupt processing. NICs interrupt the

    processor on frame transmission and reception. An interrupt is costly for the processor because

    the state of the currently running process has to be saved and an interrupt handler invoked to deal

    with the interrupt. As interface rates increase to 1 Gbps and higher, interrupt overhead can become

    significant. Many high-speed NICs support interrupt coalescing so that the processor is interrupted

    for a group of transmitted or received packets, rather than for each individual packet.

    Schemes to reduce per-byte costs involved in copying data over the memory I/O bus were

    described in Section 2.2.1. Checksum calculation can be combined with a copy operation and

    carried out efficiently in software. For instance, the sender could calculate the checksum when data

    is being copied from the user-space buffer to the kernel-space buffer. Another way to reduce the

    processors checksum calculation burden is to offload it to the interface card.

    2.2.3 Disk Access

    All the factors considered so far affect data transfer throughput. In designing a transport protocol

    for dedicated circuits, not only does a high throughput have to be maintained, the circuit utilization

    should also be high. Thus end host factors that cause variability in the throughput also need to

    be considered. For disk-to-disk data transfers, disk access can limit throughput as well as cause

    variability. The file system used can have an effect on disk access performance. The time to

    physically move the disk read/write head to the area on the hard disk where the desired data resides,

    called seek time, is a major component of the disk access latency. File accesses tend to be sequential,

  • 7/30/2019 A TRANSPORT PROTOCOL

    22/81

    Chapter 2. BACKGROUND 10

    so a file system that tries to keep all parts of a file clustered together on the hard disk would perform

    better than one in which a file is broken up into small pieces spread all over the hard disk.

    At the sender, data needs to be read from the disk to memory. System calls to do this are

    synchronous. When the system call returns successfully, the requested data is available in memory

    for immediate use. Operating systems try to improve the efficiency of disk reads by reading in

    more than the requested amount, so that some of the subsequent requests can be satisfied without

    involving the disk hardware. At the data receiver, the system call to write to disk is asynchronous

    by default. This means that when the system call returns it is not guaranteed that the data has been

    written to disk; instead it could just be buffered in memory. Asynchronous writes are tailored to

    make the common case of small, random writes efficient, since they allow the operating system

    to schedule disk writes in an efficient manner. The operating system could reorder the writes to

    minimize seeks. In Linux, for instance, data written to disk is actually copied to memory buffers

    in the page cache and these buffers are marked dirty. Two kernel threads, bdflush and kupdate, are

    responsible for flushing dirty buffers to disk. The bdflush kernel thread is activated when the number

    of dirty buffers exceeds a threshold, and kupdate is activated whenever a buffer has remained dirty

    too long. As a consequence of the kernel caching and delayed synchronization between memory

    buffers and the disk, there can be significant variability in the conditions under which a disk write

    system call operates.

    2.2.4 Process scheduling

    The final factor we consider is the effect of the process scheduler. All modern operating sys-

    tems are multitasking. Processes run on the processor for short intervals of time and then either

    relinquish the CPU voluntarily (e.g. if they block waiting for I/O) or are forcibly evicted by the

    operating system if their time slot runs out. This gives users the impression that multiple processes

    are running simultaneously. Multitasking, like packet-switched networking, tries to fairly divide up

    a resource (processor cycles for multitasking; bandwidth for packet-switched networking) among

    all contenders (multiple processes; multiple flows) for the resource. This behavior is at odds with

    resource reservation in a connection-oriented network such as CHEETAH. If the degree of mul-

  • 7/30/2019 A TRANSPORT PROTOCOL

    23/81

    Chapter 2. BACKGROUND 11

    titasking at an end host is high then a data transfer application may not get the processor cycles

    required to fully use the reserved circuit. Even if the required number of free cycles are available,

    the process scheduler might not be able to schedule the data transfer application in the monotonic

    fashion required to send and receive data at the fixed circuit rate.

    2.3 CHEETAH Network

    CHEETAH, which stands for Circuit-switched High-speed End-to-End Transport ArcHitecture, is a

    network architecture that has been proposed [37] to provide high-speed, end-to-end connectivity on

    a call-by-call basis. Since the transport protocol proposed in this thesis is to be used over a dedicatedcircuit through a CHEETAH network, in this section we provide a description of CHEETAH.

    2.3.1 Components of CHEETAH

    Many applications in the scientific computing domain require high throughput transfers with deter-

    ministic behavior. A circuit-switched path through the network can meet such requirements better

    than a packet-switched path. CHEETAH aims to bring the benefits of a dedicated circuit to an end-

    user. In order to allow wide implementation, CHEETAH has been designed to build on existing

    network infrastructure instead of requiring radical changes. Ethernet and SONET (Synchronous

    Optical Network) are the most widely used technologies in local area networks (LANs) and wide

    area networks (WANs) respectively. To take advantage of this, a CHEETAH end-to-end path con-

    sists of Ethernet links at the edges and Ethernet-over-SONET links in the core. Multi-Service

    Provisioning Platforms (MSPPs) are hardware devices that make such end-to-end paths possible.

    MSPPs are capable of mapping between the packet-switched Ethernet domain and the time divi-

    sion multiplexed (TDM) SONET domain. MSPPs are an important component of the CHEETAH

    architecture for three reasons.

    1. The end hosts can use common Ethernet NICs and do not need, for instance, SONET line

    cards.

    2. Many enterprises already have MSPPs deployed to connect to their ISPs backbone network.

  • 7/30/2019 A TRANSPORT PROTOCOL

    24/81

    Chapter 2. BACKGROUND 12

    3. Standard signaling protocols, as defined for Generalized Multiprotocol Label Switching

    (GMPLS) networks, are (being) implemented in MSPPs. This is essential to support dynamic

    call-by-call sharing in a CHEETAH network.

    2.3.2 Features of a CHEETAH Network

    One of the salient features of CHEETAH is that it is an add-on service to the existing packet-

    switched service through the Internet. This means, firstly, that applications requiring CHEETAH

    service can co-exist with applications for which a path through the packet-switched Internet is good

    enough. Secondly, because network resources are finite, it is possible that an applications request

    for a dedicated circuit is rejected; in such cases, the Internet path provides an alternative so that

    the applications data transfer does not fail. To realize this feature, end hosts are equipped with an

    additional NIC that is used exclusively for data transfer over a CHEETAH circuit.

    To make the CHEETAH architecture scalable, the network resource reservation necessary to

    set up an end-to-end circuit should be done in a distributed and dynamic manner. Standardized

    signaling protocols that operate in a distributed manner, such as the hop-by-hop signaling in GM-

    PLS protocols, are key to achieving scalability. CHEETAH uses RSVP-TE 1 signaling in the control

    plane. Dynamic circuit set up and tear down means that these operations are performed when (and

    only when) required, as opposed to statically provisioning a circuit for a long period of time. Dy-

    namic operation is essential for scalability because it allows the resources to be better utilized, thus

    driving down costs. End-host applications that want to use a CHEETAH circuit are best-placed

    to decide when the circuit should be set up or torn down. Therefore an end host connected to the

    CHEETAH network runs signaling software that can be used by applications to attempt circuit set

    up on a call-by-call basis.

    With end-host signaling in place, applications that want to use a CHEETAH circuit can do so

    in a dynamic manner. This leads to the question of whether, just because it can be done, a circuit

    set up should be attempted for a given data transfer. In [37], analytical arguments are used to show

    1Resource Reservation Protocol-Traffic Engineering. This is the signaling component of the GMPLS protocols, the

    other components being Link Management Protocol (LMP) and Open Shortest Path First-TE (OSPF-TE).

  • 7/30/2019 A TRANSPORT PROTOCOL

    25/81

    Chapter 2. BACKGROUND 13

    that, for data transfers above a threshold size, transfer delay can be reduced by using a CHEETAH

    circuit rather than an Internet path. It is also worth noting that there are situations in which the

    overhead of circuit set up makes it advantageous to use a path through the Internet, although for

    wide-area bulk data transfer a dedicated circuit invariably trumps an Internet path.

    2.3.3 The CHEETAH Testbed

    To study the feasibility of the CHEETAH concept, an experimental testbed has been set up. This

    testbed extends between North Carolina State University (NCSU), Raleigh, NC, and Oak Ridge Na-

    tional Laboratory (ORNL), Oak Ridge, TN and passes through the MCNC point-of-presence (PoP)

    in Research Triangle Park, NC and the Southern Crossroads/Southern LambdaRail (SOX/SLR) PoP

    in Atlanta, GA. The testbed layout is shown in Figure 2.2. In this testbed, the Sycamore SN16000

    Intelligent Optical Switch is used as the MSPP. In the figure we show end hosts connected directly

    or through Ethernet switches to the gigabit Ethernet card on the SN16000. The cross connect card

    is configured through the control card to set up a circuit. The SN16000 has an implementation of

    the GMPLS signaling protocol that follows the standard and has been tested for interoperability.

    "

    "

    ' ( 0 3

    ( 5 7 9

    ' A

    B

    ' 0 5 F G G G

    3 H

    ( P 0 Q 3 0

    "

    "

    ' ( 0 3

    ( 5 7 9

    ' A

    B

    ' 0 5 F G G G

    3 H

    `

    `

    "

    "

    ' ( 0 3

    ( 5 7 9

    ' A

    B

    ' 0 5 F G G G

    3 H

    a 0 0

    d

    H

    S

    d

    H

    3 H

    0 ' i 0

    Figure 2.2: CHEETAH experimental testbed

  • 7/30/2019 A TRANSPORT PROTOCOL

    26/81

    Chapter 2. BACKGROUND 14

    The testbed has been designed to support the networking needs of the TSI project mentioned

    at the beginning of this chapter. We present results of experiments conducted over this testbed in

    Chapter 4.

    2.3.4 End-host Software Support for CHEETAH

    To allow applications to start using CHEETAH circuits, software support is required to make the

    end hosts CHEETAH enabled. The architecture of the end-host software is shown in Figure 2.3.

    The relevant components of the CHEETAH end-host software are shown inside a dotted box to

    signify that the application could either interact with each component individually or make higher-

    level calls that hide the details of the components being invoked. To be able to use a CHEETAH

    circuit between two end hosts, both should support CHEETAH.

    The Optical Connectivity Service (OCS) client allows applications to query whether a re-

    mote host is on the CHEETAH network. OCS uses the Internets Domain Name Service (DNS)

    to provide additional information such as the IP address of the remote ends secondary NIC. As

    mentioned earlier, depending on the situation, either a CHEETAH circuit or a path through the In-

    ternet may be better for a particular transfer. The routing decision module takes measurements of

    relevant network parameters (e.g., available bandwidth and average loss rate) and uses these along

    with the parameters of a particular transfer (e.g., the file size and requested circuit rate) to decide

    whether or not a CHEETAH circuit set up should be attempted. To achieve CHEETAHs goal of

    distributed circuit set up, an RSVP-TE signaling module runs on each end host. The RSVP-TE

    module exchanges control messages with the enterprise MSPP to set up and tear down circuits.

    These control messages are routed through the primary NIC over the Internet. The final software

    component is the transport protocol module. Depending on whether a circuit or an Internet path

    is being used, the transport protocol used will be C-TCP or TCP. In this thesis the focus will be on

    the design, implementation and evaluation of C-TCP.

    To end this chapter we mention some of the other projects focused on connection-oriented

    networking for e-science projects. UltraScience Net [36] is a Department of Energy sponsored

    research testbed connecting Atlanta, Chicago, Seattle and Sunnyvale. This network uses a central-

  • 7/30/2019 A TRANSPORT PROTOCOL

    27/81

    Chapter 2. BACKGROUND 15

    ! "

    ' ( 0 ! 3 5

    6 8

    9 3 !

    5

    0 A C E

    ! 3

    E G " G C G ! Q 6 3

    C G S " G T

    U (

    0 3 ! " G T

    U (

    Q 0 C C

    3 G 3

    0 ' U E 3 a G b Q 0 C C

    C G S " G T

    U (

    0 3 ! " G T

    U (

    Figure 2.3: Architecture of CHEETAH end-host software

    ized scheme for the control-plane functions. Another effort is the Dynamic Resource Allocation via

    GMPLS Optical Networks (DRAGON) project [12]. DRAGON uses GMPLS protocols to support

    dynamic bandwidth provisioning.

  • 7/30/2019 A TRANSPORT PROTOCOL

    28/81

    Chapter 3

    UDP-BASED TRANSPORT PROTOCOL

    In Chapter 2 we mentioned a few protocols that are based on UDP. There are good reasons for

    taking this approach:

    UDP provides the minimal functionality of a transport protocol. It transfers datagrams be-

    tween two processes but makes no guarantees about their delivery. UDPs minimalism leaves

    no scope for anything to be taken out of its implementation. Thus a new protocol built over

    UDP has to add extra (and only the required) functionality. The significance of this is that

    these additions can be done in the user space, without requiring changes to the operating

    system code. This makes UDP-based solutions as easy to use and portable as an application

    program.

    The sockets API to the UDP and TCP kernel code is widely deployed and used. This makes

    implementation easier and faster.

    The basic design of all UDP-based protocols is similar and is shown in Figure 3.1. Data packets

    are transferred using UDP sockets. A separate TCP or UDP channel is used to carry control pack-ets. Control packets serve to add features to the data transfer not provided by UDPs best-effort

    service. We used the Simple Available Bandwidth Utilization Library (SABUL), a UDP-based data

    transfer application, to implement the Fixed Rate Transport Protocol (FRTP). In this chapter we first

    present an overview of the SABUL protocol and implementation. Then we describe the changes

    16

  • 7/30/2019 A TRANSPORT PROTOCOL

    29/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 17

    ! # % '

    ) 0 2 4 5 0 @ A C3 2 E 2 F

    G C 4 C @ A C 29 2 P F

    Figure 3.1: Architecture of a generic UDP-based protocol

    that we made to SABUL to implement FRTP. The advantages and shortcomings of this approach

    are discussed.

    3.1 SABUL Overview

    SABUL is designed for bulk data transfers over high-bandwidth networks. SABULs architecture

    is the same as that shown in Figure 3.1. TCP is used for control packet transmission from the data

    receiver to the data sender. SABUL adds reliability, rate-based congestion control and flow control

    to UDPs basic data transfer service.

    Providing end-to-end reliabilityguaranteeing that all the data sent is received in the same or-

    der and without duplicatesis a function of the transport layer. SABUL implements the following

    error control scheme for reliable transfer. It adds a sequence number to each UDP data packet.

    The receiver detects packet loss using the sequence numbers of the received packets. On inferring

    loss, the receiver immediately sends a negative-acknowledgement (NAK) control packet to convey

    this information to the sender. The sender then recovers from the error by retransmitting the lost

    packet(s). The receiver maintains an ERR timer to periodically send NAKs if there are missing

    packets. This is to provide protection against lost retransmissions. For file transfers, reading data

  • 7/30/2019 A TRANSPORT PROTOCOL

    30/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 18

    from the disk for each retransmission is very expensive in time. Therefore, the sender keeps the

    transmitted data in memory until it is acknowledged. A SABUL receiver periodically sends an ac-

    knowledgement (ACK) control packet, acknowledging all packets received in-order. On receiving

    an ACK, the sender can free the buffer space occupied by data that is confirmed to have been re-

    ceived. In addition the SABUL sender has a timer that is reset each time a control packet is received.

    If this timer (called the EXP timer) expires because no control information has been received, the

    sender assumes that all unacknowledged packets have been lost and retransmits them.

    SABUL uses a rate-based congestion control scheme. The sender modifies the sending rate

    depending on the degree of congestion in the network. The SABUL receiver sends a periodic syn-

    chronization (SYN) control packet containing the number of data packets received in the previous

    SYN period. The sender uses this information to estimate the amount of loss and hence the con-

    gestion in the network. Depending on whether the loss is above or below a threshold, the sending

    rate is reduced or increased, respectively. The sending rate is modified by changing the inter-packet

    gap.

    SABUL is a user space implementation which means a SABUL receiver cannot distinguish

    between loss due to network congestion and loss due to its receive buffer (the kernel UDP buffer)

    overflowing. The information in SYN packets represents both types of loss, and therefore, SABULs

    rate-based congestion control also serves as a reactive flow control strategy. In addition, a fixed

    window is used to limit the amount of unacknowledged data in the network.

    3.1.1 SABUL Implementation

    The SABUL implementation is described next. It is important to separate the SABUL transport

    protocol from an application that uses it. In the description below we refer to an application using

    SABUL as the sending application or receiving application. The sending application generates

    the data that is to be transferred using SABUL, for example by reading it from a file on the hard

    disk. The receiving application, likewise, consumes the data transferred using SABUL. SABUL

    is implemented in C++. The sending application invokes a SABUL method to put data into the

    protocol buffer. SABUL manages the protocol buffer and transmits or retransmits data packets

  • 7/30/2019 A TRANSPORT PROTOCOL

    31/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 19

    from it. Two threads are used. One handles the interface with the sending application, mainly the

    filling of the protocol buffer. The other thread is responsible for sending out data packets. The

    sequence numbers of packets that need to be retransmitted are recorded in a loss list. Pseudocode

    for the sender side functionality is shown below:

    INITIALIZATION:

    Create TCP socket on well-known port number

    Listen for a connection

    Accept connection from client

    Get the UDP port number on which the receiver is expecting data

    Calculate the inter-packet gap required to maintain the desired sending rate

    Fork a new thread to handle the data transmission

    DATA TRANSMISSION:

    WHILE data transfer is not over

    WHILE protocol buffer is empty AND data transfer is not over

    Wait for data from the sending application

    ENDWHILE

    Poll control channel for control packets

    IF control packet received THEN

    Process control packet /* See below */

    ENDIF

    IF loss list is not empty THEN

    Remove first packet from the loss list

    ELSE

    Form a new packet

    ENDIF

    Send the data packet by invoking the send() system call on the UDP socket

    Wait till it is time to send the next packet

  • 7/30/2019 A TRANSPORT PROTOCOL

    32/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 20

    ENDWHILE

    CONTROL PACKET PROCESSING:

    IF ACK packet THEN

    Release buffer space held by the acknowledged packet(s)

    Update loss list

    Inform sending application of availability of buffer space

    ELSE IF NAK packet THEN

    Update loss list

    ELSE IF SYN packet THEN

    Adjust sending rate

    ENDIF

    Two threads are used at the receiver too. One thread (call it the network thread) is responsible

    for receiving data packets, writing the data into the protocol buffer and sending control packets.

    The other thread (main thread) handles the interface with the receiving application, transferring

    data from the protocol buffer to the application buffer. SABUL uses an optimization when the

    receiving application asks to read more data than the protocol buffer has. The main thread sets a

    flag indicating such a situation. On seeing this flag, the network thread copies all available data

    into the application buffer and resets the flag. As the rest of the data requested by the receiving

    application arrives, it is copied directly into the application buffer saving a memory copy. The

    receiver side pseudocode follows.

    INITIALIZATION:

    Create TCP and UDP sockets

    Connect to the sender

    Inform the sender of the UDP port number

    Fork a new thread to receive data

    RECEIVING DATA:

  • 7/30/2019 A TRANSPORT PROTOCOL

    33/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 21

    WHILE all the data has not been received

    IF receiving application is waiting for data THEN

    Copy all ACKed data from protocol buffer to application buffer

    ENDIF

    IF ACK timer expired THEN

    Send ACK packet

    ENDIF

    IF ERR timer expired THEN

    Send NAK packet with sequence numbers of missing packets

    ENDIF

    IF SYN timer expired THEN

    Send SYN packet with number of packets received in previous SYN inte

    ENDIF

    Get the address into which to receive the next expected data packet

    Receive a data packet on the UDP socket

    IF missing packets THEN

    Add missing packets sequence numbers to loss list

    Send an immediate NAK packet

    ENDIF

    Update state variables like next expected sequence number, ACK sequence numb

    Update loss list

    ENDWHILE

    3.2 Modifications to SABUL : FRTP

    Our initial idea for a transport protocol that can be used over dedicated circuits was that, since

    bandwidth is reserved, the data should be just streamed across at the circuit rate. Transmitting at a

    rate lower than the reserved circuit rate would leave bandwidth unutilized. Transmitting at a higher

  • 7/30/2019 A TRANSPORT PROTOCOL

    34/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 22

    rate would eventually lead to a buffer filling up and overflowing. Therefore we wanted a transport

    protocol that could monotonically send data packets at a fixed rate. SABUL seemed like a perfect

    match for doing this since it can maintain a fixed sending rate if its rate-based congestion control

    was disabled. FRTP, our transport protocol for dedicated circuits, could be implemented just like

    SABUL, except that the rate altering congestion control would be stripped out.

    The first modification to SABUL code was to remove the rate-based congestion control that

    modified the sending rate. Second, we added support for using separate NICs for the data and

    control channels. This was in line with the CHEETAH concept of having two NICs on CHEETAH-

    enabled hosts. SABUL (and hence, FRTP) has many parameters that can be tweaked to improve

    its performance. The application, protocol and UDP buffer sizes can be changed. The values of

    the different timers that SABUL uses are also available for adjustment. We ran experiments in a

    laboratory setting [40] to determine the effect of some of these parameters on FRTP performance,

    and possibly determine the optimal values. Although we failed to determine a set of optimal values

    for the parameters, these experiments did reveal some of the flawed assumptions we were making.

    3.2.1 Problems with the FRTP Implementation

    We observed that even though FRTP was set up to send at a fixed rate, the throughput achieved

    (amount of data transferred / transfer time) was lower than the sending rate. This difference grew as

    the sending rate was increased. We found that the reasons for this discrepancy were two-fold. First,

    the FRTP implementation was not able to maintain a monotonic sending rate. Second, even if the

    sender was able to maintain a constant sending rate, the receiving application could not empty the

    buffers at the same (or higher) rate. This led to receiver buffer overflow and retransmissions, which

    reduced the throughput.

    FRTP implements a fixed sending rate by maintaining a fixed inter-packet gap. For instance,

    if 1500 byte packets are being used, a 1 Gbps sending rate can be maintained by ensuring that the

    gap between successive transmitted packets is 12 s (= 1500 bytes / 1 Gbps). Commodity operating

    systems do not provide straightforward methods (if at all) to measure such small intervals of time

    and certainly do not provide a method to reliably schedule a periodic action at such a fine granular-

  • 7/30/2019 A TRANSPORT PROTOCOL

    35/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 23

    ity. For instance, the timer tick granularity available to user-space processes in Linux is 10 ms. To

    overcome this, FRTP uses busy waiting to bide away the time between packet transmissions. If the

    next packet needs to be sent at time t, FRTP does the following:

    WHILE ((current time) < t)

    NOP

    ENDWHILE

    The rdtsc (read time stamp counter) instruction, provided by Pentium processors, is used to get

    an accurate value for the current time. The rdtsc instruction reads the time stamp counter that is

    incremented at every processor tick. NOP is a no operation instruction. The busy waiting solution is

    wasteful since the NOPs use up processor cycles that could have been used to accomplish something

    more useful. It also does nothing to make the periodic invocation of an event reliable. If the sending

    process were the only one running on the processor then the busy waiting scheme works to reliably

    perform a periodic action. If a different process is running on the processor at t, the FRTP sending

    process will miss its deadline. In fact, since FRTP itself uses 2 threads at the sender, the thread

    responsible for filling the protocol buffer could interfere with the data sending threads busy waiting

    induced periodicity.

    SABULs rate adjustment scheme has been removed from FRTP. Therefore FRTP does not have

    even the reactive flow control of SABUL. This is acceptable if we can be sure that flow control is

    not required. The FRTP receiver architecture for a transfer to disk can be represented as shown in

    Figure 3.2. Using the notation introduced in Section 3.1, the network thread handles the transfer

    marked I and the main thread and the receiving application handle II and III respectively. The

    process scheduler has to put the threads on the processor for the transfers to take place. Transfer III

    additionally depends on how long the write to disk takes. These factors introduce variability into

    the receiving rate. Buffers can hide this variability so that even a constant sending rate does not

    cause buffer overflow. For a sending rate S(t) held at a constant value S, a receiving rate R(t) and a

    receive buffer of size B, for no loss to occur:

    S.Z

    0R(t)dtB [0,T] (3.1)

  • 7/30/2019 A TRANSPORT PROTOCOL

    36/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 24

    UDP bufferProtocol

    bufferApplication

    buffer

    Disk

    Ke rn el- sp ace Us er -sp ac e

    I IIIII

    Figure 3.2: Need for receiver flow control

    where [0,T] is the transfer interval. The (false) assumption behind our initial belief that it is enough

    to just stream the data at the reserved circuit rate was that equation (3.1) holds throughout the

    transfer. From our experiments we realized that not only is R(t) varying, we do not even know a

    closed form expression for it, making the choice of S and B to satisfy equation (3.1) difficult. A

    pragmatic approach is to assign sensible values to Sand B, so that (3.1) is satisfied most of the time.

    When it is not satisfied, there are losses and the error control algorithm will recover from the loss.

    This is what we were seeing in our laboratory experiments (but with S(t) also varying with time).

    A flow control protocol would attempt to ensure that the above equation is satisfied all the time, by

    varying S(t). Unfortunately this implementation of FRTP has no flow control.

    3.2.2 Possible Solutions

    Our attempts to solve the two problems we identified with the FRTP implementation use of busy

    waiting for ensuring a steady rate and lack of flow control are described next. The ideal solution

    for maintaining a fixed inter-packet gap would involve transmitting a packet, giving up the processor

    and reclaiming it when it is time to send the next packet. Linux offers a system call to relinquish

  • 7/30/2019 A TRANSPORT PROTOCOL

    37/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 25

    the processor. To see why it is not possible to reclaim the processor at a deterministic future time

    it is essential to understand how the Linux scheduler schedules processes to run. Two queues (for

    our purposes only two of the queues are important) are maintained, one of processes that are ready

    to run (the RUNNABLE queue) and the other of processes that are waiting for some condition that

    will make them ready to run (the INTERRUPTIBLE queue). For instance, if a process executes

    instructions to write to disk, it is put in the INTERRUPTIBLE queue. When the write to disk

    completes and the hard drive interrupts the processor the process is put back in the RUNNABLE

    queue. So what happens when, after transmitting a packet, the FRTP sending process gives up the

    CPU? Usually, the system call used to relinquish the processor allows the process to specify a time

    after which it is to be made runnable again. The process is put in the INTERRUPTIBLE queue and

    when the operating system determines that the time for which the process had asked to sleep has

    passed, it is put back in the RUNNABLE queue. The problem arises because the operating system

    uses the timer interrupts (which have a 10 ms period in Linux) to check whether the sleep time has

    passed. Therefore if a process asked to sleep for 1 second, it is guaranteed to become runnable

    after a time between 1.0 and 1.01 seconds, but if it asks to sleep for 100 s it will become runnable

    after some time between 100 s and 10100 s. Note that if we give this process the highest priority

    then its becoming runnable implies that it runs on the processor, so we ignore the scheduling delay

    between a process becoming ready to run and actually running. Thus on Linux (and other operating

    systems that dont support real-time processes) it is not possible for a user space process to send

    packets monotonically at a high rate.

    An alternate approach would be to maintain the sending rate, not on a packet-by-packet basis,

    but in a longer time frame. This can be done by ensuring that N packets are sent every T units

    of time such that (N/T) is the desired sending rate. This would cause a burst ofN packets in the

    network so we would like to keep T as small as possible. In the limit N becomes 1 and we get what

    SABUL attempts to implement. The sending process should get a periodic impulse every T units

    of time and in response send out the N packets. Linux offers user-space processes the ability to

    receive such periodic impulses in the form of signals. A process can use the settimer() system call

    to activate a timer. This timer causes a signal to be sent periodically to the process. We modified the

  • 7/30/2019 A TRANSPORT PROTOCOL

    38/81

    Chapter 3. UDP-BASED TRANSPORT PROTOCOL 26

    FRTP code to use periodic signals to maintain the sending rate. This reduced the CPU utilization at

    the sender compared to the earlier busy waiting scheme. But the lack of real-time support on Linux

    meant that even if the signals were being sent like clockwork the user-space process was not always

    able to start sending the next burst of packets immediately. We observed that occasionally some

    signals would be missed because an earlier one was still pending.

    We now consider the problem of adding flow control to FRTP. Since flow control is supposed to

    avoid receiver buffer overflow, the data receiver is best placed to provide the information based on

    which the sender can control the flow of data. SABULs sending rate adjustment in response to lost

    packets is a form of flow control that does not use explicit information from the receiver. SABULs

    flow control scheme was not very effective since we observed substantial loss and retransmission.

    To be able to send back buffer status information, the receiver has to have timely access to this in-

    formation. Although, the FRTP receiver can accurately figure out how much free space is available

    in the protocol and application buffers (see Figure 3.2), it does not have access to the current status

    of the UDP buffer in kernel memory. The kernel does not make any effort to avoid UDP buffer

    overflows. The filling and emptying of a user space buffer are fully in the control of a user space

    process. So if a user space buffer is short on free space, the process can choose not to read in more

    data. With the UDP buffer the kernel has no control over the filling of the buffer since packets arrive

    asynchronously over the network. That is why flow control is necessary to prevent the UDP buffer

    from overflowing. Therefore, any flow control scheme which requires explicit buffer status infor-

    mation from the receiver would need support from the kernel. By choosing to implement FRTP in

    the user space over UDP, we lose the opportunity to implement such a flow control scheme.

  • 7/30/2019 A TRANSPORT PROTOCOL

    39/81

    Chapter 4

    TCP-BASED SOLUTION

    In the previous chapter we pointed out the shortcomings of a UDP-based transport protocol that

    were uncovered while implementing FRTP using SABUL. We realized that more support from

    the operating system would be required to better match the behavior of the end hosts with that of

    the network in which resources were reserved. This chapter describes our efforts to implement a

    transport protocol for dedicated circuits that is more closely tied in with the operating system than

    the user-space FRTP. Our protocol is based on the TCP implementation in Linux. To reiterate this

    fact, we call this protocol Circuit-TCP (C-TCP).

    In this chapter, first an overview of TCP is presented. Then we look at the advantages of using

    TCP to implement a transport protocol for dedicated circuits. Next, we present the implementation

    of C-TCP. C-TCP has been tested on the CHEETAH testbed. Results from these experiments and a

    discussion of their significance concludes this chapter.

    4.1 Transmission Control Protocol - A Primer

    TCP is the transport protocol of the TCP/IP suite of protocols. It is a connection-oriented protocol

    that provides reliability, distributed congestion control and end-to-end flow control. Note that the

    meaning of TCP being a connection-oriented protocol is different from the use of the phrase in

    connection-oriented network. In order to provide its end-to-end services, TCP maintains state

    for each data stream. Thus, TCP creates a connection between two end points wishing to commu-

    27

  • 7/30/2019 A TRANSPORT PROTOCOL

    40/81

    Chapter 4. TCP-BASED SOLUTION 28

    nicate reliably (the end points can be processes on end hosts), maintains state information about

    the connection and disconnects the two end points when they no longer need TCPs service. In

    a connection-oriented network, a connection refers to physical network resources that have been

    reserved, and that taken together form an end-to-end path.

    Applications wishing to use TCPs service use the sockets interface that the TCP/IP stack in the

    operating system provides. Two processes that want to use TCP to communicate create sockets and

    then one of the processes connects its socket to the remote socket. A connection is established if

    the connection request is accepted by the remote end. TCP uses a 3-way handshake to establish a

    connection. Connection establishment also initializes all of the state information that TCP requires

    to provide its service. This state is stored in the data structures associated with the sockets on each

    end of a connection. We now present brief descriptions of four of TCPs functions. For a more

    detailed description please see [29], [8] and [1].

    4.1.1 Error Control

    Each unique data byte transferred by TCP is assigned a unique sequence number. During connection

    establishment the two ends of a connection exchange starting sequence numbers. The TCP at the

    receiving end maintains information about sequence numbers that have been successfully received,

    the next expected sequence number and so on. The receiver can make use of the sequence numbers

    of received data to infer data reordering with certainty, but not data loss. In fact, neither the TCP

    at the sender nor the one at the receiver can reliably detect packet loss since a packet presumed lost

    could just be delayed in the network. TCP uses acknowledgements (ACKs) of successfully received

    data and a sender-based retransmission time-out (RTO) mechanism to infer data loss. The time-out

    value is calculated carefully using estimates of RTT and RTT variance, to reduce the possibility of

    falsely detecting loss or waiting too long to retransmit lost data. An optimization that was proposed

    and has been widely implemented is the use of triple duplicate ACKs to infer loss early rather than

    wait for the RTO to expire. A TCP receiver sends back a duplicate ACK whenever an out-of-order

    packet arrives. For instance, suppose packets Pn, Pn+1, Pn+2, Pn+3 and Pn+4 contain data that is

    contiguous in the sequence number space. IfPn+1 goes missing, then the receiving TCP sends back

  • 7/30/2019 A TRANSPORT PROTOCOL

    41/81

    Chapter 4. TCP-BASED SOLUTION 29

    duplicate ACKs acknowledging the sucessful receipt of Pn when Pn+2, Pn+3 and Pn+4 arrive. On

    getting 3 duplicate ACKs, a TCP sender assumes that the data packet immediately following the

    (multiply) ACKed data was lost. The sender retransmits this packet immediately. This is called fast

    retransmit. As was pointed out in Chapter 2, many enhancements to TCP have been proposed and

    implemented, such as the use of SACKs, that improve TCPs loss recovery, among other things.

    4.1.2 Flow Control

    Flow control allows a receiving TCP to control the amount of data sent by a sending TCP. With

    each ACK, the receiving TCP returns the amount of free space available in its receive buffer. This

    value is called the receiver advertised window (rwnd). The sending TCP accomplishes flow control

    by ensuring that the amount of unacknowledged data (the demand for receiver buffer space) does

    not exceed rwnd (the supply of buffer space on the receiver).

    4.1.3 Congestion Control

    The original specification of TCP [29] did not have congestion control. TCPs congestion control

    algorithm was proposed in [21]. Just as flow control tries to match the supply and demand for the

    receiver buffer space, congestion control matches the supply and demand for network resources

    like bandwidth and switch/router buffer space. This is a much more complex problem because

    TCP is designed to work on packet-switched networks in which multiple data flows share network

    resources. TCPs congestion control algorithm is a distributed solution in which each data flow

    performs congestion control using only its own state information, with no inter-flow information

    exchange.

    TCP congestion control is composed of three parts.

    1. Estimate the current available supply of the network resources and match the flows demand

    to that value.

    2. Detect when congestion occurs (i.e. demand exceeds supply).

    3. On detecting congestion, take steps to reduce it.

  • 7/30/2019 A TRANSPORT PROTOCOL

    42/81

    Chapter 4. TCP-BASED SOLUTION 30

    TCP maintains a state variable, congestion window (cwnd), which is its estimate of how much

    data can be sustained in the network. TCP ensures that the amount of unacknowledged data does

    not exceed cwnd,1 and thus uses cwnd to vary a flows resource demand. Since a sending TCP

    has no explicit, real-time information about the amount of resources available in the network, the

    cwnd is altered in a controlled manner, in the hope of matching it to the available resources. The

    cwnd is increased in two phases. The first phase, which is also the one in which TCP starts, is

    called slow start. During slow start cwndis incremented by one packet for each returning ACK that

    acknowledges new data. Thus, ifcwndat time t was C(t), all of the unacknowledged data at t would

    get acknowledged by time (t+RT T) and C(t+RT T) would be C(t)+C(t) = (2C(t)). Slow start

    is used whenever the value ofcwndis below a threshold value called slow start threshold (ssthresh).

    When cwndincreases beyond ssthresh, TCP enters the congestion avoidance phase in which the rate

    of cwnd increase is reduced. During congestion avoidance, each returning ACK increments cwnd

    from C to (C+ 1C

    ). An approximation used by many implementations is to increment C to (C+ 1)

    at the end of an RTT (assuming the unit for cwnd is packets).

    The second component of congestion control is congestion detection. TCP uses packet loss as

    an indicator of network congestion. Thus, each time a sending TCP infers loss, either through RTO

    or triple duplicate ACKs, it is assumed that the loss was because of network congestion. Other

    congestion indicators have been proposed. For instance, in Chapter 2 we mentioned that FAST

    uses queueing delay to detect network congestion. Some researchers have proposed that a more

    proactive approach should be adopted, and congestion should be anticipated and prevented, rather

    than reacted to. Such a proactive approach would require congestion information from the network

    nodes. See [5] for a discussion of the Active Queue Management (AQM) mechanisms that routers

    need to implement, and [15] for a description of the Random Early Detect (RED) AQM scheme.

    In [30], the modifications that need to be made to TCP in order to take advantage of the congestion

    information provided by routers using AQM is presented.

    The third component of congestion control is taking action to reduce congestion once its been

    detected. The fact that congestion occurred (and was detected) means that TCPs estimate of the

    1Recall that flow control requires the amount of unacknowledged data to be less than rwnd. TCP implementations

    use min(rwnd,cwnd) to bound the amount of unacknowledged data.

  • 7/30/2019 A TRANSPORT PROTOCOL

    43/81

    Chapter 4. TCP-BASED SOLUTION 31

    available network resource supply is too high. Thus, to deal with congestion, TCP reduces its

    estimate by cutting down cwnd. On detecting loss, the sending TCP first reduces ssthresh to half of

    the flight size, where flight size is the amount of data that has been sent but not yet acknowledged

    (the amount in flight). The next step is to reduce cwnd. The amount of reduction varies depending

    on whether the loss detection was by RTO or triple duplicate ACKs. If an RTO occurred then the

    congestion in the network is probably severe, so TCP sets cwndto 1 packet. The receipt of duplicate

    ACKs means that packets are getting through to the receiver and hence congestion is not that severe.

    Therefore, in this case cwnd is set to (ssthresh + 3) packets and incremented by 1 packet for each

    additional duplicate ACK. This is called fast recovery.

    The linear increase of cwnd by one packet per RTT, during congestion avoidance, and its

    decrease by a factor of two during recovery from loss is called Additive Increase Multiplicative

    Decrease (AIMD). TCP uses an AI factor of one (cwnd cwnd+ 1) and an MD factor of two

    (cwnd cwnd

    1 12

    ).

    4.1.4 Self Clocking

    Although TCP does not explicitly perform rate control, the use of ACK packets leads to a handy

    rate maintenance property called self clocking [21]. Consider the situation shown in Figure 4.1.

    The node marked SENDER is sending data to the RECEIVER that is three hops away.2 The links

    LIN K1,LIN K2 and LINK3 are logically separated to show data flow in both directions. The width

    of a link is indicative of its bandwidth, so LINK2 is the bottleneck in this network. The shaded

    blocks are packets (data packets and ACKs), with packet size proportional to a blocks area. The

    figure shows the time instant when the sender has transmitted a windows worth of packets at the

    rate of LINK1. Because all these packets have to pass through the bottleneck link, they reach the

    receiver at LINK2s rate. This is shown by the separation between packets on LINK3. The receiver

    generates an ACK for each successfully received data packet. If we assume that the processing time

    for each received data packet is the same, then the ACKs returned by the receiver have the same

    spacing as the received data packets. This ACK spacing is preserved on the return path. Each ACK

    2This figure is adapted from one in [21].

  • 7/30/2019 A TRANSPORT PROTOCOL

    44/81

    Chapter 4. TCP-BASED SOLUTION 32

    allows the sender to transmit new data packets. If a sender has cwnd worth of data outstanding in

    the network, new data packets are transmitted only when ACKs arrive. Thus, the sending rate (in

    data packets per unit time) is maintained at the rate of ACK arrival, which in turn is determined by

    the bottleneck link rate. This property of returning ACKs clocking out data packets is called self

    clocking.

    ! !

    ! !

    ! !

    ! !

    ! !

    ! !

    " "

    " "

    " "

    " "

    " "

    " "

    # #

    # #

    # #

    # #

    # #

    # #

    $ $

    $ $

    $ $

    $ $

    $ $

    $ $

    % %

    % %

    % %

    % %

    % %

    % %

    & &

    & &

    & &

    & &

    & &

    & &

    & &

    ' '

    ' '

    ' '

    ' '

    ' '

    ' '

    ' '

    ( (

    ( (

    ( (

    ( (

    ( (

    ( (

    ( (

    ) )

    ) )

    ) )

    ) )

    ) )

    ) )

    ) )

    0 0

    0 0

    0 0

    0 0

    0 0

    0 0

    0 0

    1 1

    1 1

    1 1

    1 1

    1 1

    1 1

    1 1

    2 2

    2 2

    2 2

    3 3

    3 3

    3 3

    4 4

    4 4

    4 4

    5 5

    5 5

    5 5

    6

    6

    6

    7

    7

    7

    8

    8

    8

    9

    9

    9

    @

    @

    @

    @

    @

    @

    @

    A

    A

    A

    A

    A

    A

    A

    B

    B

    B

    B

    B

    B

    B

    C

    C

    C

    C

    C

    C

    C

    D

    D

    D

    D

    D

    D

    D

    E

    E

    E

    E

    E

    E

    E

    F F

    F F

    F F

    F F

    F F

    F F

    F F

    G G

    G G

    G G

    G G

    G G

    G G

    G G

    ACKs ACKs ACKs

    SEND

    ER

    NETWORKNODE

    NETWORKNODE

    RECE

    IVER

    LINK1 LINK2 LINK3

    Figure 4.1: TCP self clocking

    4.2 Reasons for Selecting TCP

    In Chapter 3, two problems were identified in a user-space UDP-based implementation of FRTP.

    1. Use of busy waiting to maintain a fixed inter-packet gap, and thus a fixed rate, does not work

    very well. Even if it did work perfectly, it is wasteful of CPU cycles.

    2. The difficulty of maintaining a fixed receiving rate makes flow control very attractive. A

    proactive scheme, in which the receiver is able to prevent buffer overflow, requires kernel

    support that a user space FRTP cannot get. By removing SABULs rate-based congestion

    control, FRTP forgoes SABULs reactive flow control too. Thus, FRTP has null flow control.

  • 7/30/2019 A TRANSPORT PROTOCOL

    45/81

    Chapter 4. TCP-BASED SOLUTION 33

    In this section, two issues are addressed. First, whether TCP is better at tackling the two problems

    listed above. Second, are there other issues unique to TCP that need to be considered.

    The description of TCPs slow start and AIMD schemes in Section 4.1.3 shows that TCP does

    not maintain a fixed sending rate. TCP is designed with the assumption that the available bandwidth

    in the network (called supply in Section 4.1) is changing over time, as other data flows start or end,

    and that its instantaneous value is not known. TCPs congestion control algorithms attempt to match

    a flows sending rate to the available network bandwidth, inspite of this incomplete knowledge. But,

    such a sending rate altering algorithm is not needed on dedicated ciruits.

    If we assume that TCPs congestion control can be disabled, how well can TCP maintain a fixed

    sending rate and at what granularity? The self clocking property provides a low-overhead way to

    maintain a steady sending rate. In steady state, each returning ACK clocks out a data packet so a

    steady sending rate can be maintained at a granularity of packets. Moreover, packet transmission is

    initiated as a result of an interrupt (the NIC raises an interrupt when an ACK is received), and so is

    much less likely to be disturbed by the behavior of the process scheduler. This is a major advantage

    of shifting the responsibility of maintaining a steady rate to the kernel domain.

    The variability in the receiving rate is because of the receiving applications interaction with the

    process scheduler and the disk. This problem is not solved by using a different transport protocol.

    But, TCPs flow control is designed to minimize the impact of such variability on data transfer

    performance. TCP uses a window-based flow control scheme (see Section 4.1.2) that prevents

    receive buffer overflow, unlike SABUL, which reacts to packet loss caused by buffer overflow.

    TCP appears to adequately deal with the two problems identified in implementing FRTP. In

    addition there are a few other reasons for choosing TCP which we point out next. Once it had been

    established that flow control required kernel support, our choice was essentially made. We did not

    have the expertise to implement a kernel-space protocol starting from scratch. So, our protocol had

    to be implemented by modifying an existing, stable kernel-space transport protocol. TCP and UDP

    are so widely used and well understood that, unless some other protocol is clearly more suitable, it

    makes sense to modify TCP or UDP. Another reason for choosing to use TCP is that error control

    comes for free. In the next section, the protocol design for C-TCP is presented and it should be clear

  • 7/30/2019 A TRANSPORT PROTOCOL

    46/81

    Chapter 4. TCP-BASED SOLUTION 34

    that for the majority of transport protocol functions, what TCP implements worksregardless of

    whether the underlying network is connectionless or connection-oriented.

    So is TCP the answer to all our problems? Well, no. Without any modifications TCPs conges-

    tion control algorithm is not suitable for use over a dedicated circuit. One of the main differences

    between TCP and C-TCP is the congestion control algorithm used. We describe C-TCP in more

    detail in the next two sections. A practical issue with doing any kernel-space modification is that

    ease of use for the solution is much lower than a user space application, which can be downloaded,

    built and installed, since the host has to be rebooted.

    4.3 Circuit-TCP Design

    In this section the design of C-TCP is described. Five functions of a transport protocol are con-

    sidered, namely connection establishment, congestion control, multiplexing, flow control and error

    control. For each of these functions, we consider whether it is required on a dedicated circuit and if

    so, how to provide the function.

    4.3.1 Connection Establishment

    It is useful in the design of a transport protocol to think in terms of control and data planes. Control

    plane functions support the data plane. For instance, TCPs three-way handshake for connection

    establishment is used to agree upon an initial sequence number to be used in the data transfer that

    follows. C-TCP requires state to be maintained for each data flow using C-TCP. The connection-

    establishment and release schemes are used unaltered from TCP.

    4.3.2 Congestion Control

    Network congestion occurs when the