tcp performance enhancement in wireless environments: prototyping in linux

Georg-August-UniversittGttingenZentrum fr Informatik

ISSN 1612-6793Nummer GAUG-ZFI-BSC-2008-05

Bachelorarbeitim Studiengang "Angewandte Informatik"

TCP Performance Enhancementin Wireless Environments:

Prototyping in Linux

Swen Weiland

Arbeitsgruppe frComputernetzwerke

Bachelor- und Masterarbeitendes Zentrums fr Informatik

an der Georg-August-Universitt Gttingen13. Mai 2008

Georg-August-Universitt GttingenZentrum fr Informatik

Lotzestrae 16-1837083 GttingenGermany

Tel. +49 (5 51) 39-1 44 14Fax +49 (5 51) 39-1 44 15Email [email protected] www.informatik.uni-goettingen.de

Ich erklre hiermit, dass ich die vorliegende Arbeit selbstndig verfasst und keineanderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

Gttingen, den 13. Mai 2008

Bachelorarbeit

TCP Performance Enhancement in WirelessEnvironments: Prototyping in Linux

Swen Weiland

13. Mai 2008

Betreut durch Prof. Dr. Xiaoming FuArbeitsgruppe fr ComputernetzwerkeGeorg-August-Universitt Gttingen

Contents

1 Introduction 61.1 Motivation to Optimize Wireless Networks . . . . . . . . . . . . . . . . . . . 71.2 Contribution of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background and Related Work 92.1 TCP Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Existing Works on TCP Improvements in Wireless Networks . . . . . . . . . 102.3 TCP Snoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Basic Idea and how Enhancements are achieved . . . . . . . . . . . . 13

3 Proxy Implementation Design 153.1 Overview and a Brief Function Description . . . . . . . . . . . . . . . . . . . 153.2 Interface to the Kernel for capturing . . . . . . . . . . . . . . . . . . . . . . . 193.3 Capturing Packets - a Comparison of Methods for Capturing . . . . . . . . . 20

3.3.1 libPCap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2 RAW Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.3 Netfilter Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.4 Kernel Ethernet Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.4.1 Netfilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.5 Conclusion of the Comparison . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Module Design of the Proxy Implementation . . . . . . . . . . . . . . . . . . 283.4.1 Operating System requirements . . . . . . . . . . . . . . . . . . . . . 283.4.2 Module: Buffer Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4.3 Module: Netfilter Queue Interface . . . . . . . . . . . . . . . . . . . . 313.4.4 Module: TCP Connection Tracker . . . . . . . . . . . . . . . . . . . . 34

3.4.4.1 Prime Numbers for the Secondary Hash Function . . . . . . 353.4.4.2 Identify a TCP flow . . . . . . . . . . . . . . . . . . . . . . . 363.4.4.3 Retransmission Buffer . . . . . . . . . . . . . . . . . . . . . . 39

3.4.5 Module: Timer Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4.6 Module: Connection Manager . . . . . . . . . . . . . . . . . . . . . . 42

2

Contents

3.4.6.1 Stateful tracking . . . . . . . . . . . . . . . . . . . . . . . . . 433.4.6.2 Round Trip Time calculation . . . . . . . . . . . . . . . . . . 45

4 Evaluation 484.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.1 TCP Connection Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 494.1.2 Connection Manager with implemented TCP Snoop behavior . . . . 50

4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Conclusions 555.1 Summarization of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2 Future Work and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 58

3

List of Figures

2.1 TCP Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Overview: Implemented modules and their communication . . . . . . . . . 163.2 Dataflow with libPCap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Dataflow with RAW Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Ethernet Type II Frame format . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5 Dataflow with Netfilter Queue . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6 Logical structure: Linked list used as a FIFO . . . . . . . . . . . . . . . . . . 293.7 Implementation: Linked list at initial state . . . . . . . . . . . . . . . . . . . . 303.8 Implementation: Linked list after retrieval of a chunk . . . . . . . . . . . . . 303.9 Implementation: Linked list after returning the chunk . . . . . . . . . . . . . 313.10 TCP Handshake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.11 Simplified TCP State Machine Diagram from the Connection Tracker . . . . 47

4.1 Testbed for initial TCP connection tracking test . . . . . . . . . . . . . . . . . 494.2 Testbed for TCP connection tracking test . . . . . . . . . . . . . . . . . . . . . 504.3 Testbed for TCP Snoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4

Abstract

In recent years, wireless communication gets more and more popular. Future wirelessstandards will reach throughputs much higher than 100 Mbit/sec on link layer. However,wireless channels, as compared to wired lines, exhibit different characteristics due to fad-ing, interference, and so on. For transport control protocol (TCP), the misinterpretationof packet loss due to wireless channel characteristic as network congestion results in sub-optimal performance. There are many different approaches to enhance TCP over wirelessnetworks, especially for slow and lossy links such as satellite connections. This thesis eval-uates TCP Snoop as one of these approaches for high transfer rates. Finding, using andimplementing effective capturing, buffering and tracking of TCP communication were theobjectives to solve. A general and transparent TCP proxy with TCP Snoop behaviorwas implemented during the work for this thesis. The TCP proxy runs on an interme-diate Linux host which connects wired and wireless networks as a prototype user spaceapplication with a modular design.

Different traffic capture methods are compared in portability and performance. A full TCPconnection tracking is described and implemented. Design patterns and methods thatproofed their benefit in practice were applied and sometimes partially modified to fit intothe needs of the transparent TCP proxy. The modular design makes exchanging a lowlevel module such as the data traffic capture module possible. Porting the implementationto another operating system, another platform like embedded systems which are used aswireless LAN routers or changing the TCP enhancement method are also eased by themodular design.

The results show that a transparent TCP proxy or other traffic modifying implementationshould not reside in the user space for performance reasons. A kernel space implementa-tion or even better a dedicated hardware like a network processor platform should be usedfor such implementations.

1 Introduction

Nowadays, the Internet is more and more based on wireless technologies, especially thelast few meters or even few miles. For example, many of the DSL Internet connectivitiesin Germany are bundled with subsidized wireless access points in their product packages.From my personal experience, people like to sit on the sofa or in the garden outside thehouse, while they are still using the Internet for surfing, chatting and downloading. Insuch scenarios, wireless is becoming more and more desirable.

Accordingly to the traffic analysis study by Simon Leinen at Columbia, the majority oftraffic, about 74% [Leinen], of the Internet is TCP. An extract of the results from his traf-fic analysis is presented on the bottom of this page as Table 1.1. As TCP originally wasdesigned for wired communications, there are some drawbacks for wireless scenarios. Ifthese issues could be solved or at least optimized, this would also optimize the majority ofInternet traffic.This thesis focuses on the TCP performance enhancement issues over wireless environ-ments. More specifically, I performed a prototype implementation of a transparent TCPproxy as user space application, for optimizing end-to-end TCP performance. The imple-mentation is meant to be run on or very near to the last hop to a mobile node. A userspace application benefits from a straight design and is independent from any restrictionsfor implementing a kernel module. The lowered security level of a user space applicationprotects the operating system and leads to a good system stability. Moreover, debuggingis easier in user space and simplifies rapid prototyping.

Protocol Flows Flows (%) Packets Packets (%) bytes Bytes (%)

GRE 383 0.00 % 17235 0.00 % 3602115 0.00 %ICMP 101931237 1.75 % 305793711 0.45 % 37918420164 0.11 %IGMP 34662 0.00 % 901212 0.00 % 58578780 0.00 %IP 1406788 0.02 % 15474668 0.02 % 3528224304 0.01 %IPINIP 1297 0.00 % 1297 0.00 % 583650 0.00 %TCP 4361852662 74.91 % 63919315234 93.39 % 32455859980970 96.82 %UDP 1357265629 23.31 % 4201556174 6.14 % 1025284993546 3.06 %

Table 1.1: Analysis taken by Simon Leinen on an access router at a random university [Leinen].

6

1 Introduction

1.1 Motivation to Optimize Wireless NetworksIn TCP, packet losses are interpreted by the TCP stack as congestion by default. For wirednetwork hardware today, packet loss is not a real problem anymore because of the very lowbit error rate. However, wireless and mobile networks are often characterized by sporadichigh bit-error rates, intermittent connectivity and effects of interferences [Caceres]. Thisresults in higher bit error rates than in wired networks. Additionally a sender lowers thepacket sending rate due to the misinterpretation of packet losses as congestion.

To avoid this, some researches propose various TCP enhancements [Bakre], [Balakrishnan1995],[Chen] which try to reduce or eliminate these impacts. For applying some enhancementto a network, in most cases the infrastructure has to be changed or the nodes have to bereconfigured. If a transparent TCP proxy is used, nothing of the infrastructure has to bechanged, only the proxy function is added.

The positioning of such a proxy should be as near as possible to the wireless part, in orderto react more quick on changes in the wireless part of the network. Directly in a basestation of a wireless network is the nearest and therefore best position. I assume it hassufficient RAM and processing power to perform necessary proxy function, which couldbe plausible due to todays manufacture technologies and I will also go back to this issuein the evaluation in the later parts of this thesis.

1.2 Contribution of This ThesisThe contribution of this thesis can be summarized as follows:

Representative approaches for TCP enhancements over wireless environments areidentified and classified in three groups.

A transparent proxy approach, TCP Snoop [Balakrishnan1995], [Balakrishnan1996],[Balakrishnan1999], is selected for primary study due to its nice tradeoffs betweenfunctionality and complexity. A software design over Linux is presented and imple-mented.

A key technique used in transparent proxy approaches, namely data capturing, isidentified. A data capturing solution is chosen out of several alternative solutions,according to their ability and performance to capture andmodify the throughpassingtraffic.

Finally, a performance analysis of the implementation is given and the TCP Snoopapproach as user space application is evaluated systematically.

7

1 Introduction

1.3 Thesis OrganizationThe thesis starts with a general survey of related work, including an introduction to TCPSnoop as the background for the software design and implementation. I am then dis-cussing the design of the software framework:Firstly requirements and dependencies to other software are introduced, followed by abottom up overview about the general design of the software with a short functional de-scription. After deciding the most efficient capture method for this implementation, thethesis describes how each specific module was designed, implemented, which design pat-ters were applied and why the implemented algorithms were chosen. Note that sourcecode is not discussed directly, only necessary function descriptions or data structures inorder to give a closer view on the software. The implementation is then evaluated withsome test cases and analyzed in terms of performance.

For the convenience of the readers, below I give a short summary of the notation I use inthis thesis:

Bibliographic sources are given as references in brackets.

Shell commands and source code is shown in framed boxes.

Bold formattedwords are representing references to labeling on figures and/or namesof software modules. If a word is formatted italic, it represents an important expres-sion or a project name of a software implementation which was used or comparedto.

References to the Internet are directly inserted into the text as footnotes.

8

2 Background and Related Work

2.1 TCP BasisBasic knowledge about Internet Protocol (IP) and the Transmission Control Protocol (TCP)is assumed. In this thesis the focus is on TCP and in this paragraph some facts about TCPare recapitulated which are referred to later or explained later in more detail.If you have good knowledge about TCP, the rest of this section can be skipped and you cancontinue reading at section 2.2.

TCP is a reliable stream delivery service that guarantees to deliver a stream of data sentfrom one node to another without duplication or losing data.It also provides flow control and congestion control. A flow or connection is identified bya source and destination IP address and especially by the source and destination TCP portnumber.Reliability is achieved with acknowledgment messages, which are sent along with datapackets or standalone with an empty data packet. These acknowledgment messages arerepresented as special fields in the TCP protocol header.To be precise, the acknowledgment number (ACK number) and the acknowledgment flag(ACK flag) are meant.

The structure of the TCP protocol header is shown on the following Figure 2.1 on page 10.Position of the ACK number and the ACK flag can be looked up on the figure.By comparing the sequence number of already sent packets with the acknowledgmentnumber of a currently received packet, the TCP stack can decide which packets havereached their destination, are still on the way or are lost. All sent out packets in the outgo-ing buffer with a sequence number lower than the acknowledgment number of a receivedpacket have reached their destination.Every time a packet or a bunch of packets are sent out a retransmission timer is started.A loss event is a timeout of this timer or 3 duplicate ACKs.

The fast retransmit algorithm uses the arrival of 3 duplicate ACKs (4 identical ACKs with-out the arrival of any other intervening packets) as an indication that a segment has beenlost. [RFC2581, paragraph 3.2 on page 6]

9


0 15 16 31

Bit32

~~

~

~~

~ ~~

source port destination port

data

offsetwindow

checksum urgent pointer

options

(0 or more 32-Bit-words)

data

sequence number

acknowledgment number

reservedURG

ACK

PSH

RST

SYN

FIN

Figure 2.1: TCP Header

Loss events are the most important information for a TCP proxy, because it is one of themain tasks to react on and solve these loss events.Later in 3.4.4 more parts of TCP header are addressed and Figure 2.1 can be used as areference.

2.2 Existing Works on TCP Improvements in Wireless NetworksThis section gives a brief overview about related work of mechanisms for improving TCPperformance over wireless links.Accordingly to the work of Balakrishnan et al. [Balakrishnan1996] and Xiang Chen et al.[Chen] these mechanisms can be grouped into:

End-To-End

End-To-End schemes apply enhancements directly into the TCP stack or extend it asTCP options. This implies a modification of the TCP stack, which is mandatory forboth communication partners if they want to benefit from this type of enhancements.Examples are TCP-NewReno or Explicit Loss Notification (ELN) [Balakrishnan1998].The behavior of the TCP stack for loss events is optimized or additional informationis sent out and processed to realize a better differentiation between congestion andpacket loss caused by link layer errors.

10


Link-Layer

Link-Layer approaches try to make the link layer aware of higher layer protocols likeTCP. No TCP stack modification on the communicating nodes is required, but anintermediate node is added to the network infrastructure.An example for this group is TCP Snoop.Link layer errors are corrected by enhanced retransmission techniques and becauseof the transport layer awareness these errors can be hidden. Misinterpretations by thetransport layer that congestion occurred instead of link layer errors are suppressed.

Split-Connection

As the name suggests, a flow of a connection oriented and reliable transport layerprotocol, e.g. TCP, is split at an intermediate node into two separate flows. Thereforethe intermediate node needs to maintain two separate TCP stacks, but data copyingbetween these stacks is avoided by passing only pointers and having a shared buffer.Examples for split connection schemes are I-TCP and SPLIT.

All mechanisms are very similar, because they fight against the same issues with slightlydifferent methods and effort.Link layer transmission errors are detected and misinterpretations that congestion oc-curred are avoided. Asymmetric network links are handled more efficiently and addi-tionally some of these mechanisms do caching and local retransmission to recover moreefficiently from packet losses.

Modifying the TCP stack like the End-To-End approaches is very effective, because wrongor ineffective behaviors are suppressed directly at their source. Additional processingcaused by this type of modifications is often very little. On the other side, these approachesare hard to deploy because in most cases both end-nodes needs to implement these modifi-cations to benefit from them and in some cases also middle-nodes are involved. Link layerfeedback is limited and issues are mainly fixed on transport layer.For Link-Layer or Split-Connection approaches only an intermediate node ormodule is addedto the network infrastructure and no end-node has to be modified. This means they aremore easy to deploy, but adding an additional node means also raising the needed pro-cessing power for communication. An intermediate node can only guess the state of a TCPstack and only influence it indirectly.Split-Connection approaches try to solve the indirect influence issue by maintaining twoseparate TCP stacks for each flow on the intermediate node, but this also nearly doublesthe processing overhead.Link-Layer approaches are a good tradeoff, because they are more easy to deploy and theirreduced complexity against Split-Connection approaches. Further Link-Layer approaches donot break the end-to-end communication like Split-Connection approaches, which makesroaming in wireless networks possible without any trouble.

11


2.3 TCP Snoop2.3.1 OverviewAs by Balakrishnan et al. [Balakrishnan1995] defined, TCP Snoop sought to improve theperformance of the TCP protocol for wireless environments without changing the exist-ing TCP implementations neither in the wired network nor in the wireless network. It isdesigned to be used on or near to a base station of a wireless network, to enhance TCPend-to-end connections. Every through passing TCP packet is buffered in a local memoryfor doing a fast and local retransmission if a packet loss occurs on the wireless part of thenetwork. TCP Snoop behaves as a transparent proxy and maintains a cache of TCP pack-ets for each TCP flow. Lost packets are detected and locally retransmitted. Therefore eachflow is tracked with an additional but simplified TCP stack in the proxy implementation.TCP Snoop does not break end-to-end connectivity like Split-Connection approaches andstays completely transparent. This makes roaming from one wireless part to another wire-less part of the same network is possible.

Reasons for treating wireless networks for TCP enhancement differently are the following.Wired network links are very reliable and generally a higher bandwidth compared to wire-less links. Reasons for this are the physical medium access method Carrier Sense MultipleAccess/Collision Avoidance (CSMA/CA) and the medium itself. The wireless medium ismore vulnerable to physical interference and it is only a half duplex medium. In oppositewired communication is mainly used as a full duplex medium and can easily be shieldedas protection against physical interference.

Triggered by lost packets or duplicated ACKs which are also used for signaling lost pack-ets, a TCP stack may suspect a congested link and lowers its sending rate. This leads toa suboptimal usage of bandwidth for wireless networks, because congestion in standardTCP is only detected by losses but in wireless networks there are many reasons for losses.Wireless links can recover to a higher bandwidth very quick if an interference stopped orweakened, but the TCP stack detects this very slowly compared with detecting congestion.There is simply no signaling for the bandwidth changes in the standard TCP stack.To avoid this, duplicate ACKs are suppressed by the TCP Snoop proxy and local retrans-mission is triggered for every lost packet. Forwired links the congestion assumption is nor-mally the right choice, because of the lag of temporal high bit error rates in the medium.With a high probability wired links have reached their current maximum bandwidth ifpacket losses occur, which means congestion.

12


2.3.2 Basic Idea and how Enhancements are achieved

TCP Snoop is implemented as a transparent TCP proxy near to or on a base station of awireless network. A TCP proxy has two Ethernet devices and forwards all traffic from onedevice to the other. On the way from one device to the other, a packet can be modified bythe proxy if necessary and this modification is also transparent for the network. Non TCPtraffic is ignored and just forwarded, but TCP traffic is processed by the proxy before it isalso forwarded. Processing is defined as identifying each TCP flow and tracking it.If a packet loss is detected during the tracking, a local retransmission is done. Losses aredetected by a certain amount of duplicate ACKs and by timeouts of a locally installedretransmission timer, which is part of a simplified TCP stack in the proxy. The simplifiedTCP stack is utilized for tracking TCP and tracks SEQ numbers, ACK numbers and otherdynamic values of a previously identified TCP flow.

The proxy should be placed as near as possible to the wireless base station on the networktopology to reduce response times.All unacknowledged packets from the fixed host (FH) to the mobile host (MH) are cachedin the buffer of the proxy. This should be a buffer in a fast memory like DRAM or SRAM.Unnecessary congestion control mechanism invocations for a TCP flow are avoided duehiding duplicate ACKs and doing the local retransmission.

The authors of the original paper that defined TCP Snoop updated it [Balakrishnan1996],[Balakrishnan1999] to improve the performance for packet losses on the way from the MHto the wireless base station and is implemented by sending selective retransmission re-quests to the MH and be described as follows.Packets from the MH to the FH are processed and cached as normal, but if a gap in theascending sequence numbers is detected, a Negative Acknowledgment (NACK1) is sentfrom the proxy to the MH, which triggers a retransmission. A new and for the TCP proxycacheable copy of the lost packet should be on the way if the FH realizes that it was lost.

Native TCP supports only positive acknowledgment of packets. The Selective Acknowl-edgment (SACK) is a TCP option, which allows sending selective Acknowledgments (ACKs)or selective Negative Acknowledgments (NACKs) for specific packets. A TCP option de-fines an extension to TCP, which is sent in an optional part of the TCP header.

To verify that using this TCP SACK option and the proposed updates are applicable, Iinvestigated about the distribution of SACK. From December 1998 to February 2000, thefraction of hosts sampled with SACK-capable TCP has increased from 8% to 40% [Allman].Today it should be 90% and above, because SACK is supported by any major operating

1Part of the Selective Acknowledgment (SACK) TCP option. Standardized by RFC 2018.

13


system. To give some names, SACK is supported by Windows (since Windows 98), Linux(since 2.2), Solaris, IRIX, OpenBSD and AIX. [Floyd]

The host support of the SACK extension is detected by the proxy during the three-way-handshake which establishes a TCP flow.If SACK is not supported by one of the two hosts, it cannot be used for this flow at all. Thisis defined by RFC 2018, which defines TCP SACK option. In this case the proxy skips theenhancement for the traffic from theMH to the FH and just tries to enhance the traffic fromthe FH to the MH.

14

3 Proxy Implementation Design

3.1 Overview and a Brief Function DescriptionThe TCP proxy prototype implements a transparent TCP enhancer. With a modular designfor easier extension and the possibility to implement other TCP enhancements in futurework. For enhancing the TCP traffic the proxy must have the ability to drop or modifythrough passing TCP packets. To achieve this, the TCP proxy breaks the physical mediumand puts itself in between (see figure 4.3) as an intermediate node. This gives the TCPproxy the total control which packet is forwarded, modified or dropped, because he has toforward each packet from one interface to another.

The following figure 3.1 on page 16 gives an overview about the general software designof the proxy and its core modules. Details of each module and all applied design patternsor used algorithms are described later in this chapter at 3.4.

For implementation the C programming language was used. Multi-threading, thread syn-chronization andmutual exclusions (mutexes1) were implemented by using POSIXThreads.POSIX Threads is a POSIX standard for threads and defines a standard API for creating andmanipulating threads.Mutexes are needed to secure and order asynchronouswrite operations. The POSIX thread-implementation is platform independent, well standardized, good documented and it isusable as a simple library or integrated in the operating system.

The main target platform is Linux with kernel version 2.6.14. This is conditioned bythe need of using libnetfilter_queue2 as interface in the kernel, which is used for capturing,filtering and modifying TCP packets. This extension of the Netfilter packet filter demandsthis version of the Linux kernel. Netfilter is the standard packet filter of the Linux operat-ing system. All needed functionality to use this interface is implemented in the NetfilterQueue Interface (see figure 3.1) module of the proxy.

To support older kernel versions, which are often used on embedded system like WirelessLAN (WLAN) routers, only thisNetfilter Queue Interfacemodule has to be replaced with

1http://en.wikipedia.org/wiki/Mutual_exclusion2Netfilter Queue - see 3.3.3 for further description

15


Connection Manager

TCP Connection

TrackerPacket Generator /

Manipulator

On

ly T

CP

Buffer Manager

With BufferTimer Manager

Netfilter Queue

Interface

Retransmitted TCP

traffic via

RAW Socket

Kernel Ethernet Bridge

DeviceEthernet Device #0 Ethernet Device #1

TCP Snoop

Acknowledgementfor forwarding

Newly created TCP packets

Free buffer

chunks

Figure 3.1: Overview: Implemented modules and their communication

some other module which does the capturing and filtering of the TCP traffic.Such a module could use RAW-Socket3 for capturing and a generic protocol classifier todetect the TCP traffic. Also a back-port to earlier versions of theNetfilter frameworkwouldbe possible with some but minimal effort. Back-porting to some embedded router couldbe done in future works if needed. The modular design makes this possible and easierachievable.

The Netfilter interface is chosen from other methods of capturing for performance reasons,which is described and analyzed later in 3.3.A proxy instance consists of three threads. The main thread, which is created by the op-erating system, is the first thread. After some basic initialization the main thread installsan operating system callback function which triggers the Timer Manager module. Thiscallback function is called in a fixed interval. It is counted as a separate thread, because

3RAW-Socket - see 3.3.2 for further description

16


actions can be triggered or data manipulated asynchronous to the main thread.The third thread is also created by the main thread and used for capturing. The capturingthread resides in theNetfilter Queue Interface module.

Retransmission is implemented by utilizing RAW-Sockets. These allow creating and send-ing custom TCP packets. Every field in IP and TCP header can be set to a custom value. Inthe case of the TCP proxy, mainly a previously buffered packet is just retransmitted.

To do the retransmission, the proxy must be aware of each TCP flow and its state. Onlywith this information the proxy knows the point in time and which TCP packet has to beretransmitted.After some TCP traffic was captured by theNetfilter Queue Interfacemodule, it is handedover to the TCP Connection Trackermodule.The TCP Connection Trackermodule gathers the following information:

Source and destination IP address

Source and destination TCP port

Packet is stored in a queue (one for each direction by connection)

Pointer to the corresponding management structure

After the TCP flow is identified by the TCP Connection Tracker and the correspondingmanagement structure is known or created, all this information is passed to the Connec-tionManagermodule which adds the following information to themanagement structure:

Connection Status

State (TCP State-Machine)

Acknowledge Number (ACK)

Sequence Number (SEQ)

Present TCP options

Round Trip Time (RTT)

Timer (for retransmission)

17


The Connection Manager is the main module, makes all important decisions and can beseen as a concentrator for all the information. TCP packets can be forwarded, dropped,modified or retransmitted. Also new packets can be created. As you see in figure 3.1, inthis module the TCP Snoop behavior is implemented.

There are also three helper modules: Buffer Manager, Timer Manager and Packet Gener-ator/Manipulator.

The Buffer Manager offers some management functions for the buffer memory ofthe proxy. A big memory block is reserved at the initialization of this module anddivided into small chunks. Some othermodule, mainly theNetfilter Queue Interfacefor capturing and storing new data, can retrieve an unused chunk from the BufferManager. If a chunk is not needed any more, it is returned to the Buffer Manager.Different chunk managements are implemented. They are used during capturing,like described before, and for the queuemanagement per TCP flow. The TCP packetsfor each tracked TCP flow are buffered in a special queue. There is one queue percommunication direction of the flow.

The Timer Manager is used by theConnectionManager to install the retransmissiontimers and to install special timeout timers for some TCP states. It offers to trigger aspecific action after a specified time period to other modules.This Timer Managermodule is necessary, because an application thread can only in-stall one timer callback with only one fixed interval.One callback would be not enough if the Connection Manager wants to install atleast one retransmission timer for each TCP flow. Therefore the Timer Manager han-dles and manages this one callback for every module which wants to install a timeror as many timers they want. Repeating and one-time timers are possible.The callback is realized and handled with the signal handling of the operating sys-tem. Another way to implement it would be busy waiting in a separate thread, butthis is definitely a bad choice. Busywaiting is in most cases a bad choice for software.

The Packet Generator/Manipulator is only used by the Connection Manager to cre-ate new packets or modify packets.For example, hiding duplicate ACKs or sending a NACK for a specific packet to theoriginal sender.

18


3.2 Interface to the Kernel for capturing

The Linux operating system segregates its memory into kernel space and user space.Kernel space is privileged and strictly reserved for the kernel and device drivers. A normalapplication runs in user space and has no direct access to kernel space memory areas.

If a TCP packet arrives, it is pushed as an Ethernet frame from the device driver of theEthernet card to the Ethernet stack in the kernel space. From here it is passed to the TCP/IPstack in kernel if it was identified as IP traffic. We assume this as example.To send this packet to an application running in user space, it has to be copied to a memoryarea in user space. This has to be done, because the kernel space memory is not directlyaccessible from user space. In the case of passing data from the kernel to a user spaceapplication, it happens nothing more than duplicating a memory area to the user space.Also if the packet is not needed any more after duplicating, the data has to be copied toan address space that is assigned to the user space. This effort has to be done for securityreason. Just remapping the kernel space memory block to the user space is not possible.Copying memory and throwing one of the copies away is an expensive operation. How toreduce the amount of copy operations is shown in the next section 3.3.Via the TCP socket, a kind of interface library, the application passes a pointer to a memoryarea in user space for the incoming packet to the kernel. The kernel duplicates the bufferwith the incoming packet to the memory in user space, addressed by the pointer.

Normally the Ethernet card passes only Ethernet frames which are addressed to its MediaAccess Control address (MAC address) to the device driver. But the card can be set into aspecial mode. This mode is called promiscuous mode and makes the Ethernet card passall Ethernet frames on the wire to the device driver. In this mode the kernel has to processall the traffic on the wire. The traffic which is destined to its own IP address and to anyother host in the Ethernet subnet.Additional traffic to other hosts caused by the promiscuous mode is dropped by theEthernet stack or later in the IP stack of the kernel. Only traffic to the own host and whichis addressed to a corresponding application on the host is passed to the user space.

To get a copy of all traffic to user space, a special kernel interface is needed, especially forthe traffic which is destined to other hosts. With such an interface it is possible to graba copy of the traffic before it reaches the Ethernet stack or before it reaches the IP stack.Means, Ethernet frames or IP packets can be grabbed and pulled to the user space.Such an interface can be a RAW Socket (described at 3.3.2) or a special kernel module likeNetfilter (described at 3.3.3).The action of using such an interface is usually known as sniffing or capturing.

19


As we now know, passing traffic to the user space is expensive. But passing all traffic fromthe wire to user space is very expensive!The promiscuous mode leads also to much more work for the kernel, because the trafficwhich is not destined to the own host is also processed. On slow machines this can easilyhead into performance problems.

3.3 Capturing Packets - a Comparison of Methods for Capturing

In this section different methods for capturing are shown and compared in focus of perfor-mance and usability for the TCP proxy.The selected methods are chosen because of their popularity and they are present and/orusable for at least 5 years. This should assure, that a prototype based on such an interfaceor library will be usable with newer versions of operating systems and/or these interfaces.Other proprietary kernel modules, which are working only for a few Linux kernel versionswere ignored.

3.3.1 libPCap

The Packet Capture library (libPCap4) provides a multi platform and high level interfacefor packet capturing on transport layer.It puts the Ethernet device into promiscuous mode and supportsmany commonOperat-ing Systems like different Linux, Windows, MacOS and BSD versions. The interface to thelibrary is well documented and packet capturing can be implemented within a few lines ofsource code.

Early versions of my TCP proxy used libPCap for capturing, because of the easy han-dling and the support for many platforms. The library itself uses RAW Sockets (3.3.2) forcapturing Ethernet frames and abstracts only the usage of RAW Sockets on the differentplatforms.Additionally it adds a buffer management and filter management. The application doesnot have to care about the handling of an incoming buffer and also buffer overflows arehandled by the library. Just the buffer size and a callback function for handling incomingframes needs to be defined during the initialization.

4http://www.tcpdump.org/

20


All the traffic from the wire is stored in the buffer and then filtered. If no predefinedfilter applies, the library calls the previously defined callback function. In this function theEthernet frame has to be processed by the application. After the callback function returnedthe control to the library, the memory with the Ethernet frame is freed and reused forcapturing.Freeing the memory after the callback function is the main disadvantage for implementingthe proxy. If the proxy wants to retransmit a packet, the frame with the packet has to becopied during the callback to another buffer (see (5) on figure 3.2). If not, the packet wouldbe lost for the proxy after the callback. It cannot be retransmitted if this is needed.

Callback FunctionCopy

(only TCP)

Ethernet Devices

With Device Driver

Buffer in

Ethernet Stack

libPCap Buffer

Copy with RAW Socket

PointerTCP proxy

Retransmission Buffer

Kernel Space

User Space

IP StackNetfilter

framework

To some Application.Copy with IP Socket to Userspace.

Pointer

Pointer

Copy with RAW Socket

(All traffic)

eth1eth0

5

1

2

3

4

Figure 3.2: Dataflow with libPCap

Second disadvantage is, that the filtering (would happen before (3)) of libPCap is donein user space. Using the internal filtering of libPCap would be a reduction of data thatneed to be processed by the proxy and therefore an optimization. On the other hand, theproxy needs to bridge the two Ethernet devices to be transparent. And to achieve this, allEthernet frames from one device have to be sent out via RAW Sockets on the other device.The filter management of libPCap is useless, because all frames are needed for bridging thedevices. In order to not break an ongoing communication, none of them should be filteredout on the way from one Ethernet device to the other.

To buffer the TCP traffic, the proxy has to filter the bridged traffic for TCP packets. Filterfunctionality has to be implemented into the proxy. Compared to the figure 3.2, all traffic

21


takes the way (1), (2), (3) from kernel to user space into the libPCap buffer and to thecallback of the proxy. Further, the traffic goes the way back via (4) into the kernel space viaa RAW Socket to the other Ethernet device. Only the TCP packets are duplicated into theretransmission buffer of the proxy, which is symbolized by arrow (5).

3.3.2 RAW Sockets

A RAW socket is a socket that can be seen as a direct interface to the transport or networklayer. It passes a copy of a frame or packet directly to the user space, before it is processedby the Ethernet stack or IP stack. It is also possible to send data like a TCP packet directlyto the wire without or only partially being processed by the IP Stack. Partially means to cal-culate checksums in the headers for example. This is a very good possibility to implementretransmission for the proxy, but lets focus back on capturing.

The current RAW socket interface is supported by the Linux kernel since version 2.2.x.Also in version 2.0.x there is a very similar interface, but this obsolete and deprecatednow. Using RAW sockets for capturing was tested with kernel version 2.4.x on a LinksysWRT54G5, an embedded Linux router with wireless interface, and on a PC with Linux ker-nel version 2.6.x.Normally a RAW Socket gets only the traffic that is destined to the MAC addresses of theEthernet devices owned by the proxy host. To implement the proxy functionality all trafficon the wire needs to be processed by the proxy. Therefore the Ethernet devices have to beset into promiscuous mode like described in 3.2.Basically the data flow is very similar to an implementation with libPCap (3.3.1), but thereare no restrictions by a predefined library like limited control of the buffer management.On the other side everything such as protocol classification and buffering has to be imple-mented by the proxy, which causes more processing for the proxy itself and therefore moreworkload during development.

A proper design would be multi threaded and provide at least two threads. One capturethread for each Ethernet device. This prevents polling each device alternately. Polling is atype of busy waiting, which should not be used in software.With multiple threads the access to shared information like the state or the presence of aTCP flow has to be managed and protected, therefore mutexes are used. Multi threadingand mutexes were implemented since the early prototypes with the very handy POSIXThreads library libpthread.

5see products on http://www.linksys.com

22


Ethernet Devices

With Device Driver

Buffer in

Ethernet Stack

Copy with RAW Socket from one eth(all traffic)

TCP proxyRetransmission Buffer

Kernel Space

User Space

IP StackNetfilter

framework

To some Application.

Copy with IP Socket to Userspace.

Pointer

Pointer

1

2

Copy with RAW Socket to other eth

eth1eth0

3

Capture FunctionPointer

(only TCP)

Copy with RAW Socket to other eth

(all traffic)

4

5

Figure 3.3: Dataflow with RAW Sockets

On the figure 3.3 you can see that all traffic takes the way (1), (2) from kernel space to userspace directly into the buffer of the proxy. TCP and non TCP traffic is directly sent out onway (3) via a RAW Socket to the other Ethernet device, to provide bridging and thereforetransparency of the proxy.If the proxy classifies the content of the Ethernet frame as a TCP packet, a pointer to thecurrent part of the buffer is forwarded on way (4) to the retransmission buffer of the proxy.Further processing like TCP connection tracking is triggered.For every other type of data traffic the part of the buffer, later (in 3.4.2) defined as chunk, isfreed again. If later a retransmission of a buffered TCP packet is needed, the proxy sendsthe packet from the retransmission buffer on way (5) via a RAW socket to the Ethernetdevice.

The main difference from the libPCap design is the optimized buffer management. Insteadof copying all traffic to the user space and copying TCP packets a second time from thelibPCap buffer to the retransmission buffer, only a pointer is passed to the other modulesof the proxy for further processing and perhaps a retransmission.The buffer chunk is reused for non TCP packets and kept for TCP packets, which is alsoa small performance enhancement. Compared to the libPCap design, exactly one copyoperation in user space is saved, but the problem stays. All the traffic has to be copied touser space and filtered, classified or modified there.

23


3.3.3 Netfilter Queue

Before some advantages of Netfilter Queue design can be shown, some information onthe Linux Kernel Ethernet Bridge has to be given. This is important because the previousdesigns had to implement its functionality by forwarding the traffic from one device toanother with RAW sockets.

3.3.4 Kernel Ethernet Bridge

Ethernet bridging aggregates two or more Ethernet devices to one logical device. Bridgeddevices are automatically put into promiscuous mode by the kernel. As we alreadyknow, this mode of operation tells the Ethernet device to forward all traffic to the devicedriver. Not only the traffic which is destined to the own MAC address.All traffic, which is not destined to the host of the proxy, is forwarded to the device with thecorresponding part of the Ethernet subnet. The bridge learns where to forward an Ethernetframe from the Source MAC address (see figure 3.4) field in the Ethernet header.

12 A1 B2 C4 D5 E6

Destination MAC Address

12 F1 12 23 34 56

Source MAC Address

08 00

EtherType

MAC Header(14 bytes)

IP, ARP, etc.

Payload

Data(46 - 1500 bytes)

00 20 20 3A

CRC Checksum

Ethernet Type II Frame(64 to 1518 bytes)

(4 bytes)

Figure 3.4: Ethernet Type II Frame format

All unknown MAC addresses from the Source MAC address field are stored with some ad-ditional information like a timestamp and the source device in a hash table. This builds adatabase of MAC addresses linked with the Ethernet device to reach the host.The Ethernet bridge needs to process two cases:

Is a MAC address unknown, the Ethernet frame is forwarded to all other bridgeddevices. This will normally produce a response from the destination host. With sucha response, the bridge is able to detect also the corresponding Ethernet device for theprevious unknown destination host. In the response there are source and destinationexchanged, compared to the first Ethernet frame. The previous destination MACaddress can now be learned from the source MAC address of the response. Usingthis logic, broadcasting to all other devices is needed only for one time.

24


Is a MAC address already stored in the hash table, the bridge knows where to for-ward the Ethernet frame.

3.3.4.1 Netfilter

The Netfilter Queue interface6 is a part of the Netfilter framework. It is a user space libraryproviding an API to packets that have been queued by the kernel packet filter. On transportor network layer it can issue verdicts and/or reinjecting altered packets back to the kernelspace.

Using the fairly known iptables7 tool from the Netfilter framework, special filters rules canbe set. These filter rules decide which packets are passed to the user space. The possibleruleset of Netfilter includes every standard type of IPv4 or IPv6 traffic. For the proxy weonly need to install one rule, which matches against all TCP traffic. TCP is the only relevantdata for the proxy.Protocol classification is applied already in kernel space by the Netfilter framework. Othertraffic than TCP stays in kernel space and is not transferred to the proxy in user space. Thememory reservation and the copy operation to the user space for non TCP traffic are saved.The task of pulling network traffic from one Ethernet device to another, which results inbeing transparent, does not have to be untended in this design.Due to the fact that filtering and protocol classification can be applied in the kernel spaceby the Netfilter framework, but controlled from the user space, it is possible to use thebuild-in Kernel Ethernet Bridge which was described at the beginning of this section.

Netfilter makes no difference between a physical network device and a logical bridge de-vice. Rules can be set and applied to both of them. The two needed Ethernet devices areaggregated by the Ethernet Kernel Bridge to one bridge device. For this reason it is alsoenough to have only one thread for capturing, because there is only one device to capturefrom. This makes most of the thread synchronization in the whole implementation a loteasier and gains a bit more performance, because it saved some overhead.Information about which is the incoming and which is the outgoing physical device of apacket is also provided by Netfilter Queue. This information is needed for retransmissionof the packet on the appropriate device.

6http://www.netfilter.org/projects/libnetfilter_queue/index.html7http://www.netfilter.org/projects/iptables/index.html

25


Kernel

Ethernet Bridge

With Device Driver

Buffer in

Ethernet Stack

Copy with Netfilter Queue

(only TCP traffic)

TCP proxyRetransmission Buffer

Kernel Space

User Space

IP StackNetfilter

framework

To some Application.

Copy with IP Socket to Userspace.

Pointer

Pointer

13

Copy with RAW Socket to eth(only if retransmission needed)

eth1eth0

5

2

4

Figure 3.5: Dataflow with Netfilter Queue

For all traffic the need of forwarding to another Ethernet device is decided by the bridge.If the decision is positive, the frame takes the way (1) on Figure 3.5 and is handed over tothe logical Ethernet device which represents the bridge. From here the frame is passed (2)to the Netfilter Framework. Netfilter issues verdicts based on the installed filter rules andgrants (3) the forwarding to another device. In case of TCP traffic, the packet is copied (4)to the proxy in user space. Netfilter asks the user space application, the proxy, if the packetshould be forwarded or modified. Normally the answer is yes for forwarding.Retransmission, if needed, is done with a RAW Socket (5) on the accordingly Ethernetdevice. This design saves a few more copying operation between kernel and user spacethan the other designs.

3.3.5 Conclusion of the Comparison

During development of the proxy all three methods for capturing were implemented. Justin the order they were presented here. It is a bit like an evolutionary design with the goalsof fixing issues and shorten/enhance the way from the wire into the proxy.

Capturing methods libPCap (3.3.1) and RAW Socket (3.3.2) are very similar except themore effective buffer management of the RAW Sockets design. Technically both designs

26


use RAW Sockets for capturing, but libPCap has an abstraction layer above them to sup-port different platforms and to make the handling easier.The libPCap designs lags in the point of needing to have a second buffer in the user spacefor the TCP traffic. Copying the incoming TCP traffic to the second buffer is saved in theplain RAWSocket design, because traffic is directly captured into the retransmission bufferof the proxy.Both designs do protocol classification of the traffic completely in the user space, thereforeall the traffic has to go the way down from kernel to user space and back to kernel space.Back to kernel space, because the traffic is forwarded to the other Ethernet device to imple-ment bridging. But only a copy of the TCP traffic stays in user space for retransmission, allother traffic is dropped after it was forwarded.

The Netfilter design improves this by doing the classification with help of the kernel inkernel space. Only TCP traffic has to go the way down to the user space and is bufferedthere. And only in the case of retransmission or modification it has to go the way back upto kernel space. It also removed some complexity, because there is only one capture threadleft for exactly one Ethernet device, the logical bridge device. Overhead for thread securityand synchronization is also saved.

The following table 3.1 is a resume of the needed copy and memory operations. The kstands for kernel space and u for user space. buffer[u] describes the retransmissionbuffer of the proxy in user space.

Design Description Copy operations Memory allocations

libPCapk u k

ork u buffer[u] k

2 or 3 3 or 4

RAW Sockets k buffer[u] k 2 3

Netfilter Queuek

pointer k

or

(kpointer k) and (k buffer[u])

0 or 1 1 or 2

Table 3.1: Comparison by counting copy and memory operations

Finally it can be decided that the Netfilter Queue design is the best choice in focus ofperformance. It saves many data transfers to the user space and this is very important forhigher bit rates.The decision is based on reading source code of the Linux kernel, Netfilter Framework,libPCap and implementing each design during the development of the proxy.

27


3.4 Module Design of the Proxy Implementation

3.4.1 Operating System requirements

For actually setting up the proxy, the Linux Kernel needs to have the 802.1d EthernetBridging support enabled and the Netfilter Framework for IPv4 must be enabled.The Kernel Ethernet Bridge has to be set up and the filter rule installed, which applies toTCP traffic, before any packet can be captured.

To control the 802.1d Ethernet Bridging extension in the kernel, the bridge utilities8

are needed. With the following shell commands the needed bridge (br0) starts to forwardEthernet frames from eth0 to eth1 and vice versa.

b r c t l addbr br0 # c r e a t e b r i d g eb r c t l addif br0 eth0 # add e th0 t o b r i d g eb r c t l addif br0 eth1 # add e th1 t o b r i d g ei f c o n f i g br0 up # s t a r t t h e d e v i c e

At this point the traffic can pass the proxy host fully transparent, but without any filter-ing or changing of the traffic. The user space proxy application must be started with thefollowing command.

# ./ tcpproxy br0 eth0 eth1

Any further initialization is done during the startup of the application, which is describedin the following sections for each module separately.

3.4.2 Module: Buffer Manager

The proxy implementation has to buffer every TCP packet till it can be assumed that it hasreached its destination host. Therefore memory in the user space has to be allocated.Normally this is done with a void *malloc(size_t size) system call which returns a pointerto the allocated memory. After a packet has reached its destination the memory could bedeallocated with void free(void *s) system call.

8http://www.linux-foundation.org/en/Net:Bridge

28


But each system call gives the control during the runtime of the application back to theoperating system till the system call returns. This includes a context switching from theuser space application to the kernel. Each context switching operation takes time, becauseregisters and states are saved.If allocation and deallocation is done for each and every TCP packet separately this couldlead into a performance problem, because of the high amount of context switches. Therewould be two context switches for each buffered packet.

This issue is avoided by the Buffer Manager by allocating a large memory block at theinitialization time. The large memory block is divided into small pieces. Each piece islater used to store a TCP packet. Additional, a data structure specially designed to be ex-changed between the different modules of the implementation is used. Such a data struc-ture is called chunk from now on.A chunk carries all important information of a packet. The packet itself as a pointer to theright piece in the large buffer memory block and some additional management informa-tion.Normally a memory block is retrieved from the operating system and returned if the datastored in this block is not needed any more. But if all packets are stored in one large mem-ory block, this block cannot be returned during runtime to the operating system. All thedata would be lost and not only the data from the one packet that is not needed any more,therefore the memory block is only returned during termination of the application.

The Buffer Manager has to keep track which chunk is used at the moment. A linked listwhich stores pointers to all currently unused chunks is used and is accessed like a FIFO. AFirst In, First Out list.

FIFO

Chunk 00 Chunk 01 Chunk ... Chunk nRETURNED

by a module

RETRIEVED

by a module

Figure 3.6: Logical structure: Linked list used as a FIFO

During initialization of the Buffer Manager all chunks are added to the FIFO list.If another module needs a free chunk to store data, the Buffer Manager is asked. As yousee on Figure 3.6, the Buffer Manager always removes the first element from the list andgives the pointer to the chunk to the demanding module. A returned chunk is appendedat the end of the list.

29


For retrieving a free chunk from the list, the implementation just follows and backups theFIRST pointer (see Figure 3.7 as reference for the pointer names) to Chunk 00. At the endof the retrieving operation, the backup pointer to Chunk 00 is passed to the demandingmodule. Chunk 01 becomes the new first element of the list, therefore the value of theNEXT pointer from Chunk 00 to Chunk 01 is copied into the FIRST pointer.

Linked List

Chunk 00

NEXT

Chunk 01 Chunk ... Chunk n

NEXT NEXT

FIR

ST

LA

ST

Figure 3.7: Implementation: Linked list at initial state

After the retrieval of Chunk 00 the linked list looks like Figure 3.8 and a pointer toChunk 00 can be passed to the demanding module.

Linked List

Chunk 00

Chunk 01 Chunk ... Chunk n

NEXT NEXT

FIR

ST

LA

ST

Figure 3.8: Implementation: Linked list after retrieval of a chunk

For returning Chunk 00 back to the list, the implementation just follows the LAST pointerto Chunk n. The NEXT pointer of Chunk n is adjusted to Chunk 00. Finally, afteradjusting the LAST pointer to Chunk 00, the return operation is finished. Figure 3.9shows the order of list elements and pointer adjustments after retrieving and returningChunk 00.

As a small summary, it can be said that simply by only using the FIRST pointer for re-trieving an element and using only the LAST pointer for returning an element, no mutexis needed to protect this list from asynchronous manipulation. This prevents to block an-other thread if two or more simultaneous threads want to retrieve a chunk.Also having a LAST pointer enhances the returning of a chunk a lot. Without this pointer,the whole list needs to be iterated to find the last element and add an element at the end ofthe list. With the LAST pointer it can be added directly.

30


Linked List

Chunk 01

NEXT

Chunk ... Chunk n Chunk 00

NEXT NEXT

FIR

ST

LA

ST

Figure 3.9: Implementation: Linked list after returning the chunk

Simultaneous retrieving and returning is possible without using mutexes for thread syn-chronization, if there is more than one element in the list. This is an improvement if someprocessing of the chunks happens in parallel to the capturing of packets. That is the casefor this proxy implementation.

3.4.3 Module: Netfilter Queue Interface

Capturing the TCP traffic is done in this module. Therefore a callback is registered withthe libnetfilter_queue library. The Linux kernel keeps a linked list with callbacks. With thelibnetfilter_queue library it is possible to create an entry in this list.Always if some packet matches a Netfilter filter rule and QUEUE is present as the actionfor this rule, these callbacks are called one after another in registration order.

The TCP traffic filter rule is installed during the initialization of theNetfilter Queue Inter-facemodule. This happens after the registration of the callback. If the proxy application isterminated by some reason, the rule is automatically deleted again.This is very important, because if a filter rule with QUEUE as action is installed andmatches a packet, the Netfilter Framework tries to ask a waiting application in user spaceif this packet should be dropped. But if no application is present to tell the framework thata packet has to be accepted, it is dropped.The following box shows the rule which is installed by the proxy.

# command t o i n s t a l l t h e TCP f i l t e r r u l ei p t ab l e s A FORWARD p tcp j QUEUE

With this rule all forwarded (-A FORWARD) TCP traffic (-p tcp) is handled by theNetfilter Queue (-j QUEUE). The packets of interest are in the FORWARD chain ofNetfilter, because the Kernel Ethernet Bridge is used.

31


The libnetfilter_queue library supports three copymodes to transport data from kernel spaceto user space.

NFQNL_COPY_NONE - Do not copy any data

NFQNL_COPY_META - Copy only packet meta data

NFQNL_COPY_PACKET - Copy entire packet

NFQNL_COPY_METAwould be enough to realize a connection tracking in user space andwould highly reduce the amount of data that has to be copied to the user space. Only theIP and TCP headers are copied to user space in this mode.This is enough to identify and track a TCP flow, but for the proxy implementation onlythe NFQNL_COPY_PACKET mode is the right choice. In this mode the whole packet istransferred to the user space and can be buffered there by the proxy for retransmission.

To capture a single packet a blocking function of the libnetfilter_queue library must be called.The process of capturing is defined in eight steps:

1. Fetch a free buffer chunk from the Buffer Manager

2. Request the next packet with libnetfilter_queue (blocking)

3. Registered callback is triggered

4. Packet is passed to the TCP Connection Tracker

5. Packet is passed to the Connection Manager

6. Return to callback

7. Return from Request the next packet

8. Go to step 1.

Step three does not only retrieve the TCP packet from the kernel space and stores it in abuffer, also its length and other needed information for further processing is gatheredand stored. Pointers to the IP and TCP headers are calculated and set and these pointersare stored in the chunk data structure as ip_header and tcp_header, which is shown in thefollowing box Chunk data structure.

32


Chunk data structure

s t r u c t buf_man_cap_buf_chunc{

s t r u c t l i s t _head l i s t ; / l i n k e d l i s t management /

char buf fe r ; / p o i n t e r t o b y t e a r r a y with IP p a c k e t /unsigned in t length ; / l e n g t h o f whole p a c k e t /

s t r u c t device_ info out_device ; / ou tput d e v i c e /s t r u c t ip ip_header ; / p o i n t e r t o IP h e ad e r /s t r u c t tcphdr tcp_header ; / p o i n t e r t o TCP he ad e r /

unsigned in t tcp_seq ; / TCP s e qu en c e number /in t nfq_id ; / N e t f i l t e r Queue ID /

enum chunc_s tate s t a t e ;

/ ( t v _ s e c = seconds , t c _ u s e c = m i c r o s e c ond s ) // s e t a t r e c e i v e AND sen t /s t r u c t t imeval timestampt ; / g en e r a l t imes tamp /

} ;

This is done in preparation especially for the TCP Connection Trackermodule but also forall other modules that need direct access to these headers. The Netfilter Queue Interfacemodule is the first module in the processing chain of the proxy implementation, thereforeit makes sense to set these pointers here.Additionally the timestamp is set to the current time, which is the capture time with aresolution of microseconds. This time is needed to calculate the Round Trip Time (RTT) ofa packet. RTT is calculated later in the Connection Manager.

To make retransmission possible, the outgoing physical device of a packet must be known.The outgoing physical device is retrieved from Netfilter and stored in out_device of thechunk structure. Referencing a packet while using the Netfilter Queue interface is donewith a Netfilter ID. Later in the TCP Connection Tracker module or in the ConnectionManager module it is decided if a packet should be dropped, forwarded or modified andtherefore the Netfilter ID is needed to reference a specific packet. This is achieved by stor-ing the Netfilter ID in nfq_id and providing three functions for a better handling of pack-ets. Each function takes a pointer to a chunk, which carries the packet, as parameter. The

33


Netfilter ID is automatically taken from the chunk data structure by these functions.

They are called:

in t n e t f i l t e r _ s i g n a l _ a c c e p t ( s t r u c t buf_man_cap_buf_chunc chunc ) ;

in t ne t f i l t e r _ s i gna l _a c cep t_bu t _mod i f i ed (s t r u c t buf_man_cap_buf_chunc chunc ) ;

in t ne t f i l t e r _ s i g n a l _d r op ( s t r u c t buf_man_cap_buf_chunc chunc ) ;

The names are somewhat self explaining, but lets leave some words about them.netfilter_signal_accept and netfilter_signal_drop just signals Netfilter to forward or drop thepacket. netfilter_signal_accept_but_modified awaits the modified packet in the chunk andpasses the modified packet to kernel space, which is then forwarded.

If a packet arrives, the values of source port, destination port, SEQ number and ACK num-ber is stored in Network Byte Order.The sequence number is looked up by several modules of the implementation. But forcomparing the sequence number to another value, it has to be in Host Byte Order. HostByte order differs on x86 hosts from Network Byte Order. The Byte Order is reversed.Reversing the Byte Order each time a value is compared is less effective, if this happensmore than one time. To enhance this a bit, the value of the sequence number is reversed bytheNetfilter Queue interface module and stored in tcp_seq of the chunk data structure.After all important information is gathered and stored into the chunk data structure, thechunk is handed over to the TCP Connection Trackermodule.

3.4.4 Module: TCP Connection Tracker

The TCP Connection Tracker module identifies a TCP flow and looks up the correspond-ing management data structure. If the flow is unknown to the proxy, a new managementdata structure is created and initialized.Tracking of TCP flows is stateful and this means a state per flow is maintained and onlytransitions accordingly to the TCP standard are allowed. Packets that are violating a state-ful transition are ignored. Only new flows are picked up for tracking during their connec-tion establishment. Picking up an established flow is possible with the implemented statemachine, but not yet implemented by the rest of the proxy.

34


After identification of a flow, the newly arrived packet is added to a per-flow cache. Thepacket and the corresponding management data structure is handed over to the Connec-tion Managermodule.Hashing is used for faster lookups and identification of a flow. Implementation of the pri-mary hash function was taken from the Linux kernel. It is called jhash9 and was developedby Bob Jenkins. Attributes of jhash are fastness and good mixing of the input. It was alsopublished in the famous The Dr. Dobbs10 computer magazine.

During initialization of the module, an array which is used as hash table is created.Elements of the hash table are called buckets and each bucket is a linked list and used forstoring pointers to management structures of TCP flows.A primary and a secondary hash function are used to distribute data in the hash table.The primary hash function is jhash and the secondary function is a modulo operation withthe size of the hash table as value. For the size of the hash table a prime number is chosen.Using prime numbers for the hash tables size with a modulo operation as secondary hashfunction is a good idea, because it minimizes clustering in the hashed table (see 3.4.4.1).

The TCP State Machine is imported from the Netfilter Framework and described furtherin section 3.4.6.1. Netfilter has a TCP State Machine implemented to use stateful packet in-spection for the Linux kernel packet filter. Reasons for taking or adapting source code fromother implementations are - these implementations have proofed their value and manyeyes have looked on the source code. It is not wise to always reinvent the wheel each timeand do probably the same mistakes other people already did in the past.

3.4.4.1 Prime Numbers for the Secondary Hash Function

Normally we tend to use a value of 2n as size for arrays, because we like and know thesenumbers. A programmer has a good impression to the proportion of these values. He cancompare with the sizes of memory or harddisks of a PC.Lets call the size of the table S and the result of the primary hash function H.The secondary hash function is then (H mod S).Then what makes (H mod S) a good distributing hash function?

Let the size S be divisible by 2, remember the 22 which matches this specification. Thenany time H is divisible by 2, (H mod S) will be divisible by 2. In opposite any time H isnot divisible by 2, (H mod S) will not be either.

9http://burtleburtle.net/bob/hash/10http://www.ddj.com/

35


This means, by applying the secondary hash function, even numbers hash to even indexesand odd numbers hash to odd indexes.If S was also divisible by 3, then multiples of 3 would hash to multiples of 3 and non-multiples of 3 would hash to non-multiples of 3.We would expect that half the numbers to be even and the other half to be odd.Unfortunately, this is very unlikely because a sample set is more likely to be biased. Espe-cially the smaller it is. This results in the problem that the secondary hash is perpetuatingthis fact instead of reducing it by mixing the values.Therefore in general it is better to use a prime number for the size of a hash table, becauseit has only itself as factor.

3.4.4.2 Identify a TCP flow

A flow or connection is identified by a tuple of source IP (srcIP) and destination IP (dstIP)address and especially by the source port (srcPort) and destination port (dstPort) number.

flow tupel = (srcIP, dstIP, srcPort, dstPort)

For IP version 4 (IPv4) the srcIP and dstIP are 32 Bit values. srcPort and dstPort are 16 Bitvalues.

Sample for an IP address:

IP address (4 bytes) binary as octets unsigned 32 bit integer

130.94.122.195 10000010 01011110 01111010 11000011 2187229891

Normal PCs are working mostly in the x86 mode, which implies a binary compatibilitywith the 32-bit instruction set. Each machine instruction can take a 32-bit operand. Work-ing with 32-bit operands on PCs in x86 mode is the most effective usage of this hardware,therefore the proxy implementation treats and works with IP addresses and TCP ports asunsigned 32-bit values. This makes, for example comparing, more effective than compar-ing four bytes separately (four bytes like the normal notation).

The proxy sees packets for one flow in two directions. Original sender to original receiverand the responses. This means srcIP and dstIP are exchanged for the response direction.Also srcPort and dstPort are exchanged.

36


This is a problem, because all packets for origin direction would produce another hashvalue than the reply direction. Two hash values would imply to search two hash bucketsfor the correspondingmanagement data structure. Bymaintaining twomanagement struc-tures, one for each direction of a flow, it is also more difficult to keep track which packetsare already acknowledged and can be deleted from the proxy buffer.

Feeding the values in a special way into the jhash function, solves this issue. The jhashfunction, which is used as primary hash function, is designed to take one, two or threeunsigned 32 bit values as arguments.For the proxy implementation, the three argument version is chosen. srcIP and dstIP arethe first two arguments with 32 bit, each. The one IP address of srcIP and dstIPwhich hasas unsigned 32 bit value a higher value is taken as first argument for jhash.The third 32 bit value is calculated by adding the two 16 bit values from srcPort and dst-Port. Adding two 16 bit values results as maximum in a 32 bit value, which is exactly theupper bound of the third argument for jhash.

Applying this order to the arguments does the first half of the trick, because this producesalways the same input arguments for jhash and therefore the same output for packets ofboth directions. Second half is summing the two TCP port, because addition is commuta-tive and adding the two 16 bit values from srcPort and dstPort produces always the resultvalue. Regardless of which direction the packet is for.

Taking only srcIP and dstIP as input for the primary hash function would be enough. Byalso taking the srcPort and dstPort as references, this gives a better distribution over thehash table. Especially if one IP host has many simultaneous connections on different TCPports. And better distribution results in shorter lookup time.The following sample C code fragment shows how easy the primary and secondary hashfunction can be implemented. SIZE represents the hash table size and HASH_RND a con-stant, which is demanded by jhash to do a better mixing.

por t_val = s r cPor t + ds tPor t ;

i f ( ( unsigned in t ) s r c IP > ( unsigned in t ) ds t IP ){return jhash_3words ( srcIP , dstIP , port_val , HASH_RND) % SIZE ;

}else

{return jhash_3words ( dstIP , srcIP , port_val , HASH_RND) % SIZE ;

}

37


TCP Connection Trackermodule executes mainly the following steps:

1. Get chunk from Netfilter Queue Interface module

2. Generate hash value from packet headers (IP and TCP like described)

3. Search hash bucket identified by hash value

4. Identify (or create new) management data structure in hash bucket

5. Add chunk to the per-flow retransmission buffer

6. Pass chunk and management data structure to Connection Manager

7. Go to 1.

All steps till four are covered by the previous parts of this section.Step four has two different behaviors.

[data]

ack=y+1 seq=x+1

syn ack=x

+1 seq=y

syn seq=x

ServerClient 3-wayHandshake

Figure 3.10: TCP Handshake

For known TCP flows it directly goes to step five.Packets from unknown connections are only accepted ifthe packet is the first packet of a 3-way connection hand-shake (see figure 3.10). New management data structuresare only created and initialized for these packets.On the figure the packet is marked with the arrow fromClient to Server and syn seq=x as label.Ignoring other packets is not an implementation fault, buta design decision. Because this means picking up an estab-lished flow and not only new ones. And therefore someinformation can only be guessed. An example would be,which TCP options are supported by the communicatinghosts. But this information is very important for TCPSnoop. TCP Snoop acts differently if the TCP SACK op-tion is not supported by the hosts.The implementation and the implemented TCP state machine is prepared for picking upconnections. This functionality could be upgraded in future work but is not necessary fornormal operation.

A management data structures carries one sub data structure for each direction of a flowwith the following information. This makes two for a flow.

38


The sub data structure carries the following information:

Count of retransmitted packets (for statistics)

Count of duplicate ACKs (to detect lost packets)

First sequence number seen for this direction (to detect wrap arounds)

Last sequence number seen for this direction (to detect duplicate packets)

Last acknowledge number seen for this direction (to detect new ACKs)

Last end (Last sequence number + length of TCP data)

Last TCP window advertisement seen for this direction

RTT (for retransmission timers)

TCP Window scale factor

Supported TCP options

These sub data structures must be initialized for each new connection.

After that or if the connection is already know, the TCP Connection Tracker continues tostep five (see page 38). Step five is described in the next section 3.4.4.3.Step six forwards the chunk and management data structure to the Connection Manager.The Connection Manager completes the stateful tracking with the raw information gath-ered by the TCP Connection Tracker and its processing is described in 3.4.6.

3.4.4.3 Retransmission Buffer

Most of the buffered packets are not retransmitted. Only a few are retransmitted. Thisimplies a quick lookup and release from the buffer, if a packet was acknowledged by areceiver.As we already know, packets are only exchanged between functions and modules of theimplementation by passing pointers to chunks. The chunks encapsulate the pointer to theactual packet with some additional information about the packet.An intuitive attempt to lookup and free all acknowledged packet would be to iterate a list

39


with all currently buffered packets. But looking up packets from other connections is veryinefficient, therefore the proxy implements a per-flow cache. It consists of two orderedlinked lists that store pointers to chunks, ordered ascending by the TCP sequence numberof the packets. One list for each communication direction.Adding the packet to the per-connection cache is nearly the same effort as adding themto a global list, with all currently buffered packets. Communication direction and corre-sponding management structure is already known at step six (see page 38). Based on thesefacts, it is just the effort of adding an element to a linked list. Except applying the order tothe list. But the ordinary case is that packets have a rising sequence number, like definedin the TCP standard. This leads to appending at the end of the list in the ordinary case,because packets are sorted by sequence number already. Only delayed packets have to befit in sorted and cause some extra work.The ascending order makes freeing of acknowledged an easy job. Just iterating the accord-ingly list and free packets from the beginning till a packet with a sequence number greateror equal than the acknowledge number.As summary, packets are grouped by flow, additionally by direction and ordered by se-quence number. This implements an efficient lookup, which leads to a good performance.

3.4.5 Module: Timer Manager

The Timer Manager is mainly a helper module for the Connection Manager. During ini-tialization of the module, it installs an operation system callback. The operating systemsends at a given interval a timer signal, which is caught by the signal handler of the proxyapplication and forwarded to the timer the callback.

Intervals, types, count and organization of timers are mainly based on ideas from the BSDTCP/IP Stack [Stevens, chapter 25] and a TCP/IP stack for embedded systems.The BSD TCP/IP Stack was used as a reference. It is very similar to the Linux TCP/IP Stackand a good source for information about TCP timers. There is a fast and a slow timer, likein BSD TCP/IP Stack. The fast one is triggered every 200 milliseconds and the slow oneevery 400 milliseconds. Compared to the BSD Stack this differs in the value for the slowtimer. BSD uses 500 milliseconds as interval for the slow timer.The reason for using 400 milliseconds in this implementation, is a more efficient way toimplement the timer handling in the user space. For an application it is only possible toinstall one timer signal. This would mean to have only one interval. To solve this, theproxy installs a 200 milliseconds timer and creates the 400 milliseconds timer by doublingit. In the implementation an alternating functionwhich creates a 0, 1, 0, 1, 0, ..., 1, 0 sequenceis used and the output value is checked for every timer signal for being zero or non zero.

40


This function is very simple and defined as follows:

value = value XOR 1

A check for zero and the XOR can be translated by a compiler to very few machine in-structions, which makes this very efficient. The XOR instruction and a conditional jumpinstruction for the value check. There is a big need that this is efficient, because the timerhandler is run every 200 milliseconds.

The Timer Manager offers a simple API to the other modules for creating a timer event.There are predefined timer actions from which the other modules can choose from. Cur-rently implemented actions are:

Retransmission, which retransmits a TCP packet.

Timewait, which is used during a special state of the implemented TCP state machine.

Timeout, which is used to detect timed out TCP flows (no data sent any more).

By utilizing a general purpose pointer in the data structure, that stores the informationabout a timer event, all actions can use the same data structure. This simplifies to extentthe Timer Manager with more actions and makes the management of the different timerevent types more efficient. Casting from the general purpose pointer to the actual datatype for the current action is only done if the action has to be triggered. This simplifies andshorts the loop, which checks if an event action should be triggered.

The timer events are stored in an ordered linked list. One list for each of the two timers.Events are ordered by their timestamp, which defines when the timer action should betriggered. Ordering the events raises the effort to create a new entry in the list, because theright place in the list has to be found. But under the assumption that mainly retransmissiontimer events are created, this leads in most cases to a position at the end or very near to theend. Main reason for this is the sequential processing of the TCP packets and the timeoutcalculation. Timeout is defined by current time plus a multiple of the round trip time of aTCP packet, therefore the timeout values grow sequentially.Insertion into the list is done by starting at the end and iterating backwards till the correctposition. Iteration from the end is an efficient way under the previous assumption.Searching for events that has to be triggered is done by iterating from the start of the listforward until the current time as value is larger than the timestamp value of the currentlychecked event. This is also very efficient, because only events that has to be triggered plusone are iterated. Future events dont have to be checked.

41


3.4.6 Module: Connection Manager

After the TCP Connection Tracker module looked up the corresponding managementstructure for a packet both is passed to this module, the Connection Manager module.The functionality of the module can be described as a lightweight TCP stack or the TCPconnection tracking part of a stateful firewall.As a reference for a standard TCP stack and how a TCP stack works the BSD implemen-tation was used. The implementation of this stack is described in the book TCP/IP Illus-trated II. The Implementation [Stevens]. Ideas for connection tracking are also taken fromthe Netfilter implementation and a paper from Guido van Rooij [Rooij]. The following textdescribes connection tracking for TCP Snoop and not general purpose connection tracking.

Main tasks of this module are updating the state of a flow, calculating the RTT if possible,installing the retransmission timers, detecting gaps in the sequence numbers and detectingduplicate ACKs. During the development of the TCP proxy there was the question to havea management structure for each communication direction or only one for the whole TCPflow. For the point of view from the TCP Connection Tracker module it makes sense tohave two management structures, because it simplifies the hashing. Management struc-tures for different communication directions are hashed to different hash buckets becauseof the simplified hashing. Also caching could be realized with the simplified hashing, be-cause for one direction we only need to lookup themanagement structure for this directionto find a packet for retransmission.

But by ignoring the communication direction of a packet we get into serious trouble. Itis not sufficient to store each and every TCP packet, because the buffer of the TCP proxywould just overflow. Acknowledged packets have to be released from the buffer again.These ACKs are sent by the receiver of a packet, which means they are sent from the op-posite direction.For the design with two management structures per flow, this means an additional lookupfor the second management structure to check if some previously received packet wasacknowledged and can be released from the buffer. ACKs can only be found in the man-agement structure for the opposite direction. This means in general, both managementstructures are looked up for each new packet of a flow. Maintaining only one managementstructure and having a slightly more costly hash function makes sense, because one lookupshould always be cheaper than two lookups.

An additional aspect which is also positive for keeping only one management structureis that state transitions of the state machine for a specific flow can and must be triggeredby traffic from both directions. The flow itself can have only one state, therefore this im-

42


plementations maintains only one state in one structure which corresponds to the TCPstandard. There is no need to store the state twice or even maintain two different states.

The processing of a single packet can be described as follows:

1. Calculate an index value from TCP flags for the state machine

2. Determine new state of the flow with the index value and state machine

3. Check SEQ number of the current packet

4. Check if ACK is present

Check if it is a duplicate ACK

Calculate RTT if possible

Release acknowledged packets from cache

5. Update state and other values in the management structure

6. Install retransmission timer

7. Signal Netfilter to forward packet

8. Return control to capture module

3.4.6.1 Stateful tracking

Firstly, for realizing a stateful connection tracking a representation for the state of a TCPflow is needed. This is implemented by a numeric value which represents the currentstate of a finite state machine. The state machine used in this implementation is shownon figure 3.11 on page 47. It is a simplified version to get an easier overview. Simplifiedmeans that the shown arrows for state transitions with the same attributes but oppositecommunication direction are combined to one arrow.

Transitions to other states are triggered by flags in the TCP header (see figure 2.1 as refer-ence). Flags which can lead to transitions are Synchronize (SYN), Reset (RST), Acknowl-edge (ACK) and Finish (FIN).Every time a new packet arrives its

tcp performance enhancement in wireless environments: prototyping in linux

Documents