[ieee 27th international conference on distributed computing systems (icdcs '07) - toronto, on,...

8
Scheduling to Minimize the Worst-Case Loss Rate Mahmoud Elhaddad Hammad Iqbal Taieb Znati ,Rami Melhem Department of Computer Science Department of Information Sciences and Telecommunications University of Pittsburgh Abstract We study link scheduling in networks with small router buffers, with the goal of minimizing the guaranteed packet loss rate bound for each ingress–egress traffic aggregate (connection). Given a link scheduling algorithm (a service discipline and a packet drop policy), the guaranteed loss rate for a connection is the loss rate under worst-case rout- ing and bandwidth allocations for competing traffic. Un- der simplifying assumptions, we show that a local min-max fairness property with respect to apportioning loss events among the connections sharing each link, and a condition on the correlation of scheduling decisions at different links are two necessary and (together) sufficient conditions for optimality in the minimization problem. Based on these conditions, we introduce a randomized link-scheduling al- gorithm called Rolling Priority where packet scheduling at each link relies exclusively on local information. We show that RP satisfies both conditions and is therefore optimal. 1 Introduction Due to increasing data rates, and the drive towards con- structing photonic packet switches with integrated optical packet buffers, Internet routers are expected to have lim- ited buffering capacity in the near future [3, 9, 11]. Unfortu- nately, when the buffer size at a link (router port) is limited to dozens of packets, the packet loss rate at that link can be as high as 10 -2 under light load. Recent research [9] has shown that loss-sensitive TCP flows traversing a single work-conserving link having a small buffer are able to with- stand high loss rate and achieve good link utilization, under assumptions that limit the contribution of each flow to the total link load. 1 However, several questions regarding the performance of networks with small router buffers remain open. This research is motivated by one question that is critical to the usability and dependability of such networks: 1 The packet loss rate at a link is the number of lost packets as a fraction of the total number averaged over some interval of time. The packet loss rate of a flow or flow aggregate over the path it traverses is defined in a similar way. What statistical guarantees on the packet loss rate for users (flows or aggregates thereof) can be supported by a network having small router buffers without severely restricting the maximum allowable link utilization or the path length? Given the load at the network links and the link buffer ca- pacities, the loss rate along a network path is determined by three factors: (1) the packet arrival process, (2) the packet size distribution, and (3) the scheduling algorithm (service discipline and packet drop policy) used at the links. The effect of variability in the arrival process and the benefit of limiting burstiness by regulating the arrival process have been well studied and understood [13]. Similar results are known for the distribution of packet sizes; constant packet sizes are desirable when the objective is to minimize the rate of buffer overflows a link. In contrast, there are only few known results concerning the performance of schedul- ing algorithms in networks with fixed-size buffers, where the objective is to minimize the loss rate [2]. Motivated by the above question, we study the problem of link scheduling to minimize the worst-case packet loss rate bound for each ingress–egress traffic aggregate (con- nection), given its path length and the load at the links. For practical relevance, we restrict our investigation to al- gorithms that are work-conserving 2 , and local in the sense that scheduling decisions are based only on local informa- tion. The FCFS/DT algorithm combining the First-Come- First-Served discipline and the Drop-Tail policy is the most common example of local work-conserving algorithms. The networks under consideration have time-slotted links with fixed slot size (in bits) and possibly different link capacities. Incoming packets at the ingress routers are clas- sified into ingress–egress traffic aggregates (connections) and, given the loss rate minimization objective, they are packed into time-slot sized packets before being injected into the network. Links are output buffered, as is commonly assumed in literature on scheduling in packet-routing net- works. It should be noted that results for networks with 2 A work conserving algorithm is one that never leaves the link idle while there are packets in the buffer, and never drops packets when there is room in the buffer. 27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007

Upload: rami

Post on 24-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

Scheduling to Minimize the Worst-Case Loss Rate

Mahmoud Elhaddad† Hammad Iqbal‡ Taieb Znati†,‡ Rami Melhem†

†Department of Computer Science‡ Department of Information Sciences and Telecommunications

University of Pittsburgh

Abstract

We study link scheduling in networks with small routerbuffers, with the goal of minimizing the guaranteed packetloss rate bound for each ingress–egress traffic aggregate(connection). Given a link scheduling algorithm (a servicediscipline and a packet drop policy), the guaranteed lossrate for a connection is the loss rate under worst-case rout-ing and bandwidth allocations for competing traffic. Un-der simplifying assumptions, we show that a local min-maxfairness property with respect to apportioning loss eventsamong the connections sharing each link, and a conditionon the correlation of scheduling decisions at different linksare two necessary and (together) sufficient conditions foroptimality in the minimization problem. Based on theseconditions, we introduce a randomized link-scheduling al-gorithm called Rolling Priority where packet scheduling ateach link relies exclusively on local information. We showthat RP satisfies both conditions and is therefore optimal.

1 IntroductionDue to increasing data rates, and the drive towards con-

structing photonic packet switches with integrated opticalpacket buffers, Internet routers are expected to have lim-ited buffering capacity in the near future [3,9,11]. Unfortu-nately, when the buffer size at a link (router port) is limitedto dozens of packets, the packet loss rate at that link canbe as high as 10−2 under light load. Recent research [9]has shown that loss-sensitive TCP flows traversing a singlework-conserving link having a small buffer are able to with-stand high loss rate and achieve good link utilization, underassumptions that limit the contribution of each flow to thetotal link load.1 However, several questions regarding theperformance of networks with small router buffers remainopen. This research is motivated by one question that iscritical to the usability and dependability of such networks:

1The packet loss rate at a link is the number of lost packets as a fractionof the total number averaged over some interval of time. The packet lossrate of a flow or flow aggregate over the path it traverses is defined in asimilar way.

What statistical guarantees on the packet loss rate for users(flows or aggregates thereof) can be supported by a networkhaving small router buffers without severely restricting themaximum allowable link utilization or the path length?

Given the load at the network links and the link buffer ca-pacities, the loss rate along a network path is determined bythree factors: (1) the packet arrival process, (2) the packetsize distribution, and (3) the scheduling algorithm (servicediscipline and packet drop policy) used at the links. Theeffect of variability in the arrival process and the benefitof limiting burstiness by regulating the arrival process havebeen well studied and understood [13]. Similar results areknown for the distribution of packet sizes; constant packetsizes are desirable when the objective is to minimize therate of buffer overflows a link. In contrast, there are onlyfew known results concerning the performance of schedul-ing algorithms in networks with fixed-size buffers, wherethe objective is to minimize the loss rate [2].

Motivated by the above question, we study the problemof link scheduling to minimize the worst-case packet lossrate bound for each ingress–egress traffic aggregate (con-nection), given its path length and the load at the links.For practical relevance, we restrict our investigation to al-gorithms that are work-conserving2, and local in the sensethat scheduling decisions are based only on local informa-tion. The FCFS/DT algorithm combining the First-Come-First-Served discipline and the Drop-Tail policy is the mostcommon example of local work-conserving algorithms.

The networks under consideration have time-slottedlinks with fixed slot size (in bits) and possibly different linkcapacities. Incoming packets at the ingress routers are clas-sified into ingress–egress traffic aggregates (connections)and, given the loss rate minimization objective, they arepacked into time-slot sized packets before being injectedinto the network. Links are output buffered, as is commonlyassumed in literature on scheduling in packet-routing net-works. It should be noted that results for networks with

2A work conserving algorithm is one that never leaves the link idlewhile there are packets in the buffer, and never drops packets when thereis room in the buffer.

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007

Page 2: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

Output Queued routers carry over to networks with Com-bined Input-Output Queuing via results in [7].

1.1 Results and ContributionsUnder general assumptions on the packet arrivals due to

each connection, we show that a local min-max fairnessproperty with respect to apportioning loss events among theconnections sharing each link, and a condition on the cor-relation of scheduling decisions at different links are neces-sary and together sufficient for optimality in the minimiza-tion problem. Locally min-max fair algorithms are referredto simply as locally-fair. The correlation property refers topackets having consistent “priorities” at every hop in such away that the fraction of packets experiencing low loss ratethroughout the path is maximized for each connection.

Based on the optimality conditions, we introduce andanalyze a randomized link-scheduling algorithm calledRolling Priority (RP) where packet scheduling at each linkrelies exclusively on local information. We show that RPsatisfies both conditions and is therefore optimal. Further-more, we find that the algorithm combining FCFS withthe Random Drop policy (FCFS/RD) is locally fair. Us-ing simulation, we show that the guaranteed loss rate un-der FCFS/RD deteriorates much faster as a function ofpath length compared to the optimal algorithm under heavyload. We provide simulation examples comparing the per-formance of RP and FCFS/RD to FCFS/DT. Results con-firm that locally-fair scheduling algorithms result in signif-icantly less restrictions on connection routing (hence net-work utilization), under light and moderate load.

1.2 Related WorkTo the best of our knowledge, this work is the first to

investigate local scheduling algorithms for providing per-session loss guarantees in packet networks with fixed-sizebuffers. Until recently, the performance of scheduling al-gorithms in packet networks has mostly been investigatedin terms of packet delay and stability (i.e., boundedness ofbacklog) guarantees. These studies, for example [4, 5, 12],have led to valuable insights into the behavior of servicedisciplines such as FCFS and Processor Sharing. How-ever, in investigating delay and stability, the packet net-work is modeled as a queuing network where communica-tion links are represented by servers with infinite waitingroom, which limits the value of the resulting algorithmicguarantees when applied to networks with small buffers.

Although delay and stability guarantees lead to boundson buffer occupancy that can be leveraged in dimensioningbuffer capacities at the links to prevent (or at least bound)packet loss, the occupancy bounds are often dependent onnetwork parameters, such as the diameter of the networkand link capacities, which are impractical to track in to-day’s large decentralized networks. More importantly, by

relying on such bounds for buffer dimensioning, one wouldbe ignoring the technological constraints on buffer capac-ity which recently arose due to increasing link speeds [3],and the drive towards constructing photonic packet switcheswith integrated optical packet buffers [6, 9].

The paper is organized as follows: In the next section wepresent the formulation of the loss rate minimization prob-lem and state our assumptions regarding the traffic arrivalprocess. In Section 3, we define local fairness and showthat it is a necessary condition for optimality in the mini-mization problem. We also discuss why it is not readily sat-isfied by every link scheduling algorithm. Then, we identifyanother necessary condition on the statistical correlation ofscheduling decisions, and establish that together the condi-tions are sufficient for optimality. In Section 4, we presentthe Rolling Priority algorithm. Analysis of Rolling Priorityis presented in Section 5, where we establish its optimal-ity. In Section 6 we present the simulation results, followedby concluding remarks in Section 7. Due to space con-straints, we include only essential details. The interestedreader is referred to [8] for omitted proofs and details, an-alytical characterization of the performance of FCFS/RD,numerical results and additional simulation results.

2 Preliminaries and Problem Statement

2.1 Preliminaries

Consider a link l shared by a nonempty set of connec-tions and suppose packets arrive at l according to a knownstochastic arrival process jointly defined for all connections.LetAlc(t1, t2) be a random variable representing the numberof packet arrivals due to a connection c during the interval[t1, t2), and let X l,G

c (t1, t2) ≤ Alc(t1, t2) be the number ofpacket losses at l among the Alc(t1, t2) arrivals if the linkuses scheduling algorithm G. The loss rate for connection cover the same interval is given by

Rl,Gc (t1, t2) ,X l,Gc (t1, t2)Alc(t1, t2)

.

For a given packet injection sequence s (a sample path sof the joint packet arrival process), the number of packetslost from connection c is denoted Xs,l,G

c (t1, t2). The valueofXs,l,G

c (t1, t2) is deterministic ifG employs deterministicservice and drop policies. Otherwise it is a random variablethat reflects the random choices of the algorithm. The lossrate under packet injection sequence s is similarly denotedRs,l,Gc and is by definition a random variable (rv) wheneverXs,l,Gc (t1, t2) is an rv. We extend the notation to multihop

paths by removing the superscript specifying the link andusing the time interval to indicate the time of injection intothe network. For example, we denote byRGc (t1, t2) the lossrate among connection c packets injected into the network

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007

Page 3: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

in the interval [t1, t2). Throughout the paper we make thefollowing assumptions:Assumption 1. Consider an arbitrary connection c routedalong a path π and let ∆l

π be the propagation delay alongπ up to link l. For each sufficiently large interval [t1, t2) thenumber of connection c’s packets injected at the ingress in[t1, t2) that arrive at any l ∈ π after t2 + ∆l

π is negligible.For Assumption 1 to hold, the length of the interval must belarge compared to the buffer size so that link busy periodsrarely extend beyond the end of the interval. It rules out ad-versarial sources which, for any decomposition of the timeaxis into contiguous intervals, may choose to inject packetsonly at the end of some or all intervals.Assumption 2. Let G and G′ be any two work-conservingalgorithms. Then for any packet injection sequence, ex-changing G and G′ at any network link does not increasethe aggregate loss rate (over all connections) at any net-work link.

Assumption 2 is justified for work-conserving algo-rithms in a network with small buffers. When buffers arelarge, the change of scheduling algorithms may introduceburstiness that increases the loss rate at downstream links.

Suppose we fix a packet injection sequence (a samplepath of the arrival process) and exchange the scheduling al-gorithm at every hop along the path of a connection withone that better favors it in service and drop decisions, thusimproving the loss rate at every link. Given Assumption 2,the overall (path) loss rate of the connection would im-prove. For example, suppose a connection, c, that has theshortest path among the connections sharing links in itspath, and that all links originally use the Furthest-To-Goprotocol which favors packets from connections traversinglong paths. Replacing the Furthest-To-Go algorithm withNearest-To-Go would improve the overall loss rate for con-nection c. This intuitive result can be easily established byinduction along the path of the connection and observingthat the number of total packets lost up to any link alongthe path is higher under the algorithm of higher loss rate atevery link. Since the result applies to any packet injectionsequence, it also applies in expectation. This property isstated in the following lemma whose proof is omitted:

Lemma 1. For any connection c and scheduling algorithmsG and G′, suppose that for packets injected by c duringinterval I , E

[Rl,Gc (Il)

]≤ E

[Rl,G

c (Il)]

at every link l

along c’s path, where Il is the interval obtained by shift-ing the boundaries of interval I by ∆l

π . Then, E[RGc (I)

]≤

E[RG

c (I)].

Now we formally state the loss rate minimization problem.

2.2 The minimization problem

For a given algorithm G, a fixed packet injection se-quence s, and an interval I = [t1, t2), define Ms,G

c (I) as

the loss rate of connection c when, at every link l alongthe path of the connection, Rs,l,Gc (I) is the maximum lossrate at l among all connections sharing the link. Using anargument similar to that of Lemma 1, Ms,G

c (I) is at leastas large as Rs,Gc (I). It is a tight bound in the sense thatMs,Gc (I) = Rs,Gc (I) whenG is an algorithm that always fa-

vor connection c less than the competing connections. Themost obvious example is scheduling based on preassignedconnection priorities, where all packets from a given con-nection have the same fixed priority. FCFS/DT (drop tail)can be considered an implicit priority algorithm when trafficarrivals at a link are partially synchronized such that packetsfrom one or more connection tend to arrive to a full buffer.We will revisit this issue of scheduling bias under FCFS/DTand how common it is in the next section.

In general form, the loss rate minimization problem isstated as follows. Find an algorithm G such that: For everynetwork connection c and interval I , E

[MGc (I)

]is mini-

mum. The expectation is defined over the distribution ofjoint arrival process and the choices of algorithm G, if it israndomized. We limit our attention to the case where, forindividual connections, the arrival process at every link is ageneralization of the Poisson process. Specifically, we as-sume that, at every link, the packet arrival process due toevery connection satisfies the following assumption:

Assumption 3. There exists T0 > 0 such that, for each con-nection, the variables representing the number of packet ar-rivals over disjoint intervals of length T0 at any link are in-dependent and follow an identical probability distribution.

Let In, n ≥ 1 denote the set of time intervals of lengthnT0, and let C the set of connections in the network. We de-fine the set of (work-conserving local-control) algorithms,G, that minimize the loss rate bound E

[MGc

]over all inter-

vals in In for every connection as:

ML(n) = {G : G minimizes E[MGc (I)

]∀I ∈ In,∀c ∈ C}

where the expectation is defined over the random choicesof the algorithm at different links and the distribution of thejoint arrival process. We refer to algorithms in ML(n) fora given n as optimal algorithms for the minimization prob-lem. In the next section, we seek necessary and sufficientconditions for optimality under Assumptions 1–3.

3 Optimality ConditionsIn this section, we identify necessary and sufficient con-

ditions that must be satisfied by every algorithm in ML(n),n ≥ 1.

3.1 Local fairness

Here, we relate the minimization problem above to aproblem of fairly apportioning losses among connections

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007

Page 4: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

sharing each link. We find that fair apportioning of lossesat every link is necessary for minimizing the expected lossrate for all connections in a network with small buffers.

Although at any given link l the expected loss rateE[Rl,G(I)

](aggregated over all connections ) is the same

under all work-conserving algorithms, connections mayhave different expected loss rates at the link. For exam-ple, under the Nearest-To-Go scheduling algorithms, con-nections with fewest number of remaining hops to traversewill have smaller loss rate at the expense of other connec-tions.

We define the set of “locally-fair” algorithms over inter-vals of length nT0, LF(n) as follows:

Definition 1. (Local fairness) Consider a link l shared byNconnections 1, 2, . . . , N and employing a work-conservingscheduling algorithm G. G ∈ LF(n) if for all I ∈ In, Gminimizes the maximum expected loss rate among all con-nections. That is:

LF(n) = {G : G minimizes max1≤i≤N

E[Rl,Gi (I)

]},

where the expectation is defined over the joint arrival dis-tribution and the decisions of the algorithm.

It is easy to see that if G ∈ LF(n) then G ∈ LF(n′)where n′ is an integral multiple of n. In particular, if G ∈LF(1) then G ∈ LF(n) for every n ≥ 1.

The Nearest-To-Go algorithm is clearly not in LF(n) forany n. Perhaps more subtly, neither is the FCFS/DT (drop-tail) algorithm. FCFS/DT suffers from a phenomenon re-ferred to as the “traffic-phase effects” where some connec-tions may persistently experience much higher loss rate thanother connections sharing the link. This phenomenon, insome instances called the buffer-lockout problem, is due tosynchronization of traffic from different connections, whichoccurs when the link is multiplexing regulated traffic orTCP flows [10]. Thus FCFS/DT is generally not in LF(n)for any n. However, FCFS/DT ∈ LF(1) under Poissonpacket arrivals. This is due to the PASTA property whichentails that the expectation of the loss rate is identical forall connections, equal to the expectation of the aggregateloss rate at the link.3

The following result states that an optimal scheduling al-gorithm for the minimization problem must be locally fair.It is obtained using an exchange argument and by the appli-cation of Lemma 1.

Theorem 1. For every n ≥ 1, ML(n) ⊆ LF(n).

3Possible correlation of traffic arrivals among subsets of connectionsprevents the definition of local fairness as the case where all connectionshave the same expected loss rate. Clearly, connections with synchronizedarrivals would have higher expected loss rate under any work-conservingalgorithm compared to a connection whose arrivals are negatively corre-lated or independent of arrival from other connections.

Two questions that naturally arise are: (1) Whether thereare work-conserving algorithms that are locally fair for ar-rival processes satisfying Assumption 3 (in LF(n), for somen ≥ 1), and (2) whether local-fairness is a sufficient con-dition for optimality in the minimization problem. As weshall see shortly, it turns out that FCFS/RD (random drop)is in LF(n) for any n ≥ 1. In the next subsection we showthat local fairness is not a sufficient condition for optimalityby identifying an additional necessary condition that is notsatisfied by FCFS/RD. Together the conditions are shown tobe sufficient for optimality.

Under FCFS/RD, the Random Drop policy is as follows:Suppose the buffer size is B ≥ 0. If a packet arrives at timeslot τ to a full buffer (just a busy link in case B = 0), a“victim” packet is chosen at random from the set of packetavailable at the link but not in service, including the ar-riving packet. This packet is then dropped. That is, eachpacket–including the new arrival–is dropped with probabil-ity 1/(B + 1). The following result (proof omitted) is ob-tained by showing that FCFS/RD ∈ LF(1):

Theorem 2. FCFS/RD ∈ LF(n) for every n ≥ 1.

3.2 Correlation of scheduling decisions

Now we show that to achieve the scheduling objective,an algorithm must satisfy a condition on the correlation ofscheduling decisions at different links. We begin by intro-ducing a generic model of local-control scheduling algo-rithms and use it to establish the necessary correlation con-dition, then show that together the local fairness propertyand the correlation condition are sufficient for optimality.

Consider a link l shared by N connections 1, 2, . . . , N .Any scheduling algorithm G at l can be viewed as assign-ing a loss probability pGi (τ) for connection i packets arriv-ing at each time slot τ . This probability takes into accountloss upon arrival and preemption from the queue if the algo-rithm allows. It is determined by the distribution of the jointarrival process and the distribution of upstream schedulingdecision. As a concrete example, consider the Nearest-To-Go algorithm which at any link favors packets with leastremaining number of hops along their paths. The proba-bility of dropping a packet from a given connection is theprobability that a certain number of packets that have fewerlinks to traverse are already in the buffer when the packetarrives, or arrive before the packet is served. Note that ifthe scheduling algorithm is randomized, the probability alsodepends on the distribution of algorithm’s random choices.

Suppose, without loss of generality, that connectionsare numbered in increasing order of the loss probabilityat τ . That is pG1 (τ) ≤ pG2 (τ) . . . ≤ pGN (τ). Define therank of a connection at τ as follows: The rank of con-nection 1, rG1 (τ) = 1 and for all j, rGj+1(τ) = rGj (τ) ifpGj+1(τ) = pGj (τ) and rGj+1(τ) = rGj (τ)+1 otherwise. Un-

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007

Page 5: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

der this model it is possible that all connections have equalrank during a time slot. For example, under FCFS/RD ar-rivals from all connections at any slot τ have rank 1 as theyare treated identically by the algorithm.

Consider an arbitrary interval I = [t, t + nT0) on linkl composed of n consecutive (sub)intervals of length T0:Ii = [t+ (i− 1)T0, t+ iT0), i = 1, . . . , n. Given Assump-tion 3, the expected loss rate of a connection is maximumif packets arrive only at those slots with the worst (highest)rank within each interval Ii. With a slight abuse of nota-tion, let rGc (i) be the maximum rank for connection c dur-ing interval Ii. The expected loss rate for connection c overinterval I is given by:

E[Rl,Gc (I)

]=

1n

n∑i=1

E[Rl,Gc (Ii)

], (1)

where,

E[Rl,Gc (Ii)

]=

N∑k=1

E[Rl,Gc (Ii)|rGc (i) = k

]Pr[rGc (i) = k

].

(2)Note that we assume that an algorithm assigns ranks toconnections based on some probability distribution. Algo-rithms that assign ranks deterministically are a special case.

To bound the loss rate at a arbitrarily tagged connec-tion c at link l, we consider the indicator random variables(Zl,Gc (I1), . . . , Zl,Gc (In)) defined as

Zl,Gc (Ii) ,

{1 if c suffers packet loss in interval Ii0 otherwise.

Zl,Gc (Ii) is a tight bound on the loss rate during subinter-val i, Rl,Gc (Ii). That is Rl,Gc (Ii) ≤ Zl,Gc (Ii). In fact,Rl,Gc (Ii) = Zl,Gc (Ii) when the tagged connection offers ex-actly one packet to link l in the ith subinterval. Otherwise,the definition of Zl,Gc (Ii) assumes that any loss event expe-rienced by the connection is subepoch i results in the lossof all packets offered by c during the subinterval.

Suppose connection c is routed along path π. LetZGc (Ii) , maxl∈π Zl,Gc (Ii), for i = 1, . . . , n. Under theassumption that G is a local-control algorithm, the distribu-tion of ZGc (Ii) has the following product form:4

1− Pr[ZGc (Ii) = 1

]=∏l∈π

(1− Pr

[Zl,Gc (Ii) = 1

]), (3)

Let ZGc (I) be defined as ZGc (I) , 1n

∑ni=1 Z

Gc (Ii). Then

E[ZGc (I)

]is a tight bound on the expected loss rate for con-

nection c over interval I . That is, E[MGc (I)

]= E

[ZGc (I)

].

4The probabilities in the product form can be conditioned on any net-work events, for example upstream scheduling decisions. To be precisewe may write the probabilities as Pr

hZl,G

c (Ii) = 1|Hi

where H is thehistory of events in the network. We refrain from doing so for the sake ofclarity.

By linearity of expectations, we have

E[ZGc (I)

]=

1n

n∑i=1

E[ZGc (Ii)

]=

1n

n∑i=1

Pr[ZGc (Ii) = 1

](4)

Substituting from (3), we get

E[ZGc (I)

]= 1− 1

n

n∑i=1

∏l∈π

(1− Pr

[Zl,Gc (Ii) = 1

]). (5)

Equation (5) is minimum only when the following two partcondition is satisfied:

C1: Consider the sorted order of {I1, . . . , In} in increasingrank at some link. Equation (5) is minimum only if thesorted order is the same at every link along the path. Asan example if n = 2 and E

[Zl,Gc,1

]≤ E

[Zl,Gc,2

]at some

link l along c’s path π, then E[Zl

′,Gc,1

]≤ E

[Zl

′,Gc,2

]at

every l′ ∈ π.

C2: The loss probability is concentrated in as few intervalsas possible. That is, few intervals i have E

[ZGc,i

]much

larger than E[ZGc], while the remaining ones have loss

probability much smaller.

Together C1 and C2 are sufficient for optimality given thatG ∈ LF(n): because E

[Zl,Gc (I)

]is minimum at every

link l whenever G ∈ LF(n), at any link l algorithms inLF(n) can only differ in how they assign ranks to the con-nection at different subintervals of I , hence in the values ofE[Zl,Gc (Ii)

]for i = 1, . . . , n, but not in their sum. That is,

not in E[Zl,Gc (I)

].

The proof for this necessary condition proceeds by in-duction on the subintervals of I from 1 to n given a two hoppath to show that the optimal solution has to have the lossrate concentrated in as few intrevals as possible (subject tothe constraint that the loss rate at any link cannot exceed1). This is followed by using this result as the base case forinduction on the number of hops in the path. The inductivestep in both cases in straightforward, as the product-formexpands into a simple inclusion-exclusion formula.

4 The Rolling Priority Algorithm

4.1 Service and drop policies

Consider a link using the RP scheduling algorithm. Fromthe perspective of a connection, time at the link is dividedinto disjoint epochs of Te = nT0 time slots. Each time slotis spanned by exactly one epoch from each connection. Atevery time slot, RP gives scheduling priority (service anddrop priority) to connections sharing the link in the order ofearliest-starting current epoch, where the current epoch of a

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007

Page 6: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

ab

timet1 t2

start of new epoch

c

t3 t4

head

cab

t5

cab

cab

cab

Figure 1. The priorities of three connections a, b and c at a link.A circular queue is used to enforce the cyclic priorities. The headpointer indicates the connection with highest priority at any giventime.

time

link 1

link 2tprop

start of epoch

Figure 2. The start of a connectionepoch at two consecutive links (link inter-faces) differs by the propagation delay ofthe upstream link (tprop).

connection is the epoch spanning the time slot. The connec-tion(s) with earliest-starting current epoch have the highestpriority. Figure 1 illustrates the assignment of priority atdifferent time slots for a link shared by three connectionsa, b and c. At every time slot in the interval [t1, t2), thecurrent epoch for connection b started earlier than the cur-rent epochs of connections a and c. As a result the priorityof connection b is highest within this interval. The high-est priority connection during [t2, t3) is c, and it is a during[t3, t4). The cycle repeats with the start of a new epochof connection a. At the beginning of the time slot, if thenumber of packets available at the link (those already in thebuffer and those offered by the router’s input interfaces) ex-ceeds the buffer capacityB, the excess packets are dropped.RP drops packets from the least priority connections so thattheB packets with highest connection priority remain. Dur-ing the remainder of the time slot, RP serves a packet fromthe highest priority connection with backlog, if any.

4.2 Phase randomization

To ensure high priority packets are subject to a small lossprobability at every link, RP uses randomization to avoidcontention among a large number of high priority connec-tions at any link. Furthermore, RP loosely aligns the startof connection epochs across the links it traverses so that apacket that is given high priority at a link is likely to havehigh priority at all links along the path. Both randomiza-tion and epoch alignment are part of connection initializa-tion that we now describe.

Each connection has an associated phase variable φ.Suppose the connection is initialized at time t0. The ingressrouter of the connection chooses the value of the phase uni-formly at random from the interval [0, Te) so that the con-nection starts a new epoch at time t+φ+iTe, i = 0, 1, 2, . . ..The phase of the connection is communicated to down-stream links in the form of a one-time initialization packet,init, sent from the ingress at time t0 +φ. The reception timeof the init packet at a given link specifies the connection’s

epoch start times at that link. For instance, if an init packetfor a particular connection is received at the link at time t,then a new epoch for the connection at that link starts attimes t + iTe, i ≥ 0. Because RP’s service and drop poli-cies rely on the knowledge of connection epoch boundaries,the init packets are always given higher scheduling prioritythan all data packets so that they are almost never dropped.5

Figure 2 shows a timing diagram with two links in tan-dem along the path of a connection. The connection’s initpacket does not experience any queuing delay. In this case,the start of a new connection epoch at the upstream linkprecedes the start of a new epoch at the downstream link byexactly the propagation delay of the upstream link.

5 Analysis of Rolling Priority

In this section we show that RP satisfies the optimalityconditions of Section 3.2. Let RP-n denote the algorithmRP with epoch duration Te = nT0 for some n ≥ 1. Weview each epoch as being composed of n disjoint subepochsof length T0 slots (i = 1, . . . , n).

Theorem 3. RP-n∈ LF(n) for any n ≥ 1.

Proof. Consider an interval I of length nT0 at a link lshared by N connection, and employing a local-controlwork-conserving scheduling algorithm G. Leveraging thenotation of the general algorithmic model in Section 3.2, letWc,i(k) be an indicator variable of the event that an arbi-trary connection c has rank k (k = 1, 2, . . . , N) during theith subinterval Ii (of length T0 slots) within interval I , anddefine Wc(k) as the number of subintervals within I wherethe connection has rank k, i.e., Wc(k) =

∑ni=1Wc,i(k).

5Initialization packets are dropped only when there are too many initpackets at a given link, but such packets are rare since connections aretraffic aggregates that are supposed to persist for long time (i.e., weeks ormonths). Recovery mechanisms from the loss or corruption of init packetsis not part of the scheduling algorithm but should be provided, for exampleby having link interfaces notify the ingress routers of packets belonging touninitialized connections before dropping them.

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007

Page 7: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

Then the fraction of time for which connection c assumesrank k is:

E[Wc(k)] =1n

n∑i=1

E[Wc,i(k)] =1n

n∑i=1

Pr[rGc (i) = k

],

(6)where rGc (i) is the rank of connection c during subintervalIi. Suppose G /∈ LF(n). Then there exists k and connec-tion c such that E[Wc(k)] is higher than at least one otherconnection. Otherwise, there is no connection whose lossrate can be reduced without increasing the loss rate for aconnection with equal or higher loss rate, which impliesG ∈ LF(n). Conversely, if E[Wc(k)] is constant for allconnections then G ∈ LF(n).

Consider the algorithm RP-n where interval I corre-sponds to the duration of an epoch and Pr

[rGc (i) = k

]cor-

responds to the probability distribution of priority duringsubepoch i of the epoch. Since the distribution of Pi wasobtained for an arbitrary connection, Eq. (6) indicates thatE[Wc(k)] is constant for all connections traversing link lunder RP-n. Thus completing the proof.

Now we proceed to show that RP-n is optimal. By ob-serving that the priority of a connection during each sube-poch i is dominated by a binomial random variable (due tophase randomization), we can show that RP-n has the fol-lowing properties:

Lemma 2. Consider an epoch of a tagged connection c atsome link l along its path. If the number of connectionssharing the link is N , then: (i) with high probability (i.e.,with probability 1 − o(1/N)), there exists a subepoch ofthe epoch being considered where the scheduling priorityof the connection is in the range

[Nn (k − 1), Nn k

]for k =

1, 2, , . . . , n; and (ii) if there exists a subepoch 1 ≤ i ≤ nwhere the scheduling priority of the connection at link l isin the range

[Nn (k − 1), Nn k

]for some 1 ≤ k ≤ n, then

at every link l′ along the path of the connection where thenumber of connection is N ′, the priority of the connectionduring subepoch i is in the range

[N ′

n (k − 1), N′

n k]

withhigh probability.

Part (i) of the lemma can be explained as follows. If weconsider the division of the range of priorities [1, N ] inton priority classes of equal size, then with high probability,(1 − o(1/N)), there is a subepoch where the priority ofthe connection is the kth priority class, for each k. Part (ii)states that if the priority of the connection in the kth priorityclass during a subepoch at a particular link, then, with highprobability, it is in the kth priority block during the samesubepoch at every link. This implies that if a packet hashigh (low) priority at a link, then with high probability, ithas high (low) priority at every other link along the path.Note that the high-probability statements follow from the

application of Chernoff bounds to the binomial probabilitydistributions.

Finally, the optimality result follows by showing that anyalgorithm in LF(n) that does not satisfy the properties inLemma 2, does not satisfy the necessary correlation condi-tion of Section 3.2. This result is stated formally as:

Theorem 4. RP-n∈ ML(n) for any n ≥ 1.

6 Simulation

In this section, we report simulation experiments thatcompare the observed loss rate under RP-n, FCFS/DT(drop-tail), and FCFS/RD. The simulation experiments areused to support the analytical results presented so far by giv-ing examples of the observed performance under differentalgorithms.

Edge routerCore router Connection path

... ... ... ...

Figure 3. Simulation topology

Specifically, we compare the observed tradeoff betweenload and path length to achieve a desired loss rate under RPto the tradeoff observed in a network using FCFS/DT.

To differentiate the performance of RP and FCFS/DT, weused a “parking-lot” topology (Figure 3), where a tagged(foreground) connection traverses a path of identical links.At each link, the tagged connection competes with a differ-ent set of background connections. These connections donot face any contention, except at the link shared with thetagged connection. All network connections are periodic toemulate ingress-shaped traffic and have identical bandwidthallocations, equal to 1/100 of link bandwidth. All the linksalong the path have equal loads, hence are shared by thesame number of connections.

We conducted experiments using the ns2 [1] simulator atdifferent values of epoch lengths (n), link loads, and buffersizes. Figure 4(c) is a contour plot of the observed tradeoffbetween load and path length to achieve a desired averageloss rate. In these experiments, the buffer size was set to 5packets at every link and the length of the RP epoch was setto n = 10. The plot was obtained by running a set of 100experiments for each (load, path-length) pair and averagingthe loss rate of the tagged connection. Similar experimentsare reported for FCFS/DT and FCFS/RD in parts (a) and (b).

Comparing Figures 4(a) and 4(b) we find that the aver-age packet loss rate of FSFS/DT and FSFS/RD is similarunder moderate load conditions when the maximum path

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007

Page 8: [IEEE 27th International Conference on Distributed Computing Systems (ICDCS '07) - Toronto, ON, Canada (2007.06.25-2007.06.27)] 27th International Conference on Distributed Computing

0.6

0.65

0.7

0.75

0.8

0.85

0.9

5 10 15 20

Max

. lo

ad

Max. path length (hops)

0.1

0.08

0.06

0.04

0.02

0.0

2

0.04

0.0

4

0.06

0.06

0.08

0.08

0.1

0.1

Max. path length (hops)

Max

. lo

ad

5 10 15 200.6

0.65

0.7

0.75

0.8

0.6

0.65

0.7

0.75

0.8

0.85

0.9

5 10 15 20

Max

. lo

ad

Max. path length (hops)

0.1

0.080.06

0.04

0.02

Figure 4. Tradeoff between load and path length for a constant loss rate (between 0.02 and 0.1) when B = 5. From left:(a)FCFS/DT, (b) FCFS/RD, and (c) RP (n = 10).

length (in terms of hops) is relatively short. As the pathlength increases beyond a few hops, the difference betweenthe loss rate of the two algorithms increases, with FCFS/RDgiving a slightly better average performance. This decreasein loss rate for FCFS/RD can be attributed to the local fair-ness property as discussed earlier in Section 3. However,we find that both FCFS/DT and FCFS/RD are much moresensitive to path length compared to the RP (Figure 4(c)).In particular, whereas a loss rate of 0.02 can be maintainedfor up to 20 hops at 60% load under the proposed scheme,it can only be maintained for only for less than 5 hops usingFCFS/DT.

7 Concluding Remarks

Motivated by the goal of efficiently providing loss guar-antees in packet networks, we initiated the study of schedul-ing to minimize a bound on the expected loss rate of everynetwork connection (aggregate flow). Under some generalassumptions, we identified necessary and sufficient condi-tions for optimality. Specifically, we found that an optimalalgorithm must satisfy a local fairness condition whereby itensures that the maximum expected loss rate among con-nections sharing each link is minimized. In addition, thescheduling decisions at a link and along the path must bestatistically correlated to ensure that packets receive consis-tent treatment at every hop.

We showed that the algorithm combining the FCFS ser-vice rule with the random drop policy (FCFS/RD) is lo-cally fair but does not satisfy the correlation conditions. Wethen introduced a novel work-conserving algorithm calledRolling Priority (RP), that is designed to satisfy the localfairness and correlation conditions and established its opti-mality. RP relies exclusively on local information and doesnot employ explicit coordination of scheduling decisions atdifferent hops.

One limitation of the RP algorithm is that it encouragesloss to occur in bursts, which may be harmful to some ap-plications. Therefore, an open question is whether there is

an algorithm that does not suffer from this limitation andthat is also (nearly) optimal.

References

[1] The network simulator ns-2.http://www.isi.edu/nsnam/ns/.

[2] W. Aiello, R. Ostrovesky, E. Kushilevitz, and A. Rosen. Dy-namic routing on networks with fixed-size buffers. In Sym-posium On Discrete Algorithms (SODA), 2003.

[3] G. Appenzeller, I. Keslassy, and N. McKeown. Sizing routerbuffers. In ACM SIGCOMM ’04, August /September 2004.

[4] A. Borodin, J. Kleinberg, P. Raghavan, M. Sudan, and D. P.Williamson. Adversarial queuing theory. J. ACM, 48(1):13–38, 2001.

[5] C.-S. Chang. Performance Guarantees in CommunicationNetworks. Springer-Verlag, London, UK, 2000.

[6] C.-S. Chang, Y.-T. Chen, and D.-S. Lee. Constructions ofoptical FIFO queues. IEEE/ACM Trans. Netw., 14(SI):2838–2843, 2006.

[7] S.-T. Chuang, A. Goel, N. McKeown, and B. Prabhakar.Matching output queueing with a combined input-outputqueued switch. In IEEE INFOCOM, 1999.

[8] M. Elhaddad, H. Iqbal, T. Znati, and R. Melhem. On min-imizing the worst-case loss rate in packet-routing networks.Technical Report CS/TR-07-149, University of Pittsburgh.

[9] M. Enachescu, Y. Ganjali, A. Goel, N. McKewon, andT. Roughgarden. Routers with very small buffers. In IEEEInfocom, 2006.

[10] S. Floyd and V. Jacobson. Traffic phase effects in packet-switched gateways. Journal of Internetworking:Practice andExperience, 3(3):115–156, September, 1992.

[11] N. McKeown and D. Wischik. Hot Topic: Making routerbuffers much smaller. SIGCOMM Comput. Commun. Rev.,35(3):73–74, 2005.

[12] A. K. Parekh and R. G. Gallagher. A generalized proces-sor sharing approach to flow control in integrated servicesnetworks: the multiple node case. IEEE/ACM Trans. Netw.,2(2):137–150, 1994.

[13] J. W. Roberts and J. T. Virtamo. The superposition of peri-odic cell arrival streams in an ATM multiplexer. IEEE Trans.Commun., 39(2):298–303, Feb. 1991.

27th International Conference on Distributed Computing Systems (ICDCS'07)0-7695-2837-3/07 $20.00 © 2007