rrr: recursive round robin scheduler

Ž .Computer Networks 31 1999 1951–1966www.elsevier.comrlocatercomnet

RRR: recursive round robin scheduler

Rahul Garg a,), Xiaoqiang Chen b,1

a Department of Computer Science and Engineering, Indian Institute of Technology, Hauz Khas, New Delhi, Indiab Bell Laboratories, Lucent Technologies, 101 Crawfords Corner Road, Holmdel, NJ 07733, USA

Abstract

Scheduling has been an interesting problem since its inception. In the context of real-time networks, a schedulingalgorithm is concerned with dispatching streams of packets sharing the same bandwidth such that certain guaranteedperformance for each stream like rate and delay bound is provided. This function has a wide range of applications in networkelements such as host adaptors, routers and switches. This paper proposes and describes a new scheduling algorithm named

Ž .as recursiÕe round robin RRR scheduler. It is based on the concept of the construction of a scheduling tree in whichdistinct cell streams are scheduled recursively. Special emphasis is placed on the design and analysis of the scheduler. Adelay bound is analytically derived for the scheduler and verified using simulation. The scheduler can work in either awork-conserving mode or non-work-conserving mode. It is shown that the work conserving scheduler is fair. Fairnessindexes for the work conserving scheduler are analytically derived. The simple nature of this algorithm makes it possible toimplement it at very high speeds, while considerably reducing the implementation cost. q 1999 Elsevier Science B.V. Allrights reserved.

Keywords: Asynchronous transfer mode; Scheduling; Quality of service; Delay bound; Round robin; Fairness

1. Introduction

Scheduling plays an important role in providingQoS guarantees in any data network. The difficultiesin the design of a scheduling are largely due to thefact that multiple constraints have to be met, includ-ing delay, bandwidth, jitter and loss ratio. In addi-tion, there are a number of desirable properties that ascheduler has to consider. The first property is that a

) Corresponding author. Tel.: q91-11-686-7431, 685-7649;Fax: q91-11-686-8765; E-mail: [email protected]

1 Tel: q1-732-949-2928; Fax: q1-732-834-5379; E-mail:[email protected].

scheduler must isolate one stream from other streamsso that it is able to maintain the performance guaran-tees for such a stream even in the presence ofmisbehaving streams in the system. The secondproperty calls for a scheduler to divide the availablelink bandwidth among competing streams in a fairmanner. This property is particularly important forthe design of work-conserving schedulers. The thirdproperty is that a scheduler must perform reasonablywell with both, i.e., the number of packet streamsand a wide range of link capacities. Finally, a practi-cal scheduler should easily be implemented. In caseof scheduling ATM cells, available time for complet-ing a scheduling decision is inversely proportional tothe link capacity. This time is only 680 ns on an

1389-1286r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved.Ž .PII: S1389-1286 99 00006-7

( )R. Garg, X. ChenrComputer Networks 31 1999 1951–19661952

OC-12 link, and hardware solution is desirable inthis case.

Many of the scheduling algorithm proposed, werefirst designed for scheduling packets of variable

w xsizes and were complex 4,13,11 . Some of themw xwere later adopted for scheduling ATM cells 10 .

However the inherent complexity remains and typi-cal operations needed to schedule a cell are addition,multiplication and division. In addition they alsoneed a multi-level priority queue. Among the sched-ulers optimized for scheduling ATM cells, some

w xhave poor delay and fairness properties 5 , some arew xnot scalable for fine rate granularity 6 and some

may need to overallocate rate resulting into poorw xutilization of link bandwidth 3 .

In general, schedulers can be characterized aswork-conserving or non-work-conserving. A sched-uler is said to be work-conserving if the scheduler isnever idle when at least one packet is buffered in thesystem. A non-work-conserving scheduler may re-main idle even if there are available packets totransmit.

This paper proposes and describes a new schedul-ing algorithm called recursiÕe round robin schedulerŽ .RRR , which is optimized for scheduling fixed sizepackets. The proposed scheduler is based on theconcept of the construction of a scheduling tree inwhich distinct streams are scheduled recursively. Therate of each packet stream is represented in binaryfraction relative to the link capacity. It can supportvery large rate granularity for each packet stream.Most of the properties of the scheduler depend uponthe count c of number of one in the binary represen-tation of normalized rate allocated to the stream. Thescheduler provides an upper bound on the delay anddelay-jitter for compliant packet streams which areindependent of the link utilization. The delay bound

Ž .is independent of other streams isolation . The delaybound of a stream depends only on the traffic de-scriptor and normalized rate allocated to the stream.For a compliant stream of ATM cells of normalized

Žrate r where r is represented in binary fixed point.with g bits , and burst size s , the scheduler pro-

Ž . Ž .vides a delay and jitter bound of 1rr sqc ,where c is the count of number of times one occursin binary representation of its normalized rate r.

It is shown that work conserving version of theRRR algorithm is fair. The residual link capacity is

evenly divided among active streams. The serviceŽ .fairness index of two streams i, j is c rr qc rri i j j

Ž .where r and r are normalized rates of the twoi j

streams, c and c is the count of number of one ini j

the binary representation of r and r , respectively.i j

The worst case fair index of work conserving RRR isbounded by crr.

The algorithm has good scaling properties. Itsimplementation cost varies linearly with the numberof streams and the rate granularity. The algorithm isoptimized for fixed packet size and needs only bitmanipulation operations to schedule a cell. As aresult it can be implemented on high speed low costATM host adaptors where the scheduling overheadof the algorithm can be as little as two memoryaccesses to schedule an ATM cell. The algorithm isalso suited for scheduling in high speed ATMswitches, especially on the edge of ATM networkwhere isolation is required and policing is preferred.

w xImportant concepts such as link sharing 1 , classbased queuing etc. can be possibly implementedusing the RRR scheduler. In fact, more complexschedulers can be constructed in which simple and

Žefficient RRR schedulers along with other simple.and efficient schedulers and are used as building

blocks so as to meet specific scheduling require-ments.

This paper is constructed as follows. In Section 2,description of the RRR scheduling algorithm is pre-sented. Section 3 describes delay and fairness proper-ties of the scheduler. Section 4 discusses variantsderived out of RRR. Section 5 describes simulationresults. Section 6 discusses implementation issuesand tradeoffs for hardware and software implementa-tions of RRR. The paper concludes and describesfurther work in Section 7. All the proofs are given inthe Appendix.

2. Scheduler description

We use the term stream to refer to flow orsession. The term cell is used to describe the packetsof fixed size. Subscript i is usually used to identify a

Ž .stream. For each stream i a rate is allocated at anoutput link. We refer to this rate as allocated rateŽ .r .i

We divide all the rates by the link rate. As a resultthe link rate becomes equal to 1. We refer to the rate

( )R. Garg, X. ChenrComputer Networks 31 1999 1951–1966 1953

Ž .Fig. 1. Scheduling tree no stream .

so obtained as normalized rate. Further we willassume that the rate is represented in fixed pointbinary fraction with g bits of granularity. Rate allo-cated by the scheduler r is hence represented by ani

array of g bits b , b , b , . . . , b , and r sÝgi1 i2 i3 i g i ks1

b 2yk . Another assumption is that sum of rates fromi k

input streams is exactly equal to the output linkcapacity which is 1. In order to achieve this, we add

Ž .a null stream or stream 0 , with an appropriate ratesuch that the sum of the rates of all the streamsŽ .including the null stream becomes equal to 1. Inthe output schedule, whenever the null stream oc-curs, either the link is left idle or a cell from non

Ž .real-time class for example the best-effort traffic isscheduled, or cell from the ‘‘next stream’’ is sched-uled.

2.1. Basic scheduler

We now describe the scheduler itself, initially byexamples and finally by an algorithmic description.The scheduler is constructed in the form of a binary

Ž .tree also referred as a scheduling tree . The methodof constructing the scheduling tree, for the timebeing, is deferred to a later stage.

Let’s begin with some simple examples to illus-trate the basic idea of a RRR scheduler. The trivialcase is when the scheduler has no incoming streams.The corresponding tree is shown in Fig. 1. Theoutput schedule is simply 0, 0, 0, . . . .

On the arrival of a new constant stream of rate 0.5the root node of the tree splits into two nodes asshown in Fig. 2. The output schedule in this case is0, 1, 0, 1, . . . .

Ž .Fig. 2. Scheduling tree 1 stream .

Ž .Fig. 3. Scheduling tree 2 streams .

Ž .As a new stream stream 2 of rate 0.25 arrives,the tree takes the shape as shown in Fig. 3. In thiscase, the output schedule of the RRR would be 0, 1,2, 1, 0, 1, 2, 1 . . . .

A more general case is depicted in Fig. 4, wherestream i has a rate of 2yl i. In this case, a leaf oflevel i of the scheduling tree is allocated to stream i.If no such leaf exists, then a node labeled 0 can besplit into two leaves. This process proceeds recur-sively until a leaf at a desired level is created. The

Ž .algorithm for constructing the tree adding stream isisomorphic to the binary subtraction algorithm.Equally the procedure of deleting stream is isomor-phic to the binary addition algorithm.

In the most general case, as shown in Fig. 5, anode of the scheduling tree at level l is allocated toi

a stream iff l th bit in the stream’s rate representa-i

tion is 1. In other words, if the rate of a stream isequal to Ýg b 2yk , then a leaf of level k will beks1 k

allocated to the stream iff b is equal to 1.k

Thus, given a set of rates, it is possible to con-struct a tree which satisfies the above describedproperty. The tree has two kind of nodes, internalnodes and leaf nodes. A stream number is associated

Žwith each leaf node in other words, the leaf is

Ž .Fig. 4. Scheduling tree general case 1 .


.allocated to the stream . There is a bit associatedwith each internal node which keeps track of itssubtree visited last. This bit is initialized to 0. Inorder to schedule a cell, the scheduler starts from theroot of the tree and descends down to a leaf, schedul-ing cells from the stream number associated with theleaf. The scheduler descends down in the left subtreeof a node if the corresponding bit is zero. Otherwiseit descends down the right subtree. In any case, thescheduler flips the bit associated with each internalnode it visits. Thus the first output will be thenumber stored in the left most leaf and the secondoutput will be the number stored in the left most leafof the right subtree of the root. Fig. 6 gives thealgorithmic description of the scheduler.

It should be noted that the service rate of a nodem at level l is equal to 2yl m. A node m is visitedm

every 2 lm time units. A cell from the stream associ-ated with leaf m is periodically scheduled in theinterval of 2 lm time units.

Two terms, level and phase, associated with eachnode of the tree are now formally defined.

Ž . Ž .Definition 2.1 Level . Level l m of node m isdefined as the distance of the node from the root ofthe tree.

The level of the root is zero. The level of a childis one more than that of its parent.

Ž . Ž .Definition 2.2 Phase . Phase f m of a node m isthe first slot in which the node is serviced.

Ž .Fig. 5. Scheduling tree general case 2 .

Fig. 6. The RRR scheduling algorithm.

The phase of the root node is 0. The phase of theŽ . Ž .left child m of a node n is equal to that of the

parent node. Mathematically, if node m is a left childŽ . Ž .of node n then, f m sf n . The phase of the right

Ž . Ž . lŽn.child m of node n is 2 more than that of thenode. Formally, if node m is the right child of node

Ž . Ž . lŽn.n then f m sf n q2 .An cell from the stream associated with a leaf m

Ž . Ž . lŽm. Ž .is scheduled in slots f m , f m q2 , f m qlŽm. Ž . lŽm. 22)2 , f m q3)2 , . . . .

2.2. Adding and deleting streams

As mentioned earlier, adding a stream with agiven allocated rate is isomorphic to the deletionalgorithm for binary numbers. One can view theaddition of a stream with rate r as taking away runits of rate from the null stream. Similarly deletinga stream of rate r is isomorphic to binary addition,as r units of rate become free and get added to therate of the null stream.

2 The slots mentioned here are not the only slots when cellsfrom the stream are scheduled. If more than one node of schedul-ing tree are allocated for a stream, then there are additional slotswhere cells from the same stream are scheduled.


The scheduling tree is modified dynamically asstreams are added or removed. However any modifi-cations to the scheduling tree must be made carefullyso that the delay bounds of the existing streams arenot violated. To achieve this, we need to define thefollowing for later use.

Ž .Definition 2.3 Unfragmented rate allocation . Astream i is said to have unfragmented rate allocationiff for any level l of the scheduling tree, there is atmost one node allocated to the stream i.

The basic idea is to ensure that all the streamsŽ .including the null stream have unfragmented rateallocation each time a stream added or deleted fromthe scheduling tree. Unfragmented rate allocationbounds the number of leaves allocated to a particularstream. If the height of the scheduling tree is k and astream has unfragmented rate allocation, then at mostk nodes in the scheduling tree can be allocated to thestream. We will show that the number of leaves isdirectly related to delay and fairness properties. Italso limits the size of the scheduling tree, and hencethe amount of state and complexity of implementa-tion.

To add a stream i with the requested rate r , thei

first step is to check if the requested rate is available.If the link doesn’t have adequate rate, the requestedstream is denied to be added to the scheduling tree.If the link has adequate rate, i.e., r Fr , then wei 0

proceed further.

Assume

gykr s b 2 .Ýi i k

ks1

The procedure begins with the least significant bitŽ .ksg downto 1 of the requested rate, from a nodeat the highest level. If b is zero, no actions need bei k

to taken but k is simply decremented. If both bik

and b are equal to 1, then the node at level k0 k

previously allocated to the null stream is allocated tothe current stream i.

If b s1 and b s0, however, a borrow opera-i k 0 k

tion is performed from level k of the scheduling tree.During this operation, we move up to level ky1 inthe scheduling tree and find if there is a leaf allo-cated to the null stream at ky1 level. If the answeris no, we move up the scheduling tree again till sucha leaf is found. It is always possible to find such aleaf if r Fr and the null stream has unfragmentedi 0

rate allocation. After a leaf allocated to the nullstream at level kX is found, it is split into twochildren at level kX q1. One of the children isallocated to the null stream while the other child isused for performing the split operation again. Thisprocess is continued till level k is reached. At levelk, one of the two new children is allocated to streami and the other to the null stream.

The above procedure is continued till the root ofthe scheduling tree is reached. r is decremented by0

Ž .r to update the residual bandwidth r sr yr . Iti 0 0 i

Fig. 7. Adding stream 1 of rate 0.0101.


should be noted that at this stage, the scheduling treehas unfragmented rate allocation for all the streams.

The above procedure of adding streams is illus-trated by examples in Fig. 7, Fig. 8 and Fig. 9. Inthese examples, it is assumed that 4 bits are used torepresent a rate.

Evolution of the scheduling tree when stream 1 ofrate 0.0101 is added for scheduling on an idle link is

Ž . Ž .shown in Fig. 7 a to f . Initially there is no stream,so all the bandwidth is allocated to the null stream.

Ž .The scheduling tree is single leaf as shown in a .Since 4th bit of rate of stream 1 is 1, a node at level4 must be allocated. Since there is no leaf allocatedto the null stream at level 4, the borrow operation isperformed. The first leaf allocated to the null streamis the root itself. Thus the root is split into twochildren at level 1. One of the children is allocated tothe null stream and the other is marked with B

Ž .representing the borrow operation, as shown in b .Ž .The node marked B is split again as shown in c .

Ž .This borrow operation is carried as shown in d andŽ . Ž .e . As the leaf node in e is split again, two leavesat level 4 associated with the null stream are created.One of the leaf is allocated to stream 1. Now there isa leaf at level 2 allocated to the null stream. This leafis allocated to stream 1 resulting in the scheduling

Ž .tree as shown in f .Figs. 8 and 9 are self explanatory as the borrow

operation is not needed in allocating the nodes, whileadding stream 2 and stream 3.

To delete stream i, all its allocated nodes arere-allocated to the null stream. At this stage, the null

Fig. 8. Adding stream of 2 rate 0.0100.

Fig. 9. Adding stream 3 of rate 0.0011.

stream may have fragmented rate allocation. In orderto unfragment the rate allocation of the null stream,

Ž .the inverse of the split operation join and of theŽ .borrow operation carry has to be performed.

If there are two nodes at the same level assignedto the null stream, they should be joined together toform a parent at a higher level. The join operation

Žcan be applied only on siblings two nodes with a.common parent . Assume that the n and n are the1 2

nodes allocated to the null stream at the same levelsay l. If n and n are siblings then the join opera-1 2

tion is directly applied and the two nodes are re-placed by a leaf node at a lower level. In case n and1

n are not siblings, let n be the sibling of n . Now2 3 1

the nodes n and n are swapped. This operation is2 3

called phase exchange 3. After the phase exchange,n and n become siblings and the join and carry1 2

operation is performed to unfragment rate allocationof the null stream.

The above procedure is carried out from thehighest level to the root of the scheduling tree, untilthe rate allocation of the null stream becomes un-fragmented. Finally r is updated to the value of the0

Ž .currently available bandwidth r sr qr .0 0 i

3 The phase exchange has to be carried out carefully such that itdoesn’t affect the delay bound and other properties of the streamassociated with n . The phase exchange is carried out in two2

steps. Firstly node n allocated to null stream is reallocated to the3

stream associated with n . At this stage the stream gets more2

share of link rate. If current time is t then node n is reallocated2

to null stream at time tq2 l. This deallocation can be doneasynchronously as the second step.


Fig. 10. Deleting stream 1.

In continuation with the examples described inFigs. 7–9, we illustrate deleting stream 1 in Fig. 10.

Ž .The initial scheduling tree is shown in a . In orderto delete stream 1 all the leaves allocated to it are

Ž .allocated to the null stream as shown in b . Nowthere are two leaves at level 2 allocated to the nullstream. Thus the null stream is now fragmented. Inorder to unfragment the null stream, the phase ex-

Žchange between two leaves one allocated to stream.2 and the other to stream 0 at level 2 is carried out,

Ž .resulting in the scheduling tree as shown in c .Finally the two siblings allocated to the null streamare merged together to move up, resulting in a leafallocated to the null stream at level 1 as shown inŽ .d . Now all the streams have unfragmented rateallocation.

2.3. Allocation for floating point rates

In this section we will discuss the rate allocationfor the scheduling tree when the rate is representedin the form of floating point. Let the rate rF1 berepresented in simple binary floating point with gbits of mantissa and e bits of unsigned exponent. LetE be the value of exponent. Now rs2y EÝg b rks1 k

2 k. The rate allocation for this rate is possible if theheight of the scheduling tree is Eqg. If E is themax

maximum value which the exponent can take, thenthe rate allocation as discussed earlier is possibleprovided the height of the scheduling tree is E qmax

g.

It is important to note that even if the height ofthe scheduling tree has been increased to a largervalue, the number of nodes allocated for a stream is

Ž .still bounded by g, the mantissa granularity of raterepresentation. Keeping this fact in mind, it is possi-ble to implement the scheduler in hardware as wellas software with reasonable complexity. The imple-mentation of the floating point rate representationwill be very similar to that of the basic scheduler.The bounds and other properties of the schedulerchange marginally if the floating point rate allocationis carried out.

In the rest of the paper, the fixed point raterepresentation is used to simplify descriptions andproofs.

2.4. Work conserÕing RRR

We refer to the scheduler discussed before asbasic scheduler or basic RRR. The obvious exten-sion of the basic scheduler is to a work conservingscheduler.

In the basic RRR scheduling, the time slots whicha stream gets is fixed once the allocation of leavesfor that stream has been completed. The slots areindependent of traffic on other streams. It mighthappen that a scheduling slot of a stream arrived butthere is no pending cell to be scheduled. In such acase no cell will be scheduled if the basic RRR isused. The work conserving RRR will howeverscheduler a cell from the next non back-logged


Fig. 11. Work conserving RRR.

stream. Fig. 11 describes the work conserving vari-ant of RRR.

2.5. Discussion

In this section the main idea behind the schedulerhas been presented. The link bandwidth is partitionedinto two equal halves by splitting the node at root ofthe scheduling tree into two nodes at level 1. Whilescheduling, these two nodes are visited in roundrobin order. Both the nodes at level 1 can be furthersplit into two nodes each at next level, further parti-tioning the link bandwidth. Continuing in this man-ner, the link bandwidth can thus be partitioned hier-archically giving rise to a scheduling tree. Traversalof nodes of the tree becomes recursiÕe round robin.

The concept of scheduling tree is abstract. It isnot always necessary to construct the scheduling treefor generating the schedule. Hardware implementa-tion sketched in Section 6 doesn’t explicitly store thescheduling tree data structure. Therefore one mustnot judge the time complexity of the scheduler bythe time complexity of algorithm previously de-scribed in Fig. 6. Several implementations of thealgorithm are possible and the tradeoff is discussedin Section 6. We believe that efficient hardwareimplementation of the scheduler can schedule onecell per clock cycle. The amount of hardware neededscales linearly with the rate granularity g.

3. Main properties of RRR scheduler

It will be shown shortly that all the properties ofRRR scheduler scale linearly with granularity of rateallocation, g. To be more precise for a stream i allthe properties depend on c , which is count ofi

number of ones in binary representation of r . c isi i

bounded by granularity of rate allocation, g. Forsimplicity we will only discuss the g-bit fixed pointrate representation of a stream scheduled by the RRRscheduler. The floating point rate representationhowever can be dealt with in a similar manner. Wefirst evaluate bound for a single node in the networkand later extend it for multiple nodes.

The notations used are summarized in Table 1.Ž .Let S t be equal to the number of cell of the stream

scheduled by the basic RRR scheduler in the intervalw x0, t . Then

g tyf bŽ .i kS t F q1 b 1Ž . Ž .Ýi i kk2ks1

Ž .where b is k th bit in the normalized rate rik iŽ .allocated to stream i, and f b is phase of corre-i k

Ž .sponding the k th bit. The right-hand side of Eq. 1is involved in all the proofs. Intuitively, the contribu-

Ž .tion to S t is only from terms in the summationi

where b is non zero. For each k such that b isi k i kŽ .equal to 1, cells are scheduled at time instants f b ,i k

Ž . k Ž . kf b q2 , f b q2)2 ,... Summing this upi k i kŽ .gives Eq. 1 .

3.1. Delay and buffer bound

For a stream, say i which is continuously back-w xlogged in the interval 0, t we have

g tyf bŽ .i kS t G q1 b . 2Ž . Ž .Ýi i kk2ks1

Ž .Eq. 2 is true for both, basic RRR scheduler and

Table 1Notations

Symbol Meaning

Ž . w xA t Number of cells of stream i arriving in interval 0, tijŽ .S t Number of cells of stream i scheduled at node ji

w xin time 0, tŽ .B t Backlog of stream i at time instant tjŽ .b t Burstiness function of stream ijjr Normalized rate allocated to stream i at link jij jc Count of number of 1’s in r .i i


work conserving RRR. The equation is true even inthe presence of phase exchanges described in Section

Ž .2.2. This gives the following upper bound d fori

delay:

1d F s qc 3Ž . Ž .i imax iri

where c is the count of number of times 1 occurs ini

the binary representation of normalized rate r andi

s is the burst size in the leaky bucket descriptorimaxw Ž .of the stream. Mathematically s smax b t yimax t i

xr t . The proof is given in Appendix A.1.i

The above delay bound is close to that of ‘‘opti-mal fair schedulers’’ like WFQ if the burstinesss is very large as compared to the parameter c .imax i

Ž .Ž .The delay bound for WFQ is 1rr s q1 . Fori imax

instance assume a stream of MPEG video of rate 5Ž .Mbps rs11.8 K ATM cells per second , and burst

Ž .size of 10 KB 189 cells scheduled by RRR sched-uler with 16 bits of rate allocation. The total delaybound would be 17.37 ms. The corresponding boundfor WFQ scheduler is 16.10 ms.

The delay bound for a network of basic RRRschedulers is given by the following equation

N1kdF s q c 4Ž .Ýmaxž /r ks1

assuming that the stream goes through a network ofN nodes, r j is the normalized rate allocation ofstream at node j and c j is the corresponding countof number of ones. The proof is in Appendix A.2.

The output stream of basic RRR scheduler issmooth. As a result a bound on number of buffersneeded at an intermediate node to guarantee zero cellloss is obtained. At link j, c jy 1 qc j cells of bufferis needed to guarantee zero cell loss. Proof of theabove bound is given in Appendix A.3. Thus, if 20bits of rate representation is used, then an average of20 cells of buffer per stream would be needed atintermediate nodes.

3.2. Fairness properties

In case a link is under-utilized by streams sched-uled on it, the remaining link capacity should be

Ž .distributed to active or backlogged streams fairly,w xi.e., in proportion of their allocated rate 7 . Ideally,

if stream i and stream j are continuously backloggedw xin the interval t , t then amount of service in the1 2

interval should be related by

S t ,t S t ,tŽ . Ž .i 1 2 j 1 2s . 5Ž .

r ri j

In practice it is impossible to achieve ideal fairnessbecause of cell boundaries. So a measure of fairness,

Ž .called ‘‘service fairness index’’ SFI is defined toquantify the fairness,

S t ,t S t ,tŽ . Ž .i 1 2 j 1 2SFIs max y . 6Ž .

r rt , t1 2 i j

Small values of SFI correspond to more fairness.SFIs0 corresponds to ideal fairness.

The SFI for work conserving RRR is given by

c ci jq . 7Ž .

r ri j

The proof is given in Appendix A.4. The bound onSFI increases linearly with increasing granularity, asc and c are bounded by g.i j

Another measure of fairness is based on the worstcase delay for clearing the backlog of a stream’s

Ž .queue. A scheduler has worst case fair index WFIof C rr for stream i if C is the smallest numberi i i

satisfying the equation:

S t ,t G t y t r yC . 8Ž . Ž . Ž .i 1 2 2 1 i i

The normalized worst case fair index would be C .iThis measure is important in hierarchical schedulingw x12 . The delay bound of hierarchical scheduler in-creases with increasing WFI. The WFI for workconserving RRR scheduler is bounded by c rr . Thei i

proof is given in Appendix A.4.

4. Variants of RRR

Now we are ready to introduce and discuss vari-ants of RRR.

4.1. k-ary RRR

In k-ary RRR scheduler, the scheduling tree is notbinary. The degree of the scheduling tree is k. In


Fig. 12. Simulation scenario.

order to schedule one cell, the tree is traversedrecursively starting from the root node. Every non-leaf node contains an index which takes a value from0 to ky1. The index is initialized to 0. For anyinternal node, the child pointed by the index of thenode is recursively visited. The index of the node is

Ž .incremented by 1 mod k , each time the node isvisited.

4.2. Generalized RRR

In generalized RRR, the degree of nodes at differ-ent levels of the tree can be different. For each levell there is a degree parameter d . Each internal nodel

at level l has d children. The scheduling tree isl

traversed recursively as in the k-ary RRR scheduler,except that the parameter index of a node at level lcan take a value from 0 to d y1. The index at levell

l is incremented modulo d .l

4.3. RecursiÕe tree based scheduling

This is the most general form of recursivescheduling. In this case the scheduling tree can bearbitrary. Each node in the tree contains the count ofthe number of its children. In order to schedule a cellthe tree is traversed recursively. While visiting aninternal node, its index is incremented modulo itsnumber of children. The tree need not be static. Newinternal nodes of different degrees can be createdwhen needed. The nodes at various parts in the treecan be brought together and combined into a singlenode.

4.4. Hierarchical scheduling

The schedulers discussed so far schedule streamson an output link. It may be useful to schedule a setof input streams to get an output stream. There may

be a number of such schedulers generating a numberof output streams. These streams can in turn bescheduled by another scheduler on a link. This pro-cess can be repeated hierarchically over a number oflevels. The motivation behind building such sched-ulers is to use simple schedulers like static priority,RRR etc. as basic building blocks so as to constructhierarchical schedulers which meet the desired re-quirements. This idea first appeared in rate control

w xstatic priority scheduler as described in 11 and waslater refined and formalized for the fair queuing

w xalgorithms in 12 .Finally it is noted that the variants of RRR dis-

cussed above in this section share most of the goodproperties of the basic RRR algorithm outlined in thepaper, namely, simplicity, support for wide raterange, ease of hardware and software implementationand possibility of distributed implementation. Theproofs of fairness and delay bound can be easilyextended to these variants.

5. Simulation results

The simulation scenario is depicted in Fig. 12.Input streams are chosen from a uniform distribution

Žbetween 1 and 6 Mbps 6 Mbps is expected bit rate.of MPEG-II streams . The streams are scheduled on

a synchronous output link of 156 Mbps. Fifteen bitswere used to represent a rate. The fifteen bits waschosen to ease the simulation on 32 bit computers.

Fig. 13. Delay of successive cells.


Fig. 14. Distribution of delay of cells.

Further, the purpose of simulation was to experimen-tally verify the correctness of the proof. Note that thebasic RRR algorithm provides perfect isolation. Thedelay of a cell of a stream is independent of arrivalpattern of cells of other streams. Therefore the delaydistribution of cells of a single stream will remainsame irrespective of the cross traffic.

While constructing the scheduling tree, the phaseat each level was assigned randomly. The experimentwas repeated a number of times to make sure that theresults are consistent and the simulations are correct.Fig. 13 shows the delay of successive cells of astream for a part of a simulated period.

Fig. 14 plots the statistical delay distribution ofcells of the stream. The X-axis represents the delayof cells which varies from zero to the theoreticaldelay bound. The Y-axis plots the cumulative num-ber of cells having delay in the corresponding range.It was found that with the random phase allocation,the delay bound was never reached. However with aspecific phase allocation the tail of the delay distri-bution approached close to the theoretical delaybound.

6. Implementation considerations

It should be noted that the number of leaves in thetree is bounded by g times the number of streams

Ž .being scheduled N . Therefore the space require-

ment for the scheduler to store the scheduling tree isbounded by 2 Ng.

In order to schedule a cell, one must descendŽdown the scheduling tree to a leaf. The height the

.maximum level of a node in the tree of the tree isbounded by g. Thus, in the worst case, a naiveimplementation will take no more than g operationsto schedule a single cell.

However, clever implementations can better per-formance both in terms of space needed to store thescheduling tree and the time needed to schedule acell.

For instance, given a binary RRR scheduling treewith any rate allocation for streams, it is straightfor-ward to construct a corresponding 4-ary RRRscheduling tree such that the schedule generated bythe binary and 4-ary schedulers on the correspondingtree is identical. The tree is constructed starting fromthe root node and collapsing two levels of the binaryRRR tree into a single level of 4-ary tree. In casethat there is a leaf node in the first level of the binaryRRR tree, two copies of the node are made in the4-ary tree. The four children of the root node in the4-ary scheduling tree are positioned so that they arevisited in the same order as in the binary RRR tree.

ŽThis process is continued till all the nodes including.leaves of the binary RRR tree are collapsed into the

4-ary scheduling tree.The height of 4-ary tree will be gr2. As a result,

gr2 operations are needed in order to schedule onecell. This improvement in the number of operationsis at the cost of additional space requirements. Thenumber of leaves in 4-ary scheduling tree are boundedby two times the number of leaves of correspondingbinary tree i.e., 2 Ng. The number of internal nodes

Ž .in the 4-ary tree will be 4r3 2 Ng. There is a 4r3factor increase in storing space to reduce the numberof operations in scheduling an cell by factor of two.For any software implementation, there is a tradeoffbetween time required to schedule a cell, amount ofmemory available to scheduler to maintain its datastructures.

For the fixed point rate representation, both thespace requirements and the number of operations toschedule a cell are dependent on the maximum num-

Ž .ber of streams to be scheduled N , and the rateŽ .granularity g , which is clear from the above ex-

pressions. Similarly for the floating point rate repre-


Žsentation, the rate range the maximum value the.exponent of rate can take will also be a factor in the

dependence.There are similar tradeoffs and dependencies in

hardware implementation as well. Consider a counterbased hardware implementation of the RRR sched-uler. For simplicity, we assume the fixed point raterepresentation. For each leaf node in a binaryscheduling tree, there will be a counter and a regis-ter, which stores its level, phase and associatedstream number. All the registers output their streamnumbers which are fed to a hardware multiplexer.All the counters are clocked by the same clock andare incremented every clock cycle. When a counterreaches the phase stored in the corresponding regis-ter, the multiplexer is activated to select the streamnumber of its corresponding register. Every counteris reset to zero when its value becomes 2level.

Since there are Ng leaves in a scheduling tree,there will be Ng counters and the same number ofassociated registers. Each counter should be able tocount up to 2 g y1. Each register will store the phaseinformation of g bits, the level information of log g

Ž .bits and the stream number of log N bits. OneŽ .log N bit multiplexer is needed which should be

able to multiplex and output a stream number fromNg registers.

The amount of logic needed for this implementa-Ž . Žtion is Ng gq log g for the counters, Ng gq log

.N for the registers and 2 Ng log N for the multi-plexer. The speed of this hardware implementationdepends upon the multiplexing delay and the clockskew. The multiplexing delay as well as the clockskew depends upon Ng. There are Ng distinct coun-ters running by the same clock, which causes theclock skew. There are Ng outputs of registers fromwhich the multiplexer chooses one. Hence the multi-plexing delay depends heavily on Ng.

7. Conclusion and future work

A simple and new scheduling algorithm calledŽ .recursive round robin RRR scheduler has been

described in this paper. This scheduler can be used toschedule streams on a link with certain constraints.The scheduling algorithm guarantees a delay andjitter bound on compliant streams. Proofs of delaybound and jitter bound are provided for generic

traffic descriptors. Several variants of the RRRscheduling algorithm are described. It has been shownthat the work conserving version of the RRR sched-uler is fair. Bounds on two different fairness indexesare analytically derived. It would be easy to general-ize this result to other work conserving variants ofRRR. Proof of worst case delay bound remains validfor the work conserving RRR. The procedure ofdynamic addition and deletion of streams with thefloating point rate representation has also been pre-sented. Implementation tradeoffs for hardware andsoftware implementation have been discussed.

Proof of delay bound has been supplemented bysimulations. The delay bound proven for the RRRscheduling algorithm is tight but there is a room forfurther improvement by finding efficient node allo-cation strategies. Another area that this work canfurther be extended is to study the detailed perfor-mance for those proposed variants of the basic RRR.

Further work is also needed to extend the basicRRR such that it will be able to schedule packets ofvariable length.

Because of its bounded delay, bounded buffer andbounded fairness properties, the RRR schedulingalgorithm is suited for scheduling packets in realtime networks. Simplicity, low implementation com-plexity, and possibility of very high speed implemen-tations, make this algorithm particularly suitable forATM networks. Software implementation may bedeployed in network interface cards whereas hard-ware implementation is well suited for high speedATM switches and high end network interface cards.

The applications of distributed implementationsbased on the RRR scheduling algorithm are in build-ing real-time networks which provide QoS guaran-tees on a shared physical media. Broadcast andmulticast can also be supported if the RRR is used.The shared communication medium may have asym-

Ž .metric properties e.g., cable modems in uplink anddownlink paths. An implementation of RRR on tradi-tional Ethernet LANs can build real-time LANswhich would provide QoS guarantees. Other exam-

Žples of shared medium are Ethernet Wireless or.wired , Token Ring, Wireless ATM etc. This work

has a potential to make a significant impact inwireless communication technologies like cellulartelephones, cellular data, wireless in local loop, digi-tal TV broadcast, wireless ATM.


Acknowledgements

Special thanks go to Ritesh Ahuja, who showedinterest in discussing RRR and helped this work bygiving several useful comments. Authors are alsoindebted to Muddu Sudhakar, Venky and SunderRathnavelu for taking part in the initial discussions.We would also like to thank Dr. Huzur Saran ofComputer Science and Engineering department ofIIT-Delhi who was involved in discussions whichgenerated some of ideas presented in this paper.

Thanks to Dr. Vijay Kumar of Bell Laboratories,who initiated this work by giving the first author ofthis paper, an opportunity to work in High SpeedNetworks Research Department. Thanks to the Com-puter Science Department of IIT-Delhi, where one ofthe authors spent most of his time working on RRRscheduling algorithm. Had these opportunities notbeen provided, it would have been impossible tobring this piece of work to its current form.

Appendix A

We use the notation shown in Table 1.

A.1. Delay bound for single node

Ž .Let b t be a bounding burstiness function of theŽ w x.stream under consideration see 2 . Let r be the

allocated rate of RRR scheduler for scheduling thecompliant stream. 4

Without loss of generality, assume that a backlogperiod of the stream starts at time ts0. Let t be anyinstant in this backlog period. Since the stream is

4 The rate r should be chosen carefully while performingadmission control as it might affect the network utilization. Some

w xguidelines to choose r are described in 2 . For constant ratestreams r could be chosen which is close of r. The rate r of the

w .compliant stream can take arbitrary value in 0, 1 . Given rate r, itis desirable to pick r ) r such that the number of times 1 occursa

in the g bit binary representation of r is minimized withoutadding significantly to the overhead. Such a selection can furtherminimize the jitter introduced in the stream by the scheduler.

w xcontinuously backlogged in the interval 0,t , fromŽ .Eq. 2 we have

g tyf bŽ .kS t G q1 b 9Ž . Ž .Ý kk2ks1

g tyf bŽ .kG b . 10Ž .Ý kk2ks1

Ž Ž ..Since the maximum value of phase f b of akŽ k .node is less than its period 2 , we have

g tS t G b yc 11Ž . Ž .Ý kk2ks1

s tryc. 12Ž .Ž .Let A t be the number of cells of a compliant

w xstream arrived in the interval 0, t . Assume that thecell arrived at time tX suffers the maximum schedul-ing delay in this backlog period. Let d be the delayof cell arriving at time tX. This implies that all the

w X xtraffic arriving in the interval 0, t is servicedexactly by time tX qd. Therefore,

A tX ss tX qd . 13Ž . Ž . Ž .Ž . Ž .Since S t from Eq. 12 we have

A tX Gr tX qd yc. 14Ž . Ž . Ž .Ž X. Ž X.Since A t Fb t we get

rdFb tX yrtX qc. 15Ž . Ž .Ž Ž . .The burst size is defined as s smax b t yrt .max t

Substituting s in the above equation we getmax

rdFs qc 16Ž .max

1w x´dF s qc . 17Ž .maxr

Since change of origin only affects the values ofŽ Ž ..phases f b , it can be shifted to the beginning ofk

any backlog period without changing anything in theabove proof. After de-normalization, when the unitof rate is cells per second and the unit of s ismax

number of cells, the above equation gives the delayŽbound in seconds.The worst case delay for compli-

. Ž .ant streams is given by Eq. 17 . It must be notedŽ .that in Eq. 2 used in proving the delay bound is a

Ž .lower bound on S t . Therefore the same proofapplies to the work conserving RRR and the RRR

Žwith dynamic addition and removal of streams. Thisinequality would hold true in phase shift operation

.needed to un-fragment the rate of the null stream.


A.2. Delay bound for multiple nodes

We prove the end-to-end delay bound for a net-work of RRR schedulers using the theory of latency

w xrate servers 9 . Let a backlog period of a stream startat time t. Let tX be another time instant in the samebacklog period. According to the results presented inw x8 , if a server guarantees that the amount of stream

w X x Ž Xtraffic served in the interval t,t is at least r t y t.yu then the server is a latency rate server of

w xlatency u and rate r. It was also shown in 8 that theend-to-end delay bound for a leaky bucket compliant

Ž .stream of parameters s ,r , serviced by a networkof N latency rate servers is srr qu , where r isa s a

Ž .the minimum rate of the N servers r Gr and u isa s

the sum of the latencies of the N latency rateservers.

In order to prove the end-to-end delay bound, weonly need to show that the RRR scheduler is a

Ž .latency rate server. Eq. 12 may be rewritten as

cS t Gr ty 18Ž . Ž .ž /r

where t is any time instant in the backlog period ofthe stream starting at time 0. Since the origin may beshifted to the beginning of any backlog period, weconclude that the RRR scheduler is a latency rateserver with rate r and latency crr. Therefore theend-to-end delay bound of a stream served by anetwork of N RRR schedulers is given by

N1kdF s q c 19Ž .Ýmaxž /ra ks1

where r is the minimum of the rates allocated to thea

stream at N nodes, and ck is the number of timesone appears in the normalized rate allocated to thestream at the k th node. Again note that this bound isvalid for basic RRR as well as work conservingRRR.

A.3. Buffers needed at intermediate nodes

Using the results from the theory of latency ratew xservers 8 , the maximum amount of buffers needed

at k th node is s qÝk c i.max is1

However, if the non work conserving RRR isŽ k .used at every node, and the rate r allocated at a

Ž . Ž ky1.node k is no less than the rate allocated r atŽ . Ž k ky1.its upstream node ky1 i.e. r Gr , then bet-

ter buffer bounds can be obtained.kŽ .At node k the arrival A t satisfies

Ak t Fr ky1 tqcky1.Ž .kŽ .The output S t satisfies

Sk t Gr k tyck .Ž .kŽ .Therefore maximum backlog B t is bounded by

kŽ . kŽ . kS t yA t . Since r is greater than or equal tor ky1,

B k t Fcky1 qck ,Ž .

B t F2 gŽ .

at any intermediate node in the network.

A.4. Fairness of work conserÕing RRR

Now we show that the work conserving RRR isfair and we derive bounds for two fairness measuresof the scheduler. If two streams i and j are continu-

Ž xously backlogged in the interval t , t and the rate1 2Ž .allocated to them is r and r and S t , t andi j i 1 2

Ž .S t , t is the number of cells of stream i andj 1 2

stream j scheduled by the RRR scheduler in theŽ xinterval t , t then SFI is defined as1 2

S t ,t S t ,tŽ . Ž .i 1 2 j 1 2SFIs max y .

r rt , t1 2 i j

Without loss of generality assume that SFI is maxi-Ž . Ž .mized when t s0. Let S t stand for S 0, t . Let1 i i

Ž .Õ t be the number of leaves of the scheduling treeŽ x Ž .visited in time 0, t . Õ t is similar to the notion of

virtual time used in the context of fair schedulersŽ . Ž .like weighted fair queueing . Note that Õ t s t isall the streams are continuously backlogged in the

w xinterval 0, t . Now

g Õ t yf bŽ . Ž .i kS t s q1 b 20Ž . Ž .Ýi i kk2ks1


g Õ t yf bŽ . Ž .i kF q1 b 21Ž .Ý i kk2ks1

g bikFÕ t qg 22Ž . Ž .Ý k2ks1

sÕ t r qg 23Ž . Ž .i

S t gŽ .i´ FÕ t q , 24Ž . Ž .

r ri i

g Õ t yf bŽ . Ž .jkS r s q1 b 25Ž . Ž .Ýj jkk2ks1

g Õ t yf bŽ . Ž .jkG b 26Ž .Ý jkk2ks1

g bjkGÕ t yg 27Ž . Ž .Ý k2ks1

GÕ t r yg 28Ž . Ž .j

S t gŽ .j´ GÕ t y 29Ž . Ž .

r ri i

S t S t 1 1Ž . Ž .i j´ y Fg q 30Ž .ž /r r r ri j i j

1 1´SFIsg q 31Ž .ž /r ri j

Ž .Eq. 31 gives a bound on service fairness index ofwork conserving RRR.

We now derive a bound for worst case fair indexŽ .WFI of the scheduler. If C is the smallest numberi

satisfying the equation

S t ,t G t y t r yC , 32Ž . Ž . Ž .i 1 2 2 1 i i

then WFIsC .ig Õ t yf bŽ . Ž .i k

S t s q1 b 33Ž . Ž .Ýi i kk2ks1

g Õ t yf bŽ . Ž .i kG b 34Ž .Ý i kk2ks1

g bikG t b yg 35Ž .Ý i kk2ks1

G tr yg 36Ž .i

So WFI for stream i is bounded by grr . Normal-i

ized WFI is bounded by g, the granularity of rateallocation.

References

w x1 S. Floyd, V. Jacobson, Link sharing and resource manage-ment models for packet networks, IEEErACM Transactions

Ž . Ž .on Networking 3 4 1995 .w x2 R. Garg, Characterization of video traffic, Technical Report

TR-95-007, International Computer Science Institute, 1947Center Street, Suite 600, Berkeley, CA 94704-1198, USA,1995. ftp:rrftp.icsi.berkeley.edurpubrtechreportsr1995rtr-95-007.ps.gz.

w x3 C.R. Kalmanek, H. Kanakia, S. Keshav, Rate controlledservers for very high-speed networks, in: Proc. IEEE GlobalTelecommunications Conference, December 1990, pp.300.3.1–300.3.9.

w x4 S.E. Lindberger, K. Tidblom, Weighted fair queueing, amethod to control integrated heterogeneous traffic streams,having different QoS demands, in: Twelfth Nordic Teletraffic

Ž .Seminar NTS12 VTT Symposium 154 , pp. 47–58, FinlandTechnical Research Centre, Finland, August 1995, FinlandTechnical Research Centre.

w x5 S. Sidropoulos, M. Katevenis, C. Courcoubeti, Weightedround robin cell multiplexing in a general purpose ATMswitch chip, IEEE Journal on Selected Areas of Communica-

Ž .tions 9 1991 1265–1279.w x6 S.S. Panwar, T.K. Philips, M.S. Chen, Golden ratio schedul-

ing for flow control with low buffer requirements, IEEEŽ . Ž .Transactions on Communications 40 4 1992 765–772.

w x7 A.K. Parekh, A generalized processor sharing approach toflow control in integrated services networks, Ph.D. disserta-tion, Massachusetts Institute of Technology, February 1992.

w x8 D. Stiliadis, A. Verma, Latency-rate servers: a general modelfor analysis of traffic scheduling algorithms, in: Proc. INFO-COM, March 1996, pp. 111-119.

w x9 D. Stiliadis, Traffic scheduling in packet-switched networks:analysis, design and implementation, Ph.D. dissertation, Uni-versity of California Santa Cruz, June 1996.

w x10 A. Varma, D. Stiliadis, Hardware implementation of fairqueueing algorithms for asynchronous transfer mode net-

Ž .works, IEEE Communications Magazine December 199754–68.

w x11 H. Zhang, Service disciplines for packet-switching inte-grated-services. PhD dissertation, University of California atBerkeley, 1993.

w x12 H. Zhang, J.C.R. Bennett, Hierarchical packet fair queueingalgorithms. In Proceedings of ACM SIGCOMM, ACM,September 1997, pp. 143–156.

w x13 L. Zhang, Virtual clock: A new traffic control algorithm forpacket switching networks, ACM Transactions on Computer

Ž .Systems 9 1991 101–124.


Rahul Garg received B.Tech. in Com-puter Science and Engg. from the IndianInstitute of Technology, Delhi, and aM.S. in Computer Science from theUniversity of California at Berkeley. Heis currently a Ph.D. candidate in theComputer Science Department at the In-stitute of Technology, Delhi. His re-search interests are resource manage-ment, packet scheduling algorithms, pro-tocol design & implementation and digi-tal video processing.

Xiaoqiang Chen received B.Eng andM.Eng degrees in Electrical Engineeringfrom the Nanjing Institute of Communi-cations Engineering in Nanjing, China,and a Ph.D. in computer science fromthe Cambridge University in the UnitedKingdom. He is currently a technicalmanager of Bell Laboratories at LucentTechnologies in Holmdel, New Jersey,USA, where he is conducting researchand development in high speed routerdesign, ATM switching systems, traffic

management and congestion control, and multicast routing. Inaddition to his research, he has been a major contributor to thedevelopment of several commercial products.

rrr: recursive round robin scheduler

Documents