decentralised load balancing in closed and open systems a. j. ganesh university of bristol joint...

Decentralised load balancing in closed and open systems

A. J. GaneshUniversity of Bristol

Joint work with S. Lilienthal, D. Manjunath, A. Proutiere

and F. Simatos

Model Fixed set of m servers Closed system

Fixed set of n clients Open system

Clients arrive according to independent Poisson processes of rates 1,…,m

Exponential job sizes, iid with unit mean Service rates are 1,…,m Processor sharing service discipline

Objective Closed system

Balance the server loads Open system

Maximise throughput Minimise delay

Seek decentralised algorithms a client can sample an arbitrary server and

decide to move based on the loads at its current and sampled servers

Motivation Dynamic spectrum access in wireless

servers are channels Multipath TCP or dynamic routing

servers are routes Route choice in transport networks

servers are routes All are examples of congestion

games time to reach Nash equilibrium

Algorithm 1: Random local search (RLS)

Clients pick servers uniformly at random according to independent unit rate Poisson processes

Move if it would strictly improve their individual service rate (= rate of server divided by its load)

Algorithm 2: Random load oblivious (RLO)

Clients are impatient and simply perform independent random walks over the servers until the leave

Random walk described by continuous time Markov chain with rate matrix Q and invariant distribution >0

Moves are oblivious of server load

Related work: a synchronous model

Berenbrink et al. (2005) At each time step, each client picks a

server at random If load at current server is A and at

new server is B<A, then moves with probability (AB)/A

Previous results: Closed systems Expected time to reach load balance in

asynchronous model is O(m2): Goldberg (2004)

Expected time to reach balance in synchronous model is O(loglog(m) + n4): Berenbrink et al. (2005): O(log(m) + nlog(n)): Berenbrink et al.

(2007)

Our results

Closed systems Time to reach perfect balance is

O(m2log(m)/n + log2(m)) Time to reach -balance is O(log(m)/)

Open systems Both RLS and RLO are throughput

maximising: system is stable whenever i i

Notation and definitions

N(t) = (N1(t),…,Nm(t)) : number of clients at servers 1,…,m at time t

N(t) is balanced if |Ni(t)-Nj(t)| 1 for all i and j

N(t) is -balanced if (1)p Ni(t) (1+)p for all i, where p=n/m

= time to reach balance = time to reach -balance

Notation and definitions

V(t) = maxj Nj(t) Cv(t) = number of servers with

exactly v clients at time t Bv(t) = Cv1(t) Av(t) = number of servers with v2

or fewer clients at time t

Results for closed systems

E[] = O(m2 log(m)/n + log2(m)) E[] = O(log(m)/)

E[] = (m2/n + log(m))

Typically interested in n >> m

Proof (perfect balance) Previous work used quadratic

Lyapunov functions We use V(t) as Lyapunov function Say RLS algorithm is in phase v at

time t if V(t)=v Cv(t) decreases monotonically

during phase v Phase v ends when Cv(t) hits 0

Proof (cont.) Cv decreases by 1 when one of the vCv

clients at a maximally loaded server samples one of the Av servers with v2 or fewer clients

This happens at rate vCvAv/m Lower bound for Av: no more than n/(v1)

servers can have v1 or more clients Implies upper bound on mean time for

Cv to decrease by 1 and hence for V to decrease by 1

Proof (-balance) Involves counting the number of -

balanced, underloaded and overloaded servers,

and the number of clients at overloaded servers,

and using these to bound the expected time till one such client moves to an underloaded or -balanced server

Stability results for closed systems

If i i, then the system is stable under both RLO and RLS policies

Proof of stability for RLO algorithm Proof uses Foster’s criterion, with

the total number of clients in the system as Lyapunov function

Denote by |x| the L1-norm of vector x |N(t)| is the total number of clients in

system at time t || is the total arrival rate || is the maximum service

Foster’s criterion

Suppose there exist K, and t>0 such thatEn[|N(t)||n|] < for all n:|n|>K

Then N(t) is ergodic

Bounding the drift

En[|N(t)||n|] = t iE[Yi(t)] where Yi(t) is the time up to time t

that server i is non-idle (has at least 1 client)

If E[Yi(t)] is very nearly equal to t, then have Foster’s criterion from condition

Need a lower bound on Yi(t) to get an upper bound on the drift

Bounding the idle time Clients perform independent random

walks on system, but don’t leave Independent rate i Poisson processes

of `virtual’ services at servers If number of clients at server i at time t

is more than the total number of virtual services at all servers on [0,t], then queue i has to be non-empty at time t

Bounding the idle time (cont.)

Suppose |n| is large Markov chain describing random

walks reaches equilibrium in constant time

Number of clients at each server is (|n|) from this time

Number of virtual services is O(1)

Proof of stability for RLS algorithm

Uses a slightly different Lyapunov function

f(n) = |n| + k(n) for suitably small >0, where k(n)

is the number of empty servers in state n

Performance estimates in open systems Consider large m asymptotics Xk

m(t): proportion of servers with exactly k clients at time t

Xm(t) evolves as density dependent Markov process

By Kurtz’s theorem, evolution converges to solution of deterministic differential equation over finite time-horizons

Performance estimates in open systems (cont.)

Idea: look at equilibrium points of deterministic dynamics

If there is a unique stable equilibrium, we expect that stochastic dynamics will live in vicinity of this equilibrium

Use the equilibrium to derive performance measures

In more detail …

Kurtz’s theorem only applies for finite time horizons

Doesn’t tell us about long-term behaviour

Can get around this by using propagation of chaos techniques developed by Snitzman

Numerical results

Asymptotic estimates pretty accurate even for small m, say m =10

RLO is only a little bit worse than RLS in terms of mean delay (about 20% worse in parameter range considered)

Conclusions

Random local search balances loads very quickly in closed systems polylog in number of servers

Impatience is a virtue impatient customers help to balance

load and achieve resources pooling, even if they migrate oblivious of load

Open problems Have assumed all clients can use

all servers, and also that they can move between any pair of servers

What if clients can only move from a server to its neighbours in some graph?

What if clients are of different types, and each type can only use a subset of the servers?

Open problems

Suppose clients can only migrate to neighbouring servers in a graph

Can the time to balance loads be related to mixing times of random walks on this graph?

Open problems Performance measures in open

systems obtained in terms of equilibrium points of a differential equation

Is there perfect resource pooling in heavy traffic limit?

Can we get tail bounds on delays? What if clients can use multiple

servers simultaneously?

decentralised load balancing in closed and open systems a. j. ganesh university of bristol joint...

Documents