decentralised load balancing in closed and open systems a. j. ganesh university of bristol joint...
TRANSCRIPT
Decentralised load balancing in closed and open systems
A. J. GaneshUniversity of Bristol
Joint work with S. Lilienthal, D. Manjunath, A. Proutiere
and F. Simatos
Model Fixed set of m servers Closed system
Fixed set of n clients Open system
Clients arrive according to independent Poisson processes of rates 1,…,m
Exponential job sizes, iid with unit mean Service rates are 1,…,m Processor sharing service discipline
Objective Closed system
Balance the server loads Open system
Maximise throughput Minimise delay
Seek decentralised algorithms a client can sample an arbitrary server and
decide to move based on the loads at its current and sampled servers
Motivation Dynamic spectrum access in wireless
servers are channels Multipath TCP or dynamic routing
servers are routes Route choice in transport networks
servers are routes All are examples of congestion
games time to reach Nash equilibrium
Algorithm 1: Random local search (RLS)
Clients pick servers uniformly at random according to independent unit rate Poisson processes
Move if it would strictly improve their individual service rate (= rate of server divided by its load)
Algorithm 2: Random load oblivious (RLO)
Clients are impatient and simply perform independent random walks over the servers until the leave
Random walk described by continuous time Markov chain with rate matrix Q and invariant distribution >0
Moves are oblivious of server load
Related work: a synchronous model
Berenbrink et al. (2005) At each time step, each client picks a
server at random If load at current server is A and at
new server is B<A, then moves with probability (AB)/A
Previous results: Closed systems Expected time to reach load balance in
asynchronous model is O(m2): Goldberg (2004)
Expected time to reach balance in synchronous model is O(loglog(m) + n4): Berenbrink et al. (2005): O(log(m) + nlog(n)): Berenbrink et al.
(2007)
Our results
Closed systems Time to reach perfect balance is
O(m2log(m)/n + log2(m)) Time to reach -balance is O(log(m)/)
Open systems Both RLS and RLO are throughput
maximising: system is stable whenever i i
Notation and definitions
N(t) = (N1(t),…,Nm(t)) : number of clients at servers 1,…,m at time t
N(t) is balanced if |Ni(t)-Nj(t)| 1 for all i and j
N(t) is -balanced if (1)p Ni(t) (1+)p for all i, where p=n/m
= time to reach balance = time to reach -balance
Notation and definitions
V(t) = maxj Nj(t) Cv(t) = number of servers with
exactly v clients at time t Bv(t) = Cv1(t) Av(t) = number of servers with v2
or fewer clients at time t
Results for closed systems
E[] = O(m2 log(m)/n + log2(m)) E[] = O(log(m)/)
E[] = (m2/n + log(m))
Typically interested in n >> m
Proof (perfect balance) Previous work used quadratic
Lyapunov functions We use V(t) as Lyapunov function Say RLS algorithm is in phase v at
time t if V(t)=v Cv(t) decreases monotonically
during phase v Phase v ends when Cv(t) hits 0
Proof (cont.) Cv decreases by 1 when one of the vCv
clients at a maximally loaded server samples one of the Av servers with v2 or fewer clients
This happens at rate vCvAv/m Lower bound for Av: no more than n/(v1)
servers can have v1 or more clients Implies upper bound on mean time for
Cv to decrease by 1 and hence for V to decrease by 1
Proof (-balance) Involves counting the number of -
balanced, underloaded and overloaded servers,
and the number of clients at overloaded servers,
and using these to bound the expected time till one such client moves to an underloaded or -balanced server
Stability results for closed systems
If i i, then the system is stable under both RLO and RLS policies
Proof of stability for RLO algorithm Proof uses Foster’s criterion, with
the total number of clients in the system as Lyapunov function
Denote by |x| the L1-norm of vector x |N(t)| is the total number of clients in
system at time t || is the total arrival rate || is the maximum service
Foster’s criterion
Suppose there exist K, and t>0 such thatEn[|N(t)||n|] < for all n:|n|>K
Then N(t) is ergodic
Bounding the drift
En[|N(t)||n|] = t iE[Yi(t)] where Yi(t) is the time up to time t
that server i is non-idle (has at least 1 client)
If E[Yi(t)] is very nearly equal to t, then have Foster’s criterion from condition
Need a lower bound on Yi(t) to get an upper bound on the drift
Bounding the idle time Clients perform independent random
walks on system, but don’t leave Independent rate i Poisson processes
of `virtual’ services at servers If number of clients at server i at time t
is more than the total number of virtual services at all servers on [0,t], then queue i has to be non-empty at time t
Bounding the idle time (cont.)
Suppose |n| is large Markov chain describing random
walks reaches equilibrium in constant time
Number of clients at each server is (|n|) from this time
Number of virtual services is O(1)
Proof of stability for RLS algorithm
Uses a slightly different Lyapunov function
f(n) = |n| + k(n) for suitably small >0, where k(n)
is the number of empty servers in state n
Performance estimates in open systems Consider large m asymptotics Xk
m(t): proportion of servers with exactly k clients at time t
Xm(t) evolves as density dependent Markov process
By Kurtz’s theorem, evolution converges to solution of deterministic differential equation over finite time-horizons
Performance estimates in open systems (cont.)
Idea: look at equilibrium points of deterministic dynamics
If there is a unique stable equilibrium, we expect that stochastic dynamics will live in vicinity of this equilibrium
Use the equilibrium to derive performance measures
In more detail …
Kurtz’s theorem only applies for finite time horizons
Doesn’t tell us about long-term behaviour
Can get around this by using propagation of chaos techniques developed by Snitzman
Numerical results
Asymptotic estimates pretty accurate even for small m, say m =10
RLO is only a little bit worse than RLS in terms of mean delay (about 20% worse in parameter range considered)
Conclusions
Random local search balances loads very quickly in closed systems polylog in number of servers
Impatience is a virtue impatient customers help to balance
load and achieve resources pooling, even if they migrate oblivious of load
Open problems Have assumed all clients can use
all servers, and also that they can move between any pair of servers
What if clients can only move from a server to its neighbours in some graph?
What if clients are of different types, and each type can only use a subset of the servers?
Open problems
Suppose clients can only migrate to neighbouring servers in a graph
Can the time to balance loads be related to mixing times of random walks on this graph?