![Page 1: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/1.jpg)
Restless Multi-Arm Bandits Problem(RMAB): An Empirical Study
Anthony Bonifonte and Qiushi Chen
ISYE8813 Stochastic Processes and Algorithms4/18/2014
![Page 2: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/2.jpg)
Restless Multi-arm Bandit Problem 2/31
Agenda
• Restless multi-arm bandits problem
• Algorithms and policies
• Numerical experiments
▫ Simulated problem instances
▫ Real application: the capacity management problem
![Page 3: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/3.jpg)
Restless Multi-arm Bandit Problem 3/31
Restless Multi-arm Bandits Problem
s1 𝑠2 𝑠3 𝑠𝑁
𝑠1 ′ 𝑠2 ′ 𝑠3 ′ 𝑠𝑁 ′
…
…
𝑃11 (⋅|𝑠1 ) 𝑃1
1 (⋅|𝑠𝑁 )𝑃12 (⋅|𝑠3 )𝑃1
2 (⋅|𝑠2 )
𝑃11 (⋅|𝑠2 ′ ) 𝑃1
2 (⋅|𝑠𝑁 ′ )𝑃12 (⋅|𝑠3 ′ )𝑃1
1 (⋅|𝑠2 ′ )
+𝑟 11(𝑠1) +𝑟𝑁
1 (𝑠𝑁)+𝑟 22(𝑠2) +𝑟 3
2(𝑠3)
+𝑟 11(𝑠1) +𝑟 2
1(𝑠2) +𝑟 32(𝑠3) +𝑟𝑁
2 (𝑠𝑁)
Active
Passive
![Page 4: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/4.jpg)
Restless Multi-arm Bandit Problem 4/31
• Objective
▫ Discounted rewards (finite, infinite horizon)
▫ Time average
• A general modeling framework
▫ N-choose-M problem
▫ Limited capacity (production capacity, service capacity)
• Connection with Multi-arm bandit problem
𝑃 𝑖2=[ 1 ¿
⋱ ¿1 ]|𝑆|×|𝑆|,𝑟 𝑖
2=[0⋮0 ]|𝑆|× 1
, ∀ 𝑖=1 ,⋯ ,𝑁Passive arm:
![Page 5: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/5.jpg)
Restless Multi-arm Bandit Problem 5/31
Exact Optimal Solution: Dynamic Programming
• Markov decision process (MDP)
▫ State:
▫ Action:
▫ Transition matrix:
▫ Rewards:
• Algorithm:
▫ Finite horizon: backward induction
▫ Infinite horizon (discounted): value iteration, policy iteration
• Problem size: becomes a disaster quickly
S N M# of
statesSpace for transition
Matrix (Mb)
3 5 2 243 4.5
4 5 2 1024 80
4 6 2 4096 1,920 ~ 2Gb
4 7 2 16384 43,008 ~ 43Gb
(𝑠1 ,𝑠2 ,⋯ , 𝑠𝑁 ) ,𝑠𝑖∈𝑆 ,∀ 𝑖 Active set : N-choose-M,
𝑃𝑎 ( (𝑠1′ ,⋯ , 𝑠𝑁′ )|(𝑠1 ,⋯ ,𝑠𝑁 ) ,𝒜)
|𝑆|𝑁
(𝑁𝑀 )𝑆𝑁×𝑆𝑁×(𝑁𝑀 )↦ [0,1]
𝑟𝑎( (𝑠1 ,⋯ , 𝑠𝑁 ) ,𝒜) 𝑆𝑁×(𝑁𝑀)↦ℝ+¿¿
![Page 6: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/6.jpg)
Restless Multi-arm Bandit Problem 6/31
Lagrangian Relaxation: Upper Bound
• = number of active arms at time t
• Original requirement
• Relaxed requirement: an “average’’ version
• Solve the upper bound
▫ Occupancy measures
▫ Using Dual LP formulation of MDP
𝑚 (𝑡 )=𝑀 , ∀𝑡
𝔼 [∑𝑡 𝛽𝑡 −1𝑚 (𝑡 ) ]=¿
max
⋱
𝑀1− 𝛽
=𝑀 (1+𝛽+𝛽2+⋯)
![Page 7: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/7.jpg)
Restless Multi-arm Bandit Problem 7/31
Index Policies
• Philosophy: Decomposition
▫ 1 huge problem of states
small problems of states
• Index policy
▫ Compute the index for each arm separately
▫ Rank the indices
▫ Choose the arms with M smallest/largest indices
▫ Easy to compute/implement
▫ Intuitive structure
![Page 8: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/8.jpg)
Restless Multi-arm Bandit Problem 8/31
The Whittle’s Index Policy (Discounted Rewards)
• For a fixed arm, and a given state “Subsidy” W
• The Whittle’s Index: W(s)
▫ The subsidy that makes passive and active arms indifferent
• Closed form solution depends on specific models
Passive rewards
𝑉 (𝑠 ,𝑊 )=max {𝑟1 (𝑠 )+𝛽∑𝑠′𝑃
𝑠 𝑠′1 𝑉 (𝑠′ ,𝑊 ) ,𝑟2 (𝑠 )+𝑊+𝛽∑
𝑠′𝑃
𝑠 𝑠′2 𝑉 (𝑠′ ,𝑊 )},𝑠∈𝑆
W too small arm is better/large Active/Passive
𝑊 (𝑠 )=inf {𝑊 : passive arm is better thanactive arm for state s}
active passive-WW-subsidy problem
![Page 9: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/9.jpg)
Restless Multi-arm Bandit Problem 9/31
Numerical Algorithm for Solving Whittles’ Index
Initial ,
, Evaluate
Δ>0?Reduce Increase V(Passive)-V(Active)
! STOP when reverses the sign for the first time
STEP 1: Find the plausible range of W
Range of W identified: [L,U]
STEP 2: Use binary search within the range [L,U]
NoYes
- Value iteration
- Initial step size - Update W:
- (reduce)- (increase)
![Page 10: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/10.jpg)
Restless Multi-arm Bandit Problem 10/31
The Primal-Dual Index Policy
• Solve the Lagrangian relaxation formulation
• Input:
▫ Optimal primal solutions (occupancy measures)
▫ Optimal reduced costs
• Policy:
▫ (1) p=M, choose them!
▫ (2) p<M, add (M-p) more arms Among the rest arms, choose (M-p) arms with smallest
▫ (3) p>M, choose M out of p arms Among the p arms, kick out (p-M) arms with smallest
total expected discounted time spent selecting arm n in state
rate of decrease in the obj-value as increases by 1 unit
rate of decrease in the obj-value as increases by 1 unit
number of arms with
How harmful for passive activeHow harmful for active passive
Being active if >0
![Page 11: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/11.jpg)
Restless Multi-arm Bandit Problem 11/31
Heuristic Index Policies
• Absolute-greedy policy
▫ Choose M arms with largest active rewards
• Relative-greedy policy
▫ Choose M arms with largest marginal rewards
• Rolling-horizon policy (H-period look-ahead)
▫ Choose m arms with largest marginal value-to-go
[𝑟 (𝑠𝑛 ,1 )+𝛽 𝑃1 (⋅|𝑠𝑛)𝑉 ]−[𝑟 (𝑠𝑛 ,2 )+𝛽𝑃2 (⋅|𝑠𝑛 )𝑉 ]
where is the optimal value function in the following H periods
![Page 12: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/12.jpg)
Restless Multi-arm Bandit Problem 12/31
Agenda
• Restless multi-arm bandits problem
• Algorithms and policies
• Numerical experiments
▫ Simulated problem instances
▫ Real application: the capacity management problem
![Page 13: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/13.jpg)
Restless Multi-arm Bandit Problem 13/31
Experiment Settings
• Assume active rewards are larger than passive rewards
• Non-identical arms
• Structures in transition dynamics
▫ Uniformly sampled transition matrix
▫ IFR matrix with non-increasing rewards
▫ P1 is stochastically smaller than P2
▫ Less-connected chain
• Evaluation
▫ Small instances: exact optimal solution
▫ Large instances: upper bound & Monte-Carlo simulation
• Performance measure
▫ Average gaps from Optimality or Upper Bound
![Page 14: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/14.jpg)
Restless Multi-arm Bandit Problem 14/31
5 Questions of Interest
1. How do different policies compare under different problem
structures?
2. How do different policies compare under various problem sizes?
3. How do different policies compare under different discount factors?
4. How does a multi-period look ahead improve a myopic policy?
5. How do different policies compare under different time horizons?
![Page 15: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/15.jpg)
Restless Multi-arm Bandit Problem 15/31
Question 1: Does problem structure help?
• Uniformly sampled transition matrix and rewards
• Increasing failure rate matrix and non-increasing rewards
• Less-connected Markov chain
• P1 stochastically smaller than P2, non-increasing rewards
![Page 16: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/16.jpg)
Restless Multi-arm Bandit Problem 16/31
Question 1: Does problem structure help?
![Page 17: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/17.jpg)
Restless Multi-arm Bandit Problem 17/31
Question 2: Does problem size matter?
• Optimality gap: Fixed N and M , increasing S
![Page 18: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/18.jpg)
Restless Multi-arm Bandit Problem 18/31
Question 2: Does problem size matter?
• Optimality gap: Fixed M and S , increasing N Decreasing
![Page 19: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/19.jpg)
Restless Multi-arm Bandit Problem 19/31
Question 3: Does discount factor matter?
• Infinite horizon: discount factors
![Page 20: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/20.jpg)
Restless Multi-arm Bandit Problem 20/31
Question 4: Does look ahead help a myopic policy?
• Greedy policies vs Rolling-horizon policies different H
• Problem size: S=8, N=6, M=2,
• Problem structure: Uniform vs. less-connected
=0.4
Primal
Dua
l Ind
ex
Whi
ttle's
Inde
x
Abs G
reed
y
Abs G
reed
y (1
A)
Rel G
reed
yRH
-2RH
-5
RH-1
0
RH-2
0
RH-5
00.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%Uniform
![Page 21: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/21.jpg)
Restless Multi-arm Bandit Problem 21/31
Question 4: Does look ahead help a myopic policy?
• Greedy policies vs Rolling-horizon policies different H
• Problem size: S=8, N=6, M=2,
• Problem structure: Uniform vs. less-connected
=0.4 =0.7
Primal
Dua
l Ind
ex
Whi
ttle's
Inde
x
Abs G
reed
y
Abs G
reed
y (1
A)
Rel G
reed
yRH
-2RH
-5
RH-1
0
RH-2
0
RH-5
00.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%Uniform
Primal
Dua
l Ind
ex
Whi
ttle's
Inde
x
Abs G
reed
y
Abs G
reed
y (1
A)
Rel G
reed
yRH
-2RH
-5
RH-1
0
RH-2
0
RH-5
00.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%Uniform
![Page 22: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/22.jpg)
Restless Multi-arm Bandit Problem 22/31
Question 5: Does look ahead help a myopic policy?
• Greedy policies vs Rolling-horizon policies different H
• Problem size: S=8, N=6, M=2,
• Problem structure: Uniform vs. less-connected
=0.4 =0.9
Primal
Dua
l Ind
ex
Whi
ttle's
Inde
x
Abs G
reed
y
Abs G
reed
y (1
A)
Rel G
reed
yRH
-2RH
-5
RH-1
0
RH-2
0
RH-5
00.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%Uniform
Primal
Dua
l Ind
ex
Whi
ttle's
Inde
x
Abs G
reed
y
Abs G
reed
y (1
A)
Rel G
reed
yRH
-2RH
-5
RH-1
0
RH-2
0
RH-5
00.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%Uniform
![Page 23: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/23.jpg)
Restless Multi-arm Bandit Problem 23/31
Question 4: Does look ahead help a myopic policy?
• Greedy policies vs Rolling-horizon policies different H
• Problem size: S=8, N=6, M=2,
• Problem structure: Uniform vs. less-connected
=0.4 =0.98
Primal
Dua
l Ind
ex
Whi
ttle's
Inde
x
Abs G
reed
y
Abs G
reed
y (1
A)
Rel G
reed
yRH
-2RH
-5
RH-1
0
RH-2
0
RH-5
00.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%Uniform
Primal
Dua
l Ind
ex
Whi
ttle's
Inde
x
Abs G
reed
y
Abs G
reed
y (1
A)
Rel G
reed
yRH
-2RH
-5
RH-1
0
RH-2
0
RH-5
00.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%Uniform
![Page 24: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/24.jpg)
Restless Multi-arm Bandit Problem 24/31
Agenda
• Restless multi-arm bandits problem
• Algorithms and policies
• Numerical experiments
▫ Simulated problem instances
▫ Real application: the capacity management problem
![Page 25: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/25.jpg)
Restless Multi-arm Bandit Problem 25/31
Clinical Capacity Management Problem (Deo et al. 2013)• School-based asthma care for children
Scheduling PolicyMedical records of patients
Who to schedule (treat)?
Van capacity
h=health state at the last appointmentn=the time since the last appointment
Active set : choose M out of N
State (h,n), capacity M, population N
OBJECTIVE: maximize total benefit of the community
• Current guidelines (fixed duration policy)
• Whittle’s index policy• Primal-dual index policy• Greedy (myopic) policy• Rolling-horizon policy• H-N priority policy, N-H priority policy• No-schedule [baseline]
Improvement
![Page 26: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/26.jpg)
Restless Multi-arm Bandit Problem 26/31
How Large Is It?
• Horizon: 24 periods (2 years)
• Population size N ~ 50 patients
• State space:
▫ Each arm:
▫ In total: 1.3 X 1099
• Decision space:
▫ Choose 10 out of 50: 1.2 X 1010
▫ Choose 15 out of 50: 2.3 X 1012
• Actual computation time
▫ Whittle’s Indices: 96 states/arm * 50 arms = 4800 indices 1.5 - 3 hours
▫ Presolve the LP relaxation for Primal-Dual Indices: 4 - 60 seconds
96 states each arm
![Page 27: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/27.jpg)
Restless Multi-arm Bandit Problem 27/31
Performance of Policies
Improvement
𝛿
![Page 28: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/28.jpg)
Restless Multi-arm Bandit Problem 28/31
Performance of Policies
Improvement
𝛿
![Page 29: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/29.jpg)
Restless Multi-arm Bandit Problem 29/31
Performance of Policies
Improvement
𝛿
![Page 30: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/30.jpg)
Restless Multi-arm Bandit Problem 30/31
Whittle’s Index vs. Gitten’s Index
• (S,N,M=1) vs. (S,N,M=2)
• Sample 20 instances for each problem size
• Whittle’s Index policy vs. DP exact solution
▫ Optimality tolerance = 0.002
S N M=1 M=2
3 5 0% 25%
3 6 0% 25%
5 5 0% 40%
Percentage of time when Whittle’s Index policy is NOT optimal
![Page 31: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/31.jpg)
Restless Multi-arm Bandit Problem 31/31
Summary
• Whittles’ Index and Primal-dual Index work well and efficiently
• Relative greedy policy can work well depending on problem structure
• Policies perform worse on the less-connected Markov chain
• All policies tend to work better if capacity is tight
• Look ahead policies have limited marginal benefit for
small discount factor
![Page 32: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/32.jpg)
32
Q&A
![Page 33: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/33.jpg)
Restless Multi-arm Bandit Problem 33/31
Question 5: Does decision horizon matter?
• Finite horizon: # of periods
![Page 34: Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study](https://reader036.vdocuments.pub/reader036/viewer/2022062500/5681503d550346895dbe3b12/html5/thumbnails/34.jpg)
34
LC Uniform P1 LE P2 IFR0
2
4
6
8
10
12
14
16
18
20
N=10 M=3 S=3
Random GapAbs Gr GapRel Gr GapWhittles GapPD Gap
% G
ap f
rom
Lagra
nge U
B