reinforcement learning and optimal controlasu, cse 691 ...lecture 5 bertsekas reinforcement learning...
TRANSCRIPT
Reinforcement Learning and Optimal Control
ASU, CSE 691, Winter 2020
Dimitri P. [email protected]
Lecture 5
Bertsekas Reinforcement Learning 1 / 22
Outline
1 Multiagent Rollout
2 Deterministic Problem Rollout with Constraints
3 Combinatorial Applications - Examples
Bertsekas Reinforcement Learning 2 / 22
The Pure Form of Rollout: For a Suboptimal Base Policy π, UseJk+`(xk+`) = Jk+`,π(xk+`)
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E!gk(xk, uk, wk) + Jk+1(xk+ℓ)
"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Computation of Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
#F (i)
$
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
#F (i)
$
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
#F (i)
$
1
uk = µk(xk, rk) µk(·, rk) µk(xk) xk At xk
µk(xk) Jk(xk) xsk, us
k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)
Motion equations xk+1 = fk(xk, uk) Current State x
Penalty for deviating from nominal trajectory
State and control constraints Keep state close to a trajectory
Control Probabilities Run the Base Policy
Truncated Horizon Rollout Terminal Cost Approximation J
J∗3 (x3) J∗
2 (x2) J∗1 (x1) Optimal Cost J∗
0 (x0) = J∗(x0)
Optimal Cost J∗k (xk) xk xk+1 x
′k+1 x
′′k+1
Opt. Cost J∗k+1(xk+1) Opt. Cost J∗
k+1(x′k+1) Opt. Cost J∗
k+1(x′′k+1)
xk uk u′k u
′′k Matrix of Intercity Travel Costs
Corrected J J J* Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
1
Approximate Q-Factor Qk(xk, uk)
Min Approximation E{·} Approximation Cost-to-Go Approximation
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
1
Approximate Q-Factor Qk(xk, uk)
Min Approximation E{·} Approximation Cost-to-Go Approximation
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
1
Truncated Rollout Approximate Truncated Rollout Approximate Base Policy Cost
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN(xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r)
p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition
Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
1
Truncated Rollout Approximate Truncated Rollout Approximate Base Policy Cost
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN(xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r)
p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition
Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
1
Truncated Rollout Approximate Truncated Rollout Approximate Base Policy Cost
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN(xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r)
p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition
Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
1
Policy improvement property (where no truncation is used):
Jk,π(xk ) ≤ Jk,π(xk ), for all xk and k
The key issue in the practical application of rollout: Too much computationIf the problem is deterministic, the computation is greatly reduced (no Monte Carlo)
Another computational bottleneck: Large control spaces, e.g., the multiagent case,where controls have many components,
uk = (u1k , . . . , u
mk )
Bertsekas Reinforcement Learning 4 / 22
Trading off Control Space Complexity with State Space Complexity
...) xk xk+1
Random Transition
Random Transition xk+1 = fk(xk, uk, wk) Random cost
) Random Cost) Random Cost gk(xk, uk, wk)
Stage k
1 Control umk
xk, u1k
1k xk, u1
k, u2k
, u1k , u2
kmk u3
k3k um−1
k2k xk, u1
k, . . . , um−1k
An equivalent reformulation - “Unfolding" the control actionThe control space is simplified at the expense of m − 1 additional layers of states,and corresponding m − 1 cost-to-go functions
J1k (xk , u1
k ), J2k (xk , u1
k , u2k ), . . . , J
m−1k (xk , u1
k , . . . , um−1k )
Multiagent or one-component-at-a-time rollout is just standard rollout for thereformulated problem.
The increase in size of the state space does not adversely affect rollout.
The cost improvement property is maintained.
Complexity reduction: The one-step lookahead branching factor is reduced fromnm to nm, where n is the number of possible choices for each component u i
k .
Base policy for u ik may depend only on xk or include dependence on u i−1
k , u i−2k , . . ..
Bertsekas Reinforcement Learning 5 / 22
A Deterministic Example: Multi-Vehicle Routing
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
10 11
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11
10 11 12Capacity=1 Optimal Solution
Capacity=1 Optimal Solution
n vehicles move along the arcs of a given graph.
Some of the nodes of the graph include a task to be performed by the vehicles.Each task will be performed only once.
Find a route for each vehicle so that all the tasks are collectively performed by thevehicles in minimum time.
Cost function choice: For each stage there is a cost of one unit for each task thatis pending at the end of the stage.
Total cost: The number of stages that a task is pending, summed over the tasks.What is the state? What is the control? Why is this good multiagent candidate?
Bertsekas Reinforcement Learning 6 / 22
Multi-Agent Rollout for Multi-Vehicle Routing
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
10 11
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11
10 11 12Capacity=1 Optimal Solution
Capacity=1 Optimal Solution
Base heuristic:Each vehicle makes a move towards the pending task that is closest (in terms ofnumber of moves needed to reach it).
The vehicles make their selection independently one-at-a-time in the order1, . . . , n, and take into account the tasks that have already been performed.
What is the solution produced by the base heuristic?
What is the solution produced by the one-vehicle-at-a-time rollout?
Do we get cost improvement? What is the intuition?
Bertsekas Reinforcement Learning 7 / 22
Constrained Deterministic Rollout
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
For every pair (xk , uk ), uk ∈ Uk (xk ), we generate a Q-factor
Qk (xk , uk ) = gk (xk , uk ) + Hk+1(fk (xk , uk )
)using the base heuristic [Hk+1(xk+1) is the heuristic cost starting from xk+1].
Select uk with minimal Q-factor, move to next state xk+1 and continue.
What if there are constraints, i.e., future control choices are constrained by pastcontrol choices?
Base heuristic and rollout should be modified (e.g., avoid controls thatcompromise feasibility of future controls).
Bertsekas Reinforcement Learning 9 / 22
Constrained Example: Multi-Vehicle Routing with Constraints
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
10 1110 11 12
Capacity=1
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
Capacity=1 Optimal SolutionCapacity=1 Optimal Solution
Examples of constraintsVehicle capacity constraints (limit on how many tasks some vehicles can perform).
Vehicle specializations (some tasks can be performed only by some of thevehicles).
Time windows (some tasks must be performed within specified time intervals).
Bertsekas Reinforcement Learning 10 / 22
How to Deal with Constraints Across Stages in Deterministic Problems
Consider a deterministic optimal control problem with system xk+1 = fk (xk , uk ).
A complete trajectory is a sequence
T = (x0, u0, x1, u1, . . . , uN−1, xN)
We consider the (very general) problem
minT∈C
G(T )
where G is a given cost function and C is a given constraint set of trajectories.
We transform to an unconstrained problem in order to apply rollout ideasRedefine the state to be the partial trajectory
yk = (x0, u0, x1, . . . , uk−1, xk )
Partial trajectory evolves according to a redefined system equation:
yk+1 =(yk , uk , fk (xk , uk )
)The problem becomes to minimize G(yN) subject to the constraint yN ∈ C.
Bertsekas Reinforcement Learning 11 / 22
Rollout Algorithm - Partial Trajectory-Dependent Base Heuristic
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)
�
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice
Capacity=1 Optimal Solution
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
1
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)
�2 C
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice
Capacity=1 Optimal Solution
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
1
Given yk = {x0, y0, x1, u1, . . . , uk−1, xk} consider all controls uk and correspondingnext states xk+1.
Extend yk to obtain the partial trajectories yk+1 =(yk , uk , xk+1
).
Run the base heuristic from each yk+1 to obtain the partial trajectory R(yk+1).
Join the partial trajectories yk+1 and R(yk+1) to obtain complete trajectoriesdenoted by Tk (yk , uk ) =
(yk , uk ,R(yk+1)
)Find the set of controls Uk (yk ) for which Tk (yk , uk ) is feasible, i.e., Tk (yk , uk ) ∈ C
Choose the control uk ∈ Uk (yk ) according to the minimization
uk ∈ arg minuk∈Uk (yk )
G(Tk (yk , uk )
)Bertsekas Reinforcement Learning 12 / 22
Rollout Algorithm Properties
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
1
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)
�
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice
Capacity=1 Optimal Solution
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
1
u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)
�2 C
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice
Capacity=1 Optimal Solution
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
1
The notions of sequential consistency and sequential improvement apply. Part oftheir definition includes that the set of “feasible" controls Uk (yk ) is nonempty for allk ; see the notes.
There is a fortified version (follows the best feasible trajectory). Has the costimprovement property, assuming the base heuristic generates a feasible trajectorystarting from the initial condition y0 = x0
There is a multiagent version that uses one-control-component-at-a-timeminimization.
Additional variants are possible; see the notes.
Bertsekas Reinforcement Learning 13 / 22
Example: Multi-Vehicle Routing with Constraints
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2
10 1110 11 12
Capacity=1
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
Capacity=1 Optimal SolutionCapacity=1 Optimal Solution
Base heuristic moves each vehicle (one-at-a-time) to the closest pending task.
What is the first move of vehicle 1 chosen by base heuristic and by rollout?
What is the solution found by base heuristic?
What is the solution found by rollout?
Bertsekas Reinforcement Learning 14 / 22
General Discrete Optimization Problem: Minimize G(u) subject to u ∈ C
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
1
Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
1
Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
1
Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
1
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
1
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
1
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
1
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
1
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)Set of States u = (u1, . . . , uN )
Cost G(u)
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ r∗1 r∗
q r∗ℓ . . . iq
1
15 1 5 18 4 19 9 21 25 8 12 13
Stage 1 Stage 2 Stage 3 Stage N
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25 8 12 13
Stage 1 Stage 2 Stage 3 Stage N
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25 8 12 13
Stage 1 Stage 2 Stage 3 Stage N
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
15 1 5 18 4 19 9 21 25 8 12 13
Stage 1 Stage 2 Stage 3 Stage N
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13
Stage 1 Stage 2 Stage 3 Stage N
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, u1, . . . , uN�1)
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, u1, . . . , uN�1)
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, u1, . . . , uN�1)
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1)
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
10 5 7 8 3 9 6 1 2
Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2
⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk
1
Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1
Jobs j Machines ℓ Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1
yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)
"∈ C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)
...
p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1
Jobs j Machines ℓ Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1
yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)
"∈ C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)
...
p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1
Jobs j Machines ℓ Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1
yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)
"∈ C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)
...
p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1
Jobs j Machines ℓ Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1
yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)
"∈ C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .
SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)
...
p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
#1 if j ∈ If
0 if j /∈ If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
dℓi = 0 if i /∈ Iℓ
φjℓ = 1 if j ∈ Iℓ
pff(u) =
n$
i=1
dfi
n$
j=1
pij(u)φjf
g(f, u) =
n$
i=1
dfi
n$
j=1
pij(u)g(i, u, j)
Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3
φjf4
i1 i2 iℓ . . . iq
Disaggregation Sets If Aggregate Optimization Feature States
Neural Network Features Approximate Cost Jµ Policy ImprovementSampling
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
1
This is a special case of the constrained deterministic optimal control problemwhere each state xk can only take a single value.
A very broad range of problems, e.g., combinatorial, integer programming, etc.
Solution by constrained rollout applies. Provides entry point to the use of neuralnets in discrete optimization through approximation in policy space.
Other methods: local/random search, genetic algorithms, etc. Rollout is different.Bertsekas Reinforcement Learning 16 / 22
Facility Location: A Classical Integer Programming ProblemSample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Corrected J J J* Cost Jµ
�F (i), r
�of i ⇡ Jµ(i) Jµ(i) Feature Map
Jµ
�F (i), r
�: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme uj = 0 or 1
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
�F (i)
�of
F (i) =�F1(i), . . . , Fs(i)
�: Vector of Features of i
Jµ
�F (i)
�: Feature-based architecture Final Features
If Jµ
�F (i), r
�=Ps
`=1 F`(i)r` it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡
Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡
W+ =�J | J � J+, J(t) = 0
VI converges to J+ from within W+
Cost: g(xk, uk) � 0 VI converges to Jp from within Wp
1
Corrected J J J* Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
13
Jµ
!F (i), r
": Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme uj = 0 or 1
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Locations Evaluate Approximate Cost Jµ
of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
Place facilities at some locations to serve the needs of M “clients."
Client i = 1, . . . ,M has a demand di for services that may be satisfied at a locationj = 1, . . . ,N at a cost aij per unit.
A facility placed at location j has capacity cj and costs bj . Here uj ∈ {0, 1}, withuj = 1 if a facility is placed at j .
Problem: Minimize∑M
i=1
∑Nj=1 aijxij +
∑Nj=1 bjuj subject to total demand satisfaction
constraints.
Note: If the placement variables uj are known, the remaining problem is easilysolvable (it is a linear “transportation" problem).
Bertsekas Reinforcement Learning 17 / 22
Facility Location Problem: Formulation for Constrained RolloutSample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Corrected J J J* Cost Jµ
�F (i), r
�of i ⇡ Jµ(i) Jµ(i) Feature Map
Jµ
�F (i), r
�: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme uj = 0 or 1
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
�F (i)
�of
F (i) =�F1(i), . . . , Fs(i)
�: Vector of Features of i
Jµ
�F (i)
�: Feature-based architecture Final Features
If Jµ
�F (i), r
�=Ps
`=1 F`(i)r` it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡
Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡
W+ =�J | J � J+, J(t) = 0
VI converges to J+ from within W+
Cost: g(xk, uk) � 0 VI converges to Jp from within Wp
1
Corrected J J J* Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
13
Jµ
!F (i), r
": Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme uj = 0 or 1
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Locations Evaluate Approximate Cost Jµ
of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
Consider placements one location at a time.
Stage = Placement decision at a single location (N stages). State at time k = Theplacement choices at locations 1, . . . , k . Control = Choice between 1 (place) or 0(not place) for the next facility. What are the constraints? Is this multiagent?
Base heuristic: Having fixed u1, . . . , uk , place a facility in every remaining location.Rollout: Having fixed u1, . . . , uk , compare two possibilities:
I Set uk+1 = 1 (place facility at location k + 1), set uk+2 = · · · = uN = 1 (as per thebase heuristic), and solve the remaining problem.
I Set uk+1 = 0 (don’t place facility at location k + 1), set uk+2 = · · · = uN = 1 (as perthe base heuristic), and solve the remaining problem.
Select uk+1 = 1 or uk+1 = 0 depending on which yields feasibility and min cost.
Linear transportation problems can be solved with the auction algorithm.Bertsekas Reinforcement Learning 18 / 22
RNA Folding
Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities xij i j zj = 0 or 1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities xij i j zj = 0 or 1 Open Close Null
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities xij i j zj = 0 or 1 Open Close Null
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities xij i j zj = 0 or 1 Open Close Null
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Partial Folding Software Critic Software Complete Folding Current Partial Folding
Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
1
Given a sequence of nucleotides, “fold" it in an “interesting" way (introducepairings that result in an “interesting" structure).
Make a pairing decision at each nucleotide in sequence (open, close, do nothing).
Base heuristic: Given a partial folding, generates a complete folding (this is thepartial folding software).
Two complete foldings can be compared by the critic software.
Note: There is no explicit cost function here (it is internal to the critic software).Bertsekas Reinforcement Learning 19 / 22
Three-Dimensional Assignment
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
1
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
The performance of a job j requires a single machine ` and a single worker w .
There is a given cost aj`w corresponding to the triplet (j, `,w).
Given a set of m jobs, a set of m machines, and a set of m workers, we want tofind a collection of m job-machine-worker triplets that has minimum total cost.
A favorable case is when the cost has separable form aj`w = βj` + γ`w
Enforced separation heuristic:I First solve an artificial 2-dimensional machines-to-workers assignment problem with
costs γ`w derived from aj`w , e.g., γ`w = minj aj`w (the “optimistic" assignment costs).I Next solve the 2-dimensional jobs-to-machines assignment problem with costs βj`
specified by the machines-to-workers assignment and aj`w .
2-D assignment problems are easy (using the auction algorithm; see the notes).Bertsekas Reinforcement Learning 20 / 22
Three-Dimensional Assignment: Use of Rollout
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
1
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ` Workers w
Jobs j Machines ` Workers w
5-Dimensional Assignment Problem 6 Node Layers
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1
yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)
�2 C
(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .
SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1
View as a constrained multistage problem.
Two stages; control at stage 1 = the jobs-to-machines assignment; control at stage2 = the machines-to-workers assignment
We view stage 1 assignment as “multiagent": Assign one job at a time.
We view stage 2 assignment as “single-agent": Assign all machines at onceoptimally (given the stage 1 assignment).
Base heuristic: Having fixed some stage 1 assignments, use enforced separationheuristic for the remaining stage 1 assignments, and the stage 2 assignment.
Rollout: Fix each job assignment one-at-a-time using the base heuristic tocompare all machine options.
Bertsekas Reinforcement Learning 21 / 22
About the Next Two Lectures
We will cover:Parametric approximation architectures.
Neural networks and how we use them.
Approximation in value space and policy space using neural nets.
PLEASE READ AS MUCH OF CHAPTER 3 AS YOU CAN
PLEASE DOWNLOAD THE LATEST VERSION
Bertsekas Reinforcement Learning 22 / 22