approximate linear programming for mdps

ALPCompSci 590.2 Ron Parr
Linear Programming MDP solution
Issue: Turn the non-linear max into a collection of linear constraints
V(s)=maxa R(s,a)+γ P(s'|s,a)V(s') s'∑
!!
€
∑
V(s) s ∑
Weakly polynomial; slower than PI in practice (though can be modified to behave like PI)
Optimal action has tight constraints
9/21/17
2
Linear programming with samples
• Suppose we don’t have the model, but have samples • For sample (s,a,r,s’), constraint looks like:
• What goes wrong?
Problem: Noise
• Suppose s goes to s1 w.p. 0.5 and and s2 w.p. 0.5, and • V(s1) = 100, V(s2) = 0 → V(s) = 50g • Samples: (s,a,r,s1), (s,a,r,s2) • Constraints:
≥ + * ≥ + (+)
9/21/17
3
Nosie solution
• There is no (ideal) noise solution! • Problem never really goes away
• LP methods most effective in low noise scenarios
• Can do local averaging by explicitly adding an average over “nearby” states to the LP constraints
Approximate Linear Program (ALP) with model
min V(s) s ∑
k
k
∑ s' ∑
Notes: • No sampling yet • Same number of constraints, just k variables • Assumes we have access to a model
9/21/17
4
• Normally, we minimize:
• We could do:
• For c = some probability distribution • ci = importance/relevance of state I • ci doesn’t matter for the exact case, but • Can change the answer in the approximate case
,(-) .
∗ − 3 *,5 ≤ 2
1 − min< ∗ − Φ ?
Improving the bound
• We can improve the bound if: we pick weights c according to the stationary distribution of optimal policy • But how do we do that?
• Just run the optimal policy and then… • Wait a sec…
• In practice, iterative weighting schemes may help • Start with arbitrary weights • Generate policy by solving ALP • Reweight based upon resulting policy • Repeat
Problem: Missing constraints
• We’re doing approximation because n is large or infinite • Can’t write down the entire LP!
• General (for any LP) constraint sampling approach: • Observe that for an nxk system, optimal solution will have k tight constraints • Most constraints are loose/unnecessary • Sample some set of constraints • Repeat until no constraints violated
• Solve LP • Find violated constraints, then add back to the LP
• Works well if you have an efficient way to find most violated constraints
9/21/17
6
Missing constraints in ALP
• Often no obviously efficient way to find the most violated constraint • Idea: • Sample constraints (s,a,r,s’) by an initial policy • Repeat until ?
• Solve LP, produce policy • Execute policy to sample new constraints (s,a,r,s’)
• Challenges: Missing constraints can produce unbounded LP solutions
What if we’re missing constraints for some state?
1 2 3 4
R(s4) + V (s4) V (s4) Thanks to Gavin Taylor
9/21/17
7
Avoiding constraint sampling problems
• If we can sample constraints from the stationary distribution of the optimal policy (de Farias and Van Roy), then we should be OK
• Wait a sec…
Making additional assumptions
• Regularized ALP (RALP) • Assumes that features are Lipschitz continuous (Petrik et al.) • Adds bound on 1-norm of weights as a constraint
• Lipschitz constant for function f, with distance metric d
* − + ≤ (*, +)
9/21/17
8
(best max-norm error)
: = = p p :
1 [() + p()]
ep bounded as function of: • constraint density and • Lipschitz constants of features
9/21/17
11
• L1 regularization tends to produce sparser solutions
Bicycle Domain
• Ride a bicycle to a destination • S: • A: Shift weight, turn handlebars • R: Based on distance to goal and angle from upright
• Φ: Polynomials and cross terms of state dimensions (160 candidate features)
€
[Randløv and Alstrøm, 98] [http://english.people.com.cn]
9/21/17
12
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0
10
20
30
40
50
60
70
80
90
100
100 features total
No model used (incorrect deterministic action assumption) RALP had |A| constraints/sample
Problem: Picking actions
• How do we use the ALP solution to produce a policy? • We get an approximate value function, not Q-functions
• Must use model to pick actions: L
9/21/17
13
• Continuous action space • Continuous state space • Lipschitz continuous value function • Can we exploit this directly?
• Non-parametric approximate dynamic programming (NP-ALP) • Produces a piecewise-linear value function • Each segment has an associated optimal action • Actions selected from previously tried actions, so greater sampling of state- action space leads to finer grained decisions
LP for NP-ALP
min V(s) s ∑
s' ∑
9/21/17
14
• Use with sampled transitions, k-nn averaging
• Size of LP quadratic in number of (sampled) states, but • Very sparse • Very amenable to constraint generation
• Solution is sparse (unless Lv is large)
Still More NP-ALP Details • Every value traceable to tight Bellman constraint • For action selection for query state s: • Find state t which bounds value of s • Action taken at t is optimal for s
• Error bounds:
Scales with gaps in sampling, Lipschitz constant
Note that this is a bound on the value of the resulting policy, not the quality of the resulting value function!
9/21/17
15
RALP: Discrete actions, model, RBF features NP-ALP: Continuous actions, no model, no features
NP-ALP: Bicycle Balancing

approximate linear programming for mdps

Documents