reinforcement learning eligibility traces
Post on 30-Dec-2015
34 Views
Preview:
DESCRIPTION
TRANSCRIPT
Reinforcement Learning
Eligibility Traces
主講人:虞台文
大同大學資工所智慧型多媒體研究室
Content n-step TD prediction Forward View of TD() Backward View of TD() Equivalence of the Forward and Backward Views Sarsa() Q() Eligibility Traces for Actor-Critic Methods Replacing Traces Implementation Issues
Reinforcement Learning
Eligibility Traces
n-Step
TD Prediction
大同大學資工所智慧型多媒體研究室
Elementary Methods
DynamicProgramming
Monte CarloMethods
TD(0)
Monte Carlo vs. TD(0)
Monte Carlo– observe reward for all steps in an episode
TD(0)– observed one step only
2 11 2 3
T tt t t t Tr r rR r
1 1(1) ( )t t tR r V s
n-Step TD Prediction
TD (1-step) Monte Carlo2-step 3-step n-step
(1)tR
(2)tR
(3)tR
( )ntR
tR
n-Step TD Prediction
2 11 2 3
T tt t t t TR r r r r
1 1(1) ( )t t ttR r V s
21 2 2
(2) ( )t t t tt r r V sR 2 3
1(
23
3)
3( )t t t t tt r r rR V s
2 11 2 3
( ) ( )n nt t t t
nn t t nt r r r r V sR
corrected n-step truncated return
Backups
Monte Carlo 1( ) ( ) ( )t t t t t t tV s V Rs V s
TD(0) 1 1 1( ) ( ) ( () )t tt t t tt t tV s V s V sr V s (1)( ) ( )t t t ttV s R V s
n-step TD (1
)( ) ( ) ( )t t t t tn
t tV s V s R V s
( )t tV s
( )[ ( )]( )
0
nt t t
t
t
R V s s sV s
s s
n-Step TD Backup
online
offline
1( ) ( ) ( )t t tV s V s V s 1
0
( ) ( ) ( )T
tt
V s V s V s
When offline, the new V(s) will be for the next episode.
( )[ ( )]( )
0
nt t t
t
t
R V s s sV s
s s
( )max { | } ( ) max ( ) ( )n nt t
s sE R s s V s V s V s
Error Reduction Property
online
offline
1( ) ( ) ( )t t tV s V s V s 1
0
( ) ( ) ( )T
tt
V s V s V s
n-step return Maximum error using V (current value)
Maximum error using n-step return
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
Consider 2-step TD, 3-step TD, …
V(s) 1/6 2/6 3/6 4/6 5/6
n=? is optimal?
Example (19-state Random Walk)start
1 0 0 0 0 1
offlineonline
AverageRMSE
Over First10 Trials
Exercise (Random Walk)
+1 1
Standardmoves
Exercise (Random Walk)
+1 1
Standardmoves
1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ
ent n’s and ’s), and compare their performance.3. Find optimal policy.
1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ
ent n’s and ’s), and compare their performance.3. Find optimal policy.
Reinforcement Learning
Eligibility Traces
The Forward View of TD()
大同大學資工所智慧型多媒體研究室
Averaging n-step Returns
We are not limited to simply using n-step TD returns
For example, we could take average n-step TD returns like:
(2) (4)1 1
2 2avgt t tR R R
One backup
Sum to 1
TD() -Return
w1
w2
w3
wTt 1
1nw
TD() is a method for averaging all n-step backups – weight by n1 (time since
visitation)– Called -return
Backup using -return:
1 ( )
1
(1 ) n nt t
n
R R
( ) ( )t t t t tV s R V s
TD() -Return
w1
w2
w3
wTt
1nw
TD() is a method for averaging all n-step backups – weight by n1 (time since
visitation)– Called -return
Backup using -return:
1 ( )
1
(1 ) n nt t
n
R R
( ) ( )t t t t tV s R V s
Forward View of TD()
A theoretical view
TD() on the Random Walk
Reinforcement Learning
Eligibility Traces
The Backward View of TD()
大同大學資工所智慧型多媒體研究室
Why Backward View?
Forward view is acausal– Not implementable
Backward view is causal– Implementable– In the offline case, achieving the
same result as the forward view
Eligibility Traces
Each state is associated with an additional memory variable eligibility trace, defined by:
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
Eligibility Traces
Each state is associated with an additional memory variable eligibility trace, defined by:
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
Eligibility Traces
Each state is associated with an additional memory variable eligibility trace, defined by:
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
Eligibility Recency of Visiting
At any time, the traces record which states have recently been visited, where “recently" is defined in terms of .
The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur.
Reinforcing event
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
The moment-by-moment 1-step TD errors
1 1( ) ( )t tt t t tr V s V s
Reinforcing Event
The moment-by-moment 1-step TD errors
1 1( ) ( )t tt t t tr V s V s
( ) ( )tt tV s e s
TD()
Eligibility Traces1
1
( )
( ) 1( ) t
t tt
te s s s
s ss
e se
Reinforcing Events 1 1( ) ( )t tt t t tr V s V s
Value updates ( ) ( )t tt s sV e
Online TD()
Initialize
Repeat (for each step of
Initialize arbitrarily and , for all
Repeat (for each episode):
action given by for
( )
episode)
Take action , observe
) 0
:
(V s e s
s
a
a
s
s S
( ) ( )
( ) ( ) 1
( ) (
reward, , and next state
Until is
For a
) ( )
termina
ll :
( ) ( )
l
r s
r V s V s
e s e s
s
V s V s e s
e s e s
s s
s
Backward View of TD()
Backwards View vs. MC & TD(0)
Set to 0, we get to TD(0) Set to 1, we get MC but in a better
way– Can apply TD(1) to continuing tasks– Works incrementally and on-line (instead of
waiting to the end of the episode)
How about 0 < < 1?
Reinforcement Learning
Eligibility Traces
Equivalence of the Forward and Backward Views
大同大學資工所智慧型多媒體研究室
Offline TD()’s
Offline Forward TD() -Return
1 ( )
1
(1 ) n nt t
n
R R
[ ( )](
0) t t t
t
t
f R V s s s
s sV s
Offline Backward TD()
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
1 1( ) ( )t t t t t tr V s V s ( ) ( )t
bttV s e s
Forward View = Backward View
1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
Backward updates Forward updates
1
0t
tss
t
s sI
s s
See the proof
Forward View = Backward View
1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
Backward updates Forward updates
1
0t
tss
t
s sI
s s
TD() on the Random Walk
AverageRMSE
Over First10 Trials
Offline -return(forward)
Online TD()(backward)
Reinforcement Learning
Eligibility Traces
Sarsa()
大同大學資工所智慧型多媒體研究室
Sarsa()
TD() – Use eligibility traces for policy evaluation
How can eligibility traces be used for control?– Learn Qt(s, a) rather than Vt(s).
Sarsa()
1
1
( , ) ( , ) ( , )
( , ) 1 ( , ) ( , )( , ) t t
tt
t t t
e s a s a s a
e s a s a se s
aa
1 1 1( , ) ( , )t t t tt t t tr Q s a Q s a
1( , ) ( , ) ( , )tt t tQ s ea Q ss a a
EligibilityTraces
ReinforcingEvents
Updates
Sarsa()
Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
( , ) ( , )
using pol
1
( , ) ( , ) ( , )
icy derived from (e.g. -greedy)
For all :
( , )
( , )
s Q
r Q s a Q s a
e(s,a) e(s,a)
s,a
Q s a Q s a e s a
e s a e s a
Until
is ter l
;
minas
s s a a
Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
( , ) ( , )
using pol
1
( , ) ( , ) ( , )
icy derived from (e.g. -greedy)
For all :
( , )
( , )
s Q
r Q s a Q s a
e(s,a) e(s,a)
s,a
Q s a Q s a e s a
e s a e s a
Until
is ter l
;
minas
s s a a
Sarsa() Traces in Grid World
With one trial, the agent has much more information about how to get to the goal – not necessarily the best way
Considerably accelerate learning
Reinforcement Learning
Eligibility Traces
Q()
大同大學資工所智慧型多媒體研究室
Q-Learning
An off-policy method– breaks from time to time to take exploratory actions– a simple time trace cannot be easily implemented
How to combine eligibility traces and Q-learning?
Three methods:– Watkins's Q() – Peng's Q ()– Naïve Q ()
Watkins's Q() Behavior policy(e.g., -greedy)
Estimation policy(e.g., greedy)
GreedyPath
Non-GreedyPath
Firstnon-greedy
action
Backups Watkins's Q()
Two cases:1. Both behavior and
estimation policies take the greedy path.
2. Behavior path has taken a non-greedy action before the episode ends.
Case 1Case 2
How to define the eligibility traces?
Watkins's Q()
1
1 1
1 1
1
0 (
( , ) ( , ) and1
, ) max
( , )
(
( , )
( , ) max ( , )
(
, )
, )
t tt
t t t a t t
t t t a t tt
te s a ot
s a s ae
Q s a Q s a
s aQ s a Q s a
e
herwis
s
e
a
1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a
1( , ) ( , ) ( , )tt t tQ s ea Q ss a a
Watkins's Q() Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
using policy derived from (e.g. -greedy)
(if ties for the max, th* arg max ( , ) *
( , ) ( , *)
en )
For all :
( , ) ( , ) 1
b
s
a Q s b a a a
r Q s a Q s a
e s
Q
a e s a
s,a
If , then
els
( , ) ( , ) ( , )
* ( , ) ( , )
( , ) 0
Until is termina
l
e
;
Q s a Q s a e s a
a a e s a e s a
e s a
s s a a
s
Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
using policy derived from (e.g. -greedy)
(if ties for the max, th* arg max ( , ) *
( , ) ( , *)
en )
For all :
( , ) ( , ) 1
b
s
a Q s b a a a
r Q s a Q s a
e s
Q
a e s a
s,a
If , then
els
( , ) ( , ) ( , )
* ( , ) ( , )
( , ) 0
Until is termina
l
e
;
Q s a Q s a e s a
a a e s a e s a
e s a
s s a a
s
Peng's Q() Cutting off traces loses much of the advant
age of using eligibility traces. If exploratory actions are frequent, as they
often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning.
Peng's Q() is an alternate version of Q() meant to remedy this.
Backups Peng's Q()
Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).
Never cut traces Backup max action except at end The book says it outperforms Watkins Q(λ) a
nd almost as well as Sarsa(λ) Disadvantage: difficult for implementation
Peng's Q() See
Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).
for notations.
Naïve Q()
Idea: Is it really a problem to backup exploratory actions?– Never zero traces– Always backup max at current action
(unlike Peng or Watkins’s) Is this truly naïve? Works well is preliminary empirical
studies
Naïve Q()
1
1 1
1 1
1
0 (
( , ) ( , ) and1
, ) max
( , )
(
( , )
( , ) max ( , )
(
, )
, )
t tt
t t t a t t
t t t a t tt
te s a ot
s a s ae
Q s a Q s a
s aQ s a Q s a
e
herwis
s
e
a
1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a
1( , ) ( , ) ( , )tt t tQ s ea Q ss a a
Comparisons
McGovern, Amy and Sutton, Richard S. (1997) Towards a better Q(). Presented at the Fall 1997 Reinforcement Learning Workshop.
Deterministic gridworld with obstacles– 10x10 gridworld– 25 randomly generated obstacles– 30 runs = 0.05, = 0.9, = 0.9, = 0.05,– accumulating traces
Comparisons
Convergence of the Q()’s
None of the methods are proven to converge.– Much extra credit if you can prove any of
them. Watkins’s is thought to converge to Q*
Peng’s is thought to converge to a mixture of Q and Q*
Naïve - Q*?
Reinforcement Learning
Eligibility Traces
Eligibility Traces for Actor-Critic Methods
大同大學資工所智慧型多媒體研究室
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Critic: On-policy learning of V. Use TD() as described before.
Actor: Needs eligibility traces for each state-action pair.
Policy Parameters Update
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Method 1:
1
( , ) if and ( , )
( , ) otherwiset t t t
tt
p s a a a s sp s a
p s a
1( , ) ( , ) ( , )tt t tp s ea p ss a a
Policy Parameters Update
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Method 2:
1( , ) ( , ) ( , )tt t tp s ea p ss a a
1
( , ) if and ( , )
( , ) othe
1 ( ,
r se
)
wit t t t
t
t
p s a a a s sp s a
p s
a
a
s
1
1
( , ) 1 ( , ) if and ( , )
( , ) otherwiset t t t t t
tt
e s a s a s s a ae s a
e s a
Reinforcement Learning
Eligibility Traces
Replacing Traces
大同大學資工所智慧型多媒體研究室
Accumulating/Replacing Traces
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
Accumulating Traces:
1( )( )
1t t
tt
e s s se s
s s
Replacing Traces:
Why Replacing Traces?
Using accumulating traces, frequently visited states can have eligibilities greater than 1– This can be a problem for convergence
Replacing traces can significantly speed learning
They can make the system perform well for a broader set of parameters
Accumulating traces can do poorly on certain types of tasks
Example (19-State Random Walk)
Extension to action-values
When you revisit a state, what should you do with the traces for the other actions?
Singh and Sutton (1996) to set traces of all other actions from the revisited state to 0.
1
0 if
1 if and
( , )
(
, ) if
and t t
t
t
t
t
ts s
s s a a
e s a
e s a s s
a a
Reinforcement Learning
Eligibility Traces
Implementation Issues
大同大學資工所智慧型多媒體研究室
Implementation Issues
For practical use we cannot compute every trace down to the last.
Dropping very small values is recommended and encouraged.
If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices).
Use with neural networks and backpropagation generally only causes a doubling of needed computational power.
Variable Can generalize to variable
Here is a function of time– E.g.,
1
1
( ) if ( )
( ) 1 if t t
tt
t t
te s s se s
e s s s
o( ) rt
tt ts
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
0
( ) ( )k
tt k
t ssk
e s I
An accumulating eligibility trace can be written explicitly (non-recursively) as
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
0 1k t T
0 1k t T 1 1
0
( )k
T Tt k
ss tk t k
I
1 1
0
( )t
T Tk t
ss kt k t
I
k t
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
( )1
( )ft tt t tRV s V s
1 ( )
1
( ) (1 ) n nt t t
n
V s R
0
1 1( ) (1 ) [ ( )]t t t t tV s r V s 1 2
1 2 2(1 ) [ ( )]t t t tr r V s 2 2 3
1 2 3 3(1 ) [ ( )]t t t t tr r r V s
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1
( )ft tV s
0
1 1 1( ) ( ) [ ( ) ( )]t t t t t t tV s r V s V s 1
2 2 2( ) [ ( ) ( )]t t t t tr V s V s 2
3 3 3( ) [ ( ) ( )]t t t t tr V s V s
01 1 ( ) [ ( ) ( )]t t t t tr V s V s
12 2 1( ) [ ( ) ( )]t t t t tr V s V s
23 3 2( ) [ ( ) ( )]t t t t tr V s V s
0( ) t 1
1( ) t
22( ) t
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1
( )ft tV s
( )k t
kk t
11
0
( )t
TT
sst
k tk
k t
I
1
0
( )t
T
sst
ft ts IV
1
( )T
k tk
k t
0 1t k T
0 1t k T
1
0 0
( )t
T kk t
k ssk t
I
1 1
0 0 0
( )( )t k
ft t
T T tt k
ss t sst t k
I IV s
1 1
0 0 0
( )( )t k
ft t
T T tt k
ss t sst t k
I IV s
k t
top related