![Page 1: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/1.jpg)
Reinforcement Learning
Eligibility Traces
主講人:虞台文
大同大學資工所智慧型多媒體研究室
![Page 2: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/2.jpg)
Content n-step TD prediction Forward View of TD() Backward View of TD() Equivalence of the Forward and Backward Views Sarsa() Q() Eligibility Traces for Actor-Critic Methods Replacing Traces Implementation Issues
![Page 3: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/3.jpg)
Reinforcement Learning
Eligibility Traces
n-Step
TD Prediction
大同大學資工所智慧型多媒體研究室
![Page 4: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/4.jpg)
Elementary Methods
DynamicProgramming
Monte CarloMethods
TD(0)
![Page 5: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/5.jpg)
Monte Carlo vs. TD(0)
Monte Carlo– observe reward for all steps in an episode
TD(0)– observed one step only
2 11 2 3
T tt t t t Tr r rR r
1 1(1) ( )t t tR r V s
![Page 6: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/6.jpg)
n-Step TD Prediction
TD (1-step) Monte Carlo2-step 3-step n-step
(1)tR
(2)tR
(3)tR
( )ntR
tR
![Page 7: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/7.jpg)
n-Step TD Prediction
2 11 2 3
T tt t t t TR r r r r
1 1(1) ( )t t ttR r V s
21 2 2
(2) ( )t t t tt r r V sR 2 3
1(
23
3)
3( )t t t t tt r r rR V s
2 11 2 3
( ) ( )n nt t t t
nn t t nt r r r r V sR
corrected n-step truncated return
![Page 8: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/8.jpg)
Backups
Monte Carlo 1( ) ( ) ( )t t t t t t tV s V Rs V s
TD(0) 1 1 1( ) ( ) ( () )t tt t t tt t tV s V s V sr V s (1)( ) ( )t t t ttV s R V s
n-step TD (1
)( ) ( ) ( )t t t t tn
t tV s V s R V s
( )t tV s
( )[ ( )]( )
0
nt t t
t
t
R V s s sV s
s s
![Page 9: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/9.jpg)
n-Step TD Backup
online
offline
1( ) ( ) ( )t t tV s V s V s 1
0
( ) ( ) ( )T
tt
V s V s V s
When offline, the new V(s) will be for the next episode.
( )[ ( )]( )
0
nt t t
t
t
R V s s sV s
s s
![Page 10: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/10.jpg)
( )max { | } ( ) max ( ) ( )n nt t
s sE R s s V s V s V s
Error Reduction Property
online
offline
1( ) ( ) ( )t t tV s V s V s 1
0
( ) ( ) ( )T
tt
V s V s V s
n-step return Maximum error using V (current value)
Maximum error using n-step return
![Page 11: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/11.jpg)
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
Consider 2-step TD, 3-step TD, …
V(s) 1/6 2/6 3/6 4/6 5/6
n=? is optimal?
![Page 12: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/12.jpg)
Example (19-state Random Walk)start
1 0 0 0 0 1
offlineonline
AverageRMSE
Over First10 Trials
![Page 13: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/13.jpg)
Exercise (Random Walk)
+1 1
Standardmoves
![Page 14: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/14.jpg)
Exercise (Random Walk)
+1 1
Standardmoves
1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ
ent n’s and ’s), and compare their performance.3. Find optimal policy.
1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ
ent n’s and ’s), and compare their performance.3. Find optimal policy.
![Page 15: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/15.jpg)
Reinforcement Learning
Eligibility Traces
The Forward View of TD()
大同大學資工所智慧型多媒體研究室
![Page 16: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/16.jpg)
Averaging n-step Returns
We are not limited to simply using n-step TD returns
For example, we could take average n-step TD returns like:
(2) (4)1 1
2 2avgt t tR R R
One backup
Sum to 1
![Page 17: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/17.jpg)
TD() -Return
w1
w2
w3
wTt 1
1nw
TD() is a method for averaging all n-step backups – weight by n1 (time since
visitation)– Called -return
Backup using -return:
1 ( )
1
(1 ) n nt t
n
R R
( ) ( )t t t t tV s R V s
![Page 18: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/18.jpg)
TD() -Return
w1
w2
w3
wTt
1nw
TD() is a method for averaging all n-step backups – weight by n1 (time since
visitation)– Called -return
Backup using -return:
1 ( )
1
(1 ) n nt t
n
R R
( ) ( )t t t t tV s R V s
![Page 19: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/19.jpg)
Forward View of TD()
A theoretical view
![Page 20: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/20.jpg)
TD() on the Random Walk
![Page 21: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/21.jpg)
Reinforcement Learning
Eligibility Traces
The Backward View of TD()
大同大學資工所智慧型多媒體研究室
![Page 22: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/22.jpg)
Why Backward View?
Forward view is acausal– Not implementable
Backward view is causal– Implementable– In the offline case, achieving the
same result as the forward view
![Page 23: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/23.jpg)
Eligibility Traces
Each state is associated with an additional memory variable eligibility trace, defined by:
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
![Page 24: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/24.jpg)
Eligibility Traces
Each state is associated with an additional memory variable eligibility trace, defined by:
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
![Page 25: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/25.jpg)
Eligibility Traces
Each state is associated with an additional memory variable eligibility trace, defined by:
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
![Page 26: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/26.jpg)
Eligibility Recency of Visiting
At any time, the traces record which states have recently been visited, where “recently" is defined in terms of .
The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur.
Reinforcing event
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
The moment-by-moment 1-step TD errors
1 1( ) ( )t tt t t tr V s V s
![Page 27: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/27.jpg)
Reinforcing Event
The moment-by-moment 1-step TD errors
1 1( ) ( )t tt t t tr V s V s
( ) ( )tt tV s e s
![Page 28: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/28.jpg)
TD()
Eligibility Traces1
1
( )
( ) 1( ) t
t tt
te s s s
s ss
e se
Reinforcing Events 1 1( ) ( )t tt t t tr V s V s
Value updates ( ) ( )t tt s sV e
![Page 29: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/29.jpg)
Online TD()
Initialize
Repeat (for each step of
Initialize arbitrarily and , for all
Repeat (for each episode):
action given by for
( )
episode)
Take action , observe
) 0
:
(V s e s
s
a
a
s
s S
( ) ( )
( ) ( ) 1
( ) (
reward, , and next state
Until is
For a
) ( )
termina
ll :
( ) ( )
l
r s
r V s V s
e s e s
s
V s V s e s
e s e s
s s
s
![Page 30: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/30.jpg)
Backward View of TD()
![Page 31: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/31.jpg)
Backwards View vs. MC & TD(0)
Set to 0, we get to TD(0) Set to 1, we get MC but in a better
way– Can apply TD(1) to continuing tasks– Works incrementally and on-line (instead of
waiting to the end of the episode)
How about 0 < < 1?
![Page 32: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/32.jpg)
Reinforcement Learning
Eligibility Traces
Equivalence of the Forward and Backward Views
大同大學資工所智慧型多媒體研究室
![Page 33: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/33.jpg)
Offline TD()’s
Offline Forward TD() -Return
1 ( )
1
(1 ) n nt t
n
R R
[ ( )](
0) t t t
t
t
f R V s s s
s sV s
Offline Backward TD()
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
1 1( ) ( )t t t t t tr V s V s ( ) ( )t
bttV s e s
![Page 34: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/34.jpg)
Forward View = Backward View
1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
Backward updates Forward updates
1
0t
tss
t
s sI
s s
See the proof
![Page 35: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/35.jpg)
Forward View = Backward View
1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
Backward updates Forward updates
1
0t
tss
t
s sI
s s
![Page 36: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/36.jpg)
TD() on the Random Walk
AverageRMSE
Over First10 Trials
Offline -return(forward)
Online TD()(backward)
![Page 37: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/37.jpg)
Reinforcement Learning
Eligibility Traces
Sarsa()
大同大學資工所智慧型多媒體研究室
![Page 38: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/38.jpg)
Sarsa()
TD() – Use eligibility traces for policy evaluation
How can eligibility traces be used for control?– Learn Qt(s, a) rather than Vt(s).
![Page 39: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/39.jpg)
Sarsa()
1
1
( , ) ( , ) ( , )
( , ) 1 ( , ) ( , )( , ) t t
tt
t t t
e s a s a s a
e s a s a se s
aa
1 1 1( , ) ( , )t t t tt t t tr Q s a Q s a
1( , ) ( , ) ( , )tt t tQ s ea Q ss a a
EligibilityTraces
ReinforcingEvents
Updates
![Page 40: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/40.jpg)
Sarsa()
Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
( , ) ( , )
using pol
1
( , ) ( , ) ( , )
icy derived from (e.g. -greedy)
For all :
( , )
( , )
s Q
r Q s a Q s a
e(s,a) e(s,a)
s,a
Q s a Q s a e s a
e s a e s a
Until
is ter l
;
minas
s s a a
Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
( , ) ( , )
using pol
1
( , ) ( , ) ( , )
icy derived from (e.g. -greedy)
For all :
( , )
( , )
s Q
r Q s a Q s a
e(s,a) e(s,a)
s,a
Q s a Q s a e s a
e s a e s a
Until
is ter l
;
minas
s s a a
![Page 41: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/41.jpg)
Sarsa() Traces in Grid World
With one trial, the agent has much more information about how to get to the goal – not necessarily the best way
Considerably accelerate learning
![Page 42: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/42.jpg)
Reinforcement Learning
Eligibility Traces
Q()
大同大學資工所智慧型多媒體研究室
![Page 43: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/43.jpg)
Q-Learning
An off-policy method– breaks from time to time to take exploratory actions– a simple time trace cannot be easily implemented
How to combine eligibility traces and Q-learning?
Three methods:– Watkins's Q() – Peng's Q ()– Naïve Q ()
![Page 44: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/44.jpg)
Watkins's Q() Behavior policy(e.g., -greedy)
Estimation policy(e.g., greedy)
GreedyPath
Non-GreedyPath
Firstnon-greedy
action
![Page 45: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/45.jpg)
Backups Watkins's Q()
Two cases:1. Both behavior and
estimation policies take the greedy path.
2. Behavior path has taken a non-greedy action before the episode ends.
Case 1Case 2
How to define the eligibility traces?
![Page 46: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/46.jpg)
Watkins's Q()
1
1 1
1 1
1
0 (
( , ) ( , ) and1
, ) max
( , )
(
( , )
( , ) max ( , )
(
, )
, )
t tt
t t t a t t
t t t a t tt
te s a ot
s a s ae
Q s a Q s a
s aQ s a Q s a
e
herwis
s
e
a
1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a
1( , ) ( , ) ( , )tt t tQ s ea Q ss a a
![Page 47: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/47.jpg)
Watkins's Q() Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
using policy derived from (e.g. -greedy)
(if ties for the max, th* arg max ( , ) *
( , ) ( , *)
en )
For all :
( , ) ( , ) 1
b
s
a Q s b a a a
r Q s a Q s a
e s
Q
a e s a
s,a
If , then
els
( , ) ( , ) ( , )
* ( , ) ( , )
( , ) 0
Until is termina
l
e
;
Q s a Q s a e s a
a a e s a e s a
e s a
s s a a
s
Initialize
Repeat (for
Initialize arbitrarily and
each step of episode):
, for all
Repeat (for each episo
Take a
( , ) ( , )
ction
0 ,
, observe
Choose f
de)
r
,
om
:
,
Q s a e s a s a
s a
a r s
a
using policy derived from (e.g. -greedy)
(if ties for the max, th* arg max ( , ) *
( , ) ( , *)
en )
For all :
( , ) ( , ) 1
b
s
a Q s b a a a
r Q s a Q s a
e s
Q
a e s a
s,a
If , then
els
( , ) ( , ) ( , )
* ( , ) ( , )
( , ) 0
Until is termina
l
e
;
Q s a Q s a e s a
a a e s a e s a
e s a
s s a a
s
![Page 48: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/48.jpg)
Peng's Q() Cutting off traces loses much of the advant
age of using eligibility traces. If exploratory actions are frequent, as they
often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning.
Peng's Q() is an alternate version of Q() meant to remedy this.
![Page 49: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/49.jpg)
Backups Peng's Q()
Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).
Never cut traces Backup max action except at end The book says it outperforms Watkins Q(λ) a
nd almost as well as Sarsa(λ) Disadvantage: difficult for implementation
![Page 50: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/50.jpg)
Peng's Q() See
Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).
for notations.
![Page 51: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/51.jpg)
Naïve Q()
Idea: Is it really a problem to backup exploratory actions?– Never zero traces– Always backup max at current action
(unlike Peng or Watkins’s) Is this truly naïve? Works well is preliminary empirical
studies
![Page 52: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/52.jpg)
Naïve Q()
1
1 1
1 1
1
0 (
( , ) ( , ) and1
, ) max
( , )
(
( , )
( , ) max ( , )
(
, )
, )
t tt
t t t a t t
t t t a t tt
te s a ot
s a s ae
Q s a Q s a
s aQ s a Q s a
e
herwis
s
e
a
1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a
1( , ) ( , ) ( , )tt t tQ s ea Q ss a a
![Page 53: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/53.jpg)
Comparisons
McGovern, Amy and Sutton, Richard S. (1997) Towards a better Q(). Presented at the Fall 1997 Reinforcement Learning Workshop.
Deterministic gridworld with obstacles– 10x10 gridworld– 25 randomly generated obstacles– 30 runs = 0.05, = 0.9, = 0.9, = 0.05,– accumulating traces
![Page 54: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/54.jpg)
Comparisons
![Page 55: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/55.jpg)
Convergence of the Q()’s
None of the methods are proven to converge.– Much extra credit if you can prove any of
them. Watkins’s is thought to converge to Q*
Peng’s is thought to converge to a mixture of Q and Q*
Naïve - Q*?
![Page 56: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/56.jpg)
Reinforcement Learning
Eligibility Traces
Eligibility Traces for Actor-Critic Methods
大同大學資工所智慧型多媒體研究室
![Page 57: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/57.jpg)
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Critic: On-policy learning of V. Use TD() as described before.
Actor: Needs eligibility traces for each state-action pair.
![Page 58: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/58.jpg)
Policy Parameters Update
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Method 1:
1
( , ) if and ( , )
( , ) otherwiset t t t
tt
p s a a a s sp s a
p s a
1( , ) ( , ) ( , )tt t tp s ea p ss a a
![Page 59: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/59.jpg)
Policy Parameters Update
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Method 2:
1( , ) ( , ) ( , )tt t tp s ea p ss a a
1
( , ) if and ( , )
( , ) othe
1 ( ,
r se
)
wit t t t
t
t
p s a a a s sp s a
p s
a
a
s
1
1
( , ) 1 ( , ) if and ( , )
( , ) otherwiset t t t t t
tt
e s a s a s s a ae s a
e s a
![Page 60: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/60.jpg)
Reinforcement Learning
Eligibility Traces
Replacing Traces
大同大學資工所智慧型多媒體研究室
![Page 61: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/61.jpg)
Accumulating/Replacing Traces
1
1
( )( )
( ) 1t t
tt t
e s s se s
e s s s
Accumulating Traces:
1( )( )
1t t
tt
e s s se s
s s
Replacing Traces:
![Page 62: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/62.jpg)
Why Replacing Traces?
Using accumulating traces, frequently visited states can have eligibilities greater than 1– This can be a problem for convergence
Replacing traces can significantly speed learning
They can make the system perform well for a broader set of parameters
Accumulating traces can do poorly on certain types of tasks
![Page 63: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/63.jpg)
Example (19-State Random Walk)
![Page 64: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/64.jpg)
Extension to action-values
When you revisit a state, what should you do with the traces for the other actions?
Singh and Sutton (1996) to set traces of all other actions from the revisited state to 0.
1
0 if
1 if and
( , )
(
, ) if
and t t
t
t
t
t
ts s
s s a a
e s a
e s a s s
a a
![Page 65: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/65.jpg)
Reinforcement Learning
Eligibility Traces
Implementation Issues
大同大學資工所智慧型多媒體研究室
![Page 66: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/66.jpg)
Implementation Issues
For practical use we cannot compute every trace down to the last.
Dropping very small values is recommended and encouraged.
If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices).
Use with neural networks and backpropagation generally only causes a doubling of needed computational power.
![Page 67: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/67.jpg)
Variable Can generalize to variable
Here is a function of time– E.g.,
1
1
( ) if ( )
( ) 1 if t t
tt
t t
te s s se s
e s s s
o( ) rt
tt ts
![Page 68: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/68.jpg)
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
0
( ) ( )k
tt k
t ssk
e s I
An accumulating eligibility trace can be written explicitly (non-recursively) as
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
0 1k t T
0 1k t T 1 1
0
( )k
T Tt k
ss tk t k
I
1 1
0
( )t
T Tk t
ss kt k t
I
k t
![Page 69: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/69.jpg)
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
( )1
( )ft tt t tRV s V s
1 ( )
1
( ) (1 ) n nt t t
n
V s R
0
1 1( ) (1 ) [ ( )]t t t t tV s r V s 1 2
1 2 2(1 ) [ ( )]t t t tr r V s 2 2 3
1 2 3 3(1 ) [ ( )]t t t t tr r r V s
![Page 70: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/70.jpg)
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1
( )ft tV s
0
1 1 1( ) ( ) [ ( ) ( )]t t t t t t tV s r V s V s 1
2 2 2( ) [ ( ) ( )]t t t t tr V s V s 2
3 3 3( ) [ ( ) ( )]t t t t tr V s V s
01 1 ( ) [ ( ) ( )]t t t t tr V s V s
12 2 1( ) [ ( ) ( )]t t t t tr V s V s
23 3 2( ) [ ( ) ( )]t t t t tr V s V s
0( ) t 1
1( ) t
22( ) t
![Page 71: Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室](https://reader033.vdocuments.pub/reader033/viewer/2022061501/56649f4e5503460f94c6ecf0/html5/thumbnails/71.jpg)
Proof1 1
0 0
( ) ( )t
b ft t
T T
sst t
tV s V Is
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1 1
0 0 0
( () )k
T T tt k
t sst t
b
ktV s I
1
( )ft tV s
( )k t
kk t
11
0
( )t
TT
sst
k tk
k t
I
1
0
( )t
T
sst
ft ts IV
1
( )T
k tk
k t
0 1t k T
0 1t k T
1
0 0
( )t
T kk t
k ssk t
I
1 1
0 0 0
( )( )t k
ft t
T T tt k
ss t sst t k
I IV s
1 1
0 0 0
( )( )t k
ft t
T T tt k
ss t sst t k
I IV s
k t