reinforcement learning eligibility traces

Reinforcement Learning

Eligibility Traces

主講人：虞台文

大同大學資工所智慧型多媒體研究室

Content n-step TD prediction Forward View of TD() Backward View of TD() Equivalence of the Forward and Backward Views Sarsa() Q() Eligibility Traces for Actor-Critic Methods Replacing Traces Implementation Issues

Eligibility Traces

n-Step

TD Prediction

Elementary Methods

DynamicProgramming

Monte CarloMethods

Monte Carlo vs. TD(0)

Monte Carlo– observe reward for all steps in an episode

TD(0)– observed one step only

2 11 2 3

T tt t t t Tr r rR r

1 1(1) ( )t t tR r V s

n-Step TD Prediction

TD (1-step) Monte Carlo2-step 3-step n-step

( )ntR

n-Step TD Prediction

2 11 2 3

T tt t t t TR r r r r

1 1(1) ( )t t ttR r V s

21 2 2

(2) ( )t t t tt r r V sR 2 3

3( )t t t t tt r r rR V s

2 11 2 3

( ) ( )n nt t t t

nn t t nt r r r r V sR

corrected n-step truncated return

Backups

Monte Carlo 1( ) ( ) ( )t t t t t t tV s V Rs V s

TD(0) 1 1 1( ) ( ) ( () )t tt t t tt t tV s V s V sr V s (1)( ) ( )t t t ttV s R V s

n-step TD (1

)( ) ( ) ( )t t t t tn

t tV s V s R V s

( )t tV s

( )[ ( )]( )

nt t t

R V s s sV s

n-Step TD Backup

online

offline

1( ) ( ) ( )t t tV s V s V s 1

( ) ( ) ( )T

V s V s V s

When offline, the new V(s) will be for the next episode.

( )[ ( )]( )

nt t t

R V s s sV s

( )max { | } ( ) max ( ) ( )n nt t

s sE R s s V s V s V s

Error Reduction Property

online

offline

1( ) ( ) ( )t t tV s V s V s 1

( ) ( ) ( )T

V s V s V s

n-step return Maximum error using V (current value)

Maximum error using n-step return

Example (Random Walk)

A B C D E

0 0 0 0 0 1

Consider 2-step TD, 3-step TD, …

V(s) 1/6 2/6 3/6 4/6 5/6

n=? is optimal?

Example (19-state Random Walk)start

1 0 0 0 0 1

offlineonline

AverageRMSE

Over First10 Trials

Exercise (Random Walk)

Standardmoves

Exercise (Random Walk)

Standardmoves

1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ

ent n’s and ’s), and compare their performance.3. Find optimal policy.

1. Evaluate value function for random policy2. Approximate value function using n-step TD (try differ

ent n’s and ’s), and compare their performance.3. Find optimal policy.

Eligibility Traces

The Forward View of TD()

Averaging n-step Returns

We are not limited to simply using n-step TD returns

For example, we could take average n-step TD returns like:

(2) (4)1 1

2 2avgt t tR R R

One backup

Sum to 1

TD() -Return

TD() is a method for averaging all n-step backups – weight by n1 (time since

visitation)– Called -return

Backup using -return:

(1 ) n nt t

( ) ( )t t t t tV s R V s

TD() -Return

TD() is a method for averaging all n-step backups – weight by n1 (time since

visitation)– Called -return

Backup using -return:

(1 ) n nt t

( ) ( )t t t t tV s R V s

Forward View of TD()

A theoretical view

TD() on the Random Walk

Eligibility Traces

The Backward View of TD()

Why Backward View?

Forward view is acausal– Not implementable

Backward view is causal– Implementable– In the offline case, achieving the

same result as the forward view

Eligibility Traces

Each state is associated with an additional memory variable eligibility trace, defined by:

( )( )

( ) 1t t

e s s se s

e s s s

Eligibility Traces

( )( )

( ) 1t t

e s s se s

e s s s

Eligibility Traces

( )( )

( ) 1t t

e s s se s

e s s s

Eligibility Recency of Visiting

At any time, the traces record which states have recently been visited, where “recently" is defined in terms of .

The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur.

Reinforcing event

( )( )

( ) 1t t

e s s se s

e s s s

The moment-by-moment 1-step TD errors

1 1( ) ( )t tt t t tr V s V s

Reinforcing Event

The moment-by-moment 1-step TD errors

1 1( ) ( )t tt t t tr V s V s

( ) ( )tt tV s e s

Eligibility Traces1

( ) 1( ) t

te s s s

Reinforcing Events 1 1( ) ( )t tt t t tr V s V s

Value updates ( ) ( )t tt s sV e

Online TD()

Initialize

Repeat (for each step of

Initialize arbitrarily and , for all

Repeat (for each episode):

action given by for

episode)

Take action , observe

(V s e s

( ) ( )

( ) ( ) 1

reward, , and next state

Until is

termina

( ) ( )

r V s V s

e s e s

V s V s e s

e s e s

Backward View of TD()

Backwards View vs. MC & TD(0)

Set to 0, we get to TD(0) Set to 1, we get MC but in a better

way– Can apply TD(1) to continuing tasks– Works incrementally and on-line (instead of

waiting to the end of the episode)

How about 0 < < 1?

Eligibility Traces

Equivalence of the Forward and Backward Views

Offline TD()’s

Offline Forward TD() -Return

(1 ) n nt t

[ ( )](

0) t t t

f R V s s s

s sV s

Offline Backward TD()

( )( )

( ) 1t t

e s s se s

e s s s

1 1( ) ( )t t t t t tr V s V s ( ) ( )t

bttV s e s

Forward View = Backward View

( ) ( )t

b ft t

tV s V Is

Backward updates Forward updates

See the proof

Forward View = Backward View

( ) ( )t

b ft t

tV s V Is

Backward updates Forward updates

TD() on the Random Walk

AverageRMSE

Over First10 Trials

Offline -return(forward)

Online TD()(backward)

Eligibility Traces

Sarsa()

TD() – Use eligibility traces for policy evaluation

How can eligibility traces be used for control?– Learn Qt(s, a) rather than Vt(s).

Sarsa()

( , ) ( , ) ( , )

( , ) 1 ( , ) ( , )( , ) t t

e s a s a s a

e s a s a se s

1 1 1( , ) ( , )t t t tt t t tr Q s a Q s a

1( , ) ( , ) ( , )tt t tQ s ea Q ss a a

EligibilityTraces

ReinforcingEvents

Updates

Sarsa()

Initialize

Repeat (for

Initialize arbitrarily and

each step of episode):

, for all

Repeat (for each episo

Take a

( , ) ( , )

, observe

Choose f

Q s a e s a s a

( , ) ( , )

using pol

( , ) ( , ) ( , )

icy derived from (e.g. -greedy)

For all :

r Q s a Q s a

e(s,a) e(s,a)

Q s a Q s a e s a

e s a e s a

is ter l

s s a a

Initialize

Repeat (for

, for all

Take a

( , ) ( , )

, observe

Choose f

Q s a e s a s a

( , ) ( , )

using pol

( , ) ( , ) ( , )

icy derived from (e.g. -greedy)

For all :

r Q s a Q s a

e(s,a) e(s,a)

Q s a Q s a e s a

e s a e s a

is ter l

s s a a

Sarsa() Traces in Grid World

With one trial, the agent has much more information about how to get to the goal – not necessarily the best way

Considerably accelerate learning

Eligibility Traces

Q-Learning

An off-policy method– breaks from time to time to take exploratory actions– a simple time trace cannot be easily implemented

How to combine eligibility traces and Q-learning?

Three methods:– Watkins's Q() – Peng's Q ()– Naïve Q ()

Watkins's Q() Behavior policy(e.g., -greedy)

Estimation policy(e.g., greedy)

GreedyPath

Non-GreedyPath

Firstnon-greedy

action

Backups Watkins's Q()

Two cases:1. Both behavior and

estimation policies take the greedy path.

2. Behavior path has taken a non-greedy action before the episode ends.

Case 1Case 2

How to define the eligibility traces?

Watkins's Q()

( , ) ( , ) and1

, ) max

( , ) max ( , )

t t t a t t

t t t a t tt

te s a ot

s a s ae

Q s a Q s a

s aQ s a Q s a

herwis

1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a

1( , ) ( , ) ( , )tt t tQ s ea Q ss a a

Watkins's Q() Initialize

Repeat (for

, for all

Take a

( , ) ( , )

, observe

Choose f

Q s a e s a s a

using policy derived from (e.g. -greedy)

(if ties for the max, th* arg max ( , ) *

( , ) ( , *)

For all :

( , ) ( , ) 1

a Q s b a a a

r Q s a Q s a

a e s a

If , then

( , ) ( , ) ( , )

* ( , ) ( , )

( , ) 0

Until is termina

Q s a Q s a e s a

a a e s a e s a

s s a a

Initialize

Repeat (for

, for all

Take a

( , ) ( , )

, observe

Choose f

Q s a e s a s a

using policy derived from (e.g. -greedy)

(if ties for the max, th* arg max ( , ) *

( , ) ( , *)

For all :

( , ) ( , ) 1

a Q s b a a a

r Q s a Q s a

a e s a

If , then

( , ) ( , ) ( , )

* ( , ) ( , )

( , ) 0

Until is termina

Q s a Q s a e s a

a a e s a e s a

s s a a

Peng's Q() Cutting off traces loses much of the advant

age of using eligibility traces. If exploratory actions are frequent, as they

often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning.

Peng's Q() is an alternate version of Q() meant to remedy this.

Backups Peng's Q()

Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).

Never cut traces Backup max action except at end The book says it outperforms Watkins Q(λ) a

nd almost as well as Sarsa(λ) Disadvantage: difficult for implementation

Peng's Q() See

Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3).

for notations.

Naïve Q()

Idea: Is it really a problem to backup exploratory actions?– Never zero traces– Always backup max at current action

(unlike Peng or Watkins’s) Is this truly naïve? Works well is preliminary empirical

studies

Naïve Q()

( , ) ( , ) and1

, ) max

( , ) max ( , )

t t t a t t

t t t a t tt

te s a ot

s a s ae

Q s a Q s a

s aQ s a Q s a

herwis

1 1max ( , ) ( , )t a tt t t t tr Q s a Q s a

1( , ) ( , ) ( , )tt t tQ s ea Q ss a a

Comparisons

McGovern, Amy and Sutton, Richard S. (1997) Towards a better Q(). Presented at the Fall 1997 Reinforcement Learning Workshop.

Deterministic gridworld with obstacles– 10x10 gridworld– 25 randomly generated obstacles– 30 runs = 0.05, = 0.9, = 0.9, = 0.05,– accumulating traces

Comparisons

Convergence of the Q()’s

None of the methods are proven to converge.– Much extra credit if you can prove any of

them. Watkins’s is thought to converge to Q*

Peng’s is thought to converge to a mixture of Q and Q*

Naïve - Q*?

Eligibility Traces

Eligibility Traces for Actor-Critic Methods

Actor-Critic Methods

Environment

ValueFunction

Policy

Critic

Action

state reward

TDerror

Critic: On-policy learning of V. Use TD() as described before.

Actor: Needs eligibility traces for each state-action pair.

Policy Parameters Update

Environment

ValueFunction

Policy

Critic

Action

state reward

TDerror

Method 1:

( , ) if and ( , )

( , ) otherwiset t t t

p s a a a s sp s a

1( , ) ( , ) ( , )tt t tp s ea p ss a a

Policy Parameters Update

Environment

ValueFunction

Policy

Critic

Action

state reward

TDerror

Method 2:

1( , ) ( , ) ( , )tt t tp s ea p ss a a

( , ) if and ( , )

( , ) othe

wit t t t

p s a a a s sp s a

( , ) 1 ( , ) if and ( , )

( , ) otherwiset t t t t t

e s a s a s s a ae s a

Eligibility Traces

Replacing Traces

Accumulating/Replacing Traces

( )( )

( ) 1t t

e s s se s

e s s s

Accumulating Traces:

1( )( )

e s s se s

Replacing Traces:

Why Replacing Traces?

Using accumulating traces, frequently visited states can have eligibilities greater than 1– This can be a problem for convergence

Replacing traces can significantly speed learning

They can make the system perform well for a broader set of parameters

Accumulating traces can do poorly on certain types of tasks

Example (19-State Random Walk)

Extension to action-values

When you revisit a state, what should you do with the traces for the other actions?

Singh and Sutton (1996) to set traces of all other actions from the revisited state to 0.

1 if and

, ) if

and t t

s s a a

e s a s s

Eligibility Traces

Implementation Issues

For practical use we cannot compute every trace down to the last.

Dropping very small values is recommended and encouraged.

If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices).

Use with neural networks and backpropagation generally only causes a doubling of needed computational power.

Variable Can generalize to variable

Here is a function of time– E.g.,

( ) if ( )

( ) 1 if t t

te s s se s

e s s s

o( ) rt

Proof1 1

( ) ( )t

b ft t

tV s V Is

( ) ( )k

An accumulating eligibility trace can be written explicitly (non-recursively) as

( () )k

T T tt k

t sst t

ktV s I

0 1k t T

0 1k t T 1 1

T Tt k

ss tk t k

T Tk t

ss kt k t

Proof1 1

( ) ( )t

b ft t

tV s V Is

( () )k

T T tt k

t sst t

ktV s I

( () )k

T T tt k

t sst t

ktV s I

( )ft tt t tRV s V s

( ) (1 ) n nt t t

1 1( ) (1 ) [ ( )]t t t t tV s r V s 1 2

1 2 2(1 ) [ ( )]t t t tr r V s 2 2 3

1 2 3 3(1 ) [ ( )]t t t t tr r r V s

Proof1 1

( ) ( )t

b ft t

tV s V Is

( () )k

T T tt k

t sst t

ktV s I

( () )k

T T tt k

t sst t

ktV s I

( )ft tV s

1 1 1( ) ( ) [ ( ) ( )]t t t t t t tV s r V s V s 1

2 2 2( ) [ ( ) ( )]t t t t tr V s V s 2

3 3 3( ) [ ( ) ( )]t t t t tr V s V s

01 1 ( ) [ ( ) ( )]t t t t tr V s V s

12 2 1( ) [ ( ) ( )]t t t t tr V s V s

23 3 2( ) [ ( ) ( )]t t t t tr V s V s

0( ) t 1

1( ) t

22( ) t

Proof1 1

( ) ( )t

b ft t

tV s V Is

( () )k

T T tt k

t sst t

ktV s I

( () )k

T T tt k

t sst t

ktV s I

( )ft tV s

( )k t

ft ts IV

0 1t k T

T kk t

k ssk t

( )( )t k

T T tt k

ss t sst t k

I IV s

( )( )t k

T T tt k

ss t sst t k

I IV s

reinforcement learning eligibility traces

Documents

reinforcement learning v0.5

reinforcement learning eligibility traces...

asd reinforcement manual 2011

introduction to deep reinforcement...

učenje pojačavanjem reinforcement learning

anchorage and development length. development length -...

reinforcement learning 2

ujava.org reinforcement-learning

plants reinforcement

fy 2014-2015 funding eligibility list - pennsylvania 2014-15...

reinforcement learning das reinforcement learning-problem...

…. et traces perdues (sophocle). 4. traces volontaires...

bmod - eligibility

chapter 13: reinforcement learning

presentasi reinforcement positif

reinforcement theory

inverse reinforcement on pomdp

eligibility criteria - lloydsbankfoundation.org.uk 2018...

reinforcement learning - 4. model-free reinforcement...

m-traces et système à base de m-traces