actor-critic - 國立臺灣大學

18
Actor-Critic Hung-yi Lee

Upload: others

Post on 22-Mar-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Actor-Critic - 國立臺灣大學

Actor-CriticHung-yi Lee

Page 2: Actor-Critic - 國立臺灣大學

Asynchronous Advantage Actor-Critic (A3C)

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P.

Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu, “Asynchronous Methods for

Deep Reinforcement Learning”, ICML, 2016

Page 3: Actor-Critic - 國立臺灣大學

Review – Policy Gradient

𝛻 ത𝑅𝜃 ≈1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑡′=𝑡

𝑇𝑛𝛾𝑡

′−𝑡𝑟𝑡′𝑛 − 𝑏 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡

𝑛|𝑠𝑡𝑛

baseline

𝐺𝑡𝑛 : obtained via interaction

Very unstable

𝑠

𝑎𝐺 = 1

𝐺 = −10

𝐺 = 2

𝐺 = 3𝐺 = 100

With sufficient samples, approximate the expectation of G.

Can we estimate the expected value of G?

Page 4: Actor-Critic - 國立臺灣大學

Review – Q-Learning

• State value function 𝑉𝜋 𝑠

• When using actor 𝜋, the cumulated reward expects to be obtained after visiting state s

• State-action value function 𝑄𝜋 𝑠, 𝑎

• When using actor 𝜋, the cumulated reward expects to be obtained after taking a at state s

𝑉𝜋s𝑉𝜋 𝑠

scalar

𝑄𝜋 𝑠, 𝑎 = 𝑙𝑒𝑓𝑡

𝑄𝜋 𝑠, 𝑎 = 𝑓𝑖𝑟𝑒

𝑄𝜋 𝑠, 𝑎 = 𝑟𝑖𝑔ℎ𝑡𝑄𝜋

for discrete action only

s

Estimated by TD or MC

Page 5: Actor-Critic - 國立臺灣大學

Actor-Critic

𝛻 ത𝑅𝜃 ≈1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑡′=𝑡

𝑇𝑛𝛾𝑡

′−𝑡𝑟𝑡′𝑛 − 𝑏 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡

𝑛|𝑠𝑡𝑛

𝐺𝑡𝑛 : obtained via interaction

𝐸 𝐺𝑡𝑛 = 𝑄𝜋𝜃 𝑠𝑡

𝑛, 𝑎𝑡𝑛

baseline

𝑉𝜋𝜃 𝑠𝑡𝑛𝑄𝜋𝜃 𝑠𝑡

𝑛, 𝑎𝑡𝑛 − 𝑉𝜋𝜃 𝑠𝑡

𝑛

Page 6: Actor-Critic - 國立臺灣大學

Advantage Actor-Critic

𝑄𝜋 𝑠𝑡𝑛, 𝑎𝑡

𝑛 − 𝑉𝜋 𝑠𝑡𝑛 Estimate two networks? We

can only estimate one.

𝑄𝜋 𝑠𝑡𝑛, 𝑎𝑡

𝑛 = 𝑟𝑡𝑛 + 𝑉𝜋 𝑠𝑡+1

𝑛

𝑄𝜋 𝑠𝑡𝑛, 𝑎𝑡

𝑛 = 𝐸 𝑟𝑡𝑛 + 𝑉𝜋 𝑠𝑡+1

𝑛

𝑟𝑡𝑛 + 𝑉𝜋 𝑠𝑡+1

𝑛 − 𝑉𝜋 𝑠𝑡𝑛

Only estimate state value

A little bit variance

Page 7: Actor-Critic - 國立臺灣大學

Advantage Actor-Critic

𝜋 interacts with the environment

Learning 𝑉𝜋 𝑠Update actor from 𝜋 → 𝜋’ based on

𝑉𝜋 𝑠

TD or MC𝜋 = 𝜋′

𝛻 ത𝑅𝜃 ≈1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑟𝑡𝑛 + 𝑉𝜋 𝑠𝑡+1

𝑛 − 𝑉𝜋 𝑠𝑡𝑛 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡

𝑛|𝑠𝑡𝑛

Page 8: Actor-Critic - 國立臺灣大學

Advantage Actor-Critic

• Tips

• The parameters of actor 𝜋 𝑠 and critic 𝑉𝜋 𝑠can be shared

• Use output entropy as regularization for 𝜋 𝑠

• Larger entropy is preferred → exploration

Network𝑠

Network

fire

right

left

Network

𝑉𝜋 𝑠

Page 9: Actor-Critic - 國立臺灣大學

Asynchronous Advantage Actor-Critic (A3C)

The idea is from 李思叡

Page 10: Actor-Critic - 國立臺灣大學

Source of image: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2#.68x6na7o9

Asynchronous

𝜃2

𝜃1

1. Copy global parameters

2. Sampling some data

3. Compute gradients

4. Update global models

∆𝜃

∆𝜃

𝜃1

+𝜂∆𝜃𝜃1

(other workers also update models)

Page 11: Actor-Critic - 國立臺灣大學

Pathwise Derivative Policy Gradient

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, “Deterministic Policy Gradient Algorithms”, ICML, 2014

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING”, ICLR, 2016

Page 12: Actor-Critic - 國立臺灣大學

Another Way to use Critic𝑄𝜋 𝑠, 𝑎

𝑎1 𝑎2

𝑄𝜋 𝑠, 𝑎

𝑎

increasedecrease

We know the parameters of Q function

Original Actor-critic

Pathwise derivative policy gradient

𝑎′

From Q function we know that taking a’ at state s is better than a

Page 13: Actor-Critic - 國立臺灣大學

http://www.cartomad.com/comic/109000081104011.html

Original Actor-criticPathwise derivative policy gradient

Actor Critic

𝑎 = 𝑎𝑟𝑔max𝑎

𝑄 𝑠, 𝑎

Action 𝑎 is a continuous vectorActor𝜋

𝑠 𝑎

Actor as the solver of this optimization problem

Page 14: Actor-Critic - 國立臺灣大學

Pathwise Derivative Policy Gradient

𝑄𝜋 𝑄𝜋 𝑠, 𝑎

𝑠

𝑎

= 𝑎𝑟𝑔max𝑎

𝑄𝜋 𝑠, 𝑎𝜋′ 𝑠

Actor𝜋

𝑠 𝑎

Update 𝜋 → 𝜋′

=

This is a large network

FixedGradient ascent:

a is the output of an actor

𝜃𝜋′← 𝜃𝜋 + 𝜂𝛻𝜃𝜋𝑄

𝜋 𝑠, 𝑎

Page 15: Actor-Critic - 國立臺灣大學

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎Find a new actor 𝜋′ “better” than 𝜋

TD or MC𝜋 = 𝜋′

𝑄𝜋 𝑄𝜋 𝑠, 𝑎

𝑠

𝑎Actor𝜋

𝑠 𝑎

Update 𝜋 → 𝜋′

=

𝜃𝜋′← 𝜃𝜋 + 𝜂𝛻𝜃𝜋𝑄

𝜋 𝑠, 𝑎

Replay Buffer

Exploration

Page 16: Actor-Critic - 國立臺灣大學

• Initialize Q-function 𝑄, target Q-function 𝑄 = 𝑄, actor 𝜋, target actor ො𝜋 = 𝜋

• In each episode

• For each time step t

• Given state 𝑠𝑡, take action 𝑎𝑡 based on Q (exploration)

• Obtain reward 𝑟𝑡, and reach new state 𝑠𝑡+1• Store (𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1) into buffer

• Sample (𝑠𝑖, 𝑎𝑖, 𝑟𝑖, 𝑠𝑖+1) from buffer (usually a batch)

• Target 𝑦 = 𝑟𝑖 +max𝑎

𝑄 𝑠𝑖+1, 𝑎

• Update the parameters of 𝑄 to make 𝑄 𝑠𝑖, 𝑎𝑖 close to 𝑦 (regression)

• Update the parameters of 𝜋 to maximize 𝑄 𝑠𝑖,𝜋 𝑠𝑖• Every C steps reset 𝑄 = 𝑄

• Every C steps reset ො𝜋 = 𝜋

Q-Learning Algorithm

Page 17: Actor-Critic - 國立臺灣大學

• Initialize Q-function 𝑄, target Q-function 𝑄 = 𝑄, actor 𝜋, target actor ො𝜋 = 𝜋

• In each episode

• For each time step t

• Given state 𝑠𝑡, take action 𝑎𝑡 based on Q (exploration)

• Obtain reward 𝑟𝑡, and reach new state 𝑠𝑡+1• Store (𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1) into buffer

• Sample (𝑠𝑖, 𝑎𝑖, 𝑟𝑖, 𝑠𝑖+1) from buffer (usually a batch)

• Target 𝑦 = 𝑟𝑖 +max𝑎

𝑄 𝑠𝑖+1, 𝑎

• Update the parameters of 𝑄 to make 𝑄 𝑠𝑖, 𝑎𝑖 close to 𝑦 (regression)

• Update the parameters of 𝜋 to maximize 𝑄 𝑠𝑖,𝜋 𝑠𝑖• Every C steps reset 𝑄 = 𝑄

• Every C steps reset ො𝜋 = 𝜋

Q-Learning Algorithm Pathwise Derivative Policy Gradient

𝑄 𝑠𝑖+1, ො𝜋 𝑠𝑖+1

𝜋1

2

3

4

Page 18: Actor-Critic - 國立臺灣大學

Connection with GAN

David Pfau, Oriol Vinyals, “Connecting Generative Adversarial Networks and Actor-Critic Methods”, arXiv preprint, 2016