# increasing the action gap - new operators for reinforcement learning

Embed Size (px)

TRANSCRIPT

IncreasingtheActionGap:NewOperators

forReinforcementLearning

紹介者：岩城諒

2017/03/18@NIPS+読み会・関西

3

第1回・関西NIPS読み会

SafeandEfficientOff-PolicyReinforcementLearning

Safe and efficient off-policy reinforcement learning

Remi [email protected] DeepMind

Google DeepMind

Anna [email protected]

Vrije Universiteit Brussel

Marc G. [email protected]

Google DeepMind

Abstract

In this work, we take a fresh look at some old and new algorithms for off-policy,return-based reinforcement learning. Expressing these in a common form, we de-rive a novel algorithm, Retrace(�), with three desired properties: (1) low variance;(2) safety, as it safely uses samples collected from any behaviour policy, whateverits degree of “off-policyness”; and (3) efficiency, as it makes the best use of sam-ples collected from near on-policy behaviour policies. We analyse the contractivenature of the related operator under both off-policy policy evaluation and controlsettings and derive online sample-based algorithms. To our knowledge, this is thefirst return-based off-policy control algorithm converging a.s. to Q⇤ without theGLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary,we prove the convergence of Watkins’ Q(�), which was still an open problem. Weillustrate the benefits of Retrace(�) on a standard suite of Atari 2600 games.

One fundamental trade-off in reinforcement learning lies in the definition of the update target: shouldone estimate Monte Carlo returns or bootstrap from an existing Q-function? Return-based meth-ods (where return refers to the sum of discounted rewards

Pt

�trt

) offer some advantages overvalue bootstrap methods: they are better behaved when combined with function approximation, andquickly propagate the fruits of exploration (Sutton, 1996). On the other hand, value bootstrap meth-ods are more readily applied to off-policy data, a common use case. In this paper we show thatlearning from returns need not be at cross-purposes with off-policy learning.

We start from the recent work of Harutyunyan et al. (2016), who show that naive off-policy policyevaluation, without correcting for the “off-policyness” of a trajectory, still converges to the desiredQ⇡ value function provided the behavior µ and target ⇡ policies are not too far apart (the maxi-mum allowed distance depends on the � parameter). Their Q⇡

(�) algorithm learns from trajectoriesgenerated by µ simply by summing discounted off-policy corrected rewards at each time step. Un-fortunately, the assumption that µ and ⇡ are close is restrictive, as well as difficult to uphold in thecontrol case, where the target policy is always greedy with respect to the current Q-function. In thatsense this algorithm is not safe: it does not handle the case of arbitrary “off-policyness”.

Alternatively, the Tree-backup (TB) (�) algorithm (Precup et al., 2000) tolerates arbitrary tar-get/behavior discrepancies by scaling information (here called traces) from future temporal dif-ferences by the product of target policy probabilities. TB(�) is not efficient in the “near on-policy”case (similar µ and ⇡), though, as traces may be cut prematurely, blocking learning from full returns.

In this work, we express several off-policy, return-based algorithms in a common form. From thiswe derive an improved algorithm, Retrace(�), which is both safe and efficient, enjoying convergenceguarantees for off-policy policy evaluation and – more importantly – for the control setting.

Retrace(�) can learn from full returns retrieved from past policy data, as in the context of experiencereplay (Lin, 1993), which has returned to favour with advances in deep reinforcement learning (Mnih

arX

iv:1

606.

0264

7v1

[cs.L

G]

8 Ju

n 20

16

NIPS 2016

• slideshareにあります！

4

強化学習

エージェント環境

状態

行動

報酬関数

報酬

状態遷移規則

方策

5

強化学習でできること

difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughout—taking high-dimensional data (210|160colour video at 60 Hz) as input—to demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable manner—illustrated by the temporal evolution of two indices of learning (theagent’s average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).

We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;

Convolution Convolution Fully connected Fully connected

No input

Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line

symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,xð Þ).

a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

Figure 2 | Training curves tracking the agent’s average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point

on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.

RESEARCH LETTER

5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5

Macmillan Publishers Limited. All rights reserved©2015

ゲームAI [Mnih+15] IEEE ROBOTICS & AUTOMATION MAGAZINE MARCH 2016104

regression [2]. With a function approximator, the sampled data from the approximated model can be generated by inap-propriate interpolation or extrapolation that improperly up-dates the policy parameters. In addition, if we aggressively

derive the analytical gradi-ent of the approximated model to update the poli-cy, the approximated gra-dient might be far from the true gradient of the objective function due to the model approximation error. If we consider using these function approxima-tion methods for high-di-mensional systems like humanoid robots, this problem becomes more serious due to the difficul-ty of approximating high-

dimensional dynamics models with a limited amount of data sampled from real systems. On the other hand, if the environ-ment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environ-ment’s property and could lead to inappropriate policy up-dates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There-fore, our approach is suitable for a real robot learning for high-dimensional systems like humanoid robots.

Moreover, applying RL to actual robot control is difficult, since it usually requires many learning trials that cannot be exe-cuted in real environments, and the real system’s durability is limited. Previous studies used prior knowledge or properly de-signed initial trajectories to apply RL to a real robot and im-proved the robot controller’s parameters [1], [4], [10], [19], [32].

We applied our proposed learning method to our human-oid robot [7] (Figure 13) and show that it can accomplish two different movement-learning tasks without any prior knowl-edge for the cart-pole swing-up task or with a very simple nominal trajectory for the basketball-shooting task.

The proposed recursive use of previously sampled data to improve policies for real robots would also be useful for other policy search algorithms, such as reward weighted re-gression [11] or information theoretic approaches [12], and it might be interesting to investigate how these combinations work as a future study.

ConclusionsIn this article, we proposed reusing the previous experienc-es of a humanoid robot to efficiently improve its task per-formance. We proposed recursively using the off-policy PGPE method to improve the policies and applied our ap-proach to cart-pole swing-up and basketball-shooting tasks. In the former, we introduced a real-virtual hybrid task environment composed of a motion controller and vir-tually simulated cart-pole dynamics. By using the hybrid environment, we can potentially design a wide variety of different task environments. Note that complicated arm movements of the humanoid robot need to be learned for the cart-pole swing-up. Furthermore, by using our pro-posed method, the challenging basketball-shooting task was successfully accomplished.

Future work will develop a method based on a transfer learning [28] approach to efficiently reuse the previous expe-riences acquired in different target tasks.

AcknowledgmentThis work was supported by MEXT KAKENHI Grant 23120004, MIC-SCOPE, ``Development of BMI Technolo-gies for Clinical Application’’ carried out under SRPBS by AMED, and NEDO. Part of this study was supported by JSPS KAKENHI Grant 26730141. This work was also supported by NSFC 61502339.

References[1] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Data-effi-cient contextual policy search for robot movement skills,” in Proc. National Conf. Artificial Intelligence, 2013.[2] C. E. Rasmussen and C. K. I. Williams Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006.[3] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in Proc. 14th Int. Conf. Machine Learning, 1997, pp. 12–20.[4] C. G. Atkeson and J. Morimoto, “Nonparametric representation of poli- cies and value functions: A trajectory-based approach,” in Proc. Neural Infor-mation Processing Systems, 2002, pp. 1643–1650.

Efficiently reusing previous

experiences is crucial to

improve its behavioral

policies without actually

interacting with real

environments.

Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.)ロボット制御 [Sugimoto+16]IBM Research / Center for Business Optimization

Modeling and Optimization Engine

Actions

Other

System 1

System 2

System 3 Event Listener

EventNotification

EventNotification

EventNotification

< inserts >

TP Profile

Taxpayer State ( Current )

ModelerOptimizer

< input to >

State Generator

< input to >

Case Inventory

< reads >

< input to >

Allocation Rules

Resource Constraints

< input to >

< inserts , updates >

Business Rules< input to >

< generates >

Segment Selector Action 1 Cnt Action 2 Cnt Action n Cnt1 C1 C2 V C3 200 50 02 C4 V C1 C7 0 50 250

TP ID Feat 1 Feat 2 Feat n123456789 00 5 A 1500122334456 01 0 G 1600122118811 03 9 G 1700

Rule Processor< input to >

< input to >

Recommended Actions

< inserts , updates >

TP ID Rec. Date Rec. Action Start Date123456789 00 6/21/2006 A1 6/21/2006122334456 01 6/20/2006 A2 6/20/2006122118811 03 5/31/2006 A2

Action Handler< input to >

New Case

Case Extract

Scheduler

< starts > < updates >

State

Time Expired

EventNotification

< input to >

Taxpayer State History

State

TP ID State Date Feat 1 Feat 2 Feat n123456789 00 6/1/2006 5 A 1500122334456 01 5/31/2006 0 G 1600122118811 03 4/16/2006 4 R 922122118811 03 4/20/2006 9 G 1700

< inserts >

Feature Definitions

(XML)

(XSLT)

(XML)

(XML)

(XSLT)

Figure 2: Overall collections system architecture.

see the full picture. This also means that, after a recommen-dation is generated for each taxpayer, conflicts may need beconsolidated with these relationships in mind. For example,it may not be desirable to send related taxpayers to differentorganizations. This adjustment is part of what is performedby the module named ‘Rule Processor’ in the figure. Ad-ditionally, there is a mechanism in place that can be usedto deploy two models in an incumbent/challenger paradigm,until such time as there is sufficient statistical significanceto call a winner. The loser will be replaced by the winneror a new model when available.

There are three types of users at NYS DTF. The firsttype is the members of the IT department within DTF, whoare responsible for operating and maintaining the collectionsoptimization system. The second type is the project team,of size ranging from 5 to 15, that represents the users of thecollections optimzation system, who perform validation ofthe rules and output allocations, and oversee the process ofsetting resource constraints and other inputs to the system.The third is the actual collections agents who are directlyaffected by the system’s recommended actions. Of these,30 to 40 call center agents receive the recommended actionsvia an application screen, whereas the hundreds of agentsin the specialized organizations get them indirectly via theengine’s allocations to the respective organizations.

The engine itself is a result of multiple years of researchand development, and with all of its components, the rel-evant efforts sum to about 5 million dollars in research in-vestment, although some of the components (modeling andoptimization engines) are generic. Additionally, NYS DTFinvested approximately 4 million dollars for this engagement.The software maintenance planning is underway by IBMGlobal Business Services, as part of the assetization andcommercialization of the engine.

5.9 Lift EvaluationA challenge in evaluating the performance of a data-driven

business optimization methodology is that we are typicallyrequired to do so using only historical data, which was col-lected using a different (sampling) policy. Here we employan evaluation method based on bias correction that allowsus to do so, essentially following [1].

A useful quantity to estimate is the expected cumulativereward for a new policy π′, when sampled with respect to theold (or empirical) policy π and state distribution µ, writtenRπ,µ(π′) and defined as follows.

Rπ,µ(π′) = Es∼π,µ[Ea∼π′(a|s)[Rπ(s, a)]]

We can estimate the above quantity using the sampling pol-icy, with appropriate bias correction (c.f. [19]), in which theobserved reward is multiplied by the ratio between the prob-abilities assigned to the observed action by the respectivepolicies

Rπ,µ(π′) = Es,a∼π,µ[π′(a|s)π(a|s)

[Rπ(s, a)]]

Note, in the above, that π′(a|s) is known since it is thestochastic policy generated by the constrained reinforcementlearning procedure, but π(a|s) needs to be estimated fromthe data, since we do not know the sampling policy explicitly.In our evaluation, we estimate π using Laplace estimates ineach of the segments that were output by the segmentedlinear regression algorithm used for function approximationin the learning procedure.

Figire 3 plots the expected rewards of the collections poli-cies output by our engine in a typical modeling run, as com-pared to the sampling or DTF’s historical policy, as a func-tion of the number of learning iterations. The advantageover the sampling policy is significant, and the models ob-tained over the learning iterations exhibit a steady improve-ment.

6. EVALUATION ON OTHER DATA SETSIn order to assess the robustness of our proposed method-

ology, we applied our proposed method and evaluated itsperformance on two other real world data sets.

6.1 The Data SetsThe first data set we use is based on the well-known dona-

tion data set from KDD Cup 1998, which is available fromthe UCI KDD repository.1 This data set contains informa-tion from direct mail promotions soliciting donations, and

1http://kdd.ics.uci.edu/

債権回収方法の最適化 [Abe+10]

囲碁 [Silver+16]

IncreasingtheActionGap:NewOperators

forReinforcementLearning

✘ Bellman方程式により得られる greedy方策は行動価値の推定誤差に鋭敏

• 最適行動と準最適行動の価値の差（actiongap）を広げる→価値推定の誤差に頑健• Atari2600を含む実験により圧倒的性能を示した

Increasing the Action Gap:New Operators for Reinforcement Learning

Marc G. Bellemare and Georg Ostrovski and Arthur GuezPhilip S. Thomas⇤ and Remi Munos

Google DeepMind{bellemare,ostrovski,aguez,munos}@google.com; [email protected]

AbstractThis paper introduces new optimality-preserving oper-ators on Q-functions. We first describe an operator fortabular representations, the consistent Bellman opera-tor, which incorporates a notion of local policy con-sistency. We show that this local consistency leads toan increase in the action gap at each state; increasingthis gap, we argue, mitigates the undesirable effectsof approximation and estimation errors on the inducedgreedy policies. This operator can also be applied todiscretized continuous space and time problems, andwe provide empirical results evidencing superior per-formance in this context. Extending the idea of a locallyconsistent operator, we then derive sufficient conditionsfor an operator to preserve optimality, leading to a fam-ily of operators which includes our consistent Bellmanoperator. As corollaries we provide a proof of optimal-ity for Baird’s advantage learning algorithm and deriveother gap-increasing operators with interesting proper-ties. We conclude with an empirical study on 60 Atari2600 games illustrating the strong potential of these newoperators.

Value-based reinforcement learning is an attractive solu-tion to planning problems in environments with unknown,unstructured dynamics. In its canonical form, value-basedreinforcement learning produces successive refinements ofan initial value function through repeated application of aconvergent operator. In particular, value iteration (Bellman1957) directly computes the value function through the iter-ated evaluation of Bellman’s equation, either exactly or fromsamples (e.g. Q-Learning, Watkins 1989).

In its simplest form, value iteration begins with an initialvalue function V0 and successively computes V

k+1 := T V

k

,where T is the Bellman operator. When the environment dy-namics are unknown, V

k

is typically replaced by Q

k

, thestate-action value function, and T is approximated by anempirical Bellman operator. The fixed point of the Bellmanoperator, Q

⇤, is the optimal state-action value function oroptimal Q-function, from which an optimal policy ⇡

⇤ can berecovered.

In this paper we argue that the optimal Q-function is in-consistent, in the sense that for any action a which is subop-

⇤Now at Carnegie Mellon University.Copyright c� 2016, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

timal in state x, Bellman’s equation for Q

⇤(x, a) describes

the value of a nonstationary policy: upon returning to x, thispolicy selects ⇡

⇤(x) rather than a. While preserving global

consistency appears impractical, we propose a simple mod-ification to the Bellman operator which provides us a witha first-order solution to the inconsistency problem. Accord-ingly, we call our new operator the consistent Bellman oper-ator.

We show that the consistent Bellman operator gener-ally devalues suboptimal actions but preserves the set ofoptimal policies. As a result, the action gap – the valuedifference between optimal and second best actions – in-creases. Increasing the action gap is advantageous in thepresence of approximation or estimation error (Farahmand2011), and may be crucial for systems operating at a finetime scale such as video games (Togelius et al. 2009;Bellemare et al. 2013), real-time markets (Jiang and Pow-ell 2015), and robotic platforms (Riedmiller et al. 2009;Hoburg and Tedrake 2009; Deisenroth and Rasmussen 2011;Sutton et al. 2011). In fact, the idea of devaluating subop-timal actions underpins Baird’s advantage learning (Baird1999), designed for continuous time control, and occurs nat-urally when considering the discretized solution of contin-uous time and space MDPs (e.g. Munos and Moore 1998;2002), whose limit is the Hamilton-Jacobi-Bellman equation(Kushner and Dupuis 2001). Our empirical results on the bi-cycle domain (Randlov and Alstrom 1998) show a markedincrease in performance from using the consistent Bellmanoperator.

In the second half of this paper we derive novel sufficientconditions for an operator to preserve optimality. The rela-tive weakness of these new conditions reveal that it is possi-ble to deviate significantly from the Bellman operator with-out sacrificing optimality: an optimality-preserving operatorneeds not be contractive, nor even guarantee convergence ofthe Q-values for suboptimal actions. While numerous alter-natives to the Bellman operator have been put forward (e.g.recently Azar et al. 2011; Bertsekas and Yu 2012), we be-lieve our work to be the first to propose such a major depar-ture from the canonical fixed-point condition required froman optimality-preserving operator. As proof of the richnessof this new operator family we describe a few practical in-stantiations with unique properties.

We use our operators to obtain state-of-the-art empirical

arX

iv:1

512.

0486

0v1

[cs.A

I] 1

5 D

ec 2

015

AAAI 2016

7

• 状態• 行動• 状態遷移確率• 報酬関数• 決定的方策• Q関数• 行動価値関数– 状態 で行動 をとりその後方策 に従った場合に将来得られるできる収益の期待値

• 学習目的：価値を最大化する方策の獲得

強化学習

エージェント

環境

状態 行動報酬

x 2 Xa 2 AP : X ⇥A ! Pr (X )

R : X ⇥A ! [�RMAX, RMAX]

⇡ : X ! A, ⇡ 2 ⇧

Q : X ⇥A ! R, Q 2 Q

⇡x a

Q

⇡(x, a) , E 1X

t=0

�

tR(xt, at)|xt = x, at = a

�, � 2 [0, 1)

8

• 決定的方策に対するベルマン方程式

• ベルマン演算子（ベルマン最適方程式に相当）

– 縮小写像でありただ一つの不動点は最適行動価値

– このときの greedy方策は定常な最適方策

ベルマン方程式と最適価値関数・最適方策

T Q(x, a) , R(x, a) + � EP max

b2AQ(x

0, b)

[Bertsekas&Tsitsiklis 96]

Q

Q

Q

Q⇤

Q⇤

Q⇤

Q = Q⇤

Q

Q

Q

Q

T

T

Q

⇤(x, a) = R(x, a) + � EP max

b2AQ

⇤(x

0, b)

EP

, Ex

0⇠P (·|x,a)

⇡

⇤(x) , argmax

a2AQ

⇤(x, a) 8x 2 X

Q

⇡(x, a) = R(x, a) + � EP Q

⇡(x0,⇡(x0))

9

• 状態2の価値を定義

• このときベルマン方程式より

greedy方策の不安定さresults on the Arcade Learning Environment (Bellemare etal. 2013). We consider the Deep Q-Network (DQN) archi-tecture of Mnih et al. (2015), replacing only its learning rulewith one of our operators. Remarkably, this one-line changeproduces agents that significantly outperform the originalDQN. Our work, we believe, demonstrates the potential im-pact of rethinking the core components of value-based rein-forcement learning.

BackgroundWe consider a Markov decision process M :=

(X , A, P, R, �) where X is the state space, A is thefinite action space, P is the transition probability kernel,R is the reward function mapping state-action pairs toa bounded subset of R, and � 2 [0, 1) is the discountfactor. We denote by Q := QX ,A and V := VX the spaceof bounded real-valued functions over X ⇥ A and X ,respectively. For Q 2 Q we write V (x) := max

a

Q(x, a),and follow this convention for related quantities ( ˜

V for ˜

Q,V

0 for Q

0, etc.) whenever convenient and unambiguous. Inthe context of a specific (x, a) 2 X ⇥ A we further writeE

P

:= Ex

0⇠P (· | x,a) to mean the expectation with respectto P (· | x, a), with the convention that x

0 always denotes thenext state random variable.

A deterministic policy ⇡ : X ! A induces a Q-functionQ

⇡ 2 Q whose Bellman equation is

Q

⇡

(x, a) := R(x, a) + �EP

Q

⇡

(x

0, ⇡(x

0)).

The state-conditional expected return V

⇡

(x) :=

Q

⇡

(x, ⇡(x)) is the expected discounted total rewardreceived from starting in x and following ⇡.

The Bellman operator T : Q ! Q is defined pointwiseas

T Q(x, a) := R(x, a) + �EP

max

b2AQ(x

0, b). (1)

T is a contraction mapping in supremum norm (Bertsekasand Tsitsiklis 1996) whose unique fixed point is the optimalQ-function

Q

⇤(x, a) = R(x, a) + �E

P

max

b2AQ

⇤(x

0, b),

which induces the optimal policy ⇡

⇤:

⇡

⇤(x) := arg max

a2AQ

⇤(x, a) 8x 2 X .

A Q-function Q 2 Q induces a greedy policy ⇡(x) :=

arg max

a

Q(x, a), with the property that Q

⇡

= Q if andonly if Q = Q

⇤. For x 2 X we call ⇡(x) the greedy ac-tion with respect to Q and a 6= ⇡(x) a nongreedy action;for ⇡

⇤ these are the usual optimal and suboptimal actions,respectively.

We emphasize that while we focus on the Bellman op-erator, our results easily extend to its variations such asSARSA (Rummery and Niranjan 1994), policy evaluation(Sutton 1988), and fitted Q-iteration (Ernst, Geurts, and We-henkel 2005). In particular, our new operators all have asample-based form, i.e., an analogue to the Q-Learning ruleof Watkins (1989).

a1

a2

p = 1, r = 0

r = 1

“Bad State”x1 x2

“cake”

“no cake”V = � 2

�

(1 + ✏)

p = 1/2,

Figure 1: A two-state MDP illustrating the non-stationaryaspect of the Bellman operator. Here, p and r indicate tran-sition probabilities and rewards, respectively. In state x1 theagent may either eat cake to receive a reward of 1 and transi-tion to x2 with probability 1

2 , or abstain for no reward. Statex2 is a low-value absorbing state with ✏ > 0.

The Consistent Bellman OperatorIt is well known (and implicit in our notation) that the op-timal policy ⇡

⇤ for M is stationary (i.e., time-independent)and deterministic. In looking for ⇡

⇤, we may therefore re-strict our search to the space ⇧ of stationary deterministicpolicies. Interestingly, as we now show the Bellman opera-tor on Q is not, in a sense, restricted to ⇧.

To begin, consider the two-state MDP depicted in Figure1. This MDP abstracts a Faustian situation in which an agentrepeatedly chooses between an immediately rewarding butultimately harmful option (a1), or an unrewarding alterna-tive (a2). For concreteness, we imagine the agent as facedwith an endless supply of delicious cake (with � > 0) andcall these the “cake” and “no cake” actions.

Eating cake can cause a transition to x2, the “bad state”,whose value is independent of the agent’s policy:

V

⇡

(x2) := �2(1 + ✏)

1

�

8⇡ 2 ⇧.

In state x1, however, the Q-values depend on the agent’s fu-ture behaviour. For a policy ⇡ 2 ⇧, the value of a1 is

Q

⇡

(x1, a1) = 1 + �

1

2

V

⇡

(x1) +

1

2

V

⇡

(x2)

�

(2)

= 1 +

�

2

V

⇡

(x1) � (1 + ✏) =

�

2

V

⇡

(x1) � ✏.

By contrast, the value of a2 is

Q

⇡

(x1, a2) = 0 + �V

⇡

(x1),

which is greater than Q

⇡

(x1, a1) for all ⇡. It follows that noteating cake is optimal, and thus V

⇤(x1) = Q

⇤(x1, a2) = 0.

Furthermore, (2) tells us that the value difference betweenoptimal and second best action, or action gap, is

Q

⇤(x1, a2) � Q

⇤(x1, a1) = ✏.

Notice that Q

⇤(x1, a1) = �✏ does not describe the value of

any stationary policy. That is, the policy ⇡ with ⇡(x1) = a1

has value

V

⇡

(x1) = �✏ +

�

2

V

⇡

(x1) =

�✏

1 � �/2

, (3)

and in particular this value is lower than Q

⇤(x1, a1). In-

stead, Q⇤(x1, a1) describes the value of a nonstationary pol-

icy which eats cake once, but then subsequently abstains.

Q

⇡(x1, a1) =�

2V

⇡(x1)� ✏

V

⇡(x2) , � 2

�

(1 + ✏), 8⇡ 2 ⇧

Q

⇡(x1, a2) = �V

⇡(x1)

Q

⇡(x1, a1) < Q

⇡(x1, a2), 8⇡ 2 ⇧

• “Nocake”が最適：

• よって actiongapは

V

⇤(x1) = Q

⇤(x1, a2) = 0

Q

⇤(x1, a2)�Q

⇤(x1, a1) = ✏

10

• “許容できる価値の推定誤差” =actiongap

• MDPが厳密に解かれれば問題ない

✘一方，行動価値の僅かな推定誤差が誤った greedy方策の原因となる

greedy方策の不安定さ

11

• ベルマン演算子によって得られる決定的方策は定常→同じ状態が連続した時同じ行動をとるように補正

✘指示関数の存在により関数近似との組み合わせが困難

✘ は通常0もしくは0に近い

ConsistentBellmanoperator

TCQ(x, a) , R(x, a) + � EP

hI[x 6=x

0] max

b2AQ(x

0, b) + I[x=x

0]Q(x, a)

i

= T Q(x, a)� �P (x|x, a)hmax

b2AQ(x, b)�Q(x, a)

i

P (x|x, a)

=

maxstatevalue actionvalue actiongap

s,a

A

s,a

Q

s

V

�

12

• AdvantageLearningoperator– actiongap =advantage[Baird99] （advantageの定義による）

• PersistentAdvantageLearningoperator– 時間分解能が高いタスク（cf.Atari2600,60Hz）では同じ行動を選択し続ける方がおそらく有利

(Persistent)Advantagelearningoperator

TPALQ(x, a) , max

n

TALQ(x, a), R(x, a) + � EPQ(x

0, a)

o

=

maxstatevalue actionvalue actiongap

s,a

A

s,a

Q

s

V

�

TALQ(x, a) , T Q(x, a)� ↵

hmax

b2AQ(x, b)�Q(x, a)

i

13

• すべての と に対し演算子 が以下を満たすと仮定する．

• このとき optimalitypreserving:– とする．すべての に対してかつすべての に対して

• かつ gap-increasing:– すべての に対して

定理 ::演算子たちの収束

T 0Q(x, a) T Q(x, a)

T 0Q(x, a) � T Q(x, a)� ↵

hmax

b2AQ(x, b)�Q(x, a)

i

↵ 2 [0, 1)

T 0Q 2 Q, x 2 X , a 2 A

Qk+1 , T 0Qk

Q0 2 Q, x 2 X

Q

⇤(x, a) < V

⇤(x) =) lim supk!1

Qk(x, a) < V

⇤(x)

a 2 A

Q0 2 Q, x 2 X , a 2 A

lim infk!1

hVk(x)�Qk(x, a)

i� V

⇤(x)�Q

⇤(x, a)

lim

k!1max

b2AQk(x, a) = V

⇤(x)

14

• 推定した最大価値に対する actiongap未満の大きさだけ最大でない行動価値を小さく見積もるよう を設計すれば，

• optimalitypreserving:– で得られる最大の行動価値は元のベルマン方程式 と同じ最適価値 に収束し，準最適な行動は準最適に保たれる．

• かつ gap-increasing:– で得られる行動価値の actiongapは元のベルマン演算子 による actiongapよりも大きい．

ü縮小写像でない& における不動点の存在を仮定しない

– 条件としてゆるい

定理 ::ざっくり

T 0

T

T 0

T 0

T V ⇤

Q

15

• 状態： の各次元を分割・確率的に多重線形補間

• 行動：可能な9つから選択– ハンドルのトルク {負，0，正}– 重心のずらし {負，0，正}

• 報酬関数– 倒れたら罰– ゴールに到達したら報酬– ゴールを向いていれば小さな正の報酬

• ここでは近似的な価値反復により解く– 演算子としての質の違いを評価

実験 ::1::自転車のバランス &走破

⌧ 2

' , (!, !, ✓, ✓, , r)

ちゃり

Goal

r

2 Randlov’s Model

Using Randlov and Alstrom’s bicycle model and envi-ronment, we trained a learner using the SARSA(�) al-gorithm. We implemented Randlov’s balance and go-totasks with PyBrain, a machine learning library for Python[9]. The episodic learning framework is illustrated in Fig-ure 1. To evaluate the execution of the task during learn-ing, we will additionally define a rehearsal sequence. Inone rehearsal sequence, the agent learns by executing 10episodes, after which, the agent “performs” one episodegreedily without learning. We use the discounted sumof rewards achieved over the performance episode as asuccess metric.

We used Randlov’s reward function for both tasks. Forthe balance task, a reward of -1 was given when the ab-solute tilt angle exceeded 30 � and 0 otherwise. For thego-to task, the following reward was given.

if abs(!) > ⇡

12

r = �1

else

r = (4�

2) ⇤ 0.0004

Figure 2 shows the results from Randlov’s balancetask. Our learner was able to balance indefinitely afterapproximately 1500 trials. We observed cases in whichthe agent would fall travel in stable, circular trajectories.This was also observed by Randlov. Figure 3 shows theresulting implementation of Randlov’s go-to task. Theagent lfirst learns to balance, and it first reaches thegoal after approximately 4500 trials. Both of the tasksare very sensitive to learning rate annealing. Randlovdoes not mention this, despite the fact that seems to bewell-known in the field [10]. Without annealing, nei-ther task will result in a converging policy because thecurrent learning continually discards all the learning thathad already been done. The go-to task required a muchslower annealing rate than did the balance task because

the agent first needs to gain experience balancing beforeattempting to navigate toward the goal.

Figure 2: (a) Learning curve for the balance task. (b) The tra-jectories of the bicycle’s wheel contact points in each of severalthousand episodes. Simulation was stopped after 1000 seconds,so trajectories may appear to terminate. The agent also foundseveral stable circular trajectories, as the one highlighted above.

Figure 3: (a) Learning curve for the ”go-to” task. (b) The tra-jectories of the bicycle’s wheel contact points after several thou-sand episodes. Here the goal was placed at a position (x =10m, y = 50 m) relative to the bicycle starting location. Afterabout 4000 episodes (400 rehearsals), the bicycle reaches thegoal with increasing frequency.

Now, building on Randlov’s implementation, we ex-plore our topics of interest–human ability to control avirtual bicycle, and shaping of the reinforcement rewardfunction.

Figure 1: Learning methodology, in PyBrain’s terminology.

2

! ⌧

✓

d

d 2

[Randolv &Alstrom 98]

16

実験 ::1::ちゃり ::結果

rewards driving away from the goal. Instead we use the fol-lowing related reward function

R(x, a) :=

8

<

:

�c if fallen1.0 if goal reached(⇡

2/4 �

2 � 1) ⇥ 0.001 otherwise

with c := (

34⇡

2 � 1) ⇥ 0.001 the largest negative rewardachievable by the agent. Empirically, we found this rewardfunction easier to work with, while our results remainedqualitatively similar for similar reward functions. We furtheruse a discount factor of � = 0.99.

We consider two sample-based operators on QZ,A, thespace of Q-functions over representative states. The sample-based Q-value interpolation Bellman operator is defined as

TQVIQ(z, a) := R(z, a) + �

1

k

k

X

i=1

max

b2AQ(x

0i

, b),

with k 2 N and x

0i

⇠ P (· | z, a). The sample-based con-sistent Q-value interpolation Bellman operator TCQVI is sim-ilarly defined by sampling x

0 from P :

T 0QVIQ(z, a) := R(z, a)+

�

k

k

X

i=1

max

b2A

h

Q(x

0, b) � A(z | x0

) (Q(z, b) � Q(z, a))

i

TCQVIQ(z, a) := min

�

TQVIQ(z, a), T 0QVIQ(z, a)

.

In both cases, we use Q-value interpolation to define a Q-function over X :

Q(x, a) := Ez⇠A(· | x) Q(z, a).

For each operator T 0, we computed a sequence of Q-functions Q

k

2 QZ,A using an averaging form of valueiteration:

Q

k+1(z, a) = (1 � ⌘)Q

k

(z, a) + ⌘T 0Q

k

(z, a),

applied simultaneously to all z 2 Z and a 2 A. We chosethis averaging version because it led to faster convergence,and lets us take k = 1 in the definition of both operators.From a parameter sweep we found ⌘ = 0.1 to be a suitablestep-size.

Our multilinear grid was defined over the six state vari-ables. As done elsewhere in the literature, we defined ourgrid over the following bounded variables:

! 2

�4

9

⇡,

4

9

⇡

�

,

! 2 [�2, 2],

✓ 2h

� ⇡

15

,

⇡

15

i

,

˙

✓ 2 [�0.5, 0.5] ,

2 [�⇡,⇡],

d 2 [10, 1200].

Values outside of these ranges were accordingly set to therange’s minimum or maximum.

Bellmanoperator

A.L.

Fraction of Episodes Ending In Fall

Iterations

8-grid

10-grid

Persistent A.L. Consistentoperator

Iterations

Fraction of Episodes Ending At Goal

Figure 6: Top. Falling and goal-reaching frequency forgreedy policies derived from value iteration on a 8 ⇥ · · · ⇥ 8

grid. Bottom. The same, for a 10 ⇥ · · · ⇥ 10 grid.

For completeness, Figure 6 compares the performance ofthe Bellman and consistent Bellman operators, as well asadvantage learning and persistent advantage learning (with↵ = 0.1), on 8 ⇥ · · · ⇥ 8 and 10 ⇥ · · · ⇥ 10 grids. Here,the usual Bellman operator is unable to find a solution tothe goal, while the consistent Bellman operator successfullydoes so. The two other operators also achieve superior per-formance compared to Bellman operator, although appearslightly more unstable in the smaller grid setting.

Experimental Details: ALEWe omit details of the DQN architecture, which are providedin Mnih et al. (2015). A frame is a single emulation stepwithin the ALE, while a time step consists of four consecu-tive frames which are treated atomically by the agent.

Our first Atari 2600 experiment (Stochastic Minimal set-ting) used stochastic controls, which operate as follows: ateach frame (not time step), the environment accepts theagent’s action with probability 1�p, or rejects it with proba-bility p. If an action is rejected, the previous frame’s action isrepeated. In our setting, the agent selects a new action everyfour frames; in this situation, the stochastic controls approx-imate a form of reaction delay. This particular setting is partof the latest Arcade Learning Environment. For our experi-ments we use the ALE 0.5 standard value of p = 0.25, andtrained agents for 100 million frames.

Our second Atari 2600 experiment (Original DQN set-ting) was averaged over three different trials, ran for 200million frames (instead of 100 million), defined a lost lifeas a termination signal, and did not use stochastic controls.This matches the experimental setting of Mnih et al. (2015).A full table of our results is provided in Table 1.

Our last experiment took place in the Original DQN set-ting. We generated a trajectory from a trained DQN agentplaying an ✏-greedy policy with ✏ = 0.05. The full trajec-tory (up to the end of the episode) was recorded in this way.

17

• consistentoperatorは初期からゴールに向かう方策を生成

実験 ::1::ちゃり ::結果

Bellmanoperator

Consistentoperator

Fraction of Episodes Ending In Fall

Iterations

Consistentoperator Bellman

operator

Iterations

Fraction of Episodes Ending At Goal

Consistent operatorBellman operator

Start

Goal

Figure 3: Top. Falling and goal-reaching frequency forgreedy policies derived from value iteration. Bottom. Sam-ple bicycle trajectories after 100, 200, . . . , 1000 iterations.In this coarse-resolution regime, the Bellman operator ini-tially yields policies which circle the goal forever, while theconsistent operator quickly yields successful trajectories.

A Family of Convergent OperatorsOne may ask whether it is possible to extend the consistentBellman operator to Q-value approximation schemes whichlack a probabilistic interpretation, such as linear approxi-mation (Sutton 1996), locally weighted regression (Atkeson1991), neural networks (Tesauro 1995), or even information-theoretic methods (Veness et al. 2015). In this section weanswer by the affirmative.

The family of operators which we describe here are ap-plicable to arbitrary Q-value approximation schemes. Whilethese operators are in general no longer contractions, theyare gap-increasing, and optimality-preserving when the Q-function is represented exactly. Theorem 1 is our main re-sult; one corollary is a convergence proof for Baird’s ad-vantage learning (Baird 1999). Incidentally, our taking theminimum in (10) was in fact no accident, but rather a simpleapplication of this theorem.Theorem 1. Let T be the Bellman operator defined by (1).Let T 0 be an operator with the property that there exists an↵ 2 [0, 1) such that for all Q 2 Q, x 2 X , a 2 A, andletting V (x) := max

b

Q(x, b),1. T 0

Q(x, a) T Q(x, a), and2. T 0

Q(x, a) � T Q(x, a) � ↵ [V (x) � Q(x, a)].Then T 0 is both optimality-preserving and gap-increasing.

Thus any operator which satisfies the conditions of Theo-rem 1 will eventually yield an optimal greedy policy, assum-ing an exact representation of the Q-function. Condition 2,in particular, states that we may subtract up to (but not in-cluding) max

b

Q

k

(x, b) � Q

k

(x, a) from Q

k

(x, a) at eachiteration. This is exactly the action gap at (x, a), but for Q

k

,

rather than the optimal Q

⇤. For a particular x, this implieswe may initially devalue the optimal action a

⇤:= ⇡

⇤(x) in

favour of the greedy action. But our theorem shows that a

⇤

cannot be undervalued infinitely often, and in fact Q

k

(x, a

⇤)

must ultimately reach V

⇤(x).3 The proof of this perhaps sur-

prising result may be found in the appendix.To the best of our knowledge, Theorem 1 is the first

result to show the convergence of iterates of dynamicprogramming-like operators without resorting to a contrac-tion argument. Indeed, the conditions of Theorem 1 are par-ticularly weak: we do not require T 0 to be a contraction,nor do we assume the existence of a fixed point (in the Q-function space Q) of T 0. In fact, the conditions laid out inTheorem 1 characterize the set of optimality-preserving op-erators on Q, in the following sense:Remark 1. There exists a single-state MDP M and an op-erator T 0 with either

1. T 0Q(x, a) > T Q(x, a) or

2. T 0Q(x, a) < T Q(x, a) � [V (x) � Q(x, a)],

and in both cases there exists a Q0 2 Q for whichlim

k!1 max

a

(T 0)

k

Q0(x, a) 6= V

⇤(x).

We note that the above remark does not cover the casewhere condition (2) is an equality (i.e., ↵ = 1). We leaveas an open problem the existence of a divergent example for↵ = 1.Corollary 1. The consistent Bellman operator TC (8) andconsistent Q-value interpolation Bellman operator TCQVI

(10) are optimality-preserving.

In fact, it is not hard to show that the consistent Bellmanoperator (7) is a contraction, and thus enjoys even strongerconvergence guarantees than those provided by Theorem 1.Informally, whenever Condition 2 of the theorem is strength-ened to an inequality, we may also expect our operators tobe gap-increasing; this is in fact the case for both of our con-sistent operators.

To conclude this section, we describe a few operatorswhich satisfy the conditions of Theorem 1, and are thusoptimality-preserving and gap-increasing. Critically, none ofthese operators are contractions; one of them, the “lazy” op-erator, also possesses multiple fixed points.

Baird’s Advantage LearningThe method of advantage learning was proposed by Baird(1999) as a means of increasing the gap between the optimaland suboptimal actions in the context of residual algorithmsapplied to continuous time problems.4 The correspondingoperator is

T 0Q(x, a) = K

�1⇥

R(x, a) +

�

�t EP

V (x

0) + (K � 1)V (x)

⇤

,

3When two or more actions are optimal, we are only guaranteedthat one of them will ultimately be correctly valued. The “1-lazy”operator described below exemplifies this possibility.

4Advantage updating, also by Baird, is a popular but differentidea where an agent maintains both V and A := Q� V .

Bellmanoperator

Consistentoperator

Fraction of Episodes Ending In Fall

Iterations

Consistentoperator Bellman

operator

Iterations

Fraction of Episodes Ending At Goal

Consistent operatorBellman operator

Start

Goal

Figure 3: Top. Falling and goal-reaching frequency forgreedy policies derived from value iteration. Bottom. Sam-ple bicycle trajectories after 100, 200, . . . , 1000 iterations.In this coarse-resolution regime, the Bellman operator ini-tially yields policies which circle the goal forever, while theconsistent operator quickly yields successful trajectories.

A Family of Convergent OperatorsOne may ask whether it is possible to extend the consistentBellman operator to Q-value approximation schemes whichlack a probabilistic interpretation, such as linear approxi-mation (Sutton 1996), locally weighted regression (Atkeson1991), neural networks (Tesauro 1995), or even information-theoretic methods (Veness et al. 2015). In this section weanswer by the affirmative.

The family of operators which we describe here are ap-plicable to arbitrary Q-value approximation schemes. Whilethese operators are in general no longer contractions, theyare gap-increasing, and optimality-preserving when the Q-function is represented exactly. Theorem 1 is our main re-sult; one corollary is a convergence proof for Baird’s ad-vantage learning (Baird 1999). Incidentally, our taking theminimum in (10) was in fact no accident, but rather a simpleapplication of this theorem.Theorem 1. Let T be the Bellman operator defined by (1).Let T 0 be an operator with the property that there exists an↵ 2 [0, 1) such that for all Q 2 Q, x 2 X , a 2 A, andletting V (x) := max

b

Q(x, b),1. T 0

Q(x, a) T Q(x, a), and2. T 0

Q(x, a) � T Q(x, a) � ↵ [V (x) � Q(x, a)].Then T 0 is both optimality-preserving and gap-increasing.

Thus any operator which satisfies the conditions of Theo-rem 1 will eventually yield an optimal greedy policy, assum-ing an exact representation of the Q-function. Condition 2,in particular, states that we may subtract up to (but not in-cluding) max

b

Q

k

(x, b) � Q

k

(x, a) from Q

k

(x, a) at eachiteration. This is exactly the action gap at (x, a), but for Q

k

,

rather than the optimal Q

⇤. For a particular x, this implieswe may initially devalue the optimal action a

⇤:= ⇡

⇤(x) in

favour of the greedy action. But our theorem shows that a

⇤

cannot be undervalued infinitely often, and in fact Q

k

(x, a

⇤)

must ultimately reach V

⇤(x).3 The proof of this perhaps sur-

prising result may be found in the appendix.To the best of our knowledge, Theorem 1 is the first

result to show the convergence of iterates of dynamicprogramming-like operators without resorting to a contrac-tion argument. Indeed, the conditions of Theorem 1 are par-ticularly weak: we do not require T 0 to be a contraction,nor do we assume the existence of a fixed point (in the Q-function space Q) of T 0. In fact, the conditions laid out inTheorem 1 characterize the set of optimality-preserving op-erators on Q, in the following sense:Remark 1. There exists a single-state MDP M and an op-erator T 0 with either

1. T 0Q(x, a) > T Q(x, a) or

2. T 0Q(x, a) < T Q(x, a) � [V (x) � Q(x, a)],

and in both cases there exists a Q0 2 Q for whichlim

k!1 max

a

(T 0)

k

Q0(x, a) 6= V

⇤(x).

We note that the above remark does not cover the casewhere condition (2) is an equality (i.e., ↵ = 1). We leaveas an open problem the existence of a divergent example for↵ = 1.Corollary 1. The consistent Bellman operator TC (8) andconsistent Q-value interpolation Bellman operator TCQVI

(10) are optimality-preserving.

In fact, it is not hard to show that the consistent Bellmanoperator (7) is a contraction, and thus enjoys even strongerconvergence guarantees than those provided by Theorem 1.Informally, whenever Condition 2 of the theorem is strength-ened to an inequality, we may also expect our operators tobe gap-increasing; this is in fact the case for both of our con-sistent operators.

To conclude this section, we describe a few operatorswhich satisfy the conditions of Theorem 1, and are thusoptimality-preserving and gap-increasing. Critically, none ofthese operators are contractions; one of them, the “lazy” op-erator, also possesses multiple fixed points.

Baird’s Advantage LearningThe method of advantage learning was proposed by Baird(1999) as a means of increasing the gap between the optimaland suboptimal actions in the context of residual algorithmsapplied to continuous time problems.4 The correspondingoperator is

T 0Q(x, a) = K

�1⇥

R(x, a) +

�

�t EP

V (x

0) + (K � 1)V (x)

⇤

,

3When two or more actions are optimal, we are only guaranteedthat one of them will ultimately be correctly valued. The “1-lazy”operator described below exemplifies this possibility.

4Advantage updating, also by Baird, is a popular but differentidea where an agent maintains both V and A := Q� V .

状態空間での軌跡

10-grid

18

• DeepQNetwork– Qlearning+CNN– ゲームのプレイ画面を観測– 各行動に対するQを近似

• FittedQ

実験 ::2::Atari2600

difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughout—taking high-dimensional data (210|160colour video at 60 Hz) as input—to demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable manner—illustrated by the temporal evolution of two indices of learning (theagent’s average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).

We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;

Convolution Convolution Fully connected Fully connected

No input

Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line

symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,xð Þ).

a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

Figure 2 | Training curves tracking the agent’s average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point

on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.

RESEARCH LETTER

5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5

Macmillan Publishers Limited. All rights reserved©2015

[Mnih+15]

✓ ✓ + ↵t�Q(x, a; ✓)@Q(x, a; ✓)

@✓

�Q(x, a) = R(x, a) + �max

b2AQ(x

0, b; ✓

�)�Q(x, a; ✓)

�ALQ(x, a) = �Q(x, a)� ↵

max

b2AQ(x, b; ✓

�)�Q(x, a; ✓

�)

�

�PALQ(x, a) = max

⇢�ALQ(x, a),�Q(x, a)� ↵

max

b2AQ(x

0, b; ✓

�)�Q(x

0, a; ✓

�)

��

19

• 60ゲーム• DQN比の平均スコア

– AL:27%– PAL:35%

✘ Q学習は価値を過大に見積もる

ü ALとPALは価値を低く評価

実験 ::2::Atari2600::結果

Advantage learning

DQN

Persistent A.L.

Training Frames (Millions)

Aver

age

Scor

e

ASTERIX

Advantage learning

DQN

Persistent A.L.

Aver

age

Scor

e

Training Frames (Millions)

SPACE INVADERS

Figure 4: Learning curves for two Atari 2600 games in theOriginal DQN setting.

Advantagelearning

DQN

Persistent A.L.

Episode Steps

Actio

n G

ap

AdvantagelearningDQN

Persistent A.L.

Episode Steps

Estim

ated

Sta

te V

alue

Figure 5: Action gaps (left) and value functions (right) fora single episode of SPACE INVADERS (Original DQN set-ting). Our operators yield markedly increased action gapsand lower values.

stems from the instability of the Q-functions learned fromBellman updates, and provided conclusive empirical evi-dence to this effect. In the spirit of their work, we comparedour learned Q-functions on a single trajectory generated by atrained DQN agent playing SPACE INVADERS in the Origi-nal DQN setting. For each Q-function and each state x alongthe trajectory, we computed V (x) as well as the action gapat x.

The value functions and action gaps resulting from thisexperiment5 are depicted in Figure 5. As expected, the ac-tion gaps are significantly greater for both of our operators,in comparison to the action gaps produced by DQN. Fur-thermore, the value estimates are themselves lower, and cor-respond to more realistic estimates of the true value func-tion. In their experiments, van Hasselt et al. observed a sim-ilar effect on the value estimates when replacing the Bell-man updates with Double Q-Learning updates, one of manysolutions recently proposed to mitigate the negative impactof statistical bias in value function estimation (van Hasselt2010; Azar et al. 2011; Lee, Defourny, and Powell 2013).This bias is positive and is a consequence of the max term inthe Bellman operator. We hypothesize that the lower valueestimates observed in Figure 5 are also a consequence ofbias reduction. Specifically, increased action gaps are con-sistent with a bias reduction: it is easily shown that the valueestimation bias is strongest when Q-values are close to eachother. If our hypothesis holds true, the third benefit of in-creasing the action gap is thus to mitigate the statistical biasof Q-value estimates.

5Videos: https://youtu.be/wDfUnMY3vF8

Open Questions

Weaker Conditions for Optimality. At the core of our re-sults lies the redefinition of Q-values in order to facilitateapproximate value estimation. Theorem 1 and our empiri-cal results indicate that there are many practical operatorswhich do not preserve suboptimal Q-values. Naturally, pre-serving the optimal value function V is itself unnecessary,as long as the iterates converge to a Q-function ˜

Q for whicharg max

a

˜

Q(x, a) = ⇡

⇤(x). It may well be that even weaker

conditions for optimality exist than those required by Theo-rem 1. At the present, however, our proof technique does notappear to extend to this case.Statistical Efficiency of New Operators. Advantage learn-ing (as given by our redefinition) may be viewed as a gener-alization of the consistent Bellman operator when P (· | x, a)

is unknown or irrelevant. In this light, we ask: is there aprobabilistic interpretation to advantage learning? We fur-ther wonder about the statistical efficiency of the consistentBellman operator: is it ever less efficient than the usual Bell-man operator, when considering the probability of misclassi-fying the optimal action? Both of these answers might shedsome light on the differences in performance observed in ourexperiments.Maximally Efficient Operator. Having revealed the exis-tence of a broad family of optimality-preserving operators,we may now wonder which of these operators, if any, shouldbe preferred to the Bellman operator. Clearly, there are triv-ial MDPs on which any optimality-preserving operator per-forms equally well. However, we may ask whether thereis, for a given MDP, a “maximally efficient” optimality-preserving operator; and whether a learning agent can ben-efit from simultaneously searching for this operator whileestimating a value function.

Concluding Remarks

We presented in this paper a family of optimality-preservingoperators, of which the consistent Bellman operator is a dis-tinguished member. At the center of our pursuits lay the de-sire to increase the action gap; we showed through experi-ments that this gap plays a central role in the performance ofgreedy policies over approximate value functions, and howsignificantly increased performance could be obtained by asimple modification of the Bellman operator. We believe ourwork highlights the inadequacy of the classical Q-functionat producing reliable policies in practice, calls into questionthe traditional policy-value relationship in value-based rein-forcement learning, and illustrates how revisiting the con-cept of value itself can be fruitful.

Acknowledgments

The authors thank Michael Bowling, Csaba Szepesvari,Craig Boutilier, Dale Schuurmans, Marty Zinkevich, LihongLi, Thomas Degris, and Joseph Modayil for useful discus-sions, as well as the anonymous reviewers for their excellentfeedback.

Advantage learning

DQN

Persistent A.L.

Training Frames (Millions)

Aver

age

Scor

eASTERIX

Advantage learning

DQN

Persistent A.L.

Aver

age

Scor

e

Training Frames (Millions)

SPACE INVADERS

Figure 4: Learning curves for two Atari 2600 games in theOriginal DQN setting.

Advantagelearning

DQN

Persistent A.L.

Episode Steps

Actio

n G

ap

AdvantagelearningDQN

Persistent A.L.

Episode Steps

Estim

ated

Sta

te V

alue

Figure 5: Action gaps (left) and value functions (right) fora single episode of SPACE INVADERS (Original DQN set-ting). Our operators yield markedly increased action gapsand lower values.

stems from the instability of the Q-functions learned fromBellman updates, and provided conclusive empirical evi-dence to this effect. In the spirit of their work, we comparedour learned Q-functions on a single trajectory generated by atrained DQN agent playing SPACE INVADERS in the Origi-nal DQN setting. For each Q-function and each state x alongthe trajectory, we computed V (x) as well as the action gapat x.

The value functions and action gaps resulting from thisexperiment5 are depicted in Figure 5. As expected, the ac-tion gaps are significantly greater for both of our operators,in comparison to the action gaps produced by DQN. Fur-thermore, the value estimates are themselves lower, and cor-respond to more realistic estimates of the true value func-tion. In their experiments, van Hasselt et al. observed a sim-ilar effect on the value estimates when replacing the Bell-man updates with Double Q-Learning updates, one of manysolutions recently proposed to mitigate the negative impactof statistical bias in value function estimation (van Hasselt2010; Azar et al. 2011; Lee, Defourny, and Powell 2013).This bias is positive and is a consequence of the max term inthe Bellman operator. We hypothesize that the lower valueestimates observed in Figure 5 are also a consequence ofbias reduction. Specifically, increased action gaps are con-sistent with a bias reduction: it is easily shown that the valueestimation bias is strongest when Q-values are close to eachother. If our hypothesis holds true, the third benefit of in-creasing the action gap is thus to mitigate the statistical biasof Q-value estimates.

5Videos: https://youtu.be/wDfUnMY3vF8

Open Questions

Weaker Conditions for Optimality. At the core of our re-sults lies the redefinition of Q-values in order to facilitateapproximate value estimation. Theorem 1 and our empiri-cal results indicate that there are many practical operatorswhich do not preserve suboptimal Q-values. Naturally, pre-serving the optimal value function V is itself unnecessary,as long as the iterates converge to a Q-function ˜

Q for whicharg max

a

˜

Q(x, a) = ⇡

⇤(x). It may well be that even weaker

conditions for optimality exist than those required by Theo-rem 1. At the present, however, our proof technique does notappear to extend to this case.Statistical Efficiency of New Operators. Advantage learn-ing (as given by our redefinition) may be viewed as a gener-alization of the consistent Bellman operator when P (· | x, a)

is unknown or irrelevant. In this light, we ask: is there aprobabilistic interpretation to advantage learning? We fur-ther wonder about the statistical efficiency of the consistentBellman operator: is it ever less efficient than the usual Bell-man operator, when considering the probability of misclassi-fying the optimal action? Both of these answers might shedsome light on the differences in performance observed in ourexperiments.Maximally Efficient Operator. Having revealed the exis-tence of a broad family of optimality-preserving operators,we may now wonder which of these operators, if any, shouldbe preferred to the Bellman operator. Clearly, there are triv-ial MDPs on which any optimality-preserving operator per-forms equally well. However, we may ask whether thereis, for a given MDP, a “maximally efficient” optimality-preserving operator; and whether a learning agent can ben-efit from simultaneously searching for this operator whileestimating a value function.

Concluding Remarks

We presented in this paper a family of optimality-preservingoperators, of which the consistent Bellman operator is a dis-tinguished member. At the center of our pursuits lay the de-sire to increase the action gap; we showed through experi-ments that this gap plays a central role in the performance ofgreedy policies over approximate value functions, and howsignificantly increased performance could be obtained by asimple modification of the Bellman operator. We believe ourwork highlights the inadequacy of the classical Q-functionat producing reliable policies in practice, calls into questionthe traditional policy-value relationship in value-based rein-forcement learning, and illustrates how revisiting the con-cept of value itself can be fruitful.

Acknowledgments

The authors thank Michael Bowling, Csaba Szepesvari,Craig Boutilier, Dale Schuurmans, Marty Zinkevich, LihongLi, Thomas Degris, and Joseph Modayil for useful discus-sions, as well as the anonymous reviewers for their excellentfeedback.

SpaceInvaderでの結果

[vanHasselt+10]

20

実験 ::2::αに依存する性能の変化・対DQN比

ASTERIX BEAM RIDER PONG

SEAQUEST SPACE INVADERS

DQN

Persistent A.L.A.L.

Figure 7: Performance of trained agents in function of the ↵ parameter. Note that ↵ = 1.0 does not satisfy our theorem’sconditions. We attribute the odd performance of Seaquest agents using Persistent Advantage Learning with ↵ = 0.9 to astatistical issue.

↵ 2 {0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}α=0のとき DQN

21

• [Mnih+15]では4フレームに一回必ず行動が選択される→操作ミスがないから強いだけでは??

• 確率 pで選択した行動を採択，1-pで棄却– p=0.25

• 60ゲーム x10試行のスコア– DQN<AL:37– DQN<PAL:35– AL <PAL:16– AL >PAL:6

実験 ::3::Atari2600::StochasticMiminal

22

✘ Bellman方程式により得られる greedy方策は行動価値の推定誤差に鋭敏

→最適行動と準最適行動の価値の差（actiongap）を広げるü 価値推定の誤差に頑健

ü 学習則としての変更は非常にシンプル

ü greedyな行動の価値は最適価値への収束が保証される

• Atari2600を含む実験により圧倒的性能を示した

✻ “価値”について見直すべきではないか?

まとめ

23

• [Abe+10]Optimizingdebtcollectionsusingconstrainedreinforcementlearning.

• [Baird99]Reinforcementlearningthroughgradientdescent.• [Bertsekas &Tsitsiklis 96]Neuro-DynamicProgramming.• [Mnih+15]Humanlevelcontrolthroughdeepreinforcementlearning.

• [Randolv &Alstrom 98]Learningtodriveabicycleusingreinforcementlearningandshaping.

• [Silver+16]MasteringthegameofGowithdeepneuralnetworksandtreesearch.

• [Sugimoto+16]Trialanderror:usingpreviousexperiencesassimulationmodelsinhumanoidmotorlearning.

References（このスライドで）