natural actor-critic

11
Neurocomputing 71 (2008) 1180–1190 Natural Actor-Critic Jan Peters a,b, , Stefan Schaal b,c a Max-Planck-Institute for Biological Cybernetics, Tuebingen, Germany b University of Southern California, Los Angeles, CA 90089, USA c ATR Computational Neuroscience Laboratories, Kyoto 619-0288, Japan Available online 1 February 2008 Abstract In this paper, we suggest a novel reinforcement learning architecture, the Natural Actor-Critic. The actor updates are achieved using stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke’s Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm. r 2008 Elsevier B.V. All rights reserved. Keywords: Policy-gradient methods; Compatible function approximation; Natural gradients; Actor-Critic methods; Reinforcement learning; Robot learning 1. Introduction Reinforcement learning algorithms based on value func- tion approximation have been highly successful with discrete lookup table parameterization. However, when applied with continuous function approximation, many of these algo- rithms failed to generalize, and few convergence guarantees could be obtained [24]. The reason for this problem can largely be traced back to the greedy or -greedy policy updates of most techniques, as it does not ensure a policy improvement when applied with an approximate value function [8]. During a greedy update, small errors in the value function can cause large changes in the policy which in return can cause large changes in the value function. This process, when applied repeatedly, can result in oscillations or divergence of the algorithms. Even in simple toy systems, such unfortunate behavior can be found in many well-known greedy reinforcement learning algorithms [6,8]. As an alternative to greedy reinforcement learning, policy- gradient methods have been suggested. Policy gradients have rather strong convergence guarantees, even when used in conjunction with approximate value functions, and recent results created a theoretically solid framework for policy- gradient estimation from sampled data [25,15]. However, even when applied to simple examples with rather few states, policy-gradient methods often turn out to be quite inefficient [14], partially caused by the large plateaus in the expected return landscape where the gradients are small and often do not point directly towards the optimal solution. A simple example that demonstrates this behavior is given in Fig. 1. Similar as in supervised learning, the steepest ascent with respect to the Fisher information metric [3], called the ‘natural’ policy gradient, turns out to be significantly more efficient than normal gradients. Such an approach was first suggested for reinforcement learning as the ‘average natural policy gradient’ in [14], and subsequently shown in preliminary work to be the true natural policy gradient [21,4] . In this paper, we take this line of reasoning one step ARTICLE IN PRESS www.elsevier.com/locate/neucom 0925-2312/$ - see front matter r 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.11.026 Corresponding author at: Max-Planck-Institute for Biological Cybernetics, Department of Empirical Inference, Spemannstr. 38, 72076 Tuebingen, Germany. E-mail address: [email protected] (J. Peters).

Upload: jan-peters

Post on 10-Sep-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Actor-Critic

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

�CorrespondiDepartment of

Germany.

E-mail addr

Neurocomputing 71 (2008) 1180–1190

www.elsevier.com/locate/neucom

Natural Actor-Critic

Jan Petersa,b,�, Stefan Schaalb,c

aMax-Planck-Institute for Biological Cybernetics, Tuebingen, GermanybUniversity of Southern California, Los Angeles, CA 90089, USA

cATR Computational Neuroscience Laboratories, Kyoto 619-0288, Japan

Available online 1 February 2008

Abstract

In this paper, we suggest a novel reinforcement learning architecture, the Natural Actor-Critic. The actor updates are achieved using

stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and

additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy

gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be

estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by

the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the

original Actor-Critic and Bradtke’s Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations

illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning

control on an anthropomorphic robot arm.

r 2008 Elsevier B.V. All rights reserved.

Keywords: Policy-gradient methods; Compatible function approximation; Natural gradients; Actor-Critic methods; Reinforcement learning; Robot

learning

1. Introduction

Reinforcement learning algorithms based on value func-tion approximation have been highly successful with discretelookup table parameterization. However, when applied withcontinuous function approximation, many of these algo-rithms failed to generalize, and few convergence guaranteescould be obtained [24]. The reason for this problem canlargely be traced back to the greedy or �-greedy policyupdates of most techniques, as it does not ensure a policyimprovement when applied with an approximate valuefunction [8]. During a greedy update, small errors in thevalue function can cause large changes in the policy which inreturn can cause large changes in the value function. Thisprocess, when applied repeatedly, can result in oscillations ordivergence of the algorithms. Even in simple toy systems,

e front matter r 2008 Elsevier B.V. All rights reserved.

ucom.2007.11.026

ng author at: Max-Planck-Institute for Biological Cybernetics,

Empirical Inference, Spemannstr. 38, 72076 Tuebingen,

ess: [email protected] (J. Peters).

such unfortunate behavior can be found in many well-knowngreedy reinforcement learning algorithms [6,8].As an alternative to greedy reinforcement learning, policy-

gradient methods have been suggested. Policy gradients haverather strong convergence guarantees, even when used inconjunction with approximate value functions, and recentresults created a theoretically solid framework for policy-gradient estimation from sampled data [25,15]. However,even when applied to simple examples with rather few states,policy-gradient methods often turn out to be quite inefficient[14], partially caused by the large plateaus in the expectedreturn landscape where the gradients are small and often donot point directly towards the optimal solution. A simpleexample that demonstrates this behavior is given in Fig. 1.Similar as in supervised learning, the steepest ascent with

respect to the Fisher information metric [3], called the‘natural’ policy gradient, turns out to be significantly moreefficient than normal gradients. Such an approach was firstsuggested for reinforcement learning as the ‘averagenatural policy gradient’ in [14], and subsequently shownin preliminary work to be the true natural policy gradient[21,4]. In this paper, we take this line of reasoning one step

Page 2: Natural Actor-Critic

ARTICLE IN PRESS

Fig. 1. When plotting the expected return landscape for simple problem as

1d linear-quadratic regulation, the differences between (a) ‘vanilla’ and

(b) natural policy gradients becomes apparent [21].

J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1181

further in Section 2.2 by introducing the ‘Natural Actor-Critic(NAC)’ which inherits the convergence guarantees fromgradient methods. Furthermore, in Section 3, we showthat several successful previous reinforcement learningmethods can be seen as special cases of this more generalarchitecture. The paper concludes with empirical evalua-tions that demonstrate the effectiveness of the suggestedmethods in Section 4.

2. Natural Actor-Critic

2.1. Markov decision process notation and assumptions

For this paper, we assume that the underlying controlproblem is a Markov decision process (MDP) in discretetime with continuous state set X ¼ Rn, and a continuousaction set U ¼ Rm [8]. The assumption of an MDP comeswith the limitation that very good state information andMarkovian environment are assumed. However, similar asin [1], the results presented in this paper might extend toproblems with partial state information.

The system is at an initial state x0 2 X at time t ¼ 0drawn from the start-state distribution pðx0Þ. At any statext 2 X at time t, the actor will choose an action ut 2 U

by drawing it from a stochastic, parameterized policypðutjxtÞ ¼ pðutjxt; hÞ with parameters h 2 RN , and thesystem transfers to a new state xtþ1 drawn from the statetransfer distribution pðxtþ1jxt; utÞ. The system yields ascalar reward rt ¼ rðxt; utÞ 2 R after each action. Weassume that the policy ph is continuously differentiablewith respect to its parameters h, and for each consideredpolicy ph, a state-value function VpðxÞ, and the state-actionvalue function Qpðx; uÞ exist and are given by

VpðxÞ ¼ Et

X1t¼0

gtrt

�����x0 ¼ x

( ),

Qpðx; uÞ ¼ Et

X1t¼0

gtrt

�����x0 ¼ x; u0 ¼ u

( ),

where g 2 ð0; 1Þ denotes the discount factor, and t atrajectory. It is assumed that some basis functions /ðxÞare given so that the state-value function can be approxi-

mated with linear function approximation VpðxÞ ¼ /ðxÞTv.The general goal is to optimize the normalized expectedreturn

JðhÞ ¼ Et ð1� gÞX1t¼0

gtrt

�����h( )

¼

ZX

dpðxÞ

ZU

pðujxÞrðx; uÞdxdu,

where

dpðxÞ ¼ ð1� gÞ

X1t¼0

gtpðxt ¼ xÞ

is the discounted state distribution.

2.2. Actor improvement with natural policy gradients

Actor-Critic and many other policy iteration architec-tures consist of two steps, a policy evaluation step and apolicy improvement step. The main requirements for thepolicy evaluation step are that it makes efficient usage ofexperienced data. The policy improvement step is requiredto improve the policy on every step until convergence whilebeing efficient.The requirements on the policy improvement step rule

out greedy methods as, at the current state of knowledge, apolicy improvement for approximated value functionscannot be guaranteed, even on average. ‘Vanilla’ policygradient improvements (see e.g. [25,15]) which follow thegradient =hJðhÞ of the expected return function JðhÞ (where=hf ¼ ½qf =qy1; . . . ; qf =qyN �) denotes the derivative offunction f with respect to parameter vector (h) often getstuck in plateaus as demonstrated in [14]. Natural gradientse=hJðhÞ avoid this pitfall as demonstrated for supervisedlearning problems [3], and suggested for reinforcementlearning in [14]. These methods do not follow the steepestdirection in parameter space but the steepest direction withrespect to the Fisher metric given bye=hJðhÞ ¼ G�1ðhÞ=hJðhÞ, (1)

where GðhÞ denotes the Fisher information matrix. It isguaranteed that the angle between natural and ordinarygradient is never larger than 90�, i.e., convergence to thenext local optimum can be assured. The ‘vanilla’ gradient isgiven by the policy-gradient theorem (see e.g. [25,15])

=hJðhÞ ¼

ZX

dpðxÞ

ZU

=hpðujxÞðQpðx; uÞ � bpðxÞÞdudx,

(2)

where bpðxÞ denotes a baseline. Refs. [25,15] demonstrated

that in Eq. (2), the term Qpðx; uÞ � bpðxÞ can be replaced by

a compatible function approximation

f pwðx; uÞ ¼ ð=h log pðujxÞÞ

Tw � Qpðx; uÞ � bpðxÞ, (3)

parameterized by the vector w, without affecting theunbiasedness of the gradient estimate and irrespective ofthe choice of the baseline bp

ðxÞ. However, as mentioned in

Page 3: Natural Actor-Critic

ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901182

[25], the baseline may still be useful in order to reduce thevariance of the gradient estimate when Eq. (2) is approxi-mated from samples. Based on Eqs. (2) and (3), we derivean estimate of the policy gradient as

=hJðhÞ ¼

ZX

dpðxÞ

ZU

pðujxÞ=h log pðujxÞ=h log pðujxÞT dudxw

¼ Fhw (4)

as =hpðujxÞ ¼ pðujxÞ=h log pðujxÞ. Since pðujxÞ is chosen bythe user, even in sampled data, the integral

F ðh;xÞ ¼

ZU

pðujxÞ=h log pðujxÞ=h log pðujxÞT du (5)

can be evaluated analytically or empirically withoutactually executing all actions. It is also noteworthy thatthe baseline does not appear in Eq. (4) as it integrates out,thus eliminating the need to find an optimal selection ofthis open parameter. Nevertheless, the estimation of F h ¼RX

dpðxÞF ðh;xÞdx is still expensive since dp

ðxÞ is notknown. However, Eq. (4) has more surprising implicationsfor policy gradients, when examining the meaning of thematrix Fh in Eq. (4). Kakade [14] argued that F ðh;xÞ is thepoint Fisher information matrix for state x, and thatF ðhÞ ¼

RX

dpðxÞ F ðh;xÞdx, therefore, denotes a weighted

‘average Fisher information matrix’ [14]. However, goingone step further, we demonstrate in Appendix A that F h isindeed the true Fisher information matrix and does nothave to be interpreted as the ‘average’ of the point Fisherinformation matrices. Eqs. (4) and (1) combined imply thatthe natural gradient can be computed ase=hJðhÞ ¼ G�1ðhÞFhw ¼ w, (6)

since F h ¼ GðhÞ (cf. Appendix A). Therefore we only needestimate w and not GðhÞ. The resulting policy improvementstep is thus hiþ1 ¼ hi þ aw where a denotes a learning rate.Several properties of the natural policy gradient areworthwhile highlighting:

Convergence to a local minimum guaranteed as for‘vanilla gradients’ [3]. � By choosing a more direct path to the optimal solution

in parameter space, the natural gradient has, fromempirical observations, faster convergence and avoidspremature convergence of ‘vanilla gradients’ (cf. Fig. 1).

� The natural policy gradient can be shown to be covariant,

i.e., independent of the coordinate frame chosen forexpressing the policy parameters (cf. Section 3.1).

Fig. 2. The state-action value function in any stable linear-quadratic

Gaussian regulation problems can be shown to be a bowl (a). The

advantage function is always a saddle as shown in (b); it is straightforward

to show that the compatible function approximation can exactly represent

the advantage function—but projecting the value function onto the

advantage function is non-trivial for continuous problems. This figure

shows the value function and advantage function of the system described

in the caption of Fig. 1.

As the natural gradient analytically averages out theinfluence of the stochastic policy (including the baseline ofthe function approximator), it requires fewer data pointfor a good gradient estimate than ‘vanilla gradients’.

2.3. Critic estimation with compatible policy evaluation

The critic evaluates the current policy p in order toprovide the basis for an actor improvement, i.e., the change

Dh of the policy parameters. As we are interested in naturalpolicy gradient updates Dh ¼ aw, we wish to employ thecompatible function approximation f p

wðx; uÞ from Eq. (3) inthis context. At this point, a most important observation isthat the compatible function approximation f p

wðx; uÞ ismean-zero w.r.t. the action distribution, i.e.,ZU

pðujxÞf pwðx; uÞdu ¼ wT

ZU

=hpðujxÞdu ¼ 0, (7)

since fromRUpðujxÞdu ¼ 1, differention w.r.t. to y results

inRU

=hpðujxÞdu ¼ 0. Thus, f pwðx; uÞ represents an advan-

tage function Apðx; uÞ ¼ Qpðx; uÞ � VpðxÞ in general. Theessential differences between the advantage function andthe state-action value function is demonstrated in Fig. 2.The advantage function cannot be learned with TD-likebootstrapping without knowledge of the value function asthe essence of TD is to compare the value VpðxÞ of the twoadjacent states—but this value has been subtracted out inApðx; uÞ. Hence, a TD-like bootstrapping using exclusivelythe compatible function approximator is impossible.As an alternative, [25,15] suggested to approximate

f pwðx; uÞ from unbiased estimates Q

pðx; uÞ of the action

value function, e.g., obtained from rollouts and using least-squares minimization between f w and Q

p. While possible in

theory, one needs to realize that this approach implies afunction approximation problem where the parameteri-zation of the function approximator only spans a muchsmaller subspace of the training data—e.g., imagineapproximating a quadratic function with a line. In practice,the results of such an approximation depends crucially onthe training data distribution and has thus unacceptablyhigh variance—e.g., fit a line to only data from the rightbranch of a parabula, the left branch, or data from bothbranches.Furthermore, in continuous state-spaces a state (except

for single start-states) will hardly occur twice; therefore, wecan only obtain unbiased estimates Q

pðx; uÞ of Qpðx; uÞ.

This means the state-action value estimates Qpðx; uÞ have to

Page 4: Natural Actor-Critic

ARTICLE IN PRESS

Table 1

Natural Actor-Critic Algorithm with LSTD-Q(l)

Input: Parameterized policy pðujxÞ ¼ pðujx; hÞ with initial parameters

h ¼ h0, its derivative =h logpðujxÞ and basis functions /ðxÞ for the value

function VpðxÞ

1: Draw initial state x0�pðx0Þ, and select parameters

Atþ1 ¼ 0; btþ1 ¼ ztþ1 ¼ 0.

2: For t ¼ 0; 1; 2; . . . do3: Execute: Draw action ut�pðutjxtÞ, observe next state

xtþ1�pðxtþ1jxt; utÞ, and reward rt ¼ rðxt; utÞ.

4: Critic Evaluation (LSTD-Q(l)): Update

4.1: basis functions: e/t ¼ ½/ðxtþ1ÞT; 0T�T,b/t ¼ ½/ðxtÞ

T;=h logpðutjxtÞT�T,

4.2: statistics: ztþ1 ¼ lzt þ b/t;Atþ1 ¼ At þ ztþ1ð/t � ge/tÞT;

btþ1 ¼ bt þ ztþ1rt;

4.3: critic parameters: ½vTtþ1;wTtþ1�

T ¼ A�1tþ1btþ1.

5: Actor: If gradient estimate is accurate, ,ðwt;wt�1Þp�, update5.1: policy parameters: htþ1 ¼ ht þ awtþ1,

5.2: forget statistics: ztþ1 bztþ1;Atþ1 bAtþ1; btþ1 bbtþ1.

6: end.

J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1183

be projected onto the advantage function Apðx; uÞ. Thisprojection would have to average out the state-value offsetVpðxÞ. For example, for linear-quadratic regulation, it isstraightforward to show that the advantage function issaddle while the state-action value function is bowl—wetherefore would be projecting a bowl onto a saddle; bothare illustrated in Fig. 2. In this case, the distribution of thedata has a drastic impact on the projection.

To remedy this situation, we observe that we can writethe Bellman equations (e.g., see [5]) in terms of theadvantage function and the state-value function

Qpðx; uÞ ¼ Apðx; uÞ þ VpðxÞ

¼ rðx; uÞ þ gZX

pðx0jx; uÞVpðx0Þdx0. (8)

Inserting Apðx; uÞ ¼ f pwðx; uÞ and an appropriate basis

functions representation of the value function as VpðxÞ ¼/ðxÞTv, we can rewrite the Bellman Equation, Eq. (8), as aset of linear equations

=h log pðutjxtÞTwþ /ðxtÞ

Tv

¼ rðxt; utÞ þ g/ðxtþ1ÞTvþ �ðxt; ut; xtþ1Þ, (9)

where �ðxt; ut;xtþ1Þ denotes an error term which mean-zeroas can be observed from Eq. (8). These equations enable usto formulate some novel algorithms in the next sections.

The linear appearance of w and v hints at a least squaresto obtain. Thus, we now need to address algorithms thatestimate the gradient efficiently using the sampled equa-tions (such as Eq. (9)), and how to determine the additionalbasis functions /ðxÞ for which convergence of thesealgorithms is guaranteed.

2.3.1. Critic evaluation with LSTD-Q(l)

Using Eq. (9), a solution to Eq. (8) can be obtained byadapting the LSTD(l) policy evaluation algorithm [9]. Forthis purpose, we defineb/t ¼ ½/ðxtÞ

T;=h log pðutjxtÞT�T,e/t ¼ ½/ðxtþ1Þ

T; 0T�T, (10)

as new basis functions, where 0 is the zero vector. Thisdefinition of basis function reduces bias and variance of thelearning process in comparison to SARSA and previousLSTD(l) algorithms for state-action value functions [9] asthe basis functions e/t do not depend on stochastic futureactions utþ1, i.e., the input variables to the LSTD re-gression are not noisy due to utþ1 (e.g., as in [10])—suchinput noise would violate the standard regression modelthat only takes noise in the regression targets into account.Alternatively, Bradtke et al. [10] assume VpðxÞ ¼ Qpðx; uÞwhere u is the average future action, and choose their basisfunctions accordingly; however, this is only given fordeterministic policies, i.e., policies without exploration andnot applicable in our framework. LSTD(l) with the basisfunctions in Eq. (10), called LSTD-Q(l) from now on, isthus currently the theoretically cleanest way of applyingLSTD to state-value function estimation. It is exact for

deterministic or weekly noisy state transitions and arbi-trary stochastic policies. As all previous LSTD suggestions,it loses accuracy with increasing noise in the statetransitions since e/t becomes a random variable. Thecomplete LSTD-Q(l) algorithm is given in the Critic

Evaluation (lines 4.1–4.3) of Table 1.Once LSTD-Q(l) converges to an approximation of

Apðxt; utÞ þ VpðxtÞ, we obtain two results: the value functionparameters v, and the natural gradient w. The naturalgradient w serves in updating the policy parametersDht ¼ awt. After this update, the critic has to forget at leastparts of its accumulated sufficient statistics using a forgettingfactor b 2 ½0; 1� (cf. Table 1). For b ¼ 0, i.e., completeresetting, and appropriate basis functions fðxÞ, convergenceto the true natural gradient can be guaranteed. The completeNAC algorithm is shown in Table 1.However, it becomes fairly obvious that the basis

functions can have an influence on our gradient estimate.When using the counterexample in [7] with a typical Gibbspolicy, we will realize that the gradient is affected for lo1;for l ¼ 0 the gradient is flipped and would always worsenthe policy. However, unlike in [7], we at least couldguarantee that we are not affected for l ¼ 1.

2.3.2. Episodic NAC

Given the problem that the additional basis functionsfðxÞ determine the quality of the gradient, we needmethods which guarantee the unbiasedness of the naturalgradient estimate. Such method can be determined bysumming up Eq. (9) along a sample path, we obtainXN�1t¼0

gtApðxt; utÞ

¼ Vpðx0Þ þXN�1t¼0

gtrðxt; utÞ � gNVpðxNÞ. (11)

Page 5: Natural Actor-Critic

ARTICLE IN PRESS

Table 2

Episodic Natural Actor-Critic Algorithm (eNAC)

Input: Parameterized policy pðujxÞ ¼ pðujx; hÞ with initial parameters

h ¼ h0, and derivative =h logpðujxÞ.

For u ¼ 1; 2; 3; . . . doFor e ¼ 1; 2; 3; . . . doExecute Rollout: Draw initial state x0�pðx0Þ.

For t ¼ 1; 2; 3; . . . ;N do

Draw action ut�pðutjxtÞ, observe next state xtþ1�pðxtþ1jxt; utÞ,

and reward rt ¼ rðxt; utÞ.

end.

end.

Critic Evaluation (Episodic): Determine value function

J ¼ Vpðx0Þ, compatible function approximation f pwðxt; utÞ:

Update: Determine basis functions: /t ¼ ½PN

t¼0 gt=h logpðutjxtÞ

T; 1�T;

reward statistics: Rt ¼PN

t¼0 gtr;

Actor-Update: When the natural gradient is converged,

,ðwtþ1;wt�tÞp�, update the policy parameters: htþ1 ¼ ht þ awtþ1.

6: end.

J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901184

It is fairly obvious that the last term disappears for N !1

or episodic tasks (where rðxN�1; uN�1Þ is the final reward);therefore each rollout would yield one equation. If wefurthermore assume a single start-state, an additionalscalar value function of fðxÞ ¼ 1 suffices. We thereforeget a straightforward regression problem:XN�1t¼0

gt= log pðut;xtÞTwþ J ¼

XN�1t¼0

gtrðxt; utÞ (12)

with exactly dim yþ 1 unknowns. This means that for non-stochastic tasks we can obtain a gradient after dim yþ 1rollouts. The complete algorithm is shown in Table 2.

3. Properties of NAC

In this section, we will emphasize certain properties ofthe NAC. In particular, we want to give a simple proof ofcovariance of the natural policy gradient, and discuss [14]observation that in his experimental settings the naturalpolicy gradient was non-covariant. Furthermore, we willdiscuss another surprising aspect about the NAC which isits relation to previous algorithms. We briefly demonstratethat established algorithms like the classic Actor-Critic[24], and Bradtke’s Q-Learning [10] can be seen as specialcases of NAC.

3.1. On the covariance of natural policy gradients

When [14] originally suggested natural policy gradients,he came to the disappointing conclusion that they were notcovariant. As counterexample, he suggested that for twodifferent linear Gaussian policies (one in the normal form,and the other in the information form) the probabilitydistributions represented by the natural policy gradientwould be affected differently, i.e., the natural policygradient would be non-covariant. We intend to give a

proof at this point showing that the natural policy gradientis in fact covariant under certain conditions, and clarifywhy [14] experienced these difficulties.

Theorem 1. Natural policy-gradients updates are covariant

for two policies ph parameterized by h and ph parameterized

by h if (i) for all parameters yi there exists a function

yi ¼ f iðh1; . . . ; hkÞ, (ii) the derivative =hh and its inverse

=hh�1.

For the proof see Appendix B. Practical experimentsshow that the problems occurred for Gaussian policies in[14] are in fact due to the selection the stepsize a whichdetermines the length of Dh. As the linearization Dh ¼

=hhTDh does not hold for large Dh, this can cause

divergence between the algorithms even for analyticallydetermined natural policy gradients which can partiallyexplain the difficulties occurred by Kakade [14].

3.2. NAC’s relation to previous algorithms

Original Actor-Critic. Surprisingly, the original Actor-Critic algorithm [24] is a form of the NAC. By choosinga Gibbs policy pðutjxtÞ ¼ expðyxuÞ=

Pb expðyxbÞ, with

all parameters yxu lumped in the vector h (denoted ash ¼ ½yxu�) in a discrete setup with tabular representationsof transition probabilities and rewards. A linear functionapproximation VpðxÞ ¼ /ðxÞTv with v ¼ ½vx� and unit basisfunctions /ðxÞ ¼ ux was employed. Sutton et al. onlineupdate rule is given by

ytþ1xu ¼ yt

xu þ a1ðrðx; uÞ þ gvx0 � vxÞ,

vtþ1x ¼ vt

x þ a2ðrðx; uÞ þ gvx0 � vxÞ,

where a1, a2 denote learning rates. The update of the criticparameters vt

x equals the one of the NAC in expectation asTD(0) critics converges to the same values as LSTD(0) andLSTD-Q(0) for discrete problems [9]. Since for the Gibbspolicy we have q log pðbjaÞ=qyxu ¼ 1� pðbjaÞ if a ¼ x andb ¼ u, q log pðbjaÞ=qyxu ¼ �pðbjaÞ if a ¼ x and bau, andq log pðbjaÞ=qyxu ¼ 0 otherwise, and as

PbpðbjxÞAðx; bÞ ¼

0; we can evaluate the advantage function and derive

Aðx; uÞ ¼ Aðx; uÞ �X

b

pðbjxÞAðx; bÞ

¼X

b

q log pðbjxÞqyxu

Aðx; bÞ.

Since the compatible function approximation representsthe advantage function, i.e., f p

wðx; uÞ ¼ Aðx; uÞ, we realizethat the advantages equal the natural gradient, i.e.,w ¼ ½Aðx; uÞ�. Furthermore, the TD(0) error of a state-action pair ðx; uÞ equals the advantage function inexpectation, and therefore the natural gradient updatewxu ¼ Aðx; uÞ ¼ Ex0 frðx; uÞ þ gV ðx0Þ � V ðxÞjx; ug; corre-sponds to the average online updates of Actor-Critic. Asboth update rules of the Actor-Critic correspond to theones of NAC, we can see both algorithms as equivalent.

Page 6: Natural Actor-Critic

ARTICLE IN PRESS

1The true natural policy gradient can also be computed analytically.

However, it is not shown as the difference in performance to the Natural

Actor-Critic gradient estimate is negligible.

J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1185

SARSA. SARSA with a tabular, discrete state-actionvalue function Qpðx; uÞ and an �-soft policy improvement

pðutjxtÞ ¼ expðQpðx; uÞ=�Þ=X

u

expðQpðx; uÞ=�Þ

can also be seen as an approximation of NAC. When treatingthe table entries as parameters of a policy hxu ¼ Qpðx; uÞ, werealize that the TD update of these parameters correspondsapproximately to the natural gradient update since wxu ¼

�Aðx; uÞ � �Ex0 frðx; uÞ þ gQðx0; u0Þ �Qðx; uÞjx; ug. However,the SARSA-TD error equals the advantage function only forpolicies where a single action u has much better action valuesQðx; uÞ than all other actions; for such special cases, �-softSARSA can be seen as an approximation of NAC. This alsocorresponds to Kakade’s [14] observation that greedy updatestep (such as the �-soft greedy update), approximates thenatural policy gradient.

Bradtke’s Q-Learning. Bradtke [10] proposed an algo-rithm with policy pðutjxtÞ ¼Nðutjk

Ti xt;s2i Þ and parameters

hi ¼ ½kTi ;si�

T (where si denotes the exploration, and i thepolicy update time step) in a linear control task with linearstate transitions xtþ1 ¼ Axt þ but, and quadratic rewardsrðxt; utÞ ¼ xT

t Hxt þ Ru2t . They evaluated Qpðxt; utÞ with

LSTD(0) using a quadratic polynomial expansion as basisfunctions, and applied greedy updates:

kBradtkeiþ1 ¼ arg max

kiþ1

Qpðxt; ut ¼ kTiþ1xtÞ

¼ � ðRþ gbTPibÞ�1gbPiA,

where Pi denotes policy-specific value function parametersrelated to the gain ki; no update the exploration si wasincluded. Similarly, we can obtain the natural policygradient w ¼ ½wk;ws�

T, as yielded by LSTD-Q(l) analyti-cally using the compatible function approximation and thesame quadratic basis functions. As discussed in detail in[21], this gives us

wk ¼ ðgATPibþ ðRþ gbTPibÞkÞTs2i ,

ws ¼ 0:5ðRþ gbTPibÞs3i .

Similarly, it can be derived that the expected return isJðhiÞ ¼ �ðRþ gbTPibÞs2i for this type of problems, see [21].For a learning rate ai ¼ 1=kJðhiÞk, we see

kiþ1 ¼ ki þ atwk ¼ ki � ðki þ ðRþ gbTPibÞ�1gATPibÞ

¼ kBradtkeiþ1 ,

which demonstrates that Bradtke’s Actor Update is a

special case of the NAC. NAC extends Bradtke’s result as itgives an update rule for the exploration—which was notpossible in Bradtke’s greedy framework.

4. Evaluations and applications

In this section, we present several evaluations comparingthe episodic NAC architectures with previous algorithms.

We compare them in optimization tasks such as Cart-PoleBalancing and simple motor primitive evaluations andcompare them only with episodic NAC. Furthermore, weapply the combination of episodic NAC and the motorprimitive framework to a robotic task on a real robot, i.e.,‘hitting a T-ball with a baseball bat’.

4.1. Cart-Pole Balancing

Cart-Pole Balancing is a well-known benchmark forreinforcement learning. We assume the cart as shown inFig. 3(a) can be described by

ml €x cos yþml2 €y�mgl sin y ¼ 0,

ðmþmcÞ €xþml €y cos y�ml _y2sin y ¼ F ,

with l ¼ 0:75m, m ¼ 0:15 kg, g ¼ 9:81m=s2 and mc ¼

1:0 kg. The resulting state is given by x ¼ ½x; _x; y; _y�T, andthe action u ¼ F . The system is treated as if it was sampledat a rate of h ¼ 60Hz, and the reward is given by rðx; uÞ ¼xTQxþ uTRu with Q ¼ diagð1:25; 1; 12; 0:25Þ, R ¼ 0:01.The policy is specified as pðujxÞ ¼NðKx;s2Þ. In order to

ensure that the learning algorithm cannot exceed anacceptable parameter range, the variance of the policy isdefined as s ¼ 0:1þ 1=ð1þ expðZÞÞ. Thus, the policyparameter vector becomes h ¼ ½KT; Z�T and has theanalytically computable optimal solution K � ½5:71; 11:3;�82:1;�21:6�T, and s ¼ 0:1, corresponding to Z!1. AsZ!1 is hard to visualize, we show s in Fig. 3(b) despitethe fact that the update takes place over the parameter Z.For each initial policy, samples ðxt; ut; rtþ1;xtþ1Þ are

being generated using the start-state distributions, transi-tion probabilities, the rewards and the policy. The samplesarrive at a sampling rate of 60Hz, and are immediately sentto the NAC module. The policy is updated when,ðwtþ1;wtÞp� ¼ p=180. At the time of update, the true‘vanilla’ policy gradient, which can be computed analyti-cally,1 is used to update a separate policy. The true ‘vanilla’policy gradients these serve as a baseline for thecomparison. If the pole leaves the acceptable regionof �p=6pfpp=6, and �1:5mpxpþ 1:5m, it is resetto a new starting position drawn from the start-statedistribution.Results are illustrated in Fig. 3. In Fig. 3(b), a sample

run is shown: the NAC algorithms estimates the optimalsolution within less than 10min of simulated robot trialtime. The analytically obtained policy gradient forcomparison takes over 2 h of robot experience to get tothe true solution. In a real world application, a significantamount of time would be added for the vanilla policygradient as it is more unstable and leaves the admissiblearea more often. The policy gradient is clearly outperformed

Page 7: Natural Actor-Critic

ARTICLE IN PRESS

Fig. 3. This figure shows the performance of Natural Actor-Critic in the Cart-Pole Balancing framework. In (a), you can see the general setup of the pole

mounted on the cart. In (b), a sample learning run of the both Natural Actor-Critic and the true policy gradient is given. The dashed line denotes the

Natural Actor-Critic performance while the solid line shows the policy gradients performance. In (c), the expected return of the policy is shown. This is an

average over 100 randomly picked policies as described in Section 4.1.

J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901186

by the NAC algorithm. The performance differencebetween the true natural gradient and the NAC algorithmis negligible and, therefore, not shown separately. By thetime of the conference, we hope to have this exampleimplemented on a real anthropomorphic robot. InFig. 3(c), the expected return over updates is shownaveraged over all hundred initial policies.

In this experiment, we demonstrated that the NAC iscomparable with the ideal natural gradient, and outperformsthe ‘vanilla’ policy gradient significantly. Greedy policyimprovement methods do not compare easily. Discretizedgreedy methods cannot compete due to the fact that theamount of data required would be significantly increased.The only suitable greedy improvement method, to ourknowledge, is Bradtke’s Adaptive Policy Iteration [10].However, this method is problematic in real-world applica-tion due to the fact that the policy in Bradtke’s method isdeterministic: the estimation of the action-value function isan ill-conditioned regression problem with redundant para-meters and no explorative noise. Therefore, it can only workin simulated environments with an absence of noise in thestate estimates and rewards.

4.2. Motor primitive learning for baseball

This section will turn towards optimizing nonlineardynamic motor primitives for robotics. In [13], a novelform of representing movement plans ðqd ; _qdÞ for thedegrees of freedom (DOF) robot systems was suggested interms of the time evolution of the nonlinear dynamicalsystems

_qd;k ¼ hðqd;k; zk; gk; t; ykÞ, (13)

where ðqd;k; _qd;kÞ denote the desired position and velocityof a joint, zk the internal state of the dynamic system, gk

the goal (or point attractor) state of each DOF, t themovement duration shared by all DOFs, and yk the openparameters of the function h. The original work in [13]demonstrated how the parameters yk can be learnedto match a template trajectory by means of supervisedlearning—this scenario is, for instance, useful as the firststep of an imitation learning system. Here we will add theability of self-improvement of the movement primitives in

Page 8: Natural Actor-Critic

ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1187

Eq. (13) by means of reinforcement learning, which is thecrucial second step in imitation learning. The system inEq. (13) is a point-to-point movement, i.e., this task israther well suited for episodic NAC.

In Fig. 4, we show a comparison with GPOMDP forsimple, single DOF task with a reward of

rkðx0:N ; u0:N Þ ¼XN

i¼0

c1 _q2d ;k;i þ c2ðqd;k;N � gkÞ

2,

where c1 ¼ 1, c2 ¼ 1000, and gk is chose appropriately.In Fig. 4(a), we show how the expected cost decreases forboth GPOMDP and the episodic NAC. The positionsof the motor primitives are shown in Fig. 4(b) and inFig. 4(c) the accelerations are given. In 4(b,c), the dashedline shows the initial configurations, which is accomplishedby zero parameters for the motor primitives. The solid line

Expected Cost

0 0.50

0.2

0.4

0.6

0.8

1

posi

tion

[rad]

t

Position of

Exp

ecte

d C

ost J

(θ) 1000

800

600

400

200

0

Number of Episodes0 200 400 600 800 1000

GPOMDP

Episodic Natural Actor-Gritic

Fig. 4. This figure illustrates the task accomplished in the toy example. In (a

episodic Natural Actor-Critic. The positions of the motor primitives are shown

the initial configurations, which is accomplished by zero parameters for the mot

is unachievable for the motor primitives, but nicely approximated by their b

reached by both learning methods. However, for GPOMDP, this requires appr

to the optimal solution.

0-10

-8

-6

-4

-2

0x 105

Episodes100 200 300 400

Per

form

ance

J (θ

)

Fig. 5. The figure shows (a) the performance of a baseball swing task when usin

by imitation learning, in (c) it is initially failing at reproducing the motor be

batting.

shows the analytically optimal solution, which is unac-hievable for the motor primitives, but nicely approximatedby their best solution, presented by the dark dot-dashedline. This best solution is reached by both learningmethods. However, for GPOMDP, this requires approxi-mately 106 learning steps while the NAC takes less than 103

to converge to the optimal solution.We also evaluated the same setup in a challenging

robot task, i.e., the planning of these motor primitivesfor a seven DOF robot task. The task of the robot is tohit the ball properly so that it flies as far as possible.Initially, it is taught in by supervised learning as canbe seen in Fig. 5(b); however, it fails to reproduce thebehavior as shown in Fig. 5(c); subsequently, we improvethe performance using the episodic NAC which yieldsthe performance shown in Fig. 5(a) and the behavior inFig. 5(d).

1ime [s]

motor primitives

0 0.5 1-10

0

10

20

time [s]

acce

lera

tion

[rad/

s2 ]

Controls of motor primitives

), we show how the expected cost decreases for both GPOMDP and the

in (b) and in (c) the accelerations are given. In (b,c), the dashed line shows

or primitives. The solid line shows the analytically optimal solution, which

est solution, presented by the dark dot-dashed line. This best solution is

oximately 106 learning steps while the NAC takes less than 103 to converge

g the motor primitives for learning. In (b), the learning system is initialized

havior, and (d) after several hundred episodes exhibiting a nicely learned

Page 9: Natural Actor-Critic

ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901188

5. Conclusion

In this paper, we have summarized novel developmentsin policy-gradient reinforcement learning, and based onthese, we have designed a novel reinforcement learningarchitecture, the NAC algorithm. This algorithm comes in(at least) two forms, i.e., the LSTD-Q(l) form whichdepends on sufficiently rich basis functions, and theEpisodic form which only requires a constant as additionalbasis function. We compare both algorithms and apply thelatter on several evaluative benchmarks as well as on abaseball swing robot example.

Recently, our NAC architecture [19,21] has gained a lot oftraction in the reinforcement learning community. Accordingto Aberdeen, the NAC is the ‘Current method of choice’ [2].Additional to our work presented at ESANN 2007 in [19]and its earlier, preliminary versions (see e.g. [22,21,18,20]),the algorithm has found a variety of applications in largelyunmodified form in the last year. The current range ofadditional applications includes optimization of constrainedreaching movements of humanoid robots [12], traffic-lightsystem optimization [23], multi-agent system optimization[11,28], conditional random fields [27] and gait optimizationin robot locomotion [26,17]. All these new developmentsindicate that the NAC is about to become a standardarchitecture in the area of reinforcement learning as it isamong the few approaches which have scaled towardsinteresting applications.

Appendix A. Fisher information property

In Section 6, we explained that the all-action matrix F h

equals in general the Fisher information matrix GðhÞ. In[16], we can find the well-known lemma that by differ-entiating

RRn pðxÞdx ¼ 1 twice with respect to the para-

meters h, we can obtainZRn

pðxÞ=2h log pðxÞdx

¼ �

ZRn

pðxÞ=h log pðxÞ=h log pðxÞT dx (A.1)

for any probability density function pðxÞ. Furthermore, wecan rewrite the probability pðs0:nÞ of a rollout or trajectorys0:n ¼ ½x0, u0, r0, x1, u1, r1, . . ., xn, un, rn, xnþ1�

T as

pðs0:nÞ ¼ pðx0ÞYn

t¼0

pðxtþ1jxt; utÞpðutjxtÞ

which implies that

=2h log pðs0:nÞ ¼

Xn

t¼0

=2h log pðutjxtÞ.

Using Eq. (A.1), and the definition of the Fisher informa-tion matrix [3], we can determine Fisher information

matrix for the average reward case by

GðhÞ ¼ limn!1

n�1Esf=h log pðsÞ=h log pðs0:nÞTg

¼ � limn!1

n�1Esf=2h log pðsÞg, (A.2)

¼ � limn!1

n�1Es

Xn

t¼0

=2h log pðutjxtÞ

( )

¼ �

ZX

dpðxÞ

ZU

pðujxÞ=2h log pðujxÞdudx (A.3)

¼

ZX

dpðxÞ

ZU

pðujxÞ=h log pðujxÞ

=h log pðujxÞT dudx ¼ Fh. (A.4)

This proves that the all-action matrix is indeed the Fisherinformation matrix for the average reward case. For thediscounted case, with a discount factor g we realize that wecan rewrite the problem where the probability of rollout isgiven by

pgðs0:nÞ ¼ pðs0:nÞXn

i¼0

giIxi ;ui

!

and derive that the all-action matrix equals the Fisherinformation matrix by the same kind of reasoning as inEq. (A.4). Therefore, we can conclude that in general, i.e.,GðhÞ ¼ Fh.

Appendix B. Proof of the covariance theorem

For small parameter changes Dh and Dh, we haveDh ¼ =hh

TDh. If the natural policy gradient is a covariantupdate rule, a change Dh along the gradient =hJðhÞ wouldresult in the same change Dh along the gradient =hJðhÞ forthe same scalar step-size a: By differentiation, we canobtain =hJðhÞ ¼ =hh=hJðhÞ. It is straightforward to showthat the Fisher information matrix includes the Jacobian=hh twice as factor

FðhÞ ¼

ZX

dpðxÞ

ZU

pðujxÞ=h log pðujxÞ=h log pðujxÞT dudx,

¼ =hh

ZX

dpðxÞ

ZU

pðujxÞ=h log pðujxÞ

=h log pðujxÞT du dx=hh

T,

¼ =hhFðhÞ=hhT.

This shows that natural gradient in the h parameterizationis given by

e=hJðhÞ ¼ F�1ðhÞ=hJðhÞ

¼ ð=hhFðhÞ=hhTÞ�1=hh=hJðhÞ.

Page 10: Natural Actor-Critic

ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1189

This has a surprising implication as it makes it straightfor-ward to see that the natural policy is covariant since

Dh ¼ a=hhTDh ¼ a=hh

Te=hJðhÞ,

¼ a=hhTð=hhFðhÞ=hh

TÞ�1=hh=hJðhÞ,

¼ aF�1ðhÞ=hJðhÞ ¼ ae=hJðhÞ,

assuming that =hh is invertible. This concludes that thenatural policy gradient is in fact a covariant gradientupdate rule.

The assumptions underlying this proof require that thelearning rate is very small in order to ensure a covariantgradient descent process. However, single update steps willalways be covariant and, thus, this requirement is onlyformally necessary but barely matters in practice. Similaras in other gradient descent problems, learning rates can bechosen to optimize the performance without changing thefact that the covariance of a single update step directionwill not be affected.

References

[1] D. Aberdeen, Policy-gradient algorithms for partially observable

Markov decision processes, Ph.D. Thesis, Australian National

Unversity, 2003.

[2] D. Aberdeen, POMDPs and policy gradients, in: Proceedings of the

Machine Learning Summer School (MLSS), Canberra, Australia,

2006.

[3] S. Amari, Natural gradient works efficiently in learning, Neural

Comput. 10 (1998) 251–276.

[4] J. Bagnell, J. Schneider, Covariant policy search, in: International

Joint Conference on Artificial Intelligence, 2003.

[5] L.C. Baird, Advantage updating, Technical Report WL-TR-93-1146,

Wright Lab., 1993.

[6] L.C. Baird, A.W. Moore, Gradient descent for general reinforcement

learning, in: Advances in Neural Information Processing Systems,

vol. 11, 1999.

[7] P. Bartlett, An introduction to reinforcement learning theory: value

function methods, in: Machine Learning Summer School, 2002,

pp. 184–202.

[8] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming,

Athena Scientific, Belmont, MA, 1996.

[9] J. Boyan, Least-squares temporal difference learning, in: Machine

Learning: Proceedings of the Sixteenth International Conference,

1999, pp. 49–56.

[10] S. Bradtke, E. Ydstie, A.G. Barto, Adaptive Linear Quadratic

Control Using Policy Iteration, University of Massachusetts,

Amherst, MA, 1994.

[11] O. Buffet, A. Dutech, F. Charpillet, Shaping multi-agent systems with

gradient reinforcement learning, Autonomous Agents Multi-Agent

Syst. 15 (2) (October 2007) 1387–2532.

[12] F. Guenter, M. Hersch, S. Calinon, A. Billard, Reinforcement

learning for imitating constrained reaching movements, RSJ Adv.

Robotics 21 (13) (2007) 1521–1544.

[13] A. Ijspeert, J. Nakanishi, S. Schaal, Learning rhythmic movements by

demonstration using nonlinear oscillators, in: IEEE International

Conference on Intelligent Robots and Systems (IROS 2002), 2002,

pp. 958–963.

[14] S.A. Kakade, Natural policy gradient, in: Advances in Neural

Information Processing Systems, vol. 14, 2002.

[15] V. Konda, J. Tsitsiklis, Actor-critic algorithms, in: Advances in

Neural Information Processing Systems, vol. 12, 2000.

[16] T. Moon, W. Stirling, Mathematical Methods and Algorithms for

Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 2000.

[17] J. Park, J. Kim, D. Kang, An RLS-based Natural Actor-Critic

algorithm for locomotion of a two-linked robot arm, in: Proceedings

of Computational Intelligence and Security: International Conference

(CIS 2005), Xi’an, China, December 2005, pp. 15–19.

[18] J. Peters, S. Schaal, Policy gradient methods for robotics, in:

Proceedings of the IEEE/RSJ International Conference on Intelligent

Robots and Systems (IROS), Beijing, China, 2006.

[19] J. Peters, S. Schaal, Applying the episodic natural actor-critic

architecture to motor primitive learning, in: Proceedings of the 2007

European Symposium on Artificial Neural Networks (ESANN), 2007.

[20] J. Peters, S. Vijayakumar, S. Schaal, Scaling reinforcement learning

paradigms for motor learning, in: Proceedings of the 10th Joint

Symposium on Neural Computation (JSNC), Irvine, CA, May 2003.

[21] J. Peters, S. Vijaykumar, S. Schaal, Reinforcement learning for

humanoid robotics, in: IEEE International Conference on Human-

doid Robots, 2003.

[22] J. Peters, S. Vijayakumar, S. Schaal, Natural Actor-Critic, in:

Proceedings of the European Machine Learning Conference

(ECML), Porto, Portugal, 2005.

[23] S. Richter, D. Aberdeen, J. Yu, Natural Actor-Critic for road traffic

optimisation, in: Advances in Neural Information Processing

Systems, 2007.

[24] R.S. Sutton, A.G. Barto, Reinforcement Learning, MIT Press,

Cambridge, MA, 1998.

[25] R.S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient

methods for reinforcement learning with function approximation, in:

Advances in Neural Information Processing Systems, vol. 12, 2000.

[26] T. Ueno, Y. Nakamura, T. Shibata, K. Hosoda, S. Ishii, Fast and

Stable learning of quasi-passive dynamic walking by an unstable

biped robot based on off-policy Natural Actor-Critic, in: IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS),

2006.

[27] S.V. N Vishwanathan, X. Zhang, D. Aberdeen, Conditional random

fields for reinforcement learning, in: Y. Bengio, Y. LeCun (Eds.),

Proceedings of the 2007 Snowbird Learning Workshop, San Juan,

Puerto Rico, March 2007.

[28] X. Zhang, D. Aberdeen, S.V.N. Vishwanathan, Conditional random

fields for multi-agent reinforcement learning, in: Proceedings of the

24th International Conference on Machine Learning (ICML 2007),

ACM International Conference Proceeding Series, Corvalis, Oregon,

2007, pp. 1143–1150.

Jan Peters heads the Robot Learning Lab

(RoLL) at the Max-Planck Institute for Biologi-

cal Cybernetics (MPI) while being an invited

researcher at the Computational Learning and

Motor Control Lab at the University of Southern

California (USC). Before joining MPI, he grad-

uated from University of Southern California

with a Ph.D. in Computer Science in March 2007.

Jan Peters studied Electrical Engineering, Com-

puter Science and Mechanical Engineering. He

holds two German M.Sc. degrees in Informatics and in Electrical

Engineering (Dipl-Informatiker from Hagen University and Diplom-

Ingenieur from Munich University of Technology/TUM) and two M.Sc.

degrees in Computer Science and Mechanical Engineering from University

of Southern California (USC). During his graduate studies, Jan Peters has

been a visiting researcher at the Department of Robotics at the German

Aerospace Research Center (DLR) in Oberpfaffenhofen, Germany, at

Siemens Advanced Engineering (SAE) in Singapore, at the National

University of Singapore (NUS), and at the Department of Humanoid

Robotics and Computational Neuroscience at the Advanced Telecommu-

nication Research (ATR) Center in Kyoto, Japan. His research interests

include robotics, nonlinear control, machine learning, and motor skill

learning.

Page 11: Natural Actor-Critic

ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901190

Stefan Schaal is an Associate Professor at the

Department of Computer Science and the Neu-

roscience Program at the University of Southern

California, and an Invited Researcher at the ATR

Human Information Sciences Laboratory in

Japan, where he held an appointment as Head

of the Computational Learning Group during an

international ERATO project, the Kawato Dy-

namic Brain Project (ERATO/JST). Before join-

ing USC, Dr. Schaal was a postdoctoral fellow at

the Department of Brain and Cognitive Sciences

and the Artificial Intelligence Laboratory at MIT, an Invited Researcher

at the ATR Human Information Processing Research Laboratories in

Japan, and an Adjunct Assistant Professor at the Georgia Institute of

Technology and at the Department of Kinesiology of the Pennsylvania

State University. Dr. Schaal’s research interests include topics of statistical

and machine learning, neural networks, computational neuroscience,

functional brain imaging, nonlinear dynamics, nonlinear control theory,

and biomimetic robotics. He applies his research to problems of artificial

and biological motor control and motor learning, focusing on both

theoretical investigations and experiments with human subjects and

anthropomorphic robot equipment.