natural actor-critic
TRANSCRIPT
![Page 1: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/1.jpg)
ARTICLE IN PRESS
0925-2312/$ - se
doi:10.1016/j.ne
�CorrespondiDepartment of
Germany.
E-mail addr
Neurocomputing 71 (2008) 1180–1190
www.elsevier.com/locate/neucom
Natural Actor-Critic
Jan Petersa,b,�, Stefan Schaalb,c
aMax-Planck-Institute for Biological Cybernetics, Tuebingen, GermanybUniversity of Southern California, Los Angeles, CA 90089, USA
cATR Computational Neuroscience Laboratories, Kyoto 619-0288, Japan
Available online 1 February 2008
Abstract
In this paper, we suggest a novel reinforcement learning architecture, the Natural Actor-Critic. The actor updates are achieved using
stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and
additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy
gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be
estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by
the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the
original Actor-Critic and Bradtke’s Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations
illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning
control on an anthropomorphic robot arm.
r 2008 Elsevier B.V. All rights reserved.
Keywords: Policy-gradient methods; Compatible function approximation; Natural gradients; Actor-Critic methods; Reinforcement learning; Robot
learning
1. Introduction
Reinforcement learning algorithms based on value func-tion approximation have been highly successful with discretelookup table parameterization. However, when applied withcontinuous function approximation, many of these algo-rithms failed to generalize, and few convergence guaranteescould be obtained [24]. The reason for this problem canlargely be traced back to the greedy or �-greedy policyupdates of most techniques, as it does not ensure a policyimprovement when applied with an approximate valuefunction [8]. During a greedy update, small errors in thevalue function can cause large changes in the policy which inreturn can cause large changes in the value function. Thisprocess, when applied repeatedly, can result in oscillations ordivergence of the algorithms. Even in simple toy systems,
e front matter r 2008 Elsevier B.V. All rights reserved.
ucom.2007.11.026
ng author at: Max-Planck-Institute for Biological Cybernetics,
Empirical Inference, Spemannstr. 38, 72076 Tuebingen,
ess: [email protected] (J. Peters).
such unfortunate behavior can be found in many well-knowngreedy reinforcement learning algorithms [6,8].As an alternative to greedy reinforcement learning, policy-
gradient methods have been suggested. Policy gradients haverather strong convergence guarantees, even when used inconjunction with approximate value functions, and recentresults created a theoretically solid framework for policy-gradient estimation from sampled data [25,15]. However,even when applied to simple examples with rather few states,policy-gradient methods often turn out to be quite inefficient[14], partially caused by the large plateaus in the expectedreturn landscape where the gradients are small and often donot point directly towards the optimal solution. A simpleexample that demonstrates this behavior is given in Fig. 1.Similar as in supervised learning, the steepest ascent with
respect to the Fisher information metric [3], called the‘natural’ policy gradient, turns out to be significantly moreefficient than normal gradients. Such an approach was firstsuggested for reinforcement learning as the ‘averagenatural policy gradient’ in [14], and subsequently shownin preliminary work to be the true natural policy gradient[21,4]. In this paper, we take this line of reasoning one step
![Page 2: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/2.jpg)
ARTICLE IN PRESS
Fig. 1. When plotting the expected return landscape for simple problem as
1d linear-quadratic regulation, the differences between (a) ‘vanilla’ and
(b) natural policy gradients becomes apparent [21].
J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1181
further in Section 2.2 by introducing the ‘Natural Actor-Critic(NAC)’ which inherits the convergence guarantees fromgradient methods. Furthermore, in Section 3, we showthat several successful previous reinforcement learningmethods can be seen as special cases of this more generalarchitecture. The paper concludes with empirical evalua-tions that demonstrate the effectiveness of the suggestedmethods in Section 4.
2. Natural Actor-Critic
2.1. Markov decision process notation and assumptions
For this paper, we assume that the underlying controlproblem is a Markov decision process (MDP) in discretetime with continuous state set X ¼ Rn, and a continuousaction set U ¼ Rm [8]. The assumption of an MDP comeswith the limitation that very good state information andMarkovian environment are assumed. However, similar asin [1], the results presented in this paper might extend toproblems with partial state information.
The system is at an initial state x0 2 X at time t ¼ 0drawn from the start-state distribution pðx0Þ. At any statext 2 X at time t, the actor will choose an action ut 2 U
by drawing it from a stochastic, parameterized policypðutjxtÞ ¼ pðutjxt; hÞ with parameters h 2 RN , and thesystem transfers to a new state xtþ1 drawn from the statetransfer distribution pðxtþ1jxt; utÞ. The system yields ascalar reward rt ¼ rðxt; utÞ 2 R after each action. Weassume that the policy ph is continuously differentiablewith respect to its parameters h, and for each consideredpolicy ph, a state-value function VpðxÞ, and the state-actionvalue function Qpðx; uÞ exist and are given by
VpðxÞ ¼ Et
X1t¼0
gtrt
�����x0 ¼ x
( ),
Qpðx; uÞ ¼ Et
X1t¼0
gtrt
�����x0 ¼ x; u0 ¼ u
( ),
where g 2 ð0; 1Þ denotes the discount factor, and t atrajectory. It is assumed that some basis functions /ðxÞare given so that the state-value function can be approxi-
mated with linear function approximation VpðxÞ ¼ /ðxÞTv.The general goal is to optimize the normalized expectedreturn
JðhÞ ¼ Et ð1� gÞX1t¼0
gtrt
�����h( )
¼
ZX
dpðxÞ
ZU
pðujxÞrðx; uÞdxdu,
where
dpðxÞ ¼ ð1� gÞ
X1t¼0
gtpðxt ¼ xÞ
is the discounted state distribution.
2.2. Actor improvement with natural policy gradients
Actor-Critic and many other policy iteration architec-tures consist of two steps, a policy evaluation step and apolicy improvement step. The main requirements for thepolicy evaluation step are that it makes efficient usage ofexperienced data. The policy improvement step is requiredto improve the policy on every step until convergence whilebeing efficient.The requirements on the policy improvement step rule
out greedy methods as, at the current state of knowledge, apolicy improvement for approximated value functionscannot be guaranteed, even on average. ‘Vanilla’ policygradient improvements (see e.g. [25,15]) which follow thegradient =hJðhÞ of the expected return function JðhÞ (where=hf ¼ ½qf =qy1; . . . ; qf =qyN �) denotes the derivative offunction f with respect to parameter vector (h) often getstuck in plateaus as demonstrated in [14]. Natural gradientse=hJðhÞ avoid this pitfall as demonstrated for supervisedlearning problems [3], and suggested for reinforcementlearning in [14]. These methods do not follow the steepestdirection in parameter space but the steepest direction withrespect to the Fisher metric given bye=hJðhÞ ¼ G�1ðhÞ=hJðhÞ, (1)
where GðhÞ denotes the Fisher information matrix. It isguaranteed that the angle between natural and ordinarygradient is never larger than 90�, i.e., convergence to thenext local optimum can be assured. The ‘vanilla’ gradient isgiven by the policy-gradient theorem (see e.g. [25,15])
=hJðhÞ ¼
ZX
dpðxÞ
ZU
=hpðujxÞðQpðx; uÞ � bpðxÞÞdudx,
(2)
where bpðxÞ denotes a baseline. Refs. [25,15] demonstrated
that in Eq. (2), the term Qpðx; uÞ � bpðxÞ can be replaced by
a compatible function approximation
f pwðx; uÞ ¼ ð=h log pðujxÞÞ
Tw � Qpðx; uÞ � bpðxÞ, (3)
parameterized by the vector w, without affecting theunbiasedness of the gradient estimate and irrespective ofthe choice of the baseline bp
ðxÞ. However, as mentioned in
![Page 3: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/3.jpg)
ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901182
[25], the baseline may still be useful in order to reduce thevariance of the gradient estimate when Eq. (2) is approxi-mated from samples. Based on Eqs. (2) and (3), we derivean estimate of the policy gradient as
=hJðhÞ ¼
ZX
dpðxÞ
ZU
pðujxÞ=h log pðujxÞ=h log pðujxÞT dudxw
¼ Fhw (4)
as =hpðujxÞ ¼ pðujxÞ=h log pðujxÞ. Since pðujxÞ is chosen bythe user, even in sampled data, the integral
F ðh;xÞ ¼
ZU
pðujxÞ=h log pðujxÞ=h log pðujxÞT du (5)
can be evaluated analytically or empirically withoutactually executing all actions. It is also noteworthy thatthe baseline does not appear in Eq. (4) as it integrates out,thus eliminating the need to find an optimal selection ofthis open parameter. Nevertheless, the estimation of F h ¼RX
dpðxÞF ðh;xÞdx is still expensive since dp
ðxÞ is notknown. However, Eq. (4) has more surprising implicationsfor policy gradients, when examining the meaning of thematrix Fh in Eq. (4). Kakade [14] argued that F ðh;xÞ is thepoint Fisher information matrix for state x, and thatF ðhÞ ¼
RX
dpðxÞ F ðh;xÞdx, therefore, denotes a weighted
‘average Fisher information matrix’ [14]. However, goingone step further, we demonstrate in Appendix A that F h isindeed the true Fisher information matrix and does nothave to be interpreted as the ‘average’ of the point Fisherinformation matrices. Eqs. (4) and (1) combined imply thatthe natural gradient can be computed ase=hJðhÞ ¼ G�1ðhÞFhw ¼ w, (6)
since F h ¼ GðhÞ (cf. Appendix A). Therefore we only needestimate w and not GðhÞ. The resulting policy improvementstep is thus hiþ1 ¼ hi þ aw where a denotes a learning rate.Several properties of the natural policy gradient areworthwhile highlighting:
�
Convergence to a local minimum guaranteed as for‘vanilla gradients’ [3]. � By choosing a more direct path to the optimal solutionin parameter space, the natural gradient has, fromempirical observations, faster convergence and avoidspremature convergence of ‘vanilla gradients’ (cf. Fig. 1).
� The natural policy gradient can be shown to be covariant,i.e., independent of the coordinate frame chosen forexpressing the policy parameters (cf. Section 3.1).
�Fig. 2. The state-action value function in any stable linear-quadratic
Gaussian regulation problems can be shown to be a bowl (a). The
advantage function is always a saddle as shown in (b); it is straightforward
to show that the compatible function approximation can exactly represent
the advantage function—but projecting the value function onto the
advantage function is non-trivial for continuous problems. This figure
shows the value function and advantage function of the system described
in the caption of Fig. 1.
As the natural gradient analytically averages out theinfluence of the stochastic policy (including the baseline ofthe function approximator), it requires fewer data pointfor a good gradient estimate than ‘vanilla gradients’.
2.3. Critic estimation with compatible policy evaluation
The critic evaluates the current policy p in order toprovide the basis for an actor improvement, i.e., the change
Dh of the policy parameters. As we are interested in naturalpolicy gradient updates Dh ¼ aw, we wish to employ thecompatible function approximation f p
wðx; uÞ from Eq. (3) inthis context. At this point, a most important observation isthat the compatible function approximation f p
wðx; uÞ ismean-zero w.r.t. the action distribution, i.e.,ZU
pðujxÞf pwðx; uÞdu ¼ wT
ZU
=hpðujxÞdu ¼ 0, (7)
since fromRUpðujxÞdu ¼ 1, differention w.r.t. to y results
inRU
=hpðujxÞdu ¼ 0. Thus, f pwðx; uÞ represents an advan-
tage function Apðx; uÞ ¼ Qpðx; uÞ � VpðxÞ in general. Theessential differences between the advantage function andthe state-action value function is demonstrated in Fig. 2.The advantage function cannot be learned with TD-likebootstrapping without knowledge of the value function asthe essence of TD is to compare the value VpðxÞ of the twoadjacent states—but this value has been subtracted out inApðx; uÞ. Hence, a TD-like bootstrapping using exclusivelythe compatible function approximator is impossible.As an alternative, [25,15] suggested to approximate
f pwðx; uÞ from unbiased estimates Q
pðx; uÞ of the action
value function, e.g., obtained from rollouts and using least-squares minimization between f w and Q
p. While possible in
theory, one needs to realize that this approach implies afunction approximation problem where the parameteri-zation of the function approximator only spans a muchsmaller subspace of the training data—e.g., imagineapproximating a quadratic function with a line. In practice,the results of such an approximation depends crucially onthe training data distribution and has thus unacceptablyhigh variance—e.g., fit a line to only data from the rightbranch of a parabula, the left branch, or data from bothbranches.Furthermore, in continuous state-spaces a state (except
for single start-states) will hardly occur twice; therefore, wecan only obtain unbiased estimates Q
pðx; uÞ of Qpðx; uÞ.
This means the state-action value estimates Qpðx; uÞ have to
![Page 4: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/4.jpg)
ARTICLE IN PRESS
Table 1
Natural Actor-Critic Algorithm with LSTD-Q(l)
Input: Parameterized policy pðujxÞ ¼ pðujx; hÞ with initial parameters
h ¼ h0, its derivative =h logpðujxÞ and basis functions /ðxÞ for the value
function VpðxÞ
1: Draw initial state x0�pðx0Þ, and select parameters
Atþ1 ¼ 0; btþ1 ¼ ztþ1 ¼ 0.
2: For t ¼ 0; 1; 2; . . . do3: Execute: Draw action ut�pðutjxtÞ, observe next state
xtþ1�pðxtþ1jxt; utÞ, and reward rt ¼ rðxt; utÞ.
4: Critic Evaluation (LSTD-Q(l)): Update
4.1: basis functions: e/t ¼ ½/ðxtþ1ÞT; 0T�T,b/t ¼ ½/ðxtÞ
T;=h logpðutjxtÞT�T,
4.2: statistics: ztþ1 ¼ lzt þ b/t;Atþ1 ¼ At þ ztþ1ð/t � ge/tÞT;
btþ1 ¼ bt þ ztþ1rt;
4.3: critic parameters: ½vTtþ1;wTtþ1�
T ¼ A�1tþ1btþ1.
5: Actor: If gradient estimate is accurate, ,ðwt;wt�1Þp�, update5.1: policy parameters: htþ1 ¼ ht þ awtþ1,
5.2: forget statistics: ztþ1 bztþ1;Atþ1 bAtþ1; btþ1 bbtþ1.
6: end.
J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1183
be projected onto the advantage function Apðx; uÞ. Thisprojection would have to average out the state-value offsetVpðxÞ. For example, for linear-quadratic regulation, it isstraightforward to show that the advantage function issaddle while the state-action value function is bowl—wetherefore would be projecting a bowl onto a saddle; bothare illustrated in Fig. 2. In this case, the distribution of thedata has a drastic impact on the projection.
To remedy this situation, we observe that we can writethe Bellman equations (e.g., see [5]) in terms of theadvantage function and the state-value function
Qpðx; uÞ ¼ Apðx; uÞ þ VpðxÞ
¼ rðx; uÞ þ gZX
pðx0jx; uÞVpðx0Þdx0. (8)
Inserting Apðx; uÞ ¼ f pwðx; uÞ and an appropriate basis
functions representation of the value function as VpðxÞ ¼/ðxÞTv, we can rewrite the Bellman Equation, Eq. (8), as aset of linear equations
=h log pðutjxtÞTwþ /ðxtÞ
Tv
¼ rðxt; utÞ þ g/ðxtþ1ÞTvþ �ðxt; ut; xtþ1Þ, (9)
where �ðxt; ut;xtþ1Þ denotes an error term which mean-zeroas can be observed from Eq. (8). These equations enable usto formulate some novel algorithms in the next sections.
The linear appearance of w and v hints at a least squaresto obtain. Thus, we now need to address algorithms thatestimate the gradient efficiently using the sampled equa-tions (such as Eq. (9)), and how to determine the additionalbasis functions /ðxÞ for which convergence of thesealgorithms is guaranteed.
2.3.1. Critic evaluation with LSTD-Q(l)
Using Eq. (9), a solution to Eq. (8) can be obtained byadapting the LSTD(l) policy evaluation algorithm [9]. Forthis purpose, we defineb/t ¼ ½/ðxtÞ
T;=h log pðutjxtÞT�T,e/t ¼ ½/ðxtþ1Þ
T; 0T�T, (10)
as new basis functions, where 0 is the zero vector. Thisdefinition of basis function reduces bias and variance of thelearning process in comparison to SARSA and previousLSTD(l) algorithms for state-action value functions [9] asthe basis functions e/t do not depend on stochastic futureactions utþ1, i.e., the input variables to the LSTD re-gression are not noisy due to utþ1 (e.g., as in [10])—suchinput noise would violate the standard regression modelthat only takes noise in the regression targets into account.Alternatively, Bradtke et al. [10] assume VpðxÞ ¼ Qpðx; uÞwhere u is the average future action, and choose their basisfunctions accordingly; however, this is only given fordeterministic policies, i.e., policies without exploration andnot applicable in our framework. LSTD(l) with the basisfunctions in Eq. (10), called LSTD-Q(l) from now on, isthus currently the theoretically cleanest way of applyingLSTD to state-value function estimation. It is exact for
deterministic or weekly noisy state transitions and arbi-trary stochastic policies. As all previous LSTD suggestions,it loses accuracy with increasing noise in the statetransitions since e/t becomes a random variable. Thecomplete LSTD-Q(l) algorithm is given in the Critic
Evaluation (lines 4.1–4.3) of Table 1.Once LSTD-Q(l) converges to an approximation of
Apðxt; utÞ þ VpðxtÞ, we obtain two results: the value functionparameters v, and the natural gradient w. The naturalgradient w serves in updating the policy parametersDht ¼ awt. After this update, the critic has to forget at leastparts of its accumulated sufficient statistics using a forgettingfactor b 2 ½0; 1� (cf. Table 1). For b ¼ 0, i.e., completeresetting, and appropriate basis functions fðxÞ, convergenceto the true natural gradient can be guaranteed. The completeNAC algorithm is shown in Table 1.However, it becomes fairly obvious that the basis
functions can have an influence on our gradient estimate.When using the counterexample in [7] with a typical Gibbspolicy, we will realize that the gradient is affected for lo1;for l ¼ 0 the gradient is flipped and would always worsenthe policy. However, unlike in [7], we at least couldguarantee that we are not affected for l ¼ 1.
2.3.2. Episodic NAC
Given the problem that the additional basis functionsfðxÞ determine the quality of the gradient, we needmethods which guarantee the unbiasedness of the naturalgradient estimate. Such method can be determined bysumming up Eq. (9) along a sample path, we obtainXN�1t¼0
gtApðxt; utÞ
¼ Vpðx0Þ þXN�1t¼0
gtrðxt; utÞ � gNVpðxNÞ. (11)
![Page 5: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/5.jpg)
ARTICLE IN PRESS
Table 2
Episodic Natural Actor-Critic Algorithm (eNAC)
Input: Parameterized policy pðujxÞ ¼ pðujx; hÞ with initial parameters
h ¼ h0, and derivative =h logpðujxÞ.
For u ¼ 1; 2; 3; . . . doFor e ¼ 1; 2; 3; . . . doExecute Rollout: Draw initial state x0�pðx0Þ.
For t ¼ 1; 2; 3; . . . ;N do
Draw action ut�pðutjxtÞ, observe next state xtþ1�pðxtþ1jxt; utÞ,
and reward rt ¼ rðxt; utÞ.
end.
end.
Critic Evaluation (Episodic): Determine value function
J ¼ Vpðx0Þ, compatible function approximation f pwðxt; utÞ:
Update: Determine basis functions: /t ¼ ½PN
t¼0 gt=h logpðutjxtÞ
T; 1�T;
reward statistics: Rt ¼PN
t¼0 gtr;
Actor-Update: When the natural gradient is converged,
,ðwtþ1;wt�tÞp�, update the policy parameters: htþ1 ¼ ht þ awtþ1.
6: end.
J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901184
It is fairly obvious that the last term disappears for N !1
or episodic tasks (where rðxN�1; uN�1Þ is the final reward);therefore each rollout would yield one equation. If wefurthermore assume a single start-state, an additionalscalar value function of fðxÞ ¼ 1 suffices. We thereforeget a straightforward regression problem:XN�1t¼0
gt= log pðut;xtÞTwþ J ¼
XN�1t¼0
gtrðxt; utÞ (12)
with exactly dim yþ 1 unknowns. This means that for non-stochastic tasks we can obtain a gradient after dim yþ 1rollouts. The complete algorithm is shown in Table 2.
3. Properties of NAC
In this section, we will emphasize certain properties ofthe NAC. In particular, we want to give a simple proof ofcovariance of the natural policy gradient, and discuss [14]observation that in his experimental settings the naturalpolicy gradient was non-covariant. Furthermore, we willdiscuss another surprising aspect about the NAC which isits relation to previous algorithms. We briefly demonstratethat established algorithms like the classic Actor-Critic[24], and Bradtke’s Q-Learning [10] can be seen as specialcases of NAC.
3.1. On the covariance of natural policy gradients
When [14] originally suggested natural policy gradients,he came to the disappointing conclusion that they were notcovariant. As counterexample, he suggested that for twodifferent linear Gaussian policies (one in the normal form,and the other in the information form) the probabilitydistributions represented by the natural policy gradientwould be affected differently, i.e., the natural policygradient would be non-covariant. We intend to give a
proof at this point showing that the natural policy gradientis in fact covariant under certain conditions, and clarifywhy [14] experienced these difficulties.
Theorem 1. Natural policy-gradients updates are covariant
for two policies ph parameterized by h and ph parameterized
by h if (i) for all parameters yi there exists a function
yi ¼ f iðh1; . . . ; hkÞ, (ii) the derivative =hh and its inverse
=hh�1.
For the proof see Appendix B. Practical experimentsshow that the problems occurred for Gaussian policies in[14] are in fact due to the selection the stepsize a whichdetermines the length of Dh. As the linearization Dh ¼
=hhTDh does not hold for large Dh, this can cause
divergence between the algorithms even for analyticallydetermined natural policy gradients which can partiallyexplain the difficulties occurred by Kakade [14].
3.2. NAC’s relation to previous algorithms
Original Actor-Critic. Surprisingly, the original Actor-Critic algorithm [24] is a form of the NAC. By choosinga Gibbs policy pðutjxtÞ ¼ expðyxuÞ=
Pb expðyxbÞ, with
all parameters yxu lumped in the vector h (denoted ash ¼ ½yxu�) in a discrete setup with tabular representationsof transition probabilities and rewards. A linear functionapproximation VpðxÞ ¼ /ðxÞTv with v ¼ ½vx� and unit basisfunctions /ðxÞ ¼ ux was employed. Sutton et al. onlineupdate rule is given by
ytþ1xu ¼ yt
xu þ a1ðrðx; uÞ þ gvx0 � vxÞ,
vtþ1x ¼ vt
x þ a2ðrðx; uÞ þ gvx0 � vxÞ,
where a1, a2 denote learning rates. The update of the criticparameters vt
x equals the one of the NAC in expectation asTD(0) critics converges to the same values as LSTD(0) andLSTD-Q(0) for discrete problems [9]. Since for the Gibbspolicy we have q log pðbjaÞ=qyxu ¼ 1� pðbjaÞ if a ¼ x andb ¼ u, q log pðbjaÞ=qyxu ¼ �pðbjaÞ if a ¼ x and bau, andq log pðbjaÞ=qyxu ¼ 0 otherwise, and as
PbpðbjxÞAðx; bÞ ¼
0; we can evaluate the advantage function and derive
Aðx; uÞ ¼ Aðx; uÞ �X
b
pðbjxÞAðx; bÞ
¼X
b
q log pðbjxÞqyxu
Aðx; bÞ.
Since the compatible function approximation representsthe advantage function, i.e., f p
wðx; uÞ ¼ Aðx; uÞ, we realizethat the advantages equal the natural gradient, i.e.,w ¼ ½Aðx; uÞ�. Furthermore, the TD(0) error of a state-action pair ðx; uÞ equals the advantage function inexpectation, and therefore the natural gradient updatewxu ¼ Aðx; uÞ ¼ Ex0 frðx; uÞ þ gV ðx0Þ � V ðxÞjx; ug; corre-sponds to the average online updates of Actor-Critic. Asboth update rules of the Actor-Critic correspond to theones of NAC, we can see both algorithms as equivalent.
![Page 6: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/6.jpg)
ARTICLE IN PRESS
1The true natural policy gradient can also be computed analytically.
However, it is not shown as the difference in performance to the Natural
Actor-Critic gradient estimate is negligible.
J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1185
SARSA. SARSA with a tabular, discrete state-actionvalue function Qpðx; uÞ and an �-soft policy improvement
pðutjxtÞ ¼ expðQpðx; uÞ=�Þ=X
u
expðQpðx; uÞ=�Þ
can also be seen as an approximation of NAC. When treatingthe table entries as parameters of a policy hxu ¼ Qpðx; uÞ, werealize that the TD update of these parameters correspondsapproximately to the natural gradient update since wxu ¼
�Aðx; uÞ � �Ex0 frðx; uÞ þ gQðx0; u0Þ �Qðx; uÞjx; ug. However,the SARSA-TD error equals the advantage function only forpolicies where a single action u has much better action valuesQðx; uÞ than all other actions; for such special cases, �-softSARSA can be seen as an approximation of NAC. This alsocorresponds to Kakade’s [14] observation that greedy updatestep (such as the �-soft greedy update), approximates thenatural policy gradient.
Bradtke’s Q-Learning. Bradtke [10] proposed an algo-rithm with policy pðutjxtÞ ¼Nðutjk
Ti xt;s2i Þ and parameters
hi ¼ ½kTi ;si�
T (where si denotes the exploration, and i thepolicy update time step) in a linear control task with linearstate transitions xtþ1 ¼ Axt þ but, and quadratic rewardsrðxt; utÞ ¼ xT
t Hxt þ Ru2t . They evaluated Qpðxt; utÞ with
LSTD(0) using a quadratic polynomial expansion as basisfunctions, and applied greedy updates:
kBradtkeiþ1 ¼ arg max
kiþ1
Qpðxt; ut ¼ kTiþ1xtÞ
¼ � ðRþ gbTPibÞ�1gbPiA,
where Pi denotes policy-specific value function parametersrelated to the gain ki; no update the exploration si wasincluded. Similarly, we can obtain the natural policygradient w ¼ ½wk;ws�
T, as yielded by LSTD-Q(l) analyti-cally using the compatible function approximation and thesame quadratic basis functions. As discussed in detail in[21], this gives us
wk ¼ ðgATPibþ ðRþ gbTPibÞkÞTs2i ,
ws ¼ 0:5ðRþ gbTPibÞs3i .
Similarly, it can be derived that the expected return isJðhiÞ ¼ �ðRþ gbTPibÞs2i for this type of problems, see [21].For a learning rate ai ¼ 1=kJðhiÞk, we see
kiþ1 ¼ ki þ atwk ¼ ki � ðki þ ðRþ gbTPibÞ�1gATPibÞ
¼ kBradtkeiþ1 ,
which demonstrates that Bradtke’s Actor Update is a
special case of the NAC. NAC extends Bradtke’s result as itgives an update rule for the exploration—which was notpossible in Bradtke’s greedy framework.
4. Evaluations and applications
In this section, we present several evaluations comparingthe episodic NAC architectures with previous algorithms.
We compare them in optimization tasks such as Cart-PoleBalancing and simple motor primitive evaluations andcompare them only with episodic NAC. Furthermore, weapply the combination of episodic NAC and the motorprimitive framework to a robotic task on a real robot, i.e.,‘hitting a T-ball with a baseball bat’.
4.1. Cart-Pole Balancing
Cart-Pole Balancing is a well-known benchmark forreinforcement learning. We assume the cart as shown inFig. 3(a) can be described by
ml €x cos yþml2 €y�mgl sin y ¼ 0,
ðmþmcÞ €xþml €y cos y�ml _y2sin y ¼ F ,
with l ¼ 0:75m, m ¼ 0:15 kg, g ¼ 9:81m=s2 and mc ¼
1:0 kg. The resulting state is given by x ¼ ½x; _x; y; _y�T, andthe action u ¼ F . The system is treated as if it was sampledat a rate of h ¼ 60Hz, and the reward is given by rðx; uÞ ¼xTQxþ uTRu with Q ¼ diagð1:25; 1; 12; 0:25Þ, R ¼ 0:01.The policy is specified as pðujxÞ ¼NðKx;s2Þ. In order to
ensure that the learning algorithm cannot exceed anacceptable parameter range, the variance of the policy isdefined as s ¼ 0:1þ 1=ð1þ expðZÞÞ. Thus, the policyparameter vector becomes h ¼ ½KT; Z�T and has theanalytically computable optimal solution K � ½5:71; 11:3;�82:1;�21:6�T, and s ¼ 0:1, corresponding to Z!1. AsZ!1 is hard to visualize, we show s in Fig. 3(b) despitethe fact that the update takes place over the parameter Z.For each initial policy, samples ðxt; ut; rtþ1;xtþ1Þ are
being generated using the start-state distributions, transi-tion probabilities, the rewards and the policy. The samplesarrive at a sampling rate of 60Hz, and are immediately sentto the NAC module. The policy is updated when,ðwtþ1;wtÞp� ¼ p=180. At the time of update, the true‘vanilla’ policy gradient, which can be computed analyti-cally,1 is used to update a separate policy. The true ‘vanilla’policy gradients these serve as a baseline for thecomparison. If the pole leaves the acceptable regionof �p=6pfpp=6, and �1:5mpxpþ 1:5m, it is resetto a new starting position drawn from the start-statedistribution.Results are illustrated in Fig. 3. In Fig. 3(b), a sample
run is shown: the NAC algorithms estimates the optimalsolution within less than 10min of simulated robot trialtime. The analytically obtained policy gradient forcomparison takes over 2 h of robot experience to get tothe true solution. In a real world application, a significantamount of time would be added for the vanilla policygradient as it is more unstable and leaves the admissiblearea more often. The policy gradient is clearly outperformed
![Page 7: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/7.jpg)
ARTICLE IN PRESS
Fig. 3. This figure shows the performance of Natural Actor-Critic in the Cart-Pole Balancing framework. In (a), you can see the general setup of the pole
mounted on the cart. In (b), a sample learning run of the both Natural Actor-Critic and the true policy gradient is given. The dashed line denotes the
Natural Actor-Critic performance while the solid line shows the policy gradients performance. In (c), the expected return of the policy is shown. This is an
average over 100 randomly picked policies as described in Section 4.1.
J. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901186
by the NAC algorithm. The performance differencebetween the true natural gradient and the NAC algorithmis negligible and, therefore, not shown separately. By thetime of the conference, we hope to have this exampleimplemented on a real anthropomorphic robot. InFig. 3(c), the expected return over updates is shownaveraged over all hundred initial policies.
In this experiment, we demonstrated that the NAC iscomparable with the ideal natural gradient, and outperformsthe ‘vanilla’ policy gradient significantly. Greedy policyimprovement methods do not compare easily. Discretizedgreedy methods cannot compete due to the fact that theamount of data required would be significantly increased.The only suitable greedy improvement method, to ourknowledge, is Bradtke’s Adaptive Policy Iteration [10].However, this method is problematic in real-world applica-tion due to the fact that the policy in Bradtke’s method isdeterministic: the estimation of the action-value function isan ill-conditioned regression problem with redundant para-meters and no explorative noise. Therefore, it can only workin simulated environments with an absence of noise in thestate estimates and rewards.
4.2. Motor primitive learning for baseball
This section will turn towards optimizing nonlineardynamic motor primitives for robotics. In [13], a novelform of representing movement plans ðqd ; _qdÞ for thedegrees of freedom (DOF) robot systems was suggested interms of the time evolution of the nonlinear dynamicalsystems
_qd;k ¼ hðqd;k; zk; gk; t; ykÞ, (13)
where ðqd;k; _qd;kÞ denote the desired position and velocityof a joint, zk the internal state of the dynamic system, gk
the goal (or point attractor) state of each DOF, t themovement duration shared by all DOFs, and yk the openparameters of the function h. The original work in [13]demonstrated how the parameters yk can be learnedto match a template trajectory by means of supervisedlearning—this scenario is, for instance, useful as the firststep of an imitation learning system. Here we will add theability of self-improvement of the movement primitives in
![Page 8: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/8.jpg)
ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1187
Eq. (13) by means of reinforcement learning, which is thecrucial second step in imitation learning. The system inEq. (13) is a point-to-point movement, i.e., this task israther well suited for episodic NAC.
In Fig. 4, we show a comparison with GPOMDP forsimple, single DOF task with a reward of
rkðx0:N ; u0:N Þ ¼XN
i¼0
c1 _q2d ;k;i þ c2ðqd;k;N � gkÞ
2,
where c1 ¼ 1, c2 ¼ 1000, and gk is chose appropriately.In Fig. 4(a), we show how the expected cost decreases forboth GPOMDP and the episodic NAC. The positionsof the motor primitives are shown in Fig. 4(b) and inFig. 4(c) the accelerations are given. In 4(b,c), the dashedline shows the initial configurations, which is accomplishedby zero parameters for the motor primitives. The solid line
Expected Cost
0 0.50
0.2
0.4
0.6
0.8
1
posi
tion
[rad]
t
Position of
Exp
ecte
d C
ost J
(θ) 1000
800
600
400
200
0
Number of Episodes0 200 400 600 800 1000
GPOMDP
Episodic Natural Actor-Gritic
Fig. 4. This figure illustrates the task accomplished in the toy example. In (a
episodic Natural Actor-Critic. The positions of the motor primitives are shown
the initial configurations, which is accomplished by zero parameters for the mot
is unachievable for the motor primitives, but nicely approximated by their b
reached by both learning methods. However, for GPOMDP, this requires appr
to the optimal solution.
0-10
-8
-6
-4
-2
0x 105
Episodes100 200 300 400
Per
form
ance
J (θ
)
Fig. 5. The figure shows (a) the performance of a baseball swing task when usin
by imitation learning, in (c) it is initially failing at reproducing the motor be
batting.
shows the analytically optimal solution, which is unac-hievable for the motor primitives, but nicely approximatedby their best solution, presented by the dark dot-dashedline. This best solution is reached by both learningmethods. However, for GPOMDP, this requires approxi-mately 106 learning steps while the NAC takes less than 103
to converge to the optimal solution.We also evaluated the same setup in a challenging
robot task, i.e., the planning of these motor primitivesfor a seven DOF robot task. The task of the robot is tohit the ball properly so that it flies as far as possible.Initially, it is taught in by supervised learning as canbe seen in Fig. 5(b); however, it fails to reproduce thebehavior as shown in Fig. 5(c); subsequently, we improvethe performance using the episodic NAC which yieldsthe performance shown in Fig. 5(a) and the behavior inFig. 5(d).
1ime [s]
motor primitives
0 0.5 1-10
0
10
20
time [s]
acce
lera
tion
[rad/
s2 ]
Controls of motor primitives
), we show how the expected cost decreases for both GPOMDP and the
in (b) and in (c) the accelerations are given. In (b,c), the dashed line shows
or primitives. The solid line shows the analytically optimal solution, which
est solution, presented by the dark dot-dashed line. This best solution is
oximately 106 learning steps while the NAC takes less than 103 to converge
g the motor primitives for learning. In (b), the learning system is initialized
havior, and (d) after several hundred episodes exhibiting a nicely learned
![Page 9: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/9.jpg)
ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901188
5. Conclusion
In this paper, we have summarized novel developmentsin policy-gradient reinforcement learning, and based onthese, we have designed a novel reinforcement learningarchitecture, the NAC algorithm. This algorithm comes in(at least) two forms, i.e., the LSTD-Q(l) form whichdepends on sufficiently rich basis functions, and theEpisodic form which only requires a constant as additionalbasis function. We compare both algorithms and apply thelatter on several evaluative benchmarks as well as on abaseball swing robot example.
Recently, our NAC architecture [19,21] has gained a lot oftraction in the reinforcement learning community. Accordingto Aberdeen, the NAC is the ‘Current method of choice’ [2].Additional to our work presented at ESANN 2007 in [19]and its earlier, preliminary versions (see e.g. [22,21,18,20]),the algorithm has found a variety of applications in largelyunmodified form in the last year. The current range ofadditional applications includes optimization of constrainedreaching movements of humanoid robots [12], traffic-lightsystem optimization [23], multi-agent system optimization[11,28], conditional random fields [27] and gait optimizationin robot locomotion [26,17]. All these new developmentsindicate that the NAC is about to become a standardarchitecture in the area of reinforcement learning as it isamong the few approaches which have scaled towardsinteresting applications.
Appendix A. Fisher information property
In Section 6, we explained that the all-action matrix F h
equals in general the Fisher information matrix GðhÞ. In[16], we can find the well-known lemma that by differ-entiating
RRn pðxÞdx ¼ 1 twice with respect to the para-
meters h, we can obtainZRn
pðxÞ=2h log pðxÞdx
¼ �
ZRn
pðxÞ=h log pðxÞ=h log pðxÞT dx (A.1)
for any probability density function pðxÞ. Furthermore, wecan rewrite the probability pðs0:nÞ of a rollout or trajectorys0:n ¼ ½x0, u0, r0, x1, u1, r1, . . ., xn, un, rn, xnþ1�
T as
pðs0:nÞ ¼ pðx0ÞYn
t¼0
pðxtþ1jxt; utÞpðutjxtÞ
which implies that
=2h log pðs0:nÞ ¼
Xn
t¼0
=2h log pðutjxtÞ.
Using Eq. (A.1), and the definition of the Fisher informa-tion matrix [3], we can determine Fisher information
matrix for the average reward case by
GðhÞ ¼ limn!1
n�1Esf=h log pðsÞ=h log pðs0:nÞTg
¼ � limn!1
n�1Esf=2h log pðsÞg, (A.2)
¼ � limn!1
n�1Es
Xn
t¼0
=2h log pðutjxtÞ
( )
¼ �
ZX
dpðxÞ
ZU
pðujxÞ=2h log pðujxÞdudx (A.3)
¼
ZX
dpðxÞ
ZU
pðujxÞ=h log pðujxÞ
=h log pðujxÞT dudx ¼ Fh. (A.4)
This proves that the all-action matrix is indeed the Fisherinformation matrix for the average reward case. For thediscounted case, with a discount factor g we realize that wecan rewrite the problem where the probability of rollout isgiven by
pgðs0:nÞ ¼ pðs0:nÞXn
i¼0
giIxi ;ui
!
and derive that the all-action matrix equals the Fisherinformation matrix by the same kind of reasoning as inEq. (A.4). Therefore, we can conclude that in general, i.e.,GðhÞ ¼ Fh.
Appendix B. Proof of the covariance theorem
For small parameter changes Dh and Dh, we haveDh ¼ =hh
TDh. If the natural policy gradient is a covariantupdate rule, a change Dh along the gradient =hJðhÞ wouldresult in the same change Dh along the gradient =hJðhÞ forthe same scalar step-size a: By differentiation, we canobtain =hJðhÞ ¼ =hh=hJðhÞ. It is straightforward to showthat the Fisher information matrix includes the Jacobian=hh twice as factor
FðhÞ ¼
ZX
dpðxÞ
ZU
pðujxÞ=h log pðujxÞ=h log pðujxÞT dudx,
¼ =hh
ZX
dpðxÞ
ZU
pðujxÞ=h log pðujxÞ
=h log pðujxÞT du dx=hh
T,
¼ =hhFðhÞ=hhT.
This shows that natural gradient in the h parameterizationis given by
e=hJðhÞ ¼ F�1ðhÞ=hJðhÞ
¼ ð=hhFðhÞ=hhTÞ�1=hh=hJðhÞ.
![Page 10: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/10.jpg)
ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–1190 1189
This has a surprising implication as it makes it straightfor-ward to see that the natural policy is covariant since
Dh ¼ a=hhTDh ¼ a=hh
Te=hJðhÞ,
¼ a=hhTð=hhFðhÞ=hh
TÞ�1=hh=hJðhÞ,
¼ aF�1ðhÞ=hJðhÞ ¼ ae=hJðhÞ,
assuming that =hh is invertible. This concludes that thenatural policy gradient is in fact a covariant gradientupdate rule.
The assumptions underlying this proof require that thelearning rate is very small in order to ensure a covariantgradient descent process. However, single update steps willalways be covariant and, thus, this requirement is onlyformally necessary but barely matters in practice. Similaras in other gradient descent problems, learning rates can bechosen to optimize the performance without changing thefact that the covariance of a single update step directionwill not be affected.
References
[1] D. Aberdeen, Policy-gradient algorithms for partially observable
Markov decision processes, Ph.D. Thesis, Australian National
Unversity, 2003.
[2] D. Aberdeen, POMDPs and policy gradients, in: Proceedings of the
Machine Learning Summer School (MLSS), Canberra, Australia,
2006.
[3] S. Amari, Natural gradient works efficiently in learning, Neural
Comput. 10 (1998) 251–276.
[4] J. Bagnell, J. Schneider, Covariant policy search, in: International
Joint Conference on Artificial Intelligence, 2003.
[5] L.C. Baird, Advantage updating, Technical Report WL-TR-93-1146,
Wright Lab., 1993.
[6] L.C. Baird, A.W. Moore, Gradient descent for general reinforcement
learning, in: Advances in Neural Information Processing Systems,
vol. 11, 1999.
[7] P. Bartlett, An introduction to reinforcement learning theory: value
function methods, in: Machine Learning Summer School, 2002,
pp. 184–202.
[8] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming,
Athena Scientific, Belmont, MA, 1996.
[9] J. Boyan, Least-squares temporal difference learning, in: Machine
Learning: Proceedings of the Sixteenth International Conference,
1999, pp. 49–56.
[10] S. Bradtke, E. Ydstie, A.G. Barto, Adaptive Linear Quadratic
Control Using Policy Iteration, University of Massachusetts,
Amherst, MA, 1994.
[11] O. Buffet, A. Dutech, F. Charpillet, Shaping multi-agent systems with
gradient reinforcement learning, Autonomous Agents Multi-Agent
Syst. 15 (2) (October 2007) 1387–2532.
[12] F. Guenter, M. Hersch, S. Calinon, A. Billard, Reinforcement
learning for imitating constrained reaching movements, RSJ Adv.
Robotics 21 (13) (2007) 1521–1544.
[13] A. Ijspeert, J. Nakanishi, S. Schaal, Learning rhythmic movements by
demonstration using nonlinear oscillators, in: IEEE International
Conference on Intelligent Robots and Systems (IROS 2002), 2002,
pp. 958–963.
[14] S.A. Kakade, Natural policy gradient, in: Advances in Neural
Information Processing Systems, vol. 14, 2002.
[15] V. Konda, J. Tsitsiklis, Actor-critic algorithms, in: Advances in
Neural Information Processing Systems, vol. 12, 2000.
[16] T. Moon, W. Stirling, Mathematical Methods and Algorithms for
Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 2000.
[17] J. Park, J. Kim, D. Kang, An RLS-based Natural Actor-Critic
algorithm for locomotion of a two-linked robot arm, in: Proceedings
of Computational Intelligence and Security: International Conference
(CIS 2005), Xi’an, China, December 2005, pp. 15–19.
[18] J. Peters, S. Schaal, Policy gradient methods for robotics, in:
Proceedings of the IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), Beijing, China, 2006.
[19] J. Peters, S. Schaal, Applying the episodic natural actor-critic
architecture to motor primitive learning, in: Proceedings of the 2007
European Symposium on Artificial Neural Networks (ESANN), 2007.
[20] J. Peters, S. Vijayakumar, S. Schaal, Scaling reinforcement learning
paradigms for motor learning, in: Proceedings of the 10th Joint
Symposium on Neural Computation (JSNC), Irvine, CA, May 2003.
[21] J. Peters, S. Vijaykumar, S. Schaal, Reinforcement learning for
humanoid robotics, in: IEEE International Conference on Human-
doid Robots, 2003.
[22] J. Peters, S. Vijayakumar, S. Schaal, Natural Actor-Critic, in:
Proceedings of the European Machine Learning Conference
(ECML), Porto, Portugal, 2005.
[23] S. Richter, D. Aberdeen, J. Yu, Natural Actor-Critic for road traffic
optimisation, in: Advances in Neural Information Processing
Systems, 2007.
[24] R.S. Sutton, A.G. Barto, Reinforcement Learning, MIT Press,
Cambridge, MA, 1998.
[25] R.S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient
methods for reinforcement learning with function approximation, in:
Advances in Neural Information Processing Systems, vol. 12, 2000.
[26] T. Ueno, Y. Nakamura, T. Shibata, K. Hosoda, S. Ishii, Fast and
Stable learning of quasi-passive dynamic walking by an unstable
biped robot based on off-policy Natural Actor-Critic, in: IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS),
2006.
[27] S.V. N Vishwanathan, X. Zhang, D. Aberdeen, Conditional random
fields for reinforcement learning, in: Y. Bengio, Y. LeCun (Eds.),
Proceedings of the 2007 Snowbird Learning Workshop, San Juan,
Puerto Rico, March 2007.
[28] X. Zhang, D. Aberdeen, S.V.N. Vishwanathan, Conditional random
fields for multi-agent reinforcement learning, in: Proceedings of the
24th International Conference on Machine Learning (ICML 2007),
ACM International Conference Proceeding Series, Corvalis, Oregon,
2007, pp. 1143–1150.
Jan Peters heads the Robot Learning Lab
(RoLL) at the Max-Planck Institute for Biologi-
cal Cybernetics (MPI) while being an invited
researcher at the Computational Learning and
Motor Control Lab at the University of Southern
California (USC). Before joining MPI, he grad-
uated from University of Southern California
with a Ph.D. in Computer Science in March 2007.
Jan Peters studied Electrical Engineering, Com-
puter Science and Mechanical Engineering. He
holds two German M.Sc. degrees in Informatics and in Electrical
Engineering (Dipl-Informatiker from Hagen University and Diplom-
Ingenieur from Munich University of Technology/TUM) and two M.Sc.
degrees in Computer Science and Mechanical Engineering from University
of Southern California (USC). During his graduate studies, Jan Peters has
been a visiting researcher at the Department of Robotics at the German
Aerospace Research Center (DLR) in Oberpfaffenhofen, Germany, at
Siemens Advanced Engineering (SAE) in Singapore, at the National
University of Singapore (NUS), and at the Department of Humanoid
Robotics and Computational Neuroscience at the Advanced Telecommu-
nication Research (ATR) Center in Kyoto, Japan. His research interests
include robotics, nonlinear control, machine learning, and motor skill
learning.
![Page 11: Natural Actor-Critic](https://reader035.vdocuments.pub/reader035/viewer/2022080401/575074111a28abdd2e9298d5/html5/thumbnails/11.jpg)
ARTICLE IN PRESSJ. Peters, S. Schaal / Neurocomputing 71 (2008) 1180–11901190
Stefan Schaal is an Associate Professor at the
Department of Computer Science and the Neu-
roscience Program at the University of Southern
California, and an Invited Researcher at the ATR
Human Information Sciences Laboratory in
Japan, where he held an appointment as Head
of the Computational Learning Group during an
international ERATO project, the Kawato Dy-
namic Brain Project (ERATO/JST). Before join-
ing USC, Dr. Schaal was a postdoctoral fellow at
the Department of Brain and Cognitive Sciences
and the Artificial Intelligence Laboratory at MIT, an Invited Researcher
at the ATR Human Information Processing Research Laboratories in
Japan, and an Adjunct Assistant Professor at the Georgia Institute of
Technology and at the Department of Kinesiology of the Pennsylvania
State University. Dr. Schaal’s research interests include topics of statistical
and machine learning, neural networks, computational neuroscience,
functional brain imaging, nonlinear dynamics, nonlinear control theory,
and biomimetic robotics. He applies his research to problems of artificial
and biological motor control and motor learning, focusing on both
theoretical investigations and experiments with human subjects and
anthropomorphic robot equipment.