Imitation Learning forAutonomous Driving in TORCS
Final Report
Yasunori KudoMitsuru Kusumoto, Yasuhiro Fujita
SP Team
Imitation LearningImitation Learning is an approach for the sequential predictionproblem, where expert demonstrations of good behavior are used to learn a controller.
In standard reinforcement learning, agents need to explore theenvironment many times to obtain a good policy. However, sampleefficiency is crucial in actual environments.Expert demonstrations may be helpful for this issue.
Examples :• Legged locomotion [Ratliff 2006]• Outdoor navigation [Silver 2008]• Car driving [Pomerleau 1989]• Helicopter flight [Abbeel 2007]
Where we’ll go
DAgger : Dataset AggregationDAGGER3.6. DATASET AGGREGATION: ITERATIVE INTERACTIVE LEARNINGAPPROACH 69
Execute current policy and Query Expert New Data
Supervised Learning
All previous data Aggregate Dataset
Steering from expert
New Policy
Figure 3.5: Depiction of the DAGGER procedure for imitation learning in a drivingscenario.
Test Execu*on
Collect Data
No‐Regret Online Learner
Expert
Learned Policy Done?
yes no iπ̂
Best Policy
iπ̂
e.g. Gradient Descent
Figure 3.6: Diagram of the DAGGER algorithm with a general online learner for imita-tion learning.
policies, with relatively few data points, may make many more mistakes and visit states
that are irrelevant as the policy improves. We will typically use �1
= 1 so that we do
not have to specify an initial policy ⇡̂1
before getting data from the expert’s behavior.
Then we could choose �i = pi�1 to have a probability of using the expert that decays
exponentially as in SMILE and SEARN. The only requirement is that {�i} be a sequence
such that �N = 1
N
PNi=1
�i ! 0 as N ! 1. The simple, parameter-free version of the
Stéphane. Ross, Geoffrey J. Gordon, and J. Andrew. Bagnell. A reduction of imitationlearning and structured prediction to no-‑regret online learning. In AISTATS, 2011.
DAgger : Dataset AggregationDAGGER
70 CHAPTER 3. LEARNING BEHAVIOR FROM DEMONSTRATIONS
Initialize D ;.Initialize ⇡̂
1
to any policy in ⇧.for i = 1 to N do
Let ⇡i = �i⇡⇤ + (1� �i)⇡̂i.Sample T -step trajectories using ⇡i.Get dataset Di = {(s,⇡⇤(s))} of visited states by ⇡i and actions given by expert.Aggregate datasets: D D
SDi.
Train classifier ⇡̂i+1
on D (or use online learner to get ⇡̂i+1
given new data Di).end forReturn best ⇡̂i on validation.
Algorithm 3.6.1: DAGGER Algorithm.
algorithm described above is the special case �i = I(i = 1) for I the indicator function,
which often performs best in practice (see Chapter 5). The general DAGGER algorithm
is detailed in Algorithm 3.6.1.
Analysis
We now provide a complete analysis of this DAGGER procedure, and show how the
strong no-regret property of online learning procedures can be leveraged, in this inter-
active learning procedure, to obtain good performance guarantees. Again here, we seek
to provide a similar analysis to previously analyzed methods that seeks to answer the
following question : if we can find good policies at mimicking the expert on the aggre-
gate dataset we collect during training, then how well the learned policy will perform
the task?
The theoretical analysis of DAGGER relies primarily on seeing how learning iter-
atively in this algorithm can be viewed as an online learning problem and using the
no-regret property of the underlying Follow-The-Leader algorithm on strongly convex
losses (Kakade and Tewari, 2009) which picks the sequence of policies ⇡̂1:N . Hence, the
presented results also hold more generally if we use any other no-regret online learning
algorithm we would apply to our imitation learning setting. In particular, we can con-
sider the results here a reduction of imitation learning to no-regret online learning where
we treat mini-batches of trajectories under a single policy as a single online-learning
example. In addition, in Chapter 9, we also show that the data aggregation procedure
works generally whenever the supervised learner applied to the aggregate dataset has
su�cient stability properties. We refer the reader to Chapter 2 for a review of concepts
related to online learning and no regret that is used for this analysis. All the proofs of
the results presented here are in Appendix A.
!�"∗(%, !) !�
DAgger: Dataset Aggregation
• Collect new trajectories with S1
• New Dataset D1’ = {(s, S*(s))}
• Aggregate Datasets:D1 = D0 U D1’
• Train S2 on D1
17
S1
Steering from expert
DAgger: Dataset Aggregation
S2• Collect new trajectories with S2
• New Dataset D2’ = {(s, S*(s))}
• Aggregate Datasets:D2 = D1 U D2’
• Train S3 on D2
18
Steering from expert
Expert policy Predicted policyAvoid to collect states affectedby only expert policy
Experiments• Pendulum and Pong in OpenAI Gym
• We compared the performance of DAggerwith standard RL algorithm.
REINFORCE
Toy Problem: Pendulum Swingup● Classical RL benchmark task
○ Nonlinear control:○ Action: Torque○ State:
○ Reward:
From “Reinforcement Learning In Continuous Time and Space”, Kenji Doya, 2000State : (θ, θ)
Reward : cosθ・
State : 80×80 binaryReward : win +1, lose -‑1
Experiments - REINFORCEREward Increment = Nonnegative Factor × Offset Reinforcement
× Characteristic Eligibility
!∗($, &) &
!! ()
!! *)+ = $, -(∗ $
!!
*) = *. ∪ *)′
!! (1 *)
∇34 3 =16
789):;,8 − =
>
8?)
∇-3 log ( &;,8|$;,8; 3
>
8?)
E
;?)
38F) = 38 + H∇34 3
!∗($, &) &
!! ()
!! *)+ = $, -(∗ $
!!
*) = *. ∪ *)′
!! (1 *)
∇34 3 =16
789):;,8 − =
>
8?)
∇-3 log ( &;,8|$;,8; 3
>
8?)
E
;?)
38F) = 38 + H∇34 3
Ronald J. Williams. Simple statistical gradient-‑following algorithms forconnectionist reinforcement learning. Machine Learning, 8(3):229-‑256, 1992.
• Predict gradient
• Update model parameter
!�"�#�$�%�&�'�(�)�
!�"�#�$�%�&�'�(�)�
: Model parameter
: Number of episode
: Number of step
: Decay of reward
: Reward
: Baseline
: Policy
: Action
: State
Experiments ‒ Multi Agenthttp://192.168.0.1/8080
http://192.168.0.2/8080
http://192.168.0.3/8080
experience
environment
experience
environment
experience
environment
gradient
gradient
gradient
modelparameter
modelparameter
modelparameter
update
Training speed is about 3 times faster thansingle agent.
3 agents
Results - PendulumREINFORCE
DAgger
3 Layers Perceptron
3200
2input
(cosθ, sinθ, θ).
outputor
Less episodes until convergence !
Results - PongREINFORCE
DAgger
3 Layers Perceptron6400 200
2input
6400 (80x80) vectoroutput
Up or Down
Validation accuracy : 97.04%
= ー
S St+1 t
Less episodes until convergence !
Application to TORCS 7 training trackstrack4 track7 track18
3 test trackstrack8 track12 track16
…• Car driving simulator game• Try to improve Yoshida-‑sanʼ’s projects• Train policy only from vision sensor
Imitation Learning(expert : hand-‑crafted AI)
Reinforcement Learning
IIEC=
LCGL
=K
Res
ult:
Rea
sona
ble
beha
vior
•+
CNCGA
KCFME
LA
F
•G
KGK
K=D
KGK
IIG
GLK
GK8I
KGK
L=
•CG
IEC=
GEF
NCKCG
KGK
,AA
PIL(
G=
L0 S
-‑0.
+-‑/
I/
I/
I/
I/
I(
xt-‑1xt
3×64×64• Steering wheel : (-‑1, 0, 1)• Whether to brake : (0, 1)
• Steering wheel : -‑1 ~∼ 1• Accel : 0 ~∼ 1
or
Discrete actions
Continuous actions
Transfer Learning
Results ‒ DAgger in TORCSDiscrete actions Continuous actions
• DAgger works well in different environments(no overfitting!).• The agent cannot surpass the performance of the expert : Most places where an agent fails are where the expert fails.
• The expert cannot reach the goal in all test tracks.• An agent with continuous actions gradually become worse...
Expert can reach
Experiments ‒Transfer Learning• Experiment 1 (single-‑play) -‑ RL for faster and safer driving
• Experiment 2 (self-‑play) -‑ RL for racing battle
RewardsOut of the tracks ⇒ -1Every 400(track 0) or 200(track 8) steps ⇒ mean speedEnvironmentsTrack 0 and 16
RewardsOut of the tracks ⇒ -1Overtaken by the opponent ⇒ -1Overtake the opponent ⇒ mean speedEnvironmentTrack 0 32
32
64
64
Input
Input
(≒0 ~∼ 2.2)
(≒0 ~∼ 2.2)
Results - Experiment 1 (single-play)Track 0 (Goal : 400 steps) Track 16 (Goal : 1600 steps)
• Transfer learning works well in REINFORCE algorithm.• Better driving than expert in terms of both speed and safety.• An agent trained well seems to control speedby steering action only (no braking).
Expert
Moving Average
Results - Experiment 2 (self-play)
Opponent= ExpertAgent Opponent Agent OpponentAgent
vs. expert self-‑play1 self-‑play 2
Moving Average
RL not to be overtaken RL to overtake RL not to be overtaken
Conclusion and Future Works
• DAgger works well in various environments such as TORCS.• DAgger is very effective as pre-‑training before RL.
• Imitation Learning as pre-‑training cause to get stuck in local minima ?• Multi-‑task learning (predicting existence of another car to the leftor right at the same time) could help to train autonomous driving ?
Future Works
Conclusion
With baselines
Without baselines
With pre-‑training
Without pre-‑training
AppendixComparison of baselines Comparison of pre-‑training by DAgger