deep parking

1

Deep parking:an implementation of automatic parking with deep reinforcement

learning

Shintaro Shiba, Feb.2016-Dec.2016Engineer Internship at Preferred Networks

Mentor: Abe-san, Fujita-san

2

About meShintaro Shiba• Graduate student at the University of

Tokyo– Major in neuroscience and animal behavior

• Part-time engineer (internship) at Preferred Networks, Inc.– Blog post URL: https://

research.preferred.jp/2017/03/deep-parking/

https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/




3

Contents• Original Idea• Background: DQN and Double-DQN• Task definition

– Environment: car simulator– Agents

1. Coordinate2. Bird‘s-eye view3. Subjective view

• Discussion• Summary

4

Achievement

Trajectory of the car agent Subjective view (Input for DQN)

0 deg

-120 deg

+120 deg

5

Original Idea: DQN for parking

https://research.preferred.jp/2016/01/ces2016/https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/

Succeeded in driving smoothly with DQNInput: 32 virtual sensors, 3 previous actions + Current speed and steeringOutput: 9 actions

Is it possible to learn for car agent to park itself,with inputs of images from camera?

https://research.preferred.jp/2016/01/ces2016/

https://research.preferred.jp/2016/01/ces2016/



6

Reinforcement learning

Environment

Agent

action statereward

Learning algorithm

7

DQN: Deep-Q Network

Volodymyr Mnih et al. 2015

each episode >>each action >>

update Q function >>

8

Double DQN

Preventing overestimation of Q values

Hado van Hasselt et al. 2015

9

Reinforcement learning in this project

EnvironmentCar simulator

AgentDifferent sensor +

different neural network

action state = sensor inputreward

10

Environment:Car simulator

Forces of …• Traction• Air resistance• Rolling resistance• Centrifugal force• Brake• Cornering force

F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf

11

Common specifications:state, action, reward

Input (States)– Features specific to each agent + car speed, car steering

Output (Actions)– 9: accelerate, decelerate, steer right, steer left, throw (do nothing),

accelerate + steer right, accelerate + steer left, decelerate + steer right, decelerate + steer left

Reward– +1 when the car is in the goal– -1 when the car is out of the field– 0.01 - 0.01 * distance_to_goal otherwise (changed afterward)

Goal– Car inside the goal region, no other conditions like car direction

Terminate– Time up: 500 times of actions (changed to 450 afterward)– Field out: Out of the field

12

Common specifications:hyperparameters

Maximum episode: 50,000Gamma: 0.97Optimizer: RMSpropGraves

– lr=0.00015, alpha=0.95, momentum=0.95, eps=0.01

– changed afterward: lr=0.00015, alpha=0.95, momentum=0, eps=0.01

Batchsize: 50 or 64Epsilon: 0.1 at last

– linearly decreased from 1.0 at first

13

Agents1. Coordinate2. Bird’s-eye view3. Subjective view

– Three cameras– Four cameras

14

Coordinate agentInput features

– Relative coordinate value from the car to the goal

(80, 300)

goal

carinput shape: (2, )normalized

15

Coordinate agentNeural Network

– only full-connected layers (3)

n of actions (9)

n of car parameters (2)

coordinates (2)

64 64

16

Coordinate agentResult

17

Bird’s-eye view agentInput features

– Bird’s-eye image of the whole field

input size: 80 x 80normalized

18

Bird’s-eye view agentNeural Network

80

80

128

192n of actions

n of car parameters (2)64

400

Conv

19

Bird’s-eye view agentNeural Network

80

80

128

192n of actions

n of car parameters (2)64

400

Conv

20

Bird’s-eye view agentResult: 18k episodes

21

Bird’s-eye view agentResult: after 18k episodes ?

But we had already spent about 6 month for this agent so moved to the next…

22

Subjective view agentInput features

– N_of_camera images of subjective view from the car

– Number of cameras…Three or Four– FoV = 120 deg

cameraex. Input images for four camera agent

front+0

back+180

right+90

left+270

23

Subjective view agentNeural Network

Conv

80

80

200 x 3

400

256n of actions


64

24

Subjective view agentNeural Network

Conv

80

80

200 x 3

400

256n of actions


64

25

Subjective view agentProblem

– Calculation time (GeForce GTX TITAN X) • At first… 3 [min/ep] x 50k [ep] = 100 days• Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55

days– Because of copy and synchronization between GPU and CPU– Learning interrupted as soon as divergence of DNN output– (Fortunately) agent “learned” goal by ~10k episodes in

some trials– Memory usage

• In DQN, we need to store 1M previous input data– 1M x (80 x 80 x 3 ch x 4 cameras)

• Save images to disk and access every time

26

Subjective view agentResult: three cameras, 6k episodes

0 deg

-120 deg

+120 deg

Trajectory of the car agent Subjective view (Input for DQN)

27

Subjective view agentResult: three cameras, 50k episodes

The policy “move anyways” ?>> Reward setting

Seems not able to goal every timeOnly “easy” goal to achieve>> Variable task difficulty (curriculum)

Frequent goals here

28

Subjective view agentFour camera at 30k ep.

29

Modify rewardPrevious

– +1 when the car is in the goal– -1 when the car is out of the field– 0.01 - 0.01 * distance_to_goal otherwise

New– +1 - speed when the car is in the goal

• in order to stop the car– -1 when the car is out of the field– -0.005

30

Modify difficultyDifficulty: Initial car direction & position

– Constraint• Car always starts near the middle of the field• Car always starts with face toward center:

– Curriculum• Car direction:

– where n = currriculum• Criteria:

– 0.6 of mean reward over 100 episodes

Goaln = 1

n = 2

31

Subjective view agent: modifications

N cameras Reward Difficulty Learning result

3 Default Default about 6k: o50k: x

3 modified Default about 16k: o

3 modified Constraint ? (still learning)

3 modified Curriculum o(though curriculum 1

yet)4 Default Default x

4 modified Curriculum △ (not bad, but not successful yet at 6k)

32

Subjective view agent: modifications

Curriculum + Three cameras@curriculum 1. Criteria needs to be modified

reward mean reward sum1.0

0.0

500

0

n episode0 10k 20k

n episode0 10k 20k

33

Discussion1. Initial settings included the situation

where car cannot reach the goal– e.g. Start towards the edge of the field– This made learning unstable

2. Why successful for coordinate agent?– In spite there could be such situations?

34

Discussion3. Comparison with three and four cameras

– Considering success rate and execution time, three camera is better

– Why not successful in four cameras?– Need several trials?

4. DQN often diverged– every three times in personal feeling

• four cameras is slightly more oftern– Importance of dataset for learning

• memory size, batch size

35

Discussion5. Curriculum

– Ideally better to quantify “difficulty of the task”

• In this case, maybe it is roughly represented as “bias of distribution” of the selected actions?accelerate

deceleratethrow (do nothing)

steer rightsteer left

accelerate + steer rightaccelerate + steer left

decelerate + steer rightdecelerate + steer left

same times for each actions >> go straightbiased distribution of selected actions >> go right/left

36

Summary• Car agent can park itself with subjective

view of cameras, though not always stable learning

• Trade-off between reward design and learning difficulty– Simple reward: difficult to learn

• Try other algorithms like A3C– Complex reward: difficult to set

• Other setting for distance_to_goal

deep parking

Technology