deep parking

36
Deep parking: an implementation of automatic parking with deep reinforcement learning Shintaro Shiba, Feb.2016-Dec.2016 Engineer Internship at Preferred Networks Mentor: Abe-san, Fujita-san 1

Upload: shintaro-shiba

Post on 07-Apr-2017

5.168 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Deep parking

1

Deep parking:an implementation of automatic parking with deep reinforcement

learning

Shintaro Shiba, Feb.2016-Dec.2016Engineer Internship at Preferred Networks

Mentor: Abe-san, Fujita-san

Page 2: Deep parking

2

About meShintaro Shiba• Graduate student at the University of

Tokyo– Major in neuroscience and animal behavior

• Part-time engineer (internship) at Preferred Networks, Inc.– Blog post URL: https://

research.preferred.jp/2017/03/deep-parking/

Page 3: Deep parking

3

Contents• Original Idea• Background: DQN and Double-DQN• Task definition

– Environment: car simulator– Agents

1. Coordinate2. Bird‘s-eye view3. Subjective view

• Discussion• Summary

Page 4: Deep parking

4

Achievement

Trajectory of the car agent Subjective view (Input for DQN)

0 deg

-120 deg

+120 deg

Page 5: Deep parking

5

Original Idea: DQN for parking

https://research.preferred.jp/2016/01/ces2016/https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/

Succeeded in driving smoothly with DQNInput: 32 virtual sensors, 3 previous actions + Current speed and steeringOutput: 9 actions

Is it possible to learn for car agent to park itself,with inputs of images from camera?

Page 6: Deep parking

6

Reinforcement learning

Environment

Agent

action statereward

Learning algorithm

Page 7: Deep parking

7

DQN: Deep-Q Network

Volodymyr Mnih et al. 2015

each episode >>each action >>

update Q function >>

Page 8: Deep parking

8

Double DQN

Preventing overestimation of Q values

Hado van Hasselt et al. 2015

Page 9: Deep parking

9

Reinforcement learning in this project

EnvironmentCar simulator

AgentDifferent sensor +

different neural network

action state = sensor inputreward

Page 10: Deep parking

10

Environment:Car simulator

Forces of …• Traction• Air resistance• Rolling resistance• Centrifugal force• Brake• Cornering force

F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf

Page 11: Deep parking

11

Common specifications:state, action, reward

Input (States)– Features specific to each agent + car speed, car steering

Output (Actions)– 9: accelerate, decelerate, steer right, steer left, throw (do nothing),

accelerate + steer right, accelerate + steer left, decelerate + steer right, decelerate + steer left

Reward– +1 when the car is in the goal– -1 when the car is out of the field– 0.01 - 0.01 * distance_to_goal otherwise (changed afterward)

Goal– Car inside the goal region, no other conditions like car direction

Terminate– Time up: 500 times of actions (changed to 450 afterward)– Field out: Out of the field

Page 12: Deep parking

12

Common specifications:hyperparameters

Maximum episode: 50,000Gamma: 0.97Optimizer: RMSpropGraves

– lr=0.00015, alpha=0.95, momentum=0.95, eps=0.01

– changed afterward: lr=0.00015, alpha=0.95, momentum=0, eps=0.01

Batchsize: 50 or 64Epsilon: 0.1 at last

– linearly decreased from 1.0 at first

Page 13: Deep parking

13

Agents1. Coordinate2. Bird’s-eye view3. Subjective view

– Three cameras– Four cameras

Page 14: Deep parking

14

Coordinate agentInput features

– Relative coordinate value from the car to the goal

(80, 300)

goal

carinput shape: (2, )normalized

Page 15: Deep parking

15

Coordinate agentNeural Network

– only full-connected layers (3)

n of actions (9)

n of car parameters (2)

coordinates (2)

64 64

Page 16: Deep parking

16

Coordinate agentResult

Page 17: Deep parking

17

Bird’s-eye view agentInput features

– Bird’s-eye image of the whole field

input size: 80 x 80normalized

Page 18: Deep parking

18

Bird’s-eye view agentNeural Network

80

80

128

192n of actions

n of car parameters (2)64

400

Conv

Page 19: Deep parking

19

Bird’s-eye view agentNeural Network

80

80

128

192n of actions

n of car parameters (2)64

400

Conv

Page 20: Deep parking

20

Bird’s-eye view agentResult: 18k episodes

Page 21: Deep parking

21

Bird’s-eye view agentResult: after 18k episodes ?

But we had already spent about 6 month for this agent so moved to the next…

Page 22: Deep parking

22

Subjective view agentInput features

– N_of_camera images of subjective view from the car

– Number of cameras…Three or Four– FoV = 120 deg

cameraex. Input images for four camera agent

front+0

back+180

right+90

left+270

Page 23: Deep parking

23

Subjective view agentNeural Network

Conv

80

80

200 x 3

400

256n of actions

n of car parameters (2)

64

Page 24: Deep parking

24

Subjective view agentNeural Network

Conv

80

80

200 x 3

400

256n of actions

n of car parameters (2)

64

Page 25: Deep parking

25

Subjective view agentProblem

– Calculation time (GeForce GTX TITAN X) • At first… 3 [min/ep] x 50k [ep] = 100 days• Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55

days– Because of copy and synchronization between GPU and CPU– Learning interrupted as soon as divergence of DNN output– (Fortunately) agent “learned” goal by ~10k episodes in

some trials– Memory usage

• In DQN, we need to store 1M previous input data– 1M x (80 x 80 x 3 ch x 4 cameras)

• Save images to disk and access every time

Page 26: Deep parking

26

Subjective view agentResult: three cameras, 6k episodes

0 deg

-120 deg

+120 deg

Trajectory of the car agent Subjective view (Input for DQN)

Page 27: Deep parking

27

Subjective view agentResult: three cameras, 50k episodes

The policy “move anyways” ?>> Reward setting

Seems not able to goal every timeOnly “easy” goal to achieve>> Variable task difficulty (curriculum)

Frequent goals here

Page 28: Deep parking

28

Subjective view agentFour camera at 30k ep.

Page 29: Deep parking

29

Modify rewardPrevious

– +1 when the car is in the goal– -1 when the car is out of the field– 0.01 - 0.01 * distance_to_goal otherwise

New– +1 - speed when the car is in the goal

• in order to stop the car– -1 when the car is out of the field– -0.005

Page 30: Deep parking

30

Modify difficultyDifficulty: Initial car direction & position

– Constraint• Car always starts near the middle of the field• Car always starts with face toward center:

– Curriculum• Car direction:

– where n = currriculum• Criteria:

– 0.6 of mean reward over 100 episodes

Goaln = 1

n = 2

Page 31: Deep parking

31

Subjective view agent: modifications

N cameras Reward Difficulty Learning result

3 Default Default about 6k: o50k: x

3 modified Default about 16k: o

3 modified Constraint ? (still learning)

3 modified Curriculum o(though curriculum 1

yet)4 Default Default x

4 modified Curriculum △ (not bad, but not successful yet at 6k)

Page 32: Deep parking

32

Subjective view agent: modifications

Curriculum + Three cameras@curriculum 1. Criteria needs to be modified

reward mean reward sum1.0

0.0

500

0

n episode0 10k 20k

n episode0 10k 20k

Page 33: Deep parking

33

Discussion1. Initial settings included the situation

where car cannot reach the goal– e.g. Start towards the edge of the field– This made learning unstable

2. Why successful for coordinate agent?– In spite there could be such situations?

Page 34: Deep parking

34

Discussion3. Comparison with three and four cameras

– Considering success rate and execution time, three camera is better

– Why not successful in four cameras?– Need several trials?

4. DQN often diverged– every three times in personal feeling

• four cameras is slightly more oftern– Importance of dataset for learning

• memory size, batch size

Page 35: Deep parking

35

Discussion5. Curriculum

– Ideally better to quantify “difficulty of the task”

• In this case, maybe it is roughly represented as “bias of distribution” of the selected actions?accelerate

deceleratethrow (do nothing)

steer rightsteer left

accelerate + steer rightaccelerate + steer left

decelerate + steer rightdecelerate + steer left

same times for each actions >> go straightbiased distribution of selected actions >> go right/left

Page 36: Deep parking

36

Summary• Car agent can park itself with subjective

view of cameras, though not always stable learning

• Trade-off between reward design and learning difficulty– Simple reward: difficult to learn

• Try other algorithms like A3C– Complex reward: difficult to set

• Other setting for distance_to_goal