deep learning in robotics

95
Deep Learning applied to robotics Presenter: Sungjoon Choi ([email protected])

Upload: sungjoon-samuel

Post on 16-Apr-2017

1.935 views

Category:

Engineering


0 download

TRANSCRIPT

1

Deep Learning applied to roboticsPresenter: Sungjoon Choi

([email protected])

Contents2Learning Contact-Rich Manipulation Skills with Guided Policy SearchSupersizing Self-supervision- Learning to Grasp from 50K Tries and 700 Robot HoursLearning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data CollectionPlaying Atari with Deep Reinforcement LearningHuman Level Control through Deep Reinforcement LearningDeep Reinforcement Learning with Double Q-LearningFor playing AtariFor controlling real manipulators

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

Learning Contact-Rich Manipulation Skills with Guided Policy Search

Sergey Levine, Nolan Wagener, and Pieter AbbeelICRA 2015

Video64

Introduction65This paper wins the ICRA 2015 Best Manipulation Paper Award!But why? Whats so great about this paper? Personally, main contribution of this paper is to propose a direct policy learning method that can actually train a real-world robot to perform some tasks. Thats it?? I guess so! By the way, actually training a real-world robot is harder than you might imagine!You will see how brilliant this paper is!

Brief review of MDP and RL66

actionobservationreward

Agent

Brief review of MDP and RL67

Remember! The goal of MDP and RL is to find an optimal policy! It is like saying I will find a function which best satisfies given conditions!.However, learning a function is not an easy problem. (In fact, impossible unless we use some prior knowledge!)So, instead of learning a function itself, most of the works try to find the parameters of a function by restricting the solution space to a space of certain parametric functions such as linear functions.

Brief review of MDP and RL68What are typical impediments in reinforcement learning? 2. However, linear functions do not work well in practice. In other words, why is it so HARD to find an optimal policy??1. We are living in a continuous world, not a discrete grid world. 3. (Dynamic) model, which is often required, is HARD to obtain. - In this continuous world, standard MDP cannot be established. - So instead, we usually use function approximation to handle this issue. - And, of course, nonlinear functions are hard to optimize. - The definition of value is expected sum of rewards!. Todays paper tackles all three problems listed above!!

69Big Picture (which might be wrong)RL: Reinforcement Learning IRL: Inverse Reinforcement Learning LfD: Learning from Demonstration DPL: Direct Policy Learning RLDPLIRL(=IOC)LfDGuided Policy SearchObjective?Whats given?Whats NOT givenAlgorithmsRLFind optimal policyRewardDynamic modelPolicy Policy iteration, Value iteration, TD learning, Q learning IRLFind underlying rewardFind optimal policyExperts demonstrations(often dynamic model)Reward PolicyMaxEnt IRL, MaxMargin planning, Apprenticeship learningDPLFind optimal policyExperts demonstrationsRewardDynamic model (not always)Guided policy searchLfDFind underlying rewardFind optimal policyExperts demonstrations+ others.. Dynamic model (not always)GP motion controller

IOC: Inverse Optimal Control

[10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search."ICRA 2015[2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs."ICML 2010[11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models."arXiv 2015[3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning. AAAI 2008[1] Emanuel Todorov. "Linearly-solvable Markov decision problems."NIPS 2006[6] Levine, Sergey, and Vladlen Koltun. "Guided policy search."ICML 2013[5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples."ICML 2012[9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies."ICRA 2015[7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization."ICML 2014[8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics."NIPS 2014.MDP is powerful.But it requires heavy computation for finding the value function. LMDP [1]Lets use the LMDP in inverse optimal control problem! [2]How can we measure the probability of (experts) state-action sequence? [3]Can we learn nonlinear reward function? [4] Can we do that with locally optimal examples? [5]Given reward, how can we effectively learn optimal policy? [6]Re-formalize the guided policy search. [7]Let learn both dynamic model and policy!! [8]Image based control with CNN!! [9]Applied to a real-world robot, PR2!! [10][4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes"NIPS 2011The beginning of a new era!(RL + Deep learning) Note that reward is Given!!How can we effectively search the optimal policy? [11] (latest)

Learning Contact-Rich Manipulation Skills with Guided Policy Search71Main building block is a Guided Policy Search (GPS).GPS is a two stage algorithm consists of a trajectory optimization stage and policy learning stage. Levine, Sergey, and Vladlen Koltun. "Guided policy search."ICML 2013 Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization."ICML 2014GPS is a direct policy search algorithm, that can effectively scale to high-dimensional systems. Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics."NIPS 2014.

Guided policy search72

Stage 1) Trajectory optimization (iterative LQR)Given a reward function and dynamic model,

Each trajectory consists of (state-action) pairs. Levine, Sergey, and Vladlen Koltun. "Guided policy search."ICML 2013

Iterative LQR73 Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear-quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search."ICML 2013

Linear dynamicsQuadratic reward

Iterative LQR74 Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear-quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search."ICML 2013

Iterative LQR75 Levine, Sergey, and Vladlen Koltun. "Guided policy search."ICML 2013 Iteratively compute a trajectory, find a deterministic policy based on the trajectory, and recomputed a trajectory until convergence. But this only results a deterministic policy. We need something stochastic! By exploiting the concept of linearly solvable MDP and maximum entropy control, one can derive following stochastic policy!

Guided Policy Search76Stage 2) Policy learning

From collected (state-action) pairs,

Train neural network controllers, using Importance Sampled Policy Search. Levine, Sergey, and Vladlen Koltun. "Guided policy search."ICML 2013

Experiments77

(a) stacking large lego blocks on a fixed base, (b) onto a free-standing block, (c) held in both gripper;(d) threading wooden rings onto a tight-fitting peg; (e) assembling a toy airplane by inserting the wheels into a slot; (f) inserting a shoetree into a shoe; (g,h) screwing caps onto pill bottles and (i) onto a water bottle.

Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours Lerrel Pinto,Abhinav Gupta ICRA 2016

Video79

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data CollectionSergey Levine, Peter Paster, Alex Krizhevsky, Deirdre QuileenISER 2016

Video86

95