Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley
Motivation for apprenticeship learning Pieter Abbeel
Outline n Preliminary: reinforcement learning. n Apprenticeship learning algorithms. n Experimental results on various robotic platforms. Pieter Abbeel
Reinforcement learning (RL) state s 0 System Dynamics dynamics Psa s 1 a 0 reward R(s 0) + Psa System Dynamics … s 2 s. T-1 + Psa a. T-1 a 1 R(s 1) s. T R(s 2) +…+ R(s. T-1) + R(s. T) Example reward function: R(s) = - || s – s* || Goal: Pick actions over time so as to maximize the expected score: E[R(s 0) + R(s 1) + … + R(s. T)] Solution: policy which specifies an action for each possible state for all times t= 0, 1, … , T. Pieter Abbeel
Model-based reinforcement learning Control policy Run RL algorithm in simulator. Pieter Abbeel
Reinforcement learning (RL) Dynamics Model Psa Reward Function R n Reinforcement Learning Control policy p Apprenticeship learning algorithms use a demonstration to help us find n a good dynamics model, n a good reward function, n a good control policy. Pieter Abbeel
Apprenticeship learning for the dynamics model Dynamics Model Psa Reward Function R Reinforcement Learning Control policy p Pieter Abbeel
Motivating example Collect flight data. How to fly helicopter for data • Textbook model collection? How to ensure that • Specification entire flight envelope is covered by the data collection process? Accurate dynamics model Psa Learn model from data. Pieter Abbeel
Learning the dynamical model n State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002. ) Have good model of dynamics? YES “Exploit” NO “Explore” Pieter Abbeel
Learning the dynamical model n State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002. ) Have good model of impractical: dynamics? Exploration policies are they do not even try to perform well. NO Can we avoid explicit exploration and just exploit? YES “Exploit” “Explore” Pieter Abbeel
Apprenticeship learning of the model Autonomous flight Teacher: human pilot flight Learn Psa Dynamics Model Psa (a 1, s 1, a 2, s 2, a 3, s 3, …. ) Reward Function R Learn Psa Reinforcement Learning Control policy p No explicit exploration, always try to fly as well as possible. [ICML 2005] Pieter Abbeel
Theorem. n Assuming a polynomial number of teacher demonstrations, then after a polynomial number of trials, with probability 1 - E [ sum of rewards | policy returned by algorithm ] ¸ E [ sum of rewards | teacher’s policy] - . n Here, polynomial is with respect to n 1/ , n the horizon T, n the maximum reward R, n the size of the state space. Pieter Abbeel
Learning the dynamics model n Details of algorithm for learning dynamics model: n Exploiting structure from physics n Lagged learning criterion [NIPS 2005, 2006] Pieter Abbeel
Helicopter flight results n First high-speed autonomous funnels. n Speed: 5 m/s. Nominal pitch angle: 30 degrees. 30 o Pieter Abbeel
Autonomous nose-in funnel Pieter Abbeel
Accuracy Pieter Abbeel
Autonomous tail-in funnel Pieter Abbeel
Key points n n Unlike exploration methods, our algorithm concentrates on the task of interest. Bootstrapping off an initial teacher demonstration is sufficient to perform the task as well as the teacher. Pieter Abbeel
Apprenticeship learning: reward Dynamics Model Psa Reward Function R Reinforcement Learning Control policy p Pieter Abbeel
Example task: driving Pieter Abbeel
Related work n Previous work: n n Learn to predict teacher’s actions as a function of states. E. g. , Pomerleau, 1989; Sammut et al. , 1992; Kuniyoshi et al. , 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; … Assumes “policy simplicity. ” Our approach: n n Assumes “reward simplicity” and is based on inverse reinforcement learning (Ng & Russell, 2000). Similar work since: Ratliff et al. , 2006, 2007. Pieter Abbeel
Inverse reinforcement learning n Find R s. t. R is consistent with the teacher’s policy * being optimal. n Find R s. t. : n Find w: n n Linear constraints in w, quadratic objective QP. Very large number of constraints. Pieter Abbeel
Algorithm n For i = 1, 2, … n Inverse RL step: n RL step: (= constraint generation) Compute optimal policy i for the estimated reward Rw. Pieter Abbeel
Theoretical guarantees n Theorem. n n After at most n. T 2/ 2 iterations our algorithm returns a policy that performs as well as the teacher according to the teacher’s unknown reward function, i. e. , Note: Our algorithm does not necessarily recover the teacher’s reward function R* --- which is impossible to recover. [ICML 2004] Pieter Abbeel
Performance guarantee intuition n Intuition by example: n Let n If the returned policy satisfies n Then no matter what the values of and are, the policy performs as well as the teacher’s policy *. Pieter Abbeel
Case study: Highway driving Input: Driving demonstration Output: Learned behavior The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided. Pieter Abbeel
More driving examples Driving demonstration Learned behavior Driving demonstration In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration. Learned behavior Pieter Abbeel
Helicopter 25 features Reward Function R Dynamics Model Psa Reinforcement Learning Control policy p Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989] Pieter Abbeel [NIPS 2007]
Autonomous aerobatics n [Show helicopter movie in Media Player. ] Pieter Abbeel
Quadruped Pieter Abbeel
Quadruped n Reward function trades off: n Height differential of terrain. n Gradient of terrain around each foot. n Height differential between feet. n … (25 features total for our setup) Pieter Abbeel
Teacher demonstration for quadruped n n Full teacher demonstration = sequence of footsteps. Much simpler to “teach hierarchically”: n n Specify a body path. Specify best footstep in a small area. Pieter Abbeel
Hierarchical inverse RL n Quadratic programming problem (QP): n n quadratic objective, linear constraints. Constraint generation for path constraints. Pieter Abbeel
Experimental setup n Training: n n n Have quadruped walk straight across a fairly simple board with fixed-spaced foot placements. Around each foot placement: label the best foot placement. (about 20 labels) Label the best body-path for the training board. Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels. Test on hold-out terrains: n Plan a path across the test-board. Pieter Abbeel
Quadruped on test-board n [Show movie in Media Player. ] Pieter Abbeel
Apprenticeship learning: RL algorithm Dynamics Model Psa Reward Function R Reinforcement Learning n (Sloppy) demonstration n (Crude) model n Small number of real-life trials Control policy p Pieter Abbeel
Experiments n Two Systems: n RC car Fixed-wing flight simulator Control actions: throttle and steering. Pieter Abbeel
RC Car: Circle Pieter Abbeel
RC Car: Figure-8 Maneuver Pieter Abbeel
Conclusion n n Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations. Our current work exploits teacher demonstrations to find n a good dynamics model, n a good reward function, n a good control policy. Pieter Abbeel
Acknowledgments n n Adam Coates, Morgan Quigley, Andrew Y. Ng J. Zico Kolter, Andrew Y. Ng n n Andrew Y. Ng Morgan Quigley, Andrew Y. Ng Pieter Abbeel
- Slides: 43