An Application of Reinforcement Learning to Autonomous Helicopter

  • Slides: 25
Download presentation
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan

An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University

Overview n n Autonomous helicopter flight is widely accepted to be a highly challenging

Overview n n Autonomous helicopter flight is widely accepted to be a highly challenging control/reinforcement learning (RL) problem. Human expert pilots significantly outperform autonomous helicopters. Apprenticeship learning algorithms use expert demonstrations to obtain good controllers. Our experimental results significantly extend the state of the art in autonomous helicopter aerobatics. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Apprenticeship learning and RL Hard to specify the reward function for complex tasks such

Apprenticeship learning and RL Hard to specify the reward function for complex tasks such as helicopter aerobatics. Reward Function R n Unknown dynamics: flight data is required to obtain an accurate model. Dynamics Model Psa Reinforcement Learning Control policy p Apprenticeship learning: uses an expert demonstration to help select the model and the reward function. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Learning the dynamical model Have good model of dynamics? n NO YES State-of-the-art: E

Learning the dynamical model Have good model of dynamics? n NO YES State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002. ) “Exploit” “Explore” Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Learning the dynamical model Have good model of impractical: dynamics? Exploration policies are they

Learning the dynamical model Have good model of impractical: dynamics? Exploration policies are they do not even try to perform well. n NO YES State-of-the-art: E 3 algorithm, Kearns and Singh (2002). Can we avoid explicit exploration and just exploit? (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002. ) “Exploit” “Explore” Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Aggressive manual exploration Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Aggressive manual exploration Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Apprenticeship learning of the model Autonomous flight Expert human pilot flight L ea rn

Apprenticeship learning of the model Autonomous flight Expert human pilot flight L ea rn Ps Dynamics Model Psa (a 1, s 1, a 2, s 2, a 3, s 3, …. ) Reward Function R Learn Psa Reinforcement Learning Control policy p Theorem. The described procedure will [Abbeel & Ng, 2005] return policy as good as the expert’s policy in a polynomial number of iterations. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Learning the dynamics model n Details of algorithm for learning dynamics model: n n

Learning the dynamics model n Details of algorithm for learning dynamics model: n n Gravity subtraction [Abbeel, Ganapathi & Ng, 2005] Lagged criterion [Abbeel & Ng, 2004] Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous nose-in funnel Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous nose-in funnel Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous tail-in funnel Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous tail-in funnel Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Apprenticeship learning: reward Hard to specify the reward function for complex tasks such as

Apprenticeship learning: reward Hard to specify the reward function for complex tasks such as helicopter aerobatics. Reward Function R Dynamics Model Psa Reinforcement Learning Control policy p Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Example task: flip n Ideal flip: rotate 360 degrees around horizontal axis going right

Example task: flip n Ideal flip: rotate 360 degrees around horizontal axis going right to left through the helicopter. 1 2 T 3 T 4 T T g 5 g 6 T 7 T g g 8 T g Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Example task: flip (2) n Specify flip task as: n Idealized trajectory + n

Example task: flip (2) n Specify flip task as: n Idealized trajectory + n n Reward function that penalizes for deviation. Reward function Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Example of a bad reward function Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew

Example of a bad reward function Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Apprenticeship learning for the reward function n Our approach: n n Observe expert’s demonstration

Apprenticeship learning for the reward function n Our approach: n n Observe expert’s demonstration of task. Infer reward function from demonstration. [see also Ng & Russell, 2000] n Algorithm: Iterate for t = 1, 2, … n Inverse RL step: n n Estimate expert’s reward function R(s)= w. T (s) such that under R(s) the expert outperforms all previously found policies { i}. RL step: n Compute optimal policy t for the estimated reward function. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Theoretical Results: Convergence n Theorem. After a number of iterations polynomial in the number

Theoretical Results: Convergence n Theorem. After a number of iterations polynomial in the number of features and the horizon, the algorithm outputs a policy that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T (s). [Abbeel & Ng, 2004] Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Overview Dynamics Model Psa Reward Function R Reinforcement Learning Control policy p Pieter Abbeel,

Overview Dynamics Model Psa Reward Function R Reinforcement Learning Control policy p Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Optimal control algorithm n Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore,

Optimal control algorithm n Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989] n An efficient algorithm to (locally) optimize a policy for continuous state/action spaces. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

DDP design choices and lessons learned n n Simplest reward function: penalize for deviation

DDP design choices and lessons learned n n Simplest reward function: penalize for deviation from the target states for each time. Insufficient: resulting controllers perform very poorly. Penalize for high frequency control inputs significantly improves the controllers. n To allow aggressive maneuvering, we use a two-step procedure: n n n Make a plan off-line. Penalize for high frequency deviations from the planned inputs. Penalize for integrated orientation error. [See paper for details. ] Process noise has little influence on controllers’ performance. Observation noise and delay in observations greatly affect the controllers’ performance. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous stationary flips Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous stationary flips Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous stationary rolls Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Autonomous stationary rolls Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Related work n n Bagnell & Schneider, 2001; La. Civita et al. , 2006;

Related work n n Bagnell & Schneider, 2001; La. Civita et al. , 2006; Ng et al. , 2004 a; Roberts et al. , 2003; Saripalli et al. , 2003. ; Ng et al. , 2004 b; Gavrilets, Martinos, Mettler and Feron, 2002. Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Conclusion n Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments.

Conclusion n Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments. Procedure based on inverse RL for the reward function gives performance similar to human pilots. Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Acknowledgments n n Ben Tse, Garett Oku, Antonio Genova. Mark Woodward, Tim Worley. Pieter

Acknowledgments n n Ben Tse, Garett Oku, Antonio Genova. Mark Woodward, Tim Worley. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Continuous flips Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Continuous flips Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng