Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel
Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng. ] Pieter Abbeel and Andrew Y. Ng
Overview • Reinforcement Learning (RL) • Motivation for Apprenticeship Learning • Proposed algorithm • Theoretical results • Experimental results • Conclusion Pieter Abbeel and Andrew Y. Ng
Example of Reinforcement Learning Problem Highway driving. Pieter Abbeel and Andrew Y. Ng
RL formalism s 0 R(s 0) System dynamics + s 1 R(s 1) System dynamics + … s 2 s. T-1 R(s 2) +…+ R(s. T-1) System dynamics s. T + R(s. T) • Assume that at each time step, our system is in some state st. • Upon taking an action at, our system randomly transitions to some new state st+1. • We are also given a reward function R. • The goal: Pick actions over time so as to maximize the expected sum of rewards E[R(s 0) + R(s 1) + … + R(s. T)]. Pieter Abbeel and Andrew Y. Ng
RL formalism • Markov Decision Process (S, A, P, s 0, R) • W. l. o. g. we assume • Policy • Utility of a policy for reward R=w. T Pieter Abbeel and Andrew Y. Ng
Motivation for Apprenticeship Learning Reinforcement learning (RL) gives powerful tools for solving MDPs. It can be difficult to specify the reward function. Example: Highway driving. Pieter Abbeel and Andrew Y. Ng
Apprenticeship Learning • Learning from observing an expert. • Previous work: – Learn to predict expert’s actions as a function of states. – Usually lacks strong performance guarantees. – (E. g. , . Pomerleau, 1989; Sammut et al. , 1992; Kuniyoshi et al. , 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …) • Our approach: – Based on inverse reinforcement learning (Ng & Russell, 2000). – Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function. Pieter Abbeel and Andrew Y. Ng
Algorithm For i = 1, 2, … Inverse RL step: Estimate expert’s reward function R(s)= w. T (s) such that under R(s) the expert performs better than all previously found policies { j}. RL step: Compute optimal policy i for the estimated reward w. Pieter Abbeel and Andrew Y. Ng
Algorithm: Inverse RL step Pieter Abbeel and Andrew Y. Ng
Algorithm: Inverse RL step Quadratric programming problem. (same as for SVM) Pieter Abbeel and Andrew Y. Ng
Algorithm 2 ( 2) ( E) w(3) w(2) ( 1) w(1) ( 0) 1 Pieter Abbeel and Andrew Y. Ng
Feature Expectation Closeness and Performance If we can find a policy such that || ( E) - ( )||2 , then for any underlying reward R*(s) =w*T (s), we have that |Uw*( E) - Uw*( )| = |w*T ( E) - w*T ( )| ||w*||2 || ( E) - ( )||2 . Pieter Abbeel and Andrew Y. Ng
Theoretical Results: Convergence Theorem. Let an MDP (without reward function), a k-dimensional feature vector and the expert’s feature expectations ( E) be given. Then after at most k T 2/ 2 iterations, the algorithm outputs a policy that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T (s), i. e. , Uw*( ) Uw*( E) - . Pieter Abbeel and Andrew Y. Ng
Theoretical Results: Sampling In practice, we have to use sampling to estimate the feature expectations of the expert. We still have -optimal performance with high probability if the number of observed samples is at least O(poly(k, 1/ )). Note: the bound has no dependence on the “complexity” of the policy. Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments Reward function is piecewise constant over small regions. Features for IRL are these small regions. 128 x 128 grid, small regions of size 16 x 16. Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments Pieter Abbeel and Andrew Y. Ng
Case study: Highway driving Input: Driving demonstration Output: Learned behavior The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided. Pieter Abbeel and Andrew Y. Ng
More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration. Pieter Abbeel and Andrew Y. Ng
Car driving results Left Lane Middle Lane Right Shoulder Collision Left Shoulder 0 0 0. 13 0. 20 0. 60 0. 07 1 (learned) 0 0 0. 09 0. 23 0. 60 0. 08 w (learned) -0. 08 -0. 04 0. 01 0. 03 -0. 01 0. 12 0 0. 06 0. 47 0 2 (learned) 0. 13 0 0. 10 0. 32 0. 58 0 w (learned) 0. 23 -0. 11 0. 05 0. 06 -0. 01 0. 70 0. 29 3 (learned) 0 0 0. 74 0. 26 w (learned) -0. 11 -0. 06 -0. 04 0. 09 0. 01 (expert) Pieter Abbeel and Andrew Y. Ng
Different Formulation LP formulation for RL problem max. s, a (s, a) R(s) s. t. s a (s, a) = s’, a P(s|s’, a) (s’, a) QP formulation for Apprenticeship Learning min. , i ( E, i - i)2 s. t. s a (s, a) = s’, a P(s|s’, a) (s’, a) i i = s, a i(s) (s, a) Pieter Abbeel and Andrew Y. Ng
Different Formulation (ctd. ) Our algorithm is equivalent to iteratively linearizing QP at current point (Inverse RL step), solve resulting LP (RL step). Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality. ] Pieter Abbeel and Andrew Y. Ng
Conclusions Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function. Algorithm is guaranteed to converge in poly(k, 1/ ) iterations. Sample complexity poly(k, 1/ ). The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches). Pieter Abbeel and Andrew Y. Ng
Proof (sketch) 2 ( E) d 0 w(1) ( 0) ( 1) d 1 (1) 1 Pieter Abbeel and Andrew Y. Ng
Proof (sketch) Pieter Abbeel and Andrew Y. Ng
More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration. Pieter Abbeel and Andrew Y. Ng
Additional slides for poster (slides to come are additional material, not included in the talk, in particular: projection (vs. QP) version of the Inverse RL step; another formulation of the apprenticeship learning problem, and its relation to our algorithm) Pieter Abbeel and Andrew Y. Ng
Simplification of Inverse RL step: QP Euclidean projection • In the Inverse RL step – set (i-1) = orthogonal projection of E onto line through { (i-1), ( (i-1)) } – set w(i) = E - (i-1) • Note: theoretical results on convergence and sample complexity hold unchanged for the simpler algorithm. Pieter Abbeel and Andrew Y. Ng
Algorithm (projection version) 2 E ( 1) w(1) ( 0) 1 Pieter Abbeel and Andrew Y. Ng
Algorithm (projection version) 2 E ( 2) w(1) ( 0) ( 1) (1) 1 Pieter Abbeel and Andrew Y. Ng
Algorithm (projection version) 2 E ( 2) w(3) w(2) w(1) ( 0) ( 1) (1) 1 Pieter Abbeel and Andrew Y. Ng
Appendix: Different View Bellman LP for solving MDPs Min. V c’V s. t. s, a V(s) R(s, a) + s’ P(s, a, s’)V(s’) Dual LP Max. s, a (s, a)R(s, a) s. t. s c(s) - a (s, a) + s’, a P(s’, a, s) (s’, a) =0 Apprenticeship Learning as QP Min. i ( E, i - s, a (s, a) i(s))2 s. t. s c(s) - a (s, a) + s’, a P(s’, a, s) (s’, a) =0 Pieter Abbeel and Andrew Y. Ng
Different View (ctd. ) Our algorithm is equivalent to iteratively linearize QP at current point (Inverse RL step), solve resulting LP (RL step). Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality. ] Pieter Abbeel and Andrew Y. Ng
Slides that are different for poster (slides to come are slightly different for poster, but already “appeared” earlier) Pieter Abbeel and Andrew Y. Ng
Algorithm (QP version) 2 ( E) ( 1) w(1) ( 0) Uw( ) = w. T ( ) 1 Pieter Abbeel and Andrew Y. Ng
Algorithm (QP version) 2 ( E) ( 2) w(2) ( 1) w(1) ( 0) Uw( ) = w. T ( ) 1 Pieter Abbeel and Andrew Y. Ng
Algorithm (QP version) 2 ( 2) ( E) w(3) w(2) ( 1) w(1) ( 0) Uw( ) = w. T ( ) 1 Pieter Abbeel and Andrew Y. Ng
Gridworld Experiments Pieter Abbeel and Andrew Y. Ng
Case study: Highway driving Input: Driving demonstration Output: Learned behavior (Videos available. ) Pieter Abbeel and Andrew Y. Ng
More driving examples (Videos available. ) Pieter Abbeel and Andrew Y. Ng
Car driving results (more detail) 1 Feature Distr. Expert Colli sion Offroad Left Middle Lane Left Lane Right Lane Offroad Right 0 0 0. 1325 0. 2033 0. 5983 0. 0658 5. 00 E-05 0. 0004 0. 0904 0. 2286 0. 604 0. 0764 -0. 0767 -0. 0439 0. 0077 0. 0078 0. 0318 -0. 0035 Feature Distr. Expert 0. 1167 0 0. 0633 0. 4667 0. 47 0 Feature Distr. Learned 0. 1332 0 0. 1045 0. 3196 0. 5759 0 Weights Learned 0. 234 -0. 1098 0. 0092 0. 0487 0. 0576 -0. 0056 Feature Distr. Expert 0 0. 0033 0. 7058 0. 2908 Feature Distr. Learned 0 0 0. 7447 0. 2554 Weights Learned -0. 1056 -0. 0051 -0. 0573 -0. 0386 0. 0929 0. 0081 0. 06 0 0 0. 0033 0. 2908 0. 7058 Feature Distr. Learned Weights Learned 2 3 4 Feature Distr. Expert Feature Distr. Learned 0. 0569 0 0. 2666 0. 7334 Weights Learned 0. 1079 -0. 0001 -0. 0487 -0. 0666 0. 059 0. 0564 0. 06 0 0 1 0 0 5 Feature Distr. Expert Feature Distr. Learned 0. 0542 0 0 1 0 0 Weights Learned 0. 0094 -0. 0108 -0. 2765 0. 8126 -0. 51 -0. 0153
Proof (sketch) Pieter Abbeel and Andrew Y. Ng
Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University Pieter Abbeel and Andrew Y. Ng
- Slides: 45