Extraction and Transfer of Knowledge in Reinforcement Learning

  • Slides: 43
Download presentation
Extraction and Transfer of Knowledge in Reinforcement Learning A. LAZARIC Inria “ 30 minutes

Extraction and Transfer of Knowledge in Reinforcement Learning A. LAZARIC Inria “ 30 minutes de Science” Seminars Seque. L Inria Lille – Nord Europe December 10 th, 2014

Tools Stochastic approximation Online optimization Optimal control theory Dynamic programming Statistics Seque. L Sequential

Tools Stochastic approximation Online optimization Optimal control theory Dynamic programming Statistics Seque. L Sequential Learning Problems Multi-arm bandit Reinforcement Learning Sequence Prediction Master @Poli. Mi+UIC (2005) Ph. D @Poli. Mi (2008) Post-doc @Seque. L (2010) CR @Seque. L since Dec. 2010 Online Learning Results Theory Algorithms Applications (learnability, sample complexity, regret) (online/batch RL, bandit with structure) (finance, recommendation systems, computer games) A. LAZARIC – Transfer in RL December 10 th, 2014 - 2

Extraction and Transfer of Knowledge in Reinforcement Learning A. LAZARIC – Transfer in RL

Extraction and Transfer of Knowledge in Reinforcement Learning A. LAZARIC – Transfer in RL December 10 th, 2014 - 3

Good transfer Positive transfer No transfer Negative transfer A. LAZARIC – Transfer in RL

Good transfer Positive transfer No transfer Negative transfer A. LAZARIC – Transfer in RL December 10 th, 2014 - 4

Can we design algorithms able to learn from experience and transfer knowledge across different

Can we design algorithms able to learn from experience and transfer knowledge across different problems to improve their learning performance? A. LAZARIC – Transfer in RL December 10 th, 2014 - 5

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the Accuracy of Approximation § Conclusions A. LAZARIC – Transfer in RL December 10 th, 2014 - 6

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the Accuracy of Approximation § Conclusions A. LAZARIC – Transfer in RL December 10 th, 2014 - 7

Reinforcement Learning agent delay critic environment <position, speed> <handlebar, pedals> <new position, new speed>,

Reinforcement Learning agent delay critic environment <position, speed> <handlebar, pedals> <new position, new speed>, advancement Control Policy Value Function A. LAZARIC – Transfer in RL December 10 th, 2014 - 8

Markov Decision Process (MDP) A Markov Decision Process • • Set of states Set

Markov Decision Process (MDP) A Markov Decision Process • • Set of states Set of actions Dynamics Reward is (probability of transition) Policy Objective: maximize the value function A. LAZARIC – Transfer in RL December 10 th, 2014 - 9

Reinforcement Learning Algorithms Over time Exploration/exploitation dilemma • Observe state • Take an action

Reinforcement Learning Algorithms Over time Exploration/exploitation dilemma • Observe state • Take an action • Observe next state and reward • Update policy and value function Approximation RL algorithms often require many samples and careful design and hand-tuning A. LAZARIC – Transfer in RL December 10 th, 2014 - 10

Reinforcement Learning agent delay critic environment <position, speed> <handlebar, pedals> <new position, new speed>,

Reinforcement Learning agent delay critic environment <position, speed> <handlebar, pedals> <new position, new speed>, advancement Very inefficient! A. LAZARIC – Transfer in RL December 10 th, 2014 - 11

Transfer in Reinforcement Learning agent transfer of knowledge delay critic environment A. LAZARIC –

Transfer in Reinforcement Learning agent transfer of knowledge delay critic environment A. LAZARIC – Transfer in RL December 10 th, 2014 - 12

Transfer in Reinforcement Learning agent transfer of knowledge delay critic environment A. LAZARIC –

Transfer in Reinforcement Learning agent transfer of knowledge delay critic environment A. LAZARIC – Transfer in RL December 10 th, 2014 - 16

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the Accuracy of Approximation § Conclusions A. LAZARIC – Transfer in RL December 10 th, 2014 - 17

Multi-arm Bandit: a “Simple” RL Problem The Multi-armed bandit problem • • Set of

Multi-arm Bandit: a “Simple” RL Problem The Multi-armed bandit problem • • Set of states: no state Set of actions (eg, movies, lessons) Dynamics: no dynamics Reward (eg, rating, grade) Policy Objective: maximize the reward over time Online optimization of an unknown stochastic function under computational constraints… A. LAZARIC – Transfer in RL December 10 th, 2014 - 18

Sequential Transfer in Bandit explore and exploit A. LAZARIC – Transfer in RL December

Sequential Transfer in Bandit explore and exploit A. LAZARIC – Transfer in RL December 10 th, 2014 - 19

Sequential Transfer in Bandit Future users A. LAZARIC – Transfer in RL Current user

Sequential Transfer in Bandit Future users A. LAZARIC – Transfer in RL Current user Past users December 10 th, 2014 - 20

Sequential Transfer in Bandit Future users Current user Past users Idea: although the type

Sequential Transfer in Bandit Future users Current user Past users Idea: although the type of the user is unknown, we may collect knowledge about users and exploit their similarity to identify the type and speed-up the learning process A. LAZARIC – Transfer in RL December 10 th, 2014 - 21

Sequential Transfer in Bandit Future users Current user Past users Sanity check: develop an

Sequential Transfer in Bandit Future users Current user Past users Sanity check: develop an algorithm that given the information about possible users as prior knowledge can outperform a non-transfer approach A. LAZARIC – Transfer in RL December 10 th, 2014 - 22

The model-Upper Confidence Bound Algorithm Over time • Select action Exploitation the higher the

The model-Upper Confidence Bound Algorithm Over time • Select action Exploitation the higher the (estimated) reward the higher the chance to select the action A. LAZARIC – Transfer in RL Exploration the higher the (theoretical) uncertainty the higher the chance to select the action December 10 th, 2014 - 23

The model-Upper Confidence Bound Algorithm Over time • Select action “Transfer” combine current estimates

The model-Upper Confidence Bound Algorithm Over time • Select action “Transfer” combine current estimates with prior knowledge about the users in Θ A. LAZARIC – Transfer in RL December 10 th, 2014 - 24

Sequential Transfer in Bandit Future users Current user Past users Collect knowledge A. LAZARIC

Sequential Transfer in Bandit Future users Current user Past users Collect knowledge A. LAZARIC – Transfer in RL December 10 th, 2014 - 25

Sequential Transfer in Bandit Future users Current user Past users Transfer knowledge A. LAZARIC

Sequential Transfer in Bandit Future users Current user Past users Transfer knowledge A. LAZARIC – Transfer in RL December 10 th, 2014 - 26

Sequential Transfer in Bandit Future users Current user Past users Collect & Transfer knowledge

Sequential Transfer in Bandit Future users Current user Past users Collect & Transfer knowledge A. LAZARIC – Transfer in RL December 10 th, 2014 - 27

Sequential Transfer in Bandit Future users Current user Past users Collect & Transfer knowledge

Sequential Transfer in Bandit Future users Current user Past users Collect & Transfer knowledge A. LAZARIC – Transfer in RL December 10 th, 2014 - 28

Sequential Transfer in Bandit Future users Current user Past users Collect & Transfer knowledge

Sequential Transfer in Bandit Future users Current user Past users Collect & Transfer knowledge A. LAZARIC – Transfer in RL December 10 th, 2014 - 29

The transfer-Upper Confidence Bound Algorithm Over time • Select action “Collect and transfer” using

The transfer-Upper Confidence Bound Algorithm Over time • Select action “Collect and transfer” using a method of moment approach to solve a latent variable model problem A. LAZARIC – Transfer in RL December 10 th, 2014 - 30

Empirical Results NIPS 2013, with E. Brunskill (CMU), M. Azar (Northwestern Univ) • Synthetic

Empirical Results NIPS 2013, with E. Brunskill (CMU), M. Azar (Northwestern Univ) • Synthetic data BAD Currently testing on a “movie recommendation” dataset GOOD A. LAZARIC – Transfer in RL December 10 th, 2014 - 32

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the Accuracy of Approximation § Conclusions A. LAZARIC – Transfer in RL December 10 th, 2014 - 33

Sparse Multi-task Reinforcement Learning to play poker • States: cards, chips, … • Action:

Sparse Multi-task Reinforcement Learning to play poker • States: cards, chips, … • Action: stay, call, fold • Dynamics: deck, opponent • Reward: money Use RL to solve it! A. LAZARIC – Transfer in RL December 10 th, 2014 - 34

Sparse Multi-task Reinforcement Learning This is a Multi-Task RL problem! A. LAZARIC – Transfer

Sparse Multi-task Reinforcement Learning This is a Multi-Task RL problem! A. LAZARIC – Transfer in RL December 10 th, 2014 - 35

Sparse Multi-task Reinforcement Learning Let’s use as much information as possible to solve the

Sparse Multi-task Reinforcement Learning Let’s use as much information as possible to solve the problem! Not all the “features” are equally useful! A. LAZARIC – Transfer in RL December 10 th, 2014 - 36

The linear Fitted Q-Iteration Algorithm Collect samples from the environment Solve a linear regression

The linear Fitted Q-Iteration Algorithm Collect samples from the environment Solve a linear regression problem features Create a regression dataset Return the greedy policy A. LAZARIC – Transfer in RL December 10 th, 2014 - 37

Sparse Linear Fitted Q-Iteration Collect samples from the environment Create a regression dataset Solve

Sparse Linear Fitted Q-Iteration Collect samples from the environment Create a regression dataset Solve a sparse linear regression problem The LASSO Return the greedy policy L 1 -regularized least-squares A. LAZARIC – Transfer in RL December 10 th, 2014 - 38

features The Multi-task Joint Sparsity Assumption tasks A. LAZARIC – Transfer in RL December

features The Multi-task Joint Sparsity Assumption tasks A. LAZARIC – Transfer in RL December 10 th, 2014 - 39

Multi-task Sparse Linear Fitted Q-Iteration Collect samples from each task Create T regression datasets

Multi-task Sparse Linear Fitted Q-Iteration Collect samples from each task Create T regression datasets Solve a MT sparse linear regression problem The Group LASSO Return the greedy policies L-(1, 2)-regularized least-squares A. LAZARIC – Transfer in RL December 10 th, 2014 - 40

Learning a sparse representation transformation of the features (aka dictionary learning) A. LAZARIC –

Learning a sparse representation transformation of the features (aka dictionary learning) A. LAZARIC – Transfer in RL December 10 th, 2014 - 42

Multi-task Feature Learning Linear Fitted Q-Iteration Collect samples from each task Create T regression

Multi-task Feature Learning Linear Fitted Q-Iteration Collect samples from each task Create T regression datasets Learn a sparse representation Solve a MT sparse linear regression problem The MTFeature Learning Return the greedy policies A. LAZARIC – Transfer in RL December 10 th, 2014 - 43

Theoretical Results Number of samples (per task) needed to have an accurate approximation using

Theoretical Results Number of samples (per task) needed to have an accurate approximation using d features Std approach: Linearly proportional… too many samples! Lasso: Only log(d)! But no advantage from multiple tasks… G-Lasso: Decreasing in T! But joint sparsity may be poor… Rep. learn. : Smallest number of important features! But learning the representation may be expensive… A. LAZARIC – Transfer in RL December 10 th, 2014 - 44

Empirical Results: the Black. Jack Under study: application to other computer games NIPS 2014,

Empirical Results: the Black. Jack Under study: application to other computer games NIPS 2014, with D. Calandriello and M. Restelli (Poli. Mi) A. LAZARIC – Transfer in RL December 10 th, 2014 - 45

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the

Outline § Transfer in Reinforcement Learning § Improving the Exploration Strategy § Improving the Accuracy of Approximation § Conclusions A. LAZARIC – Transfer in RL December 10 th, 2014 - 46

Conclusions Without Transfer With Transfer A. LAZARIC – Transfer in RL December 10 th,

Conclusions Without Transfer With Transfer A. LAZARIC – Transfer in RL December 10 th, 2014 - 47

Thanks!! Inria Lille – Nord Europe www. inria. fr

Thanks!! Inria Lille – Nord Europe www. inria. fr