Bayesian Reinforcement Learning Machine Learning RCC 16 th

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Outline • Introduction to Reinforcement Learning • Overview of the field • Model-based BRL • Model-free RL

References • ICML-07 Tutorial – P. Poupart, M. Ghavamzadeh, Y. Engel • Reinforcement Learning: An Introduction – Richard S. Sutton and Andrew G. Barto

Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning

Definitions State Action Reward £££££ Policy Reward function

Markov Decision Process Transition Probability x 0 x 1 a 0 a 1 r 0 r 1 Policy Reward function

Value Function

Optimal Policy • Assume one optimal action per state Value Iteration Unknown

Reinforcement Learning • RL Problem: Solve MDP when reward/transition models are unknown • Basic Idea: Use samples obtained from agent’s interaction with environment

Model-Based vs Model-Free RL • Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy • Model-Free: Directly learn value function/policy

RL Solutions

RL Solutions • Value Function Algorithms – Define a form for the value function – Sample state-action-reward sequence – Update value function – Extract optimal policy • SARSA, Q-learning

RL Solutions • Actor-Critic – Define a policy structure (actor) – Define a value function (critic) – Sample state-action-reward – Update both actor & critic

RL Solutions • Policy Search Algorithm – Define a form for the policy – Sample state-action-reward sequence – Update policy • PEGASUS – (Policy Evaluation-of-Goodness And Search Using Scenarios)

Online - Offline • Offline – Use a simulator – Policy fixed for each ‘episode’ – Updates made at the end of episode • Online – Directly interact with environment – Learning happens step-by-step

Model-Free Solutions 1. Prediction: Estimate V(x) or Q(x, a) 2. Control: Extract policy 1. On-Policy 2. Off-Policy

Monte-Carlo Predictions Leave car park Get out of city Motorway Enter Cambridge State Value -90 -83 -55 -11 Reward -13 -15 -61 -11 Updated -100 -87 -72 -11

Temporal Difference Predictions Leave car park Get out of city Motorway Enter Cambridge State Value -90 -83 -55 -11 Reward -13 -15 -61 -11 Updated -96 -70 -72 -11

Advantages of TD • Don’t need a model of reward/transitions • Online, fully incremental • Proved to converge given conditions on step-size • “Usually” faster than MC methods

From TD to TD(λ) State Reward Terminal state

SARSA & Q-learning TD-Learning On-Policy Off-Policy Estimate value function for current policy Estimate value function for optimal policy SARSA Q-Learning

GP Temporal Difference xx xx xx x 1 2

GP Temporal Difference