Reinforcement Learning Mitchell Ch 13 see also Barto

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Rationale • Learning from experience • Adaptive control • Examples not explicitly labeled, delayed feedback • Problem of credit assignment – which action(s) led to payoff? • tradeoff short-term thinking (immediate reward) for long-term consequences

Agent Model • Transition function – T: Sx. A->S, environment • Reward function R: Sx. A->real, payoff • Stochastic but Markov = • Policy=decision function, p: S->A • “rationality” – maximize long term expected reward – Discounted long-term reward (convergent series) – Alternatives: finite time horizon, uniform weights

R, T

Markov Decision Processes (MDPs) • • if know R and T(=P), solve for value func Vp(s) policy evaluation Bellman Equations dynamic programming (|S| eqns in |S| unknowns)

MDPs • finding optimal policies • Value iteration – update V(s) iteratively until p(s)=argmaxa Vp(s) stops changing • Policy iteration – iterate between choosing p and updating V over all states • Monte Carlo sampling: run random scenarios using p and take average rewards as V(s)

Q-learning: model-free • Q-function: reformulate as value function of S and A, independent of R and T(=d)

Q-learning algorithm

Convergence • Theorem: Q converges to Q*, after visiting each state infinitely often (assuming |r|< ) • Proof: with each iteration (where all Sx. A visited), magnitude of largest error in Q table decreases by at least g

• “on-policy” Training – exploitation vs. exploration – will relevant parts of the space be explored if stick to current (sub-optimal) policy? – e-greedy policies: choose action with max Q value most of the time, or random action e % of the time • “off-policy” – learn from simulations or traces – SARSA: training example database: <s, a, r, s’, a’> • Actor-critic

Non-deterministic case

Temporal Difference Learning

• convergence is not the problem • representation of large Q table is the problem (domains with many states or continuous actions) • how to represent large Q tables? – neural network – function approximation – basis functions – hierarchical decomposition of state space