Reinforcement Learning Control learning Control polices that choose

Control Learning Consider learning to choose actions, e. g. , • Robot learning to

One Example: TD-Gammon Tesauro, 1995 Learn to play Backgammon Immediate reward • +100 if

Reinforcement Learning Problem Environment action state reward Agent s 0 a 0 r 0

Markov Decision Process Assume • finite set of states S • set of actions

Agent’s Learning Task Execute action in environment, observe results, and • learn action policy

Value Function To begin, consider deterministic worlds … For each possible policy the agent

What to Learn We might try to have agent learn the evaluation function V

Q Function Define new function very similar to V* If agent learns Q, it

Training Rule to Learn Q Note Q and V* closely related: Which allows us

Q Learning for Deterministic Worlds For each s, a initialize table entry Observe current

Updating R 63 72 81 R 100 63 90 81 100 aright initial state:

Convergence converges to Q. Consider case of deterministic world where each <s, a> visited

Convergence (cont) For any table entry updated on iteration n+1, the error in the

Nondeterministic Case What if reward and next state are non-deterministic? We redefine V, Q

Nondeterministic Case Q learning generalizes to nondeterministic worlds Alter training rule to where Can

Temporal Difference Learning Q learning: reduce discrepancy between successive Q estimates One step time

Temporal Difference Learning Equivalent expression: TD( ) algorithm uses above training rule • Sometimes

Subtleties and Ongoing Research • Replace table with neural network or other generalizer •

RL Summary • Reinforcement learning (RL) – – control learning delayed reward possible that

RL Summary • Q(s, a) - state/action value function – – – related to

Slides: 22

Download presentation

Reinforcement Learning • Control learning • Control polices that choose optimal actions • Q learning • Convergence CS 5541 Chapter 13 Reinforcement Learning 1

Control Learning Consider learning to choose actions, e. g. , • Robot learning to dock on battery charger • Learning to choose actions to optimize factory output • Learning to play Backgammon Note several problem characteristics • Delayed reward • Opportunity for active exploration • Possibility that state only partially observable • Possible need to learn multiple tasks with same sensors/effectors CS 5751 Machine Learning Chapter 13 Reinforcement Learning 2

One Example: TD-Gammon Tesauro, 1995 Learn to play Backgammon Immediate reward • +100 if win • -100 if lose • 0 for all other states Trained by playing 1. 5 million games against itself Now approximately equal to best human player CS 5751 Machine Learning Chapter 13 Reinforcement Learning 3

Reinforcement Learning Problem Environment action state reward Agent s 0 a 0 r 0 s 1 a 1 r 1 s 2 a 2 r 2 . . . Goal: learn to choose actions that maximize r 0 + r 1 + 2 r 2 + …, where 0 < 1 CS 5751 Machine Learning Chapter 13 Reinforcement Learning 4

Markov Decision Process Assume • finite set of states S • set of actions A • at each discrete time, agent observes state st S and choose action at A • then receives immediate reward rt • and state changes to st+1 • Markov assumption: st+1 = (st, at) and rt = r(st, at) – i. e. , rt and st+1 depend only on current state and action – functions and r may be nondeterministic – functions and r no necessarily known to agent CS 5751 Machine Learning Chapter 13 Reinforcement Learning 5

Agent’s Learning Task Execute action in environment, observe results, and • learn action policy : S A that maximizes E[rt + rt+1 + 2 rt+2 + …] from any starting state in S • here 0 < 1 is the discount factor future rewards Note something new: • target function is : S A • but we have no training examples of form <s, a> • training examples are of form <<s, a>, r> CS 5751 Machine Learning Chapter 13 Reinforcement Learning 6

Value Function To begin, consider deterministic worlds … For each possible policy the agent might adopt, we can define an evaluation function over states where rt, rt+1, … are generated by following policy starting at state s Restated, the task is to learn the optimal policy * CS 5751 Machine Learning Chapter 13 Reinforcement Learning 7

0 0 0 0 100 G 0 0 0 72 9081 8172 100 81 0 100 81 90 V*(s) values CS 5751 Machine Learning 100 G 90 81 100 90 Q(s, a) values r(s, a) (immediate reward) values 90 0 81 G G 0 100 One optimal policy Chapter 13 Reinforcement Learning 8

What to Learn We might try to have agent learn the evaluation function V * (which we write as V*) We could then do a lookahead search to choose best action from any state s because A problem: • This works well if agent knows a : S A S, and r : S A • But when it doesn’t, we can’t choose actions this way CS 5751 Machine Learning Chapter 13 Reinforcement Learning 9

Q Function Define new function very similar to V* If agent learns Q, it can choose optimal action even without knowing d! Q is the evaluation function the agent will learn CS 5751 Machine Learning Chapter 13 Reinforcement Learning 10

Training Rule to Learn Q Note Q and V* closely related: Which allows us to write Q recursively as Let denote learner’s current approximation to Q. Consider training rule where s' is the state resulting from applying action a in state s CS 5751 Machine Learning Chapter 13 Reinforcement Learning 11

Q Learning for Deterministic Worlds For each s, a initialize table entry Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Observe the new state s' • Update the table entry for as follows: • s s' CS 5751 Machine Learning Chapter 13 Reinforcement Learning 12

Updating R 63 72 81 R 100 63 90 81 100 aright initial state: s 1 CS 5751 Machine Learning next state: s 2 Chapter 13 Reinforcement Learning 13

Convergence converges to Q. Consider case of deterministic world where each <s, a> visited infinitely often. Proof: define a full interval to be an interval during which each <s, a> is visited. During each full interval the largest error in Let in table is reduced by factor of be table after n updates, and n be the maximum error ; that is CS 5751 Machine Learning Chapter 13 Reinforcement Learning 14

Convergence (cont) For any table entry updated on iteration n+1, the error in the revised estimate is CS 5751 Machine Learning Chapter 13 Reinforcement Learning 15

Nondeterministic Case What if reward and next state are non-deterministic? We redefine V, Q by taking expected values CS 5751 Machine Learning Chapter 13 Reinforcement Learning 16

Nondeterministic Case Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove converge of Dayan, 1992] CS 5751 Machine Learning to Q [Watkins and Chapter 13 Reinforcement Learning 17

Temporal Difference Learning Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these: CS 5751 Machine Learning Chapter 13 Reinforcement Learning 18

Temporal Difference Learning Equivalent expression: TD( ) algorithm uses above training rule • Sometimes converges faster than Q learning • converges for learning V* for any 0 1 (Dayan, 1992) • Tesauro’s TD-Gammon uses this algorithm CS 5751 Machine Learning Chapter 13 Reinforcement Learning 19

Subtleties and Ongoing Research • Replace table with neural network or other generalizer • Handle case where state only partially observable • Design optimal exploration strategies • Extend to continuous action, state • Learn and use d : S A S, d approximation to • Relationship to dynamic programming CS 5751 Machine Learning Chapter 13 Reinforcement Learning 20

RL Summary • Reinforcement learning (RL) – – control learning delayed reward possible that the state is only partially observable possible that the relationship between states/actions unknown • Temporal Difference Learning – learn discrepancies between successive estimates – used in TD-Gammon • V(s) - state value function – needs known reward/state transition functions CS 5751 Machine Learning Chapter 13 Reinforcement Learning 21

RL Summary • Q(s, a) - state/action value function – – – related to V does not need reward/state trans functions training rule related to dynamic programming measure actual reward received for action and future value using current Q function – deterministic - replace existing estimate – nondeterministic - move table estimate towards measure estimate – convergence - can be shown in both cases CS 5751 Machine Learning Chapter 13 Reinforcement Learning 22