REINFORCEMENT LEARNING AGENDA Online learning Reinforcement learning Modelfree

  • Slides: 47
Download presentation
REINFORCEMENT LEARNING

REINFORCEMENT LEARNING

AGENDA Online learning Reinforcement learning � Model-free vs. model-based � Passive vs. active learning

AGENDA Online learning Reinforcement learning � Model-free vs. model-based � Passive vs. active learning � Exploration-exploitation tradeoff

REINFORCEMENT LEARNING RL problem: given only observations of actions, states, and rewards, learn a

REINFORCEMENT LEARNING RL problem: given only observations of actions, states, and rewards, learn a (near) optimal policy No prior knowledge of transition or reward models We consider: fully-observable, episodic environment, finite state space, uncertainty in action (MDP)

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn:

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn: Policy p Action-utility function Q(s, a) Utility function U Model of R and T Online: p(s) arg maxa Q(s, a) arg maxa s P(s’|s, a)U(s’) Solve MDP Direct utility estimation, TDlearning Adaptive dynamic programming Method: Learning from demonstration Simpler execution Q-learning, SARSA Fewer examples needed to learn?

FIRST STEPS: PASSIVE RL Observe execution trials of an agent that acts according to

FIRST STEPS: PASSIVE RL Observe execution trials of an agent that acts according to some unobserved policy p Problem: estimate the utility function Up [Recall Up(s) = E[ t gt R(St)] where St is the random variable denoting the distribution of states at time t]

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn:

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn: Policy p Action-utility function Q(s, a) Utility function U Model of R and T Online: p(s) arg maxa Q(s, a) arg maxa s P(s’|s, a)U(s’) Solve MDP Direct utility estimation, TDlearning Adaptive dynamic programming Method: Learning from demonstration Simpler execution Q-learning, SARSA Fewer examples needed to learn?

DIRECT UTILITY ESTIMATION 3 0 2 0 1 1. 2. 0 +1 3 0.

DIRECT UTILITY ESTIMATION 3 0 2 0 1 1. 2. 0 +1 3 0. 81 0. 87 0. 92 +1 0 -1 2 0. 76 -1 0 0 0 1 2 3 4 0 0. 66 0. 71 0. 66 0. 61 0. 39 1 2 3 4 Observe trials t(i)=(s 0(i), a 1(i), s 1(i), r 1(i), …, ati(i), sti(i), rti(i)) for i=1, …, n For each state s S: 3. Find all trials t(i) that pass through s 4. Compute subsequent utility Ut(i)(s)= t=k to ti gt-k rt(i) 5. Set Up(s) to the average observed utility

INCREMENTAL (“ONLINE”) FUNCTION LEARNING Data is streaming into learner x 1, y 1, …,

INCREMENTAL (“ONLINE”) FUNCTION LEARNING Data is streaming into learner x 1, y 1, …, xn, yn yi = f(xi) Observes xn+1 and must make prediction for next time step yn+1 “Batch” approach: � Store all data at step n � Use your learner of choice on all data up to time n, predict for time n+1 Can we do this using less memory? 8

EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qn

EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qn = 1/n i=1…n yi q 5 qn+1 = 1/(n+1) i=1…n+1 yi = 1/(n+1) (yn+1 + i=1…n yi) = 1/(n+1) (yn+1 + n qn) = qn + 1/(n+1) (yn+1 - qn) 9

EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qt

EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qt = 1/n i=1…n yi y 6 q 5 qn+1 = 1/(n+1) i=1…n+1 yi = 1/(n+1) (yn+1 + i=1…n yi) = 1/(n+1) (yn+1 + n qn) = qn + 1/(n+1) (yn+1 - qn) 10

EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qt

EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qt = 1/n i=1…n yi q 5 q 6 = 5/6 q 5 + 1/6 y 6 qn+1 = 1/(n+1) i=1…n+1 yi = 1/(n+1) (yn+1 + i=1…n yi) = 1/(n+1) (yn+1 + n qn) = qn + 1/(n+1) (yn+1 - qn) 11

EXAMPLE: MEAN ESTIMATION qn+1 = qn + 1/(n+1) (yn+1 - qn) Only need to

EXAMPLE: MEAN ESTIMATION qn+1 = qn + 1/(n+1) (yn+1 - qn) Only need to store n, qn q 5 q 6 = 5/6 q 6 + 1/6 y 6 12

LEARNING RATES In fact, qn+1 = qn + an (yn+1 - qn) converges to

LEARNING RATES In fact, qn+1 = qn + an (yn+1 - qn) converges to the mean for any an such that: � an 0 as n � an 2 C < O(1/n) does the trick If an is close to 1, then the estimate shifts strongly to recent data; close to 0, and the old estimate is preserved

ONLINE IMPLEMENTATION 3 0 2 0 1 1. 2. 0 +1 3 0. 81

ONLINE IMPLEMENTATION 3 0 2 0 1 1. 2. 0 +1 3 0. 81 0. 87 0. 92 +1 0 -1 2 0. 76 -1 0 0 0 1 2 3 4 0 0. 66 0. 71 0. 66 0. 61 0. 39 1 2 3 Store counts N[s] and estimated utilities Up(s) After a trial t, for each state s in the trial: 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(Ut(s)-Up(s)) • • Simply supervised learning on trials Slow learning, because Bellman equation is not used to pass knowledge between adjacent states 4

TEMPORAL DIFFERENCE LEARNING 3 0 2 0 1 1. 2. 0 +1 0 -1

TEMPORAL DIFFERENCE LEARNING 3 0 2 0 1 1. 2. 0 +1 0 -1 0 0 0 2 3 4 0 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 3 0 2 0 1 1. 2. 0 +1 0 -1

TEMPORAL DIFFERENCE LEARNING 3 0 2 0 1 1. 2. 0 +1 0 -1 0 0 0 2 3 4 0 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING +1 0 -1 3 2 0 1 -0. 02 0 0

TEMPORAL DIFFERENCE LEARNING +1 0 -1 3 2 0 1 -0. 02 0 0 0 1 2 3 4 1. 2. 0 0 0 With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 3 -0. 02 0 +1 2 -0. 02 0 -1 1

TEMPORAL DIFFERENCE LEARNING 3 -0. 02 0 +1 2 -0. 02 0 -1 1 -0. 02 0 0 0 1 2 3 4 1. 2. With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 3 -0. 02 0. 48 2 -0. 02 1 1. 2.

TEMPORAL DIFFERENCE LEARNING 3 -0. 02 0. 48 2 -0. 02 1 1. 2. +1 0 -1 0 0 0 2 3 4 With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 0. 72 +1 0 -1 3 -0. 04 0. 21 2

TEMPORAL DIFFERENCE LEARNING 0. 72 +1 0 -1 3 -0. 04 0. 21 2 -0. 04 1 -0. 04 0 0 0 1 2 3 4 1. 2. With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 3 0. 07 0. 44 0. 84 2 -0. 06 1

TEMPORAL DIFFERENCE LEARNING 3 0. 07 0. 44 0. 84 2 -0. 06 1 1. 2. +1 0 -1 0 0 0 2 3 4 With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 3 0. 23 0. 62 0. 42 2 -0. 03 1

TEMPORAL DIFFERENCE LEARNING 3 0. 23 0. 62 0. 42 2 -0. 03 1 -0. 08 1 1. 2. +1 0 -1 0 0 0 2 3 4 With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 3 0. 23 0. 62 0. 42 +1 2 -0. 03

TEMPORAL DIFFERENCE LEARNING 3 0. 23 0. 62 0. 42 +1 2 -0. 03 0. 19 -1 1 -0. 08 0 0 0 1 2 3 4 1. 2. With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s))

TEMPORAL DIFFERENCE LEARNING 3 0. 23 0. 62 0. 69 +1 2 -0. 03

TEMPORAL DIFFERENCE LEARNING 3 0. 23 0. 62 0. 69 +1 2 -0. 03 0. 19 -1 1 -0. 08 0 0 0 1 2 3 4 1. 2. With learning rate a=0. 5 Store counts N[s] and estimated utilities Up(s) For each observed transition (s, r, a, s’): 3. Set N[s]+1 4. Adjust utility Up(s)+a(N[s])(r+g. Up(s’)-Up(s)) • • • For any s, distribution of s’ approaches P(s’|s, p(s)) Uses relationships between adjacent states to adjust utilities toward equilibrium Unlike direct estimation, learns before trial is terminated

“OFFLINE” INTERPRETATION OF TD LEARNING 3 0 2 0 1 0 +1 3 +1

“OFFLINE” INTERPRETATION OF TD LEARNING 3 0 2 0 1 0 +1 3 +1 0 -1 2 -1 0 0 0 1 ? 2 3 4 0 1 2 3 4 Observe trials t(i)=(s 0(i), a 1(i), s 1(i), r 1(i), …, ati(i), sti(i), rti(i)) for i=1, …, n For each state s S: 3. Find all trials t(i) that pass through s 4. Extract local history at (s, r(i), a(i), s’(i)) for each trial 5. Set up constraint Up(s) = r(i) + g. Up(s’(i)) 6. Solve all constraints in least squares fashion using stochastic gradient descent 1. 2. [Recall linear system in policy iteration: u = r+Tpu]

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn:

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn: Policy p Action-utility function Q(s, a) Utility function U Model of R and T Online: p(s) arg maxa Q(s, a) arg maxa s P(s’|s, a)U(s’) Solve MDP Direct utility estimation, TDlearning Adaptive dynamic programming Method: Learning from demonstration Simpler execution Q-learning, SARSA Fewer examples needed to learn?

ADAPTIVE DYNAMIC PROGRAMMING 3 0 2 0 1 0 0 2 0 +1 0

ADAPTIVE DYNAMIC PROGRAMMING 3 0 2 0 1 0 0 2 0 +1 0 -1 0 0 3 4 -. 04 +1 -. 04 -1 -. 04 R(s) -. 04 ? P(s’|s, a) ? Store counts N[s], N[s, a, s’], estimated rewards R(s), and transition model P(s’|s, a) 2. For each observed transition (s, r, a, s’): 3. Set N[s]+1, N[s, a]+1, N[s, a, s’]+1 4. • Adjust reward R(s)than R(s)+a(N[s])(r-R(s)) Faster learning TD, because Bellman equation 5. Set N[s, a, s’]/N[s, a] is P(s’|s, a) exploited=across all states 6. Solve policy evaluation using P, R, p • Modified policy evaluation algorithms make updates faster than solving linear system (O(n 3)) 1.

ACTIVE RL Rather than assume a policy is given, can we use the learned

ACTIVE RL Rather than assume a policy is given, can we use the learned utilities to pick good actions? At each state s, the agent must learn outcomes for all actions, not just the action p(s)

GREEDY RL Maintain current estimates Up(s) Idea: At state s, take action a that

GREEDY RL Maintain current estimates Up(s) Idea: At state s, take action a that maximizes s’ P(s’|s, a) Up(s’) Very seldom works well! Why?

EXPLORATION VS. EXPLOITATION Greedy strategy purely exploits current knowledge � The quality of this

EXPLORATION VS. EXPLOITATION Greedy strategy purely exploits current knowledge � The quality of this knowledge improves only for those states that the agent observes often A good learner must perform exploration in order to improve its knowledge about states that are not often observed � But pure exploration is useless (and costly) if it is never exploited

RESTAURANT PROBLEM

RESTAURANT PROBLEM

OPTIMISTIC EXPLORATION STRATEGY Behave initially as if there wonderful rewards R+ scattered all over

OPTIMISTIC EXPLORATION STRATEGY Behave initially as if there wonderful rewards R+ scattered all over the place Define a modified optimistic Bellman update U+(s) R(s)+g maxa f( s P(s’|s, a)U+(s’) , N[s, a]) Truncated exploration function: f(u, n) = R+ u if n < Ne otherwise [Here the agent will try each action in each state at least Ne times. ]

COMPLEXITY Truncated: at least Ne·|S|·|A| steps are needed in order to explore every action

COMPLEXITY Truncated: at least Ne·|S|·|A| steps are needed in order to explore every action in every state � Some costly explorations might not be necessary, or the reward from far-off explorations may be highly discounted � Convergence to optimal policy guaranteed only if each action is tried in each state an infinite number of times! This works with ADP… But how to perform action selection in TD? � Must also learn the transition model P(s’|s, a)

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn:

WHAT TO LEARN? Model free Model based Less online deliberation More online deliberation Learn: Policy p Action-utility function Q(s, a) Utility function U Model of R and T Online: p(s) arg maxa Q(s, a) arg maxa s P(s’|s, a)U(s’) Solve MDP Direct utility estimation, TDlearning Adaptive dynamic programming Method: Learning from demonstration Simpler execution Q-learning, SARSA Fewer examples needed to learn?

Q-VALUES Learning U is not enough for action selection because a transition model is

Q-VALUES Learning U is not enough for action selection because a transition model is needed Solution: learn Q-values: Q(s, a) is the utility of choosing action a in state s “Shift” Bellman equation � U(s) = maxa Q(s, a) � Q(s, a) = R(s) + g s P(s’|s, a) maxa’ Q(s’, a’) So far, everything is the same… but what about the learning rule?

Q-LEARNINGUPDATE Recall TD: Update: Select action: a arg maxa f( s P(s’|s, a)U(s’) ,

Q-LEARNINGUPDATE Recall TD: Update: Select action: a arg maxa f( s P(s’|s, a)U(s’) , N[s, a]) Q-Learning: Update: Select � � U(s)+a(N[s])(r+g. U(s’)-U(s)) Q(s, a)+a(N[s, a])(r+g maxa’Q(s’, a’)-Q(s, a)) action: a arg maxa f( Q(s, a) , N[s, a]) Key difference: average over P(s’|s, a) is “baked in” to the Q function Q-learning is therefore a model-free active learner

MORE ISSUES IN RL Model-free vs. model-based � Model-based techniques are typically better at

MORE ISSUES IN RL Model-free vs. model-based � Model-based techniques are typically better at incorporating prior knowledge Generalization � Value function approximation � Policy search methods

LARGE SCALE APPLICATIONS Game playing � TD-Gammon: neural network representation of Qfunctions, trained via

LARGE SCALE APPLICATIONS Game playing � TD-Gammon: neural network representation of Qfunctions, trained via self-play Robot control

RECAP Online learning: learn incrementally with low memory overhead Key differences between RL methods:

RECAP Online learning: learn incrementally with low memory overhead Key differences between RL methods: what to learn? Temporal differencing: learn U through incremental updates. Cheap, somewhat slow learning. � Adaptive DP: learn P and R, derive U through policy evaluation. Fast learning but computationally expensive. � Q-learning: learn state-action function Q(s, a), allows modelfree action selection � Action selection requires trading off exploration vs. exploitation � Infinite exploration needed to guarantee that the optimal policy is found!

INCREMENTAL LEAST SQUARES § § Recall Least Squares estimate q = (ATA)-1 AT b

INCREMENTAL LEAST SQUARES § § Recall Least Squares estimate q = (ATA)-1 AT b Where A is matrix of x(i)’s, b is vector of y(i)’s (laid out in rows) x(1) A= x(2) Nx. M y(1) b= Nx 1 y(2) … … x(N) y(N) 40

DELTA RULE SQUARES § § FOR LINEAR LEAST Delta rule (Widrow-Hoff rule): stochastic gradient

DELTA RULE SQUARES § § FOR LINEAR LEAST Delta rule (Widrow-Hoff rule): stochastic gradient descent q(t+1) = q(t)+a x (y-q(t)Tx) O(n) time and space 41

INCREMENTAL LEAST SQUARES § Let A(t), b(t) be A matrix, b vector up to

INCREMENTAL LEAST SQUARES § Let A(t), b(t) be A matrix, b vector up to time t q(t) = (A(t)TA(t))-1 A(t)T b(t) (T+1)x. M A(t+1) = A(t) x(t+1) b(t+1) = (t+1)x 1 b(t) y(t+1) 42

INCREMENTAL LEAST SQUARES § § Let A(t), b(t) be A matrix, b vector up

INCREMENTAL LEAST SQUARES § § Let A(t), b(t) be A matrix, b vector up to time t q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1) (T+1)x. M A(t+1) = A(t) b(t+1) = (t+1)x 1 b(t) 43 x(t+1) y(t+1)

INCREMENTAL LEAST SQUARES § § § Let A(t), b(t) be A matrix, b vector

INCREMENTAL LEAST SQUARES § § § Let A(t), b(t) be A matrix, b vector up to time t q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1) A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)T (T+1)x. M A(t+1) = A(t) b(t+1) = (t+1)x 1 b(t) 44 x(t+1) y(t+1)

INCREMENTAL LEAST SQUARES § § § Let A(t), b(t) be A matrix, b vector

INCREMENTAL LEAST SQUARES § § § Let A(t), b(t) be A matrix, b vector up to time t q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1) A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)T (T+1)x. M A(t+1) = A(t) b(t+1) = (t+1)x 1 b(t) 45 x(t+1) y(t+1)

INCREMENTAL LEAST SQUARES § § Let A(t), b(t) be A matrix, b vector up

INCREMENTAL LEAST SQUARES § § Let A(t), b(t) be A matrix, b vector up to time t q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1) A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)T Sherman-Morrison Update § (Y + xx. T)-1 = Y-1 - Y-1 xx. T Y-1 / (1 – x. T Y-1 x) 46

INCREMENTAL LEAST SQUARES § § § Putting it all together Store p(t) = A(t)Tb(t)

INCREMENTAL LEAST SQUARES § § § Putting it all together Store p(t) = A(t)Tb(t) Q(t) = (A(t)TA(t))-1 Update p(t+1) = p(t) + y x Q(t+1) = Q(t) - Q(t) xx. T Q(t) / (1 – x. T Q(t) x) q(t+1) = Q(t+1)p(t+1) O(M 2) time and space instead of O(M 3+MN) time and O(MN) space for OLS True least squares estimator for any t, (delta rule works only for large t) 47