Reinforcement Learning Yishay Mansour TelAviv University Outline Goal
- Slides: 37
Reinforcement Learning Yishay Mansour Tel-Aviv University
Outline • Goal of Reinforcement Learning • Mathematical Model (MDP) • Planning 2
Goal of Reinforcement Learning Goal oriented learning through interaction Control of large scale stochastic environments with partial knowledge. Supervised / Unsupervised Learning Learn from labeled / unlabeled examples 3
Reinforcement Learning - origins Artificial Intelligence Control Theory Operation Research Cognitive Science & Psychology Solid foundations; well established research. 4
Typical Applications • Robotics – Elevator control [CB]. – Robo-soccer [SV]. • Board games – backgammon [T], – checkers [S]. – Chess [B] • Scheduling – Dynamic channel allocation [SB]. – Inventory problems. 5
Contrast with Supervised Learning The system has a “state”. The algorithm influences the state distribution. Inherent Tradeoff: Exploration versus Exploitation. 6
Mathematical Model - Motivation Model of uncertainty: Environment, actions, our knowledge. Focus on decision making. Maximize long term reward. Markov Decision Process (MDP) 7
Mathematical Model - MDP Markov decision processes S- set of states A- set of actions d - Transition probability R - Reward function Similar to DFA! 8
MDP model - states and actions Environment = states 0. 7 0. 3 action a Actions = transitions 9
MDP model - rewards R(s, a) = reward at state s for doing action a (a random variable). Example: R(s, a) = -1 with probability 0. 5 +10 with probability 0. 35 +20 with probability 0. 15 10
MDP model - trajectories trajectory: s 0 a 0 r 0 s 1 a 1 r 1 s 2 a 2 r 2 11
MDP - Return function. Combining all the immediate rewards to a single value. Modeling Issues: Are early rewards more valuable than later rewards? Is the system “terminating” or continuous? Usually the return is linear in the immediate rewards. 12
MDP model - return functions Finite Horizon - parameter H Infinite Horizon discounted - parameter g<1. undiscounted Terminating MDP 13
MDP model - action selection AIM: Maximize the expected return. Fully Observable - can “see” the “entire” state. Policy - mapping from states to actions Optimal policy: optimal from any start state. THEOREM: There exists a deterministic optimal policy 14
Contrast with Supervised Learning: Fixed distribution on examples. Reinforcement Learning: The state distribution is policy dependent!!! A small local change in the policy can make a huge global change in the return. 15
MDP model - summary - set of states, |S|=n. - set of k actions, |A|=k. - transition function. R(s, a) - immediate reward function. - policy. - discounted cumulative return. 16
Simple example: N- armed bandit Goal: Maximize sum of immediate rewards. Single state. s a 1 a 2 a 3 Given the model: Greedy action. Difficulty: unknown model. 17
N-Armed Bandit: Highlights • Algorithms (near greedy): – Exponential weights • Gi sum of rewards of action ai • w i = e Gi – Follow the (perturbed) leader • Results: – For any sequence of T rewards: – E[online] > maxi {Gi} - sqrt{T log N} 18
Planning - Basic Problems. Given a complete MDP model. Policy evaluation - Given a policy p, estimate its return. Optimal control - Find an optimal policy p* (maximizes the return from any start state). 19
Planning - Value Functions Vp(s) The expected return starting at state s following p. Qp(s, a) The expected return starting at state s with action a and then following p. V*(s) and Q*(s, a) are define using an optimal policy p*. V*(s) = maxp Vp(s) 20
Planning - Policy Evaluation Discounted infinite horizon (Bellman Eq. ) Vp(s) = Es’~ p (s) [ R(s, p (s)) + g Vp(s’)] Rewrite the expectation Linear system of equations. 21
Algorithms - Policy Evaluation Example A={+1, -1} g = 1/2 d(si, a)= si+a p random "a: R(si, a) = i s 0 s 3 0 1 3 2 s 1 s 2 Vp(s 0) = 0 +g [p(s 0, +1)Vp(s 1) + p(s 0, -1) Vp(s 3) ] 22
Algorithms -Policy Evaluation Example A={+1, -1} g = 1/2 d(si, a)= si+a p random s 0 "a: R(si, a) = i s 3 0 1 3 2 s 1 Vp(s 0) = 5/3 Vp(s 1) = 7/3 Vp(s 2) = 11/3 Vp(s 3) = 13/3 s 2 Vp(s 0) = 0 + (Vp(s 1) + Vp(s 3) )/4 23
Algorithms - optimal control State-Action Value function: Qp(s, a) = E [ R(s, a)] + g Es’~ (s, a) [ Vp(s’)] Note For a deterministic policy p. 24
Algorithms -Optimal control Example A={+1, -1} g = 1/2 d(si, a)= si+a p random s 0 R(si, a) = i s 3 0 1 3 2 s 1 Qp(s 0, +1) = 5/6 Qp(s 0, -1) = 13/6 s 2 Qp(s 0, +1) = 0 +g Vp(s 1) 25
Algorithms - optimal control CLAIM: A policy p is optimal if and only if at each state s: Vp(s) = MAXa {Qp(s, a)} (Bellman Eq. ) PROOF: Assume there is a state s and action a s. t. , Vp(s) < Qp(s, a). Then the strategy of performing a at state s (the first time) is better than p. This is true each time we visit s, so the policy that performs action a at state s is better than p. p 26
Algorithms -optimal control Example A={+1, -1} g = 1/2 d(si, a)= si+a p random s 0 R(si, a) = i s 3 0 1 3 2 s 1 s 2 Changing the policy using the state-action value function. 27
Algorithms - optimal control The greedy policy with respect to Qp(s, a) is p(s) = argmaxa{Qp(s, a) } The e-greedy policy with respect to Qp(s, a) is p(s) = argmaxa{Qp(s, a) } with probability 1 -e, and p(s) = random action with probability e 28
MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. 3. Policy Iteration method. 29
Convergence • Value Iteration – Drop in distance from optimal each iteration maxs {V*(s) – Vt(s)} • Policy Iteration – Policy can only improve s Vt+1(s) Vt(s) • Less iterations then Value Iteration, but • more expensive iterations. 30
Relations to Board Games • • • state = current board action = what we can play. opponent action = part of the environment value function = probability of winning Q- function = modified policy. Hidden assumption: Game is Markovian 31
Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning. 32
Example - Elevator Control Learning (alone): Model the arrival model well. Planning (alone) : Given arrival model build schedule Real objective: Construct a schedule while updating model 33
Partially Observable MDP Rather than observing the state we observe some function of the state. Ob - Observable function. a random variable for each states. Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s. Problem: different states may “look” similar. The optimal strategy is history dependent ! 34
POMDP - Belief State Algorithm Given a history of actions and observable value we compute a posterior distribution for the state we are in (belief state). The belief-state MDP: States: distribution over S (states of the POMDP). actions: as in the POMDP. Transition: the posterior distribution (given the observation) We can perform the planning and learning on the belief-state MDP. 35
POMDP Hard computational problems. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a deterministic POMDP is P-spacehard (NP-complete) [PT, L]. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a stochastic POMDP is EXPTIMEhard (P-space-complete) [PT, L]. Computing an infinite (polynomial) horizon undiscounted optimal policy for an MDP is P-complete [PT]. 36
Resources • Reinforcement Learning (an introduction) [Sutton & Barto] • Markov Decision Processes [Puterman] • Dynamic Programming and Optimal Control [Bertsekas] • Neuro-Dynamic Programming [Bertsekas & Tsitsiklis] • Ph. D. thesis - Michael Littman 37
- Yishay mansour
- Yishay mansour
- Apprenticeship learning via inverse reinforcement learning
- Apprenticeship learning via inverse reinforcement learning
- Inverse reinforcement learning
- Secondary reinforcers
- Dr omar mansour
- Dr khalfallah névralgie pudendale
- Abdullah mansour github
- Omar mansour surgeon
- Alfred mansour iii md
- Dr omar mansour
- Marlina mansour
- George mansour md
- "mansour orthodontics"
- "mansour orthodontics"
- "proleukin"
- Cuadro comparativo e-learning y b-learning
- What is optimal policy in reinforcement learning
- Active passive reinforcement learning
- Reinforcement
- Coarse coding reinforcement learning
- Q learning snake
- Direct reinforcement learning
- Hierarchical reinforcement learning: a comprehensive survey
- What is optimal policy in reinforcement learning
- Www.youtube.com
- Reinforcement learning exploration vs exploitation
- Policy evaluation
- Jack's car rental reinforcement learning
- Neural network blackjack
- Passive reinforcement learning
- I2a reinforcement learning
- Eo for negative reinforcement
- Reinforcement learning slides
- Reinforcement learning slides
- Reinforcement learning agent environment
- Reinforcement learning atari