Hierarchical Reinforcement Learning A Survey and Comparison of
- Slides: 54
Hierarchical Reinforcement Learning [A Survey and Comparison of HRL techniques] Mausam
The Outline of the Talk Ø MDPs and Bellman’s curse of dimensionality. v. RL: Simultaneous learning and planning. v. Explore avenues to speed up RL. v. Illustrate prominent HRL methods. v. Compare prominent HRL methods. v. Discuss future research. v. Summarise
Decision Making Environment What action next? Percept Action Slide courtesy Dan Weld
Personal Printerbot v. States (S) : {loc, has-robot-printout, user-loc, has-user-printout}, map v. Actions (A) : {moven, moves, movee, movew, extend-arm, grab-page, release-pages} v Reward (R) : if h-u-po +20 else -1 v Goal (G) : All states with h-u-po true. v. Start state : A state with h-u-po false.
Episodic Markov Decision Process v v v v Episodic MDP ´ h. S, A, P, R, G, s 0 i MDP with S : Set of environment states. absorbing goals A : Set of available actions. P : Probability Transition model. P(s’|s, a)* R : Reward model. R(s)* G : Absorbing goal states. s 0 : Start state. * Markovian : Discount factor**. assumption. ** bounds R for infinite horizon.
Goal of an Episodic MDP Find a policy (S ! A), which: Ømaximises expected discounted reward for a Øa fully observable* Episodic MDP. Øif agent is allowed to execute for an indefinite horizon. * Non-noisy complete information perceptors
Solution of an Episodic MDP v. Define V*(s) : Optimal reward starting in state s. v. Value Iteration : Start with an estimate of V*(s) and successively re-estimate it to converge to a fixed point.
Complexity of Value Iteration v. Each iteration – polynomial in |S| v. Number of iterations – polynomial in |S| v. Overall – polynomial in |S| v. Polynomial in |S| - |S| : exponential in number of features in the domain*. * Bellman’s curse of dimensionality
The Outline of the Talk ü MDPs and Bellman’s curse of dimensionality. Ø RL: Simultaneous learning and planning. v. Explore avenues to speed up RL. v. Illustrate prominent HRL methods. v. Compare prominent HRL methods. v. Discuss future research. v. Summarise
Learning Environment • Gain knowledge • Gain understanding • Gain skills • Modification of behavioural tendency Data
Decision Making while Learning* Environment • Gain knowledge • Gain understanding • Gain skills • Modification of behavioural tendency What action Percepts next? Datum Action * Known as Reinforcement Learning
Reinforcement Learning v. Unknown P and reward R. v. Learning Component : Estimate the P and R values via data observed from the environment. v. Planning Component : Decide which actions to take that will maximise reward. v. Exploration vs. Exploitation ØGLIE (Greedy in Limit with Infinite Exploration)
Learning v. Model-based learning ØLearn the model, and do planning ØRequires less data, more computation v. Model-free learning ØPlan without learning an explicit model ØRequires a lot of data, less computation
Q-Learning v. Instead of learning, P and R, learn Q* directly. v Q*(s, a) : Optimal reward starting in s, if the first action is a, and after that the optimal policy is followed. v Q* directly defines the optimal policy: Optimal policy is the action with maximum Q* value.
Q-Learning v. Given an experience tuple hs, a, s’, ri New v. Under suitable assumptions, and. Old estimate GLIE estimate of of Q value exploration Q-Learning Q value converges to optimal.
Semi-MDP: When actions take time. v. The Semi-MDP equation: v. Semi-MDP Q-Learning equation: where experience tuple is hs, a, s’, r, Ni r = accumulated discounted reward while action a was executing.
Printerbot v. Paul G. Allen Center has 85000 sq ft space v. Each floor ~ 85000/7 ~ 12000 sq ft v. Discretise location on a floor: 12000 parts. v. State Space (without map) : 2*2*12000 --- very large!!!!! v. How do humans do the decision making?
The Outline of the Talk ü MDPs and Bellman’s curse of dimensionality. ü RL: Simultaneous learning and planning. Ø Explore avenues to speedup RL. v. Illustrate prominent HRL methods. v. Compare prominent HRL methods. v. Discuss future research. v. Summarise
1. The Mathematical Perspective A Structure Paradigm v S : Relational MDP v A : Concurrent MDP v P : Dynamic Bayes Nets v R : Continuous-state MDP v G : Conjunction of state variables v V : Algebraic Decision Diagrams v : Decision List (RMDP)
2. Modular Decision Making
2. Modular Decision Making • Go out of room • Walk in hallway • Go in the room
2. Modular Decision Making v. Humans plan modularly at different granularities of understanding. v. Going out of one room is similar to going out of another room. v. Navigation steps do not depend on whether we have the print out or not.
3. Background Knowledge v. Classical Planners using additional control knowledge can scale up to larger problems. v(E. g. : HTN planning, TLPlan) v. What forms of control knowledge can we provide to our Printerbot? ØFirst pick printouts, then deliver them. ØNavigation – consider rooms, hallway, separately, etc.
A mechanism that exploits all three avenues : Hierarchies 1. Way to add a special (hierarchical) structure on different parameters of an MDP. 2. Draws from the intuition and reasoning in human decision making. 3. Way to provide additional control knowledge to the system.
The Outline of the Talk ü MDPs and Bellman’s curse of dimensionality. ü RL: Simultaneous learning and planning. ü Explore avenues to speedup RL. Ø Illustrate prominent HRL methods. v. Compare prominent HRL methods. v. Discuss future research. v. Summarise
Hierarchy v. Hierarchy of : Behaviour, Skill, Module, Sub. Task, Macro-action, etc. Ø picking the pages Ø collision avoidance Ø fetch pages phase Ø walk in hallway v. HRL ´ RL with temporally extended actions
Hierarchical Algos ´ Gating Mechanism Hierarchical Learning • Learning the gating function • Learning the individual behaviours • Learning both * g is a gate bi is a behaviour *Can be a multilevel hierarchy.
Option : Movee until end of hallway v. Start : Any state in the hallway. v. Execute : policy as shown. v Terminate : when s is end of hallway.
Options [Sutton, Precup, Singh’ 99] v. An option is a well defined behaviour. vo = h I o , o i v Io : Set of states (IoµS) in which o can be initiated. v o(s) : Policy (S!A*) when o is executing. v o(s) : Probability that o terminates in s. *Can be a policy over lower level options.
Learning v. An option is temporally extended action with well defined policy. v. Set of options (O) replaces the set of actions (A) v. Learning occurs outside options. v. Learning over options ´ Semi MDP QLearning.
Machine: Movee + Collision Avoidance : End of hallway Call M 1 Movee Obstacle Choose Call M 2 End of hallway Return M 1 M 2 Movew Moves Return Moven Return
Hierarchies of Abstract Machines [Parr, Russell’ 97] v. A machine is a partial policy represented by a Finite State Automaton. v. Node : ØExecute a ground action. ØCall a machine as a subroutine. ØChoose the next node. ØReturn to the calling machine.
Hierarchies of Abstract Machines v. A machine is a partial policy represented by a Finite State Automaton. v. Node : ØExecute a ground action. ØCall a machine as subroutine. ØChoose the next node. ØReturn to the calling machine.
Learning v. Learning occurs within machines, as machines are only partially defined. v. Flatten all machines out and consider states [s, m] where s is a world state, and m, a machine node ´ MDP vreduce(So. M) : Consider only states where machine node is a choice node ´ Semi-MDP. v. Learning ¼ Semi-MDP Q-Learning
Task Hierarchy: MAXQ Decomposition [Dietterich’ 00] Root Fetch Take Extend-arm Children of a task are unordered Deliver Give Navigate(loc) Grab Release Moven. Moves. Movew. Movee Extend-arm
MAXQ Decomposition v. Augment the state s by adding the subtask i : [s, i]. v. Define C([s, i], j) as the reward received in i after j finishes. v. Q([s, Fetch], Navigate(prr)) = V([s, Navigate(prr)])+C([s, Fetch], Navigate(prr))* Reward received v. Express V in terms of Reward C received *Observe the while navigating after navigation context-free v. Learn C, instead of learning Q nature of Q-value
The Outline of the Talk ü MDPs and Bellman’s curse of dimensionality. ü RL: Simultaneous learning and planning. ü Explore avenues to speedup RL. ü Illustrate prominent HRL methods. Ø Compare prominent HRL methods. v. Discuss future research. v. Summarise
1. State Abstraction v. Abstract state : A state having fewer state variables; different world states maps to the same abstract state. v. If we can reduce some state variables, then we can reduce on the learning time considerably! v. We may use different abstract states for different macro-actions.
State Abstraction in MAXQ v. Relevance : Only some variables are relevant for the task. ØFetch : user-loc irrelevant ØNavigate(printer-room) : h-r-po, h-u-po, user-loc ØFewer params for V of lower levels. v. Funnelling : Subtask maps many states to smaller set of states. ØFetch : All states map to h-r-po=true, loc=pr. room. ØFewer params for C of higher levels.
State Abstraction in Options, HAM v. Options : Learning required only in states that are terminal states for some option. v. HAM : Original work has no abstraction. ØExtension: Three-way value decomposition*: Q([s, m], n) = V([s, n]) + C([s, m], n) + Cex([s, m]) ØSimilar abstractions are employed. *[Andre, Russell’ 02]
2. Optimality Hierarchical Optimality vs. Recursive Optimality
Optimality v. Options : Hierarchical ØUse (A [ O) : Global** ØInterrupt options v. HAM : Hierarchical* v. MAXQ : Recursive* ØInterrupt subtasks ØUse Pseudo-rewards ØIterate! * Can define eqns for both optimalities **Adv. of using macro-actions maybe lost.
3. Language Expressiveness v. Option ØCan only input a complete policy v. HAM ØCan input a complete policy. ØCan input a task hierarchy. ØCan represent “amount of effort”. ØLater extended to partial programs. v. MAXQ ØCannot input a policy (full/partial)
4. Knowledge Requirements v. Options ØRequires complete specification of policy. ØOne could learn option policies – given subtasks. v. HAM ØMedium requirements v. MAXQ ØMinimal requirements
5. Models advanced v. Options : Concurrency v. HAM : Richer representation, Concurrency v. MAXQ : Continuous time, state, actions; Multi-agents, Average-reward. v. In general, more researchers have followed MAXQ ØLess input knowledge ØValue decomposition
6. Structure Paradigm v S : Options, MAXQ v A : All v P : None v R : MAXQ v G : All v V : MAXQ v : All
The Outline of the Talk ü MDPs and Bellman’s curse of dimensionality. ü RL: Simultaneous learning and planning. ü Explore avenues to speedup RL. ü Illustrate prominent HRL methods. ü Compare prominent HRL methods. Ø Discuss future research. v. Summarise
Directions for Future Research v. Bidirectional State Abstractions v. Hierarchies over other RL research ØModel based methods ØFunction Approximators v. Probabilistic Planning ØHierarchical P and Hierarchical R v. Imitation Learning
Directions for Future Research v. Theory ØBounds (goodness of hierarchy) ØNon-asymptotic analysis v. Automated Discovery ØDiscovery of Hierarchies ØDiscovery of State Abstraction v. Apply…
Applications v. Toy Robot v. Flight Simulator v. AGV Scheduling v. Keepaway soccer P 2 D 2 P 1 D 1 Parts D 3 Assemblies D 4 P 3 Warehouse P 4 Images courtesy various sources
Thinking Big… ". . . consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*. ” -- David Andre v Use planners, theorem provers, etc. as components in big hierarchical solver.
The Outline of the Talk ü MDPs and Bellman’s curse of dimensionality. ü RL: Simultaneous learning and planning. ü Explore avenues to speedup RL. ü Illustrate prominent HRL methods. ü Compare prominent HRL methods. ü Discuss future research. Ø Summarise
How to choose appropriate hierarchy v. Look at available domain knowledge ØIf some behaviours are completely specified – options ØIf some behaviours are partially specified – HAM ØIf less domain knowledge available – MAXQ v. We can use all three to specify different behaviours in tandem.
Main ideas in HRL community v. Hierarchies speedup learning v. Value function decomposition v. State Abstractions v. Greedy non-hierarchical execution v. Context-free learning and pseudo-rewards v. Policy improvement by re-estimation and re-learning.
- Hierarchical reinforcement learning: a comprehensive survey
- Apprenticeship learning via inverse reinforcement learning
- Apprenticeship learning via inverse reinforcement learning
- Inverse reinforcement learning
- What are primary and secondary reinforcers
- Forced choice reinforcement survey with pictures
- Limit convergence test
- Active and passive reinforcement learning?
- What is passive reinforcement learning
- Escape contingency aba
- Policy evaluation
- Passive reinforcement learning
- Acls primary survey
- Cuadro comparativo entre e-learning b-learning y m-learning
- Karan kathpalia
- Bootstrapping machine learning
- Snake game
- Direct reinforcement
- What is optimal policy in reinforcement learning
- Www.youtube.com
- Q learning exploration vs exploitation
- Jack's car rental reinforcement learning
- Reinforcement learning blackjack
- I2a reinforcement learning
- Reinforcement learning slides
- Reinforcement learning slides
- Reinforcement learning agent environment
- Reinforcement learning atari
- Reinforcement learning exercises
- Policy network reinforcement learning
- A long peek into reinforcement learning
- Using inaccurate models in reinforcement learning
- Reinforcement learning lectures
- Vassilis athitsos
- Chatbot reinforcement learning
- Reinforcement learning behaviorism
- Reinforcement learning lectures
- Passive reinforcement learning in artificial intelligence
- Backup diagram reinforcement learning
- Alpha reinforcement learning
- Supervised learning dan unsupervised learning
- Reinforcement learning mooc
- Reinforcement learning competition
- A crash course on reinforcement learning
- Cipd learning and development survey
- Diverse learning environments survey
- Diverse learning environments survey
- Linear order in syntax
- Timing wheels
- Flat cluster
- Learning management systems comparison matrix
- Task analysis
- Hierarchical linear regression spss
- Flat addressing vs hierarchical addressing
- Vif in regression spss