Hierarchical POMDP Solutions Georgios Theocharous 1 Sequential Decision

Sequential Decision Making Under Uncertainty Symptoms AGENT OBSERVATIONS & REWARDS HIDDEN STATES What is

Manufacturing Processes (Mahadevan, Theocharous FLAIRS 98) Machine Buffer Reward: States: Actions: Observations: • Parts

Foveated Active Vision (Minut) States: • Objects Observations: • Local features Actions: Reward: •

Many More Partially Observable Problems § Assistive technologies § Web searching, preference elicitation §

Overview § Learning models of partially observable problems is far from a solved problem

How? Spatial and Time Abstractions Reduce Uncertainty Spatial abstraction MIT Temporal abstraction 7

Outline § Sequential decision-making under uncertainty § A Hierarchical POMDP model for robot navigation

A Real System: Robot Navigation 0. 1 0. 8 S 1 Transition matrix for

Belief States (Probability Distributions over states) True State Belief State 10

Belief States (Probability Distributions over states) True State Belief State 11

Belief States (Probability Distributions over states) True State Belief State 12

Learning POMDPs § Given As and Zs compute Ts and Os § Estimate probability

Planning in POMDPs § Belief states constitute a sufficient statistic for making decisions (Markov

Our Solution: Spatial and Temporal Abstraction § Learning § A hierarchical Baum-Welch algorithm, which

Hierarchical POMDPs - ABSTRACT STATES + ACTIONS (Fine, Singer, Tishby, MLJ 98) 18

Experimental Environments 600 states 1200 states 19

The Robot Navigation Domain § The robot Pavlov in the real MSU environment §

Learning Feature Detectors (Mahadevan, Theocharous, Khaleeli: MLJ 98) § 736 hand-labeled-grids § 8 -fold

Learning and Planning in H-POMDPs for Robot Navigation LEARNING INITIAL H-POMDP COMPILATION HAND CODING

Planning in H-POMDPs (Theocharous, Mahadevan: ICRA 2002) § Hierarchical MDP solutions (using the options

Intuition § Probability distribution at the higher level evolves more slowly § The agent

Hierarchical is More Successful Unknown initial position Success % Environment MLS QMDP Algorithm 32

Hierarchical Takes Less Time to Reach Goal Unknown initial position ? Average Steps to

Hierarchical Plans are Computed Faster Planning Time Environment Goal 1 Goal 2 Algorithm 34

Near Optimal Macro-action Selection (Theocharous, Kaelbling NIPS 2003) §Usually agents don’t require the entire

Dynamic Grids Given a resolution, points are sampled dynamically from regular dicretizations, by simulating

The Algorithm True trajectory True belief state Resulting next true belief state Simulation trajectories

Dynamic POMDP Abstractions En tr op y th re sh o ld s (Theocharous,

Dynamic Bayesian Networks STATE POMDP FACTORED DBN POMDP 0. 08 0. 01 0. 05

Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004) WEST EAST STATE

Complexity of Inference FACTORED DBN H-POMDP STATE H-POMDP WEST EAST DBN H-POMDP STATE POMDP

Hierarchical Localizes better Original Factored DBN tied H-POMDP Factored DBN H-POMDP STATE POMDP Before

Hierarchical Fits Data Better Original Factored DBN tied H-POMDP Factored DBN H-POMDP STATE POMDP

Directions for Future Research § In the future we will explore structure learning §

Major Contributions § The H-POMDP model § Requires less training data § Provides better

Slides: 59

Download presentation

Hierarchical POMDP Solutions Georgios Theocharous 1

Sequential Decision Making Under Uncertainty Symptoms AGENT OBSERVATIONS & REWARDS HIDDEN STATES What is T rea the optimal tm ACTIONS s en t policy? s ts Te ENVIRONMENT 2

Manufacturing Processes (Mahadevan, Theocharous FLAIRS 98) Machine Buffer Reward: States: Actions: Observations: • Parts in • Machine • Produce buffers internal • Maintenance state • Throughput • Reward for consuming • Penalize for filling buffers • Penalize for machine breakdown What is the optimal policy? 3

Foveated Active Vision (Minut) States: • Objects Observations: • Local features Actions: Reward: • Where to saccade next • Reward for finding object • What features to use What is the optimal policy? 4

Many More Partially Observable Problems § Assistive technologies § Web searching, preference elicitation § Sophisticated Computing § Distributed file access, Network trouble-shooting § Industrial § Machine maintenance, manufacturing processes § Social § Education, medical diagnosis, health care policymaking § Corporate § Marketing, corporate policy § …. 5

Overview § Learning models of partially observable problems is far from a solved problem § Computing policies for partially observable domains is intractable § We Propose hierarchical solutions § Learn models using less space and time § Compute robust policies that cannot be computed by previous approaches 6

How? Spatial and Time Abstractions Reduce Uncertainty Spatial abstraction MIT Temporal abstraction 7

Outline § Sequential decision-making under uncertainty § A Hierarchical POMDP model for robot navigation § Heuristic macro-action selection in H-POMDPs § Near Optimal macro-action selection for arbitrary POMDPs § Representing H-POMDPs as DBNs § Current and Future directions 8

A Real System: Robot Navigation 0. 1 0. 8 S 1 Transition matrix for the Go-Forward action 0. 1 S 5 S 9 S 1 … S 5 … S 9 0. 1 0. 0 0. 8 0. 0 0. 1 … S 15 Observations Actions Observation model for S 1 OBSERVATION OOOO … OWOW … WWWW 0. 0 0. 15 0. 7 0. 15 0. 0 9

Belief States (Probability Distributions over states) True State Belief State 10

Belief States (Probability Distributions over states) True State Belief State 11

Belief States (Probability Distributions over states) True State Belief State 12

Learning POMDPs § Given As and Zs compute Ts and Os § Estimate probability distribution over hidden states § Count number of times a state was visited § Update T and O and repeat. A 1 A 2 T(S 1=i, A 1=a, S 2=j) S 1 Z 1 § It is an Expectation Maximization algorithm: S 2 Z 2 S 3 Z 3 O(O 2=z, S 2=i, A 1=a) An iterative procedure for doing maximum likelihood parameter estimation over hidden state variables § Converges to local maxima § 13

Planning in POMDPs § Belief states constitute a sufficient statistic for making decisions (Markov property holds: Astrom 1965) § Bellman equation: OBSERVATION (z) Since we have an infinite state space, the problem becomes computationally intractable (PSPACE hard for finite horizon) (UNDECIDABLE for infinite horizon) ENVIRONMENT AGENT STATE ESTIMATOR ACTION (a) BELIEF STATE (b) POLICY(p) 14

Our Solution: Spatial and Temporal Abstraction § Learning § A hierarchical Baum-Welch algorithm, which is derived from the Baum-Welch algorithm for training HHMMs (with Rohanimanesh and Mahadevan, ICRA 2001) § Structure learning from weak priors (with Mahadevan IROS 2002) § Inference can be done in linear time by representing H-POMDPs as Dynamic Bayesian Networks (DBNs) (with Murphy and Kaelbling, ICRA 2004) § Planning § Heuristic macro-action selection (with Mahadevan, ICRA 2002) § Near optimal macro-action selection (with Kaelbling, NIPS 2003) § Structure Learning and Planning combined § Dynamic POMDP abstractions (with Mannor and Kaelbling) 15

Hierarchical POMDPs WEST EAST 17

Hierarchical POMDPs - ABSTRACT STATES + ACTIONS (Fine, Singer, Tishby, MLJ 98) 18

Experimental Environments 600 states 1200 states 19

The Robot Navigation Domain § The robot Pavlov in the real MSU environment § The Nomad 200 simulator 20

Learning Feature Detectors (Mahadevan, Theocharous, Khaleeli: MLJ 98) § 736 hand-labeled-grids § 8 -fold cross-validation § Classification error (m=7. 33, s=3. 7) 21

Learning and Planning in H-POMDPs for Robot Navigation LEARNING INITIAL H-POMDP COMPILATION HAND CODING ENVIRONMENT TOPOLOGICAL MAP PLANNING EM EXECUTION TRAINED H-POMDP NAVIGATION SYSTEM 22

Planning in H-POMDPs (Theocharous, Mahadevan: ICRA 2002) § Hierarchical MDP solutions (using the options framework [Sutton, Precup, Beliefs: b(s) Singh, AIJ]) Abstract actions Primitive actions 0. 35 0. 3 0. 2 0. 1 § Heuristic POMDP solutions 0. 05 § MLS 4, 10 p(b)= go-west v(go-west) 10, 5 23, 100 49, 20 100, 40 v(go-east) 24

Plan Execution 25

Plan Execution 26

Plan Execution 27

Plan Execution 28

Intuition § Probability distribution at the higher level evolves more slowly § The agent does not decide what the best macro-action to do every time step § Long term actions result in robot localization 29

F-MLS Demo 30

H-MLS Demo 31

Hierarchical is More Successful Unknown initial position Success % Environment MLS QMDP Algorithm 32

Hierarchical Takes Less Time to Reach Goal Unknown initial position ? Average Steps to Goal Environment MLS QMDP Algorithm 33

Hierarchical Plans are Computed Faster Planning Time Environment Goal 1 Goal 2 Algorithm 34

Near Optimal Macro-action Selection (Theocharous, Kaelbling NIPS 2003) §Usually agents don’t require the entire belief space §Macro-actions can reduce belief space even more §Tested in large scale robot navigation §Only small part of the belief-space is required §Learn approximate POMDP policies fast §High success rate §Better policies §Does information gathering 36

Dynamic Grids Given a resolution, points are sampled dynamically from regular dicretizations, by simulating trajectories 37

The Algorithm True trajectory True belief state Resulting next true belief state Simulation trajectories from g of macro A (estimation of value at g) Value of b’’ is interpolated from it’s neighbors Nearest grid point to b 38

Experimental Setup 39

Fewer Number of States 40

Fewer Steps to Goal 41

More Successful 42

Information Gathering 43

Information Gathering (scaling up) 44

Dynamic POMDP Abstractions En tr op y th re sh o ld s (Theocharous, Mannor, Kaelbling) t r sta go a l on ati liz a c Lo os r c a m 45

Fewer Steps to Goal 46

Dynamic Bayesian Networks STATE POMDP FACTORED DBN POMDP 0. 08 0. 01 0. 05 0. 01 0. 7 0. 08 # of parameters 48

DBN Inference L 1 49

Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004) WEST EAST STATE H-POMDP WEST EAST FACTORED DBN H-POMDP 50

Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004) WEST EAST STATE H-POMDP WEST EAST FACTORED DBN H-POMDP 51

Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004) WEST EAST STATE H-POMDP WEST EAST FACTORED DBN H-POMDP 52

Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004) WEST EAST STATE H-POMDP WEST EAST FACTORED DBN H-POMDP 53

Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004) WEST EAST STATE H-POMDP WEST EAST FACTORED DBN H-POMDP 54

Complexity of Inference FACTORED DBN H-POMDP STATE H-POMDP WEST EAST DBN H-POMDP STATE POMDP 55

Hierarchical Localizes better Original Factored DBN tied H-POMDP Factored DBN H-POMDP STATE POMDP Before training 56

Hierarchical Fits Data Better Original Factored DBN tied H-POMDP Factored DBN H-POMDP STATE POMDP Before training 57

Directions for Future Research § In the future we will explore structure learning § Bayesian model selection approaches § Methods for learning compositional hierarchies (recurrent nets, hierarchical sparse n-grams) § Natural language acquisition methods § Identifying isomorphic processes § On–line learning § Interactive Learning § Application to real world problems 58

Major Contributions § The H-POMDP model § Requires less training data § Provides better state estimation § Fast planning § Macro-actions in POMDPS reduce uncertainty § Information gathering § Application of the algorithms to large scale Robot navigation § Map Learning § Planning and execution 59