Optimal FixedSize Controllers for Decentralized POMDPs Christopher Amato

Overview n n n DEC-POMDPs and their solutions Fixing memory with controllers Previous approaches

DEC-POMDPs n n Decentralized partially observable Markov decision process (DEC-POMDP) Multiagent sequential decision making

DEC-POMDP definition n A two agent DEC-POMDP can be defined with the tuple: M

DEC-POMDP solutions n n A policy for each agent is a mapping from their

Example: Grid World States: grid cell pairs Actions: move stay , , , ,

Previous work n Optimal algorithms n n n Very large space and time requirements

Policies as controllers n Finite state controller for each agent i n n n

Controller example n Stochastic controller for a single agent n n 2 nodes, 2

Optimal controllers n n How do we set the parameters of the controllers? Deterministic

Decentralized BPI n n Decentralized Bounded Policy Iteration (DECBPI) - (Bernstein, Hansen and Zilberstein

Problems with DEC-BPI n n Difficult to improve value for all states and other

Optimal controllers n n Use nonlinear programming (NLP) Consider node value as a variable

NLP intuition n n Value variable allows improvement and evaluation at the same time

NLP representation Variables: , Objective: Maximize , Value Constraints: s S, Q Linear constraints

Optimality Theorem: An optimal solution of the NLP results in optimal stochastic controllers for

Pros and cons of the NLP n Pros n n Retains fixed memory and

Experiments n n n Nonlinear programming algorithms (snopt and filter) - sequential quadratic programming

Results: Broadcast Channel n n Two agents share a broadcast channel (4 states, 5

Results: Recycling Robots mean quality of the NLP and DEC-BPI implementations on the recycling

Results: Grid World mean quality of the NLP and DEC-BPI implementations on the meeting

Results: Running time n n Running time mostly comparable to DEC-BPI corr The increase

Conclusion n n Defined the optimal fixed-size stochastic controller using NLP Showed consistent improvement

Future Work n n n Explore more efficient NLP formulations Investigate more specialized solution

Slides: 24

Download presentation

Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel S. Bernstein Shlomo Zilberstein University of Massachusetts Amherst May 9, 2006 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science

Overview n n n DEC-POMDPs and their solutions Fixing memory with controllers Previous approaches Representing the optimal controller Some experimental results UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 2

DEC-POMDPs n n Decentralized partially observable Markov decision process (DEC-POMDP) Multiagent sequential decision making under uncertainty n At each stage, each agent receives: n n A local observation rather than the actual state A joint immediate reward a 1 o 1 a 2 r Environment o 2 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 3

DEC-POMDP definition n A two agent DEC-POMDP can be defined with the tuple: M = S, A 1, A 2, P, R, 1, 2, O n n n S, a finite set of states with designated initial state distribution b 0 A 1 and A 2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a 1, a 2) R, the reward model: R(s, a 1, a 2) 1 and 2, each agent’s finite set of observations O, the observation model: O(o 1, o 2 | s', a 1, a 2) UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 4

DEC-POMDP solutions n n A policy for each agent is a mapping from their observation sequences to actions, * A , allowing distributed execution A joint policy is a policy for each agent Goal is to maximize expected discounted reward over an infinite horizon Use a discount factor, , to calculate this UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 5

Example: Grid World States: grid cell pairs Actions: move stay , , , , Transitions: noisy Observations: red lines Goal: share same square UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 6

Previous work n Optimal algorithms n n n Very large space and time requirements Can only solve small problems Approximation algorithms n provide weak optimality guarantees, if any UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 7

Policies as controllers n Finite state controller for each agent i n n n Fixed memory Randomness used to offset memory limitations Action selection, ψ : Qi → Ai Transitions, η : Qi × Ai × Oi → Qi Value for a pair is given by the Bellman equation: Where the subscript denotes the agent and lowercase values are elements of the uppercase sets above UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 8

Controller example n Stochastic controller for a single agent n n 2 nodes, 2 actions, 2 obs o 1 Parameters a 1 n n P(a|q) P(q’|q, a, o) o 2 a 2 o 2 0. 5 0. 25 0. 75 0. 5 1 1. 0 2 1. 0 o 1 1. 0 o 2 a 1 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 9

Optimal controllers n n How do we set the parameters of the controllers? Deterministic controllers - traditional methods such as best-first search (Szer and Charpillet 05) n Stochastic controllers - continuous optimization UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 10

Decentralized BPI n n Decentralized Bounded Policy Iteration (DECBPI) - (Bernstein, Hansen and Zilberstein 05) Alternates between improvement and evaluation until convergence Improvement: For each node of each agent’s controller, find a probability distribution over one-step lookahead values that is greater than the current node’s value for all states and controllers for the other agents Evaluation: Finds values of all nodes in all states UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 11

Problems with DEC-BPI n n Difficult to improve value for all states and other agents’ controllers May require more nodes for a given start state Linear program (one step lookahead) results in local optimality Correlation device can somewhat improve performance UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 13

Optimal controllers n n Use nonlinear programming (NLP) Consider node value as a variable Improvement and evaluation all in one step Add constraints to maintain valid values UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 14

NLP intuition n n Value variable allows improvement and evaluation at the same time (infinite lookahead) While iterative process of DEC-BPI can “get stuck” the NLP does define the globally optimal solution UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 15

NLP representation Variables: , Objective: Maximize , Value Constraints: s S, Q Linear constraints are needed to ensure controllers are independent Also, all probabilities must sum to 1 and be greater than 0 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 16

Optimality Theorem: An optimal solution of the NLP results in optimal stochastic controllers for the given size and initial state distribution. UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 17

Pros and cons of the NLP n Pros n n Retains fixed memory and efficient policy representation Represents optimal policy for given size Takes advantage of known start state Cons n Difficult to solve optimally UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 18

Experiments n n n Nonlinear programming algorithms (snopt and filter) - sequential quadratic programming (SQP) Guarantees locally optimal solution NEOS server 10 random initial controllers for a range of sizes Compared the NLP with DEC-BPI n With and without a small correlation device UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 19

Results: Broadcast Channel n n Two agents share a broadcast channel (4 states, 5 obs , 2 acts) Very simple near-optimal policy mean quality of the NLP and DEC-BPI implementations UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 20

Results: Recycling Robots mean quality of the NLP and DEC-BPI implementations on the recycling robot domain (4 states, 2 obs, 3 acts) UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 21

Results: Grid World mean quality of the NLP and DEC-BPI implementations on the meeting in a grid (16 states, 2 obs, 5 acts) UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 22

Results: Running time n n Running time mostly comparable to DEC-BPI corr The increase as controller size grows offset by better performance Broadcast Recycle Grid UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 23

Conclusion n n Defined the optimal fixed-size stochastic controller using NLP Showed consistent improvement over DEC-BPI with locally optimal solvers In general, the NLP may allow small optimal controllers to be found Also, may provide concise near-optimal approximations of large controllers UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 24

Future Work n n n Explore more efficient NLP formulations Investigate more specialized solution techniques for NLP formulation Greater experimentation and comparison with other methods UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 25