Optimal FixedSize Controllers for Decentralized POMDPs Christopher Amato
- Slides: 24
Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel S. Bernstein Shlomo Zilberstein University of Massachusetts Amherst May 9, 2006 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science
Overview n n n DEC-POMDPs and their solutions Fixing memory with controllers Previous approaches Representing the optimal controller Some experimental results UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 2
DEC-POMDPs n n Decentralized partially observable Markov decision process (DEC-POMDP) Multiagent sequential decision making under uncertainty n At each stage, each agent receives: n n A local observation rather than the actual state A joint immediate reward a 1 o 1 a 2 r Environment o 2 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 3
DEC-POMDP definition n A two agent DEC-POMDP can be defined with the tuple: M = S, A 1, A 2, P, R, 1, 2, O n n n S, a finite set of states with designated initial state distribution b 0 A 1 and A 2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a 1, a 2) R, the reward model: R(s, a 1, a 2) 1 and 2, each agent’s finite set of observations O, the observation model: O(o 1, o 2 | s', a 1, a 2) UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 4
DEC-POMDP solutions n n A policy for each agent is a mapping from their observation sequences to actions, * A , allowing distributed execution A joint policy is a policy for each agent Goal is to maximize expected discounted reward over an infinite horizon Use a discount factor, , to calculate this UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 5
Example: Grid World States: grid cell pairs Actions: move stay , , , , Transitions: noisy Observations: red lines Goal: share same square UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 6
Previous work n Optimal algorithms n n n Very large space and time requirements Can only solve small problems Approximation algorithms n provide weak optimality guarantees, if any UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 7
Policies as controllers n Finite state controller for each agent i n n n Fixed memory Randomness used to offset memory limitations Action selection, ψ : Qi → Ai Transitions, η : Qi × Ai × Oi → Qi Value for a pair is given by the Bellman equation: Where the subscript denotes the agent and lowercase values are elements of the uppercase sets above UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 8
Controller example n Stochastic controller for a single agent n n 2 nodes, 2 actions, 2 obs o 1 Parameters a 1 n n P(a|q) P(q’|q, a, o) o 2 a 2 o 2 0. 5 0. 25 0. 75 0. 5 1 1. 0 2 1. 0 o 1 1. 0 o 2 a 1 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 9
Optimal controllers n n How do we set the parameters of the controllers? Deterministic controllers - traditional methods such as best-first search (Szer and Charpillet 05) n Stochastic controllers - continuous optimization UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 10
Decentralized BPI n n Decentralized Bounded Policy Iteration (DECBPI) - (Bernstein, Hansen and Zilberstein 05) Alternates between improvement and evaluation until convergence Improvement: For each node of each agent’s controller, find a probability distribution over one-step lookahead values that is greater than the current node’s value for all states and controllers for the other agents Evaluation: Finds values of all nodes in all states UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 11
Problems with DEC-BPI n n Difficult to improve value for all states and other agents’ controllers May require more nodes for a given start state Linear program (one step lookahead) results in local optimality Correlation device can somewhat improve performance UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 13
Optimal controllers n n Use nonlinear programming (NLP) Consider node value as a variable Improvement and evaluation all in one step Add constraints to maintain valid values UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 14
NLP intuition n n Value variable allows improvement and evaluation at the same time (infinite lookahead) While iterative process of DEC-BPI can “get stuck” the NLP does define the globally optimal solution UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 15
NLP representation Variables: , Objective: Maximize , Value Constraints: s S, Q Linear constraints are needed to ensure controllers are independent Also, all probabilities must sum to 1 and be greater than 0 UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 16
Optimality Theorem: An optimal solution of the NLP results in optimal stochastic controllers for the given size and initial state distribution. UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 17
Pros and cons of the NLP n Pros n n Retains fixed memory and efficient policy representation Represents optimal policy for given size Takes advantage of known start state Cons n Difficult to solve optimally UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 18
Experiments n n n Nonlinear programming algorithms (snopt and filter) - sequential quadratic programming (SQP) Guarantees locally optimal solution NEOS server 10 random initial controllers for a range of sizes Compared the NLP with DEC-BPI n With and without a small correlation device UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 19
Results: Broadcast Channel n n Two agents share a broadcast channel (4 states, 5 obs , 2 acts) Very simple near-optimal policy mean quality of the NLP and DEC-BPI implementations UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 20
Results: Recycling Robots mean quality of the NLP and DEC-BPI implementations on the recycling robot domain (4 states, 2 obs, 3 acts) UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 21
Results: Grid World mean quality of the NLP and DEC-BPI implementations on the meeting in a grid (16 states, 2 obs, 5 acts) UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 22
Results: Running time n n Running time mostly comparable to DEC-BPI corr The increase as controller size grows offset by better performance Broadcast Recycle Grid UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 23
Conclusion n n Defined the optimal fixed-size stochastic controller using NLP Showed consistent improvement over DEC-BPI with locally optimal solvers In general, the NLP may allow small optimal controllers to be found Also, may provide concise near-optimal approximations of large controllers UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 24
Future Work n n n Explore more efficient NLP formulations Investigate more specialized solution techniques for NLP formulation Greater experimentation and comparison with other methods UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science 25
- Christopher: "do you want to dance?"
- Dio ha tanto amato il mondo
- Francesco amato unina
- Dott ssa amato pediatra
- Flora amato
- Flora amato unina
- Una voce il mio diletto
- Inverse compton
- Consustanziazione
- Danny amato
- Project planning begins with the melding of
- Astratum
- Aims and objectives of curriculum
- Purchasing organization structure
- Centralized decentralized distributed
- Recognizing a firm's intellectual assets
- Performance measurement in decentralized organizations
- Decentralized volume this year
- Performance evaluation for decentralized operations
- Bsp classification
- First decentralized digital currency
- Centralized decentralized distributed
- Zerocash decentralized anonymous payments from bitcoin
- Centralized stores with sub-stores
- Performance measurement in decentralized organizations