Dynamic Programming Chris Atkeson 2012 Dynamic Programming x

  • Slides: 43
Download presentation
Dynamic Programming Chris Atkeson 2012

Dynamic Programming Chris Atkeson 2012

Dynamic Programming • • x: continuous state u: continuous control or action Discrete time

Dynamic Programming • • x: continuous state u: continuous control or action Discrete time dynamics: xk+1 = f(xk, uk) Cost function: L(x, u) Value function V(x) = ∑L(x, u) Q function Q(x, u) = L(x, u) + V(f(x, u)) Bellman Equation V(x) = minu Q(x, u) Policy/control law: u(x) = argminu Q(x, u)

Value Function

Value Function

Policy

Policy

Why Dynamic Programming? • Trajectory optimization (including receding horizon control) can produce high quality

Why Dynamic Programming? • Trajectory optimization (including receding horizon control) can produce high quality answers. • However, trajectory optimization often gets stuck in local minima that are not useful. • We combine dynamic programming and trajectory optimization to get high quality answers with fewer problems from local minima. DP produces an initial trajectory, which is refined by optimization. • DP produces a policy u(x)

Why Dynamic Programming? • Policy optimization can produce useful answers. • However, need to

Why Dynamic Programming? • Policy optimization can produce useful answers. • However, need to choose good features/basis functions/parameterization.

The Curse of Dimensionality • Continuous States: Storage cost: resolutiondx State computational cost: resolutiondx

The Curse of Dimensionality • Continuous States: Storage cost: resolutiondx State computational cost: resolutiondx • Continuous Actions: Action computational cost: resolutiondu

How to do Dynamic Programming (specified end time T) • • Dynamics: xk+1 =

How to do Dynamic Programming (specified end time T) • • Dynamics: xk+1 = f(xk, uk) Cost: C() = c. T(x. T) + c(xk, uk) Value function Vk(x) is represented by table. VT(x) = c. T(x) For each x, Vk(x) = minu(c(x, u) + Vk+1(f(x, u))) This is the Bellman Equation This version of DP is value iteration Can also tabulate policy: u = k(x)

How to do Dynamic Programming (no specified end time) Cost: C() = c(xk, uk)

How to do Dynamic Programming (no specified end time) Cost: C() = c(xk, uk) VN(x) = a guess, or all zeros. Apply the Bellman equation. V(x) is given by Vk(x) when V stops changing. Goal needs to have zero cost, or need to discount so V() does not grow to infinity: • Vk(x) = minu(c(x, u) + Vk+1(f(x, u))), < 1 • • •

Policy Iteration • u = (x): general policy (a table in discrete state case).

Policy Iteration • u = (x): general policy (a table in discrete state case). • *) Compute V (x): V k(x) = c(x, (x)) + V k+1(f(x, (x))) • Update policy (x) = argminu(c(x, u) + V (f(x, u))) • Goto *)

Stochastic Dynamic Programming • Cost: C() = E(c(xk, uk)) • The Bellman equation now

Stochastic Dynamic Programming • Cost: C() = E(c(xk, uk)) • The Bellman equation now involves expectations: • Vk(x) = minu. E(c(x, u) + Vk+1(f(x, u))) = minu(c(x, u) + p(xk+1)Vk+1(xk+1)) • Modified Bellman equation applies to value and policy iteration. • May need to add discount factor.

Continuous State DP • Time is still discrete. • How do we discretize the

Continuous State DP • Time is still discrete. • How do we discretize the states?

How to handle continuous states. • Discretize states on a grid. • At each

How to handle continuous states. • Discretize states on a grid. • At each point (x 0), generate trajectory segment of length N by minimizing C(u) = c(xk, uk) + V(x. N) • V(x. N): interpolate using surrounding V() • Typically multilinear interpolation used. • N typically determined by when V(x. N) independent of V(x 0) • Use favorite continuous function optimizer to search for best u when minimizing C(u) • Update V() at that cell.

State Increment Dynamic Programming (Larson) c 1 V(x 0) c 0 c 2 V(x.

State Increment Dynamic Programming (Larson) c 1 V(x 0) c 0 c 2 V(x. N)

State Increment Dynamic Programming

State Increment Dynamic Programming

Discretizations • Grid, multilinear interpolation

Discretizations • Grid, multilinear interpolation

Adaptive Grids • Ideal grid: local optimizer gets right answer in each cell. –

Adaptive Grids • Ideal grid: local optimizer gets right answer in each cell. – Split at policy discontinuities. – Split to avoid local optima. • Challenge: can you figure out good tessellation without fully solving problem?

Munos and Moore, Variable Resolution Discretization in Optimal Control Machine Learning, 49 (2/3), 291

Munos and Moore, Variable Resolution Discretization in Optimal Control Machine Learning, 49 (2/3), 291 -323, 2002

Kuhn Triangulation, kd-trie

Kuhn Triangulation, kd-trie

Kuhn Triangulation in 3 D

Kuhn Triangulation in 3 D

Trajectory Segments

Trajectory Segments

Mountain Car

Mountain Car

Value Function

Value Function

Value Function

Value Function

Discretizations

Discretizations

Where Split? • Violations of V(x) of cell/region from model (constant, linear, quadratic, …)

Where Split? • Violations of V(x) of cell/region from model (constant, linear, quadratic, …) • Violations of u(x) of cell/region from model (constant, linear, …) • Some criteria applied to predecessors (how much change in V needed to affect trajectory? )

Policy Iteration: Continuous State • • • Discretize states Represent policy at discretized states

Policy Iteration: Continuous State • • • Discretize states Represent policy at discretized states u(x) Each cell in table has constant u, or u as knot points for linear or higher order spline *) Same kind of trajectory segments used to compute V k(x) = c(x, (x)) + V k+1(x. N) • Optimize policy (x) = argminu(c(x, u) + V (f(x, u))) using favorite continuous function optimizer. • Goto *)

Stochastic DP: Continous State • Cost: C() = E(c(xk, uk)) • Do Monte Carlo

Stochastic DP: Continous State • Cost: C() = E(c(xk, uk)) • Do Monte Carlo sampling of process noise for each trajectory segment (many trajectory segments), or • Propagate analytic distribution (see Kalman filter) • The Bellman equation involves expectations: • Vk(x) = minu. E(c(x, u) + Vk+1(f(x, u)))

Insight • Try to develop planning methods that scale computational cost according to complexity

Insight • Try to develop planning methods that scale computational cost according to complexity of problem. • Simple problems are easy to solve (LQR) • Complicated problems are expensive to solve, or aren’t solvable with current methods.

Ideas • • • Randomly sample actions Randomly sample states Use local models Propagate

Ideas • • • Randomly sample actions Randomly sample states Use local models Propagate local models along trajectories Locally optimize Coordinate local optimizations to globally optimize.

What about continuous actions?

What about continuous actions?

Randomized Dynamic Planning? • Probabilistic Roadmaps (aka PRM) • Rapidly Exploring Random Trees (RRT)

Randomized Dynamic Planning? • Probabilistic Roadmaps (aka PRM) • Rapidly Exploring Random Trees (RRT)

Random Sampling DP: Related Work • Semi-uniform strategies in multi-armed bandit problems: Epsilon-X strategies.

Random Sampling DP: Related Work • Semi-uniform strategies in multi-armed bandit problems: Epsilon-X strategies. • Evolutionary (random) action search. • Nemirovsky and Yudin: can’t beat curse of dimensionality in continuous action search. • Rust 97: Random sampling of states in stochastic DP. E() smooths problem. Beat curse of dimensionality for computing E(). • Thrun 00: POMDPs. Random sampling of belief states. Nearest neighbor interpolation of V(). Coverage vs. surprise test.

Random Action Search • Time invariant deterministic problem: find steady state solution. • Continuous

Random Action Search • Time invariant deterministic problem: find steady state solution. • Continuous actions. • Q function: Q(x, u) = L(x, u) + V(f(x, u)) • Bellman’s equation: V(x) = minu Q(x, u) • Try current policy u(x): Q(x, u(x)) • Try 1 random action urandom: Q(x, urandom) • Take best one: ubest : Q(x, u(x)) < Q(x, urandom)? • Update V(x) = Q(x, ubest), u(x) = ubest

DP Asynchronous

DP Asynchronous

Two Link Swing Up • • • State: Action: Cost function: State Resolution: 100

Two Link Swing Up • • • State: Action: Cost function: State Resolution: 100 x 100 million states

Two Link Swing Up

Two Link Swing Up

Two Link Swing Up

Two Link Swing Up

Angle Angular Velocity Two Link Swing Up Torque

Angle Angular Velocity Two Link Swing Up Torque

Effects of Action Grid Size Two Link: 60 x 60 x 60

Effects of Action Grid Size Two Link: 60 x 60 x 60

Repeatability (8 -1004, 1 -1484)

Repeatability (8 -1004, 1 -1484)

Search More Actions Per Update?

Search More Actions Per Update?

Random Sampling of Actions: Improvements • Tune distribution to problem. • Do some local

Random Sampling of Actions: Improvements • Tune distribution to problem. • Do some local optimization (gradient descent): so far only slows search down. • Smooth output trajectory with trajectory optimization. • Schedule updates adaptively