Markov Decision Processes MDPs read Ch 17 1

• the goal: maximize reward over time – long-term discounted reward – handles

• value function Vp(s): expected long-term reward from starting in state s and

Calculating V*(s) • Bellman’s equations – (eqn 17. 5) • method 1: linear programming

• method 2: Value Iteration – initialize V(s)=0 for all states – iteratively

Slides: 6

Download presentation

Markov Decision Processes (MDPs) • read Ch 17. 1 -17. 2 • utility-based agents – goals encoded in utility function U(s), or U: S • effects of actions encoded in state transition function: T: Sx. A S – or T: Sx. A pdf(S) for non-deterministic • rewards/costs encoded in reward function: R: Sx. A • Markov property: effects of actions only depend on current state, not previous history

• the goal: maximize reward over time – long-term discounted reward – handles infinite horizon; encourages quicker achievement • “plans” are encoded in policies – mappings from states to actions: p: S A • how to compute optimal policy p* that maximizes longterm discounted reward?

• value function Vp(s): expected long-term reward from starting in state s and following policy p • derive policy from V(s): • p(s)=maxa A E[R(s, a)+g. V(T(s, p(s)))] • = max S p(s’|s, a)·(R+g. V(s’)) • optimal policy comes from optimal value function: p*(s)= max S p(s’|s, a)·V*(s’) =

Calculating V*(s) • Bellman’s equations – (eqn 17. 5) • method 1: linear programming – n coupled linear equations – v 1 = max(v 2, v 3, v 4. . . ) – v 2 = max(v 1, v 3, v 4. . . ) – v 3 = max(v 1, v 2, v 4. . . ) – solve for {v 1, v 2, v 3. . . } using Gnu LP kit, etc.

• method 2: Value Iteration – initialize V(s)=0 for all states – iteratively update value of each state based on neighbors –. . . until convergence