Learning Procedural Planning Knowledge in Complex Environments Douglas

  • Slides: 21
Download presentation
Learning Procedural Planning Knowledge in Complex Environments Douglas Pearson douglas. pearson@threepenny. net March 2004

Learning Procedural Planning Knowledge in Complex Environments Douglas Pearson douglas. pearson@threepenny. net March 2004

Characterizing the Learner Method KR Implicit Deliberate Complex Environments Procedural Reinforcement Learning IMPROV Symbolic

Characterizing the Learner Method KR Implicit Deliberate Complex Environments Procedural Reinforcement Learning IMPROV Symbolic Learners Declarative Simpler Agents Weak, slower learning Complex Agents Strong, faster learning Actions: Duration & Conditional Sensing: Limited, noisy, delayed Task : Timely response Domain: Change over time large state space Simple Environments

Why Limit Knowledge Access? • • Procedural – Only access by executing Declarative –

Why Limit Knowledge Access? • • Procedural – Only access by executing Declarative – Can answer when will execute/what it will do. Declarative Problems • Availability – If (x^5 + 3 x^3 – 5 x^2 +2) > 7 then Action – Chains of rules A->B->C->Action • Efficiency – O(size of knowledge base) or worse – Agent slows down as learns more IMPROV Representation – Sets of production rules for operator preconditions and actions – Assume learner can only execute rules – But allow ability to add declarative knowledge when it’s efficient to do so.

Focusing on Part of the Problem 100% Task Performance 0% Representation Knowledge Initial Rule

Focusing on Part of the Problem 100% Task Performance 0% Representation Knowledge Initial Rule Base Learn this Domain Knowledge

The Problem • Cast learning problem as – Error detection (incomplete/incorrect K) – Error

The Problem • Cast learning problem as – Error detection (incomplete/incorrect K) – Error correction (fixing or adding K) • But with just limited, procedural access • Aim is to support learning in complex, scalable agents/environments.

Error Detection Problem PLAN S 1 Speed-30 S 2 Speed-10 S 3 Speed-0 S

Error Detection Problem PLAN S 1 Speed-30 S 2 Speed-10 S 3 Speed-0 S 4 Speed-30 Existing (Possibly Incorrect) Knowledge How to monitor the plan during execution without direct knowledge access?

Error Detection Solution • • Direct monitoring – not possible Instead detect lack of

Error Detection Solution • • Direct monitoring – not possible Instead detect lack of progress to the goal – No rules matching or conflicting rules S 1 • • • Speed-30 S 2 Speed-10 S 3 Speed-0 S 4 Engine stalls No proposal Not predicting behavior of the world (useful in stochastic environments) But no implicit notion of quality of solution Can add domain specific error conditions – but not required.

IMPROV’s Recovery Method Replan Search Repeat until find goal Fail Execute Record [State, Op

IMPROV’s Recovery Method Replan Search Repeat until find goal Fail Execute Record [State, Op -> Result] Identify Incorrect Operator(s) Learning Train Inductive Learner Change Domain Knowledge Reached Goal

Finding the Incorrect Operator(s) Speed-30 Speed-10 Change-Gear is over-specific Speed-0 is over-general Change-Gear Speed-0

Finding the Incorrect Operator(s) Speed-30 Speed-10 Change-Gear is over-specific Speed-0 is over-general Change-Gear Speed-0 Speed-30 By waiting can do better credit assignment

Learning to Correct the Operator • Collected a set of training instances – [State,

Learning to Correct the Operator • Collected a set of training instances – [State, Operator -> Result] – Can identify differences between states Speed = 40 Light = green Self = car Other = car Speed = 40 Light = green Self = car Other = ambulance • Used as a default bias in training inductive learner • Learn preconditions as classification problem (predict operator from state)

K-Incremental Learning • • Collect a set of k instances Then train inductive learner

K-Incremental Learning • • Collect a set of k instances Then train inductive learner Reinforcement Learners Till Correction (IMPROV) Till Unique Cause (EXPO) k 1 k 2 1 Instance set size K-Incremental Learner – k does not grow over time => incremental behavior – Better decisions about what to discard when generalizing – When doing “active learning” bad early learning can really hurt Non-Incremental Learners n

Extending to Operator Actions Speed 30 Speed 20 Decompose into operator hierarchy Brake Slow

Extending to Operator Actions Speed 30 Speed 20 Decompose into operator hierarchy Brake Slow -5 Slow -10 Release Slow 0 Terminates with operators that modify a single symbol

Correcting Actions Slow -5 Slow -10 Slow -2 Slow -4 Slow -6 => Failure

Correcting Actions Slow -5 Slow -10 Slow -2 Slow -4 Slow -6 => Failure Expected effects of braking Observed effects of braking on ice Use the correction method to change the pre-conditions of these sub-operators

Change Procedural Actions Brake Changing effects of brake Braking & slow=0 & ice =>

Change Procedural Actions Brake Changing effects of brake Braking & slow=0 & ice => reject slow -5 Specialize Slow -5 Braking & slow=0 & ice => propose slow -2 Generalize Slow -2 Supports Complex Actions with durations (sequence of operators) Conditional actions (branches in sequence of operators) Multiple simultaneous effects

IMPROV Summary Method KR Procedural Incremental Declarative Non-Incremental Implicit Reinforcement Learning Deliberate IMPROV Symbolic

IMPROV Summary Method KR Procedural Incremental Declarative Non-Incremental Implicit Reinforcement Learning Deliberate IMPROV Symbolic Learners k-Incremental Learning -- Improved credit assignment -- Which operator -- Which feature IMPROV support for: • Powerful agents -- Multiple goals -- Faster, deliberate learning • Complex environments -- Noise -- Complex actions -- Dynamic environments General weak deliberate learner with only procedural access assumed -- General purpose error detection -- General correction method applied to preconditions and actions -- Nice re-use of precondition learner to learn actions -- Easy to add domain specific knowledge to make method stronger

Redux: Diagram-based Example-driven Knowledge Acquisition Douglas Pearson douglas. pearson@threepenny. net March 2004

Redux: Diagram-based Example-driven Knowledge Acquisition Douglas Pearson douglas. pearson@threepenny. net March 2004

1. User specifies desired behavior

1. User specifies desired behavior

2. User selects features – define rules Later we’ll use ML to guess this

2. User selects features – define rules Later we’ll use ML to guess this initial feature set

3. Compare desired with rules Desired Move-through(door 1) Turn-to-face(threat 1) Shoot(threat 1) Move-through(door 1)

3. Compare desired with rules Desired Move-through(door 1) Turn-to-face(threat 1) Shoot(threat 1) Move-through(door 1) Turn-to-face(neutral 1) Shoot(neutral 1) Actual

4. Identify and correct problems • Detect differences between desired behavior and rules –

4. Identify and correct problems • Detect differences between desired behavior and rules – – – Detect overgeneral preconditions Detect conflicts within the scenario Detect conflicts between scenarios Detect choice points where there’s no guidance etc. • All of these errors are detected automatically when rule is created

5. Fast rule creation by expert Define behavior with diagram-based examples Library of validated

5. Fast rule creation by expert Define behavior with diagram-based examples Library of validated behavior examples Analysis & generation tools Executable Code Detect inconsistency A -> B C -> D E, J -> F G, A, C -> H E, G -> I J, K -> L Generalize Generate rules Simulate execution Expert Engineer Simulation Environment