Learning from Observations Chapter 18 Spring 2004 CS

Learning from Observations Chapter 18 Spring 2004 CS 471/598 by H. Liu Copyright, 1996 © Dale Carnegie & Associates, Inc.

Learning agents Improve their behavior through diligent study of their own experiences. Acting -> Experience -> Better Acting We’ll study how to make a learning agent to learn; what is needed for learning; and some representative methods of learning from observations CS 471/598 by H. Liu 2

A general model What are the components of a learning agent? n n Learning element - learn and improve (Fig 2. 15) Performance element - an agent itself to perceive & act Problem generator - suggest some exploratory actions Critic - provide feedback how the agent is doing The design of an LA is affected by four issues: prior info, feedback, representation, performance CS 471/598 by H. Liu 3

What do we need Components of the performance element Each component should be learnable given feedback Representation of the components n Propositional Logic, FOL, or others Available feedback n Supervised, Reinforcement, Unsupervised Prior knowledge n Nil, some, (Why not all? ) Put it all together as learning some functions CS 471/598 by H. Liu 4

Inductive Learning Data described by examples an example is a pair (x, f(x)) Induction - given a collection of examples of f, return a function h that approximates f. n Data in Fig 18. 3 n Concepts about learning (explained using Fig 18. 1) n w Hypothesis w Bias Learning incrementally or in batch CS 471/598 by H. Liu 5

Some questions about inductive learning Are there many forms of inductive learning? n We’ll learn some Can we achieve both expressiveness and efficiency? How can one possibly know that one’s learning algorithm has produced a theory that will correctly predict the future? If one does not, how can one say that the algorithm is any good? CS 471/598 by H. Liu 6

Learning decision trees A decision tree takes as input an object described by a set of properties and outputs yes/no “decision”. One of the simplest and yet most successful forms of learning To make a decision “wait” or “not wait”, we need information such as … (page 654 for 10 attributes for the data set in Fig 18. 3) Patrons(Full)^Wait. Estimate(0 -10)^Hungry(N)=>Will. Wait CS 471/598 by H. Liu 7

Let’s make a decision Where to start? CS 471/598 by H. Liu 8

Expressiveness of a DT Continued from page 7 - A possible DT (e. g. , Fig 18. 2) The decision tree language is essentially propositional, with each attribute test being a proposition. Any Boolean functions can be written as a decision tree (truth tables <-> DTs) DTs can represent many functions with much smaller trees, but not for all Boolean functions (parity, majority) CS 471/598 by H. Liu 9

How many different functions are in the set of all Boolean functions on n attributes? How to find consistent hypotheses in the space of all possible ones? And which one is most likely the best? CS 471/598 by H. Liu 10

Inducing DTs from examples Extracting a pattern (DTs) means being able to describe a large number of cases in a concise way - a consistent & concise tree. Applying Occam’s razor: the most likely hypothesis is the simplest one that is consistent with all observations. How to find the smallest DT? n n n Examine the most important attribute first (Fig 18. 4) Algorithm (Fig 18. 5, page 658) Another DT (Fig 18. 6) CS 471/598 by H. Liu 11

Choosing the best attribute A computational method - information theory n Information - informally, the more surprise you have, the more information you have; mathematically, I(P(v 1), …, P(vn)) = sum[-P(vi)log. P(vi)] w I(1/2, 1/2) = 1 w I(0, 1) = (1, 0) = 0 n Information alone can’t help much to answer “what is the correct classification? ”. CS 471/598 by H. Liu 12

Information gain - the difference between the original and the new info requirement: n Remainder(A) = p 1*I(B 1)+…+pn*I(Bn) n where p 1+…+pn = 1 Gain(A) = I(A) - Remainder(A) CS 471/598 by H. Liu 13

Which attribute? Revisit the example of “Wait” or “Not Wait” using your favorite 2 attributes. CS 471/598 by H. Liu 14

Assessing the performance A fair assessment: the one the learner has not seen. Errors Training and test sets: n n Divide the data into two sets Learn on the training set Test on the test set If necessary, shuffle the data and repeat Learning curve - “happy graph” (Fig 18. 7) CS 471/598 by H. Liu 15

Practical use of DT learning BP’s use of GASOIL Learning to fly on a flight simulator An industrial strength system - Quinlan’s C 4. 5 Who’s the next hero? CS 471/598 by H. Liu 16

Some issues of DT applications Missing values Multivalued attributes Continuous-valued attributes CS 471/598 by H. Liu 17

Why learning works? How can one possibly know that his/her learning algorithm will correctly predict the future? How do we know that h is close enough to f without knowing f? Computational learning theory has provided some answers. The basic idea is that because any wrong h will make an incorrect prediction, it will be found out with high probability after a small number of examples. So, if h is consistent with a sufficient number of examples, it is unlikely to to seriously wrong - probably approximately correct (PAC). Stationarity assumption – Tr and Te have the same probability distribution CS 471/598 by H. Liu 18

Summary Learning is essential for intelligent agents n n dealing with the unknowns improving its capability over time All types of learning can be considered as learning an accurate representation h of f. Inductive learning - f from data to h Decision trees - deterministic Boolean functions CS 471/598 by H. Liu 19