Machine Learning meets the Real World Successes and

  • Slides: 44
Download presentation
Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk

Machine Learning meets the Real World: Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College, Williamstown, MA October 11, 2002

Data, data everywhere. . . • Scientific: data collection routinely produces gigabytes of data

Data, data everywhere. . . • Scientific: data collection routinely produces gigabytes of data per day • Telecommunications: AT&T produces 275 million call records • Web: Google handles 70 million searches • Retail: Wal. Mart records 20 million sales transactions

A wealth of information • Scientific data – Detection of oil spills from satellite

A wealth of information • Scientific data – Detection of oil spills from satellite images – Prediction of molecular bioactivity for drug design • Telecommunications – Fraud detection to distinguish between “bad” and normal usage of cell phones

A wealth of information • Web mining – Characterize killer pages • Retail –

A wealth of information • Web mining – Characterize killer pages • Retail – Determine better product placement • Direct mail – Predict who is most likely to donate to a charity

Machine learning success (Machine learning is ubiquitous) • Scientific discovery – Detection of oil

Machine learning success (Machine learning is ubiquitous) • Scientific discovery – Detection of oil spills from satellite images • Telecommunications – Diagnosis of problems in the local loop • Printing – Determine causes of banding (printing cylinder problems) • Control – Self-steering vehicles

Why research in machine learning is so good today Research in machine learning benefits

Why research in machine learning is so good today Research in machine learning benefits from • Abundant data • Interest in fielding new applications – Even more data – Push on limits of our understanding, technology, etc.

Plan for this talk Original • Discuss success stories and failures • Failures help

Plan for this talk Original • Discuss success stories and failures • Failures help identify new areas of research New plan • One success story in detail • Lesson learned: can identify new areas of research even when we succeed

Induction of decision trees • Not the only (or even the most “hot”) algorithms

Induction of decision trees • Not the only (or even the most “hot”) algorithms • Have been used in many contexts • Important for understanding our success story: local-loop network diagnosis

Inductive learning Given a collection of observations of the form (<x>, f<x>) Find g<x>

Inductive learning Given a collection of observations of the form (<x>, f<x>) Find g<x> that approximates f<x>

Sample data

Sample data

Predictive model I. e. , g<x>

Predictive model I. e. , g<x>

Learning objectives • Learn a tree that is correct • Learn a tree that

Learning objectives • Learn a tree that is correct • Learn a tree that is compact • At every level in the tree, select a test that best differentiates examples of one class from another

TDIDT • If all examples are from the same class – The tree is

TDIDT • If all examples are from the same class – The tree is a leaf with that class name • Else – Pick a test to make – Construct one edge for each possible test outcome – Partition the examples by test outcome – Build subtrees recursively

Which is better?

Which is better?

The Gain Criterion • Measure the information of the collection • Measure the information

The Gain Criterion • Measure the information of the collection • Measure the information of each possible split • Choose the split with greatest information gain

Information (Entropy) • Let T be a set of examples • Let C 1,

Information (Entropy) • Let T be a set of examples • Let C 1, C 2, …, Cn be class labels • freq(Ci, T) = number of examples in T that belong to class Ci. • |T| = number of examples in T • Select example and announce its class: info = - log 2 freq(Ci, T)/|T|

Information (Entropy) • Let T be a set of examples • Info(T) = -

Information (Entropy) • Let T be a set of examples • Info(T) = - (freq(Ci, T)/|T|) (log 2 (freq(Ci/|T|))

Entropy after a split • Let X be an attribute with n possible values.

Entropy after a split • Let X be an attribute with n possible values. • Let Tj be the examples that have the value j for attribute X. Average entropy that results from making split on X: info. X(T) = ( |Ti| / |T| ) * info(Ti), sum over n possible values of X.

Information Gain • Compute info. X(T) for every attribute • Select attribute that maximizes

Information Gain • Compute info. X(T) for every attribute • Select attribute that maximizes info(T) – info. X(T)

Which is better?

Which is better?

Scrubber (the success story) • Diagnoses problems in the local loop • Problem may

Scrubber (the success story) • Diagnoses problems in the local loop • Problem may be due to trouble in: – Customer premise equipment – Facilities connecting customer to cable – Central office • Millions of “troubles” reported annually

MAX, 1990 • Acts as Maintenance Administrator (MA) • Sequence of action: – Customer

MAX, 1990 • Acts as Maintenance Administrator (MA) • Sequence of action: – Customer calls – Rep takes information; initiates tests – Trouble report sent to MA – MA puts trouble in dispatch queue for specific type of technician

Scrubber 2 • Performed a task at a later point in the pipeline •

Scrubber 2 • Performed a task at a later point in the pipeline • Survey dispatch queues to determine whether dispatch appropriate – Dispatch not immediate – Many problems resolved exogenously

Scrubber 3 • Scrubber 2 for new application platform • Centralized knowledge server •

Scrubber 3 • Scrubber 2 for new application platform • Centralized knowledge server • Cover twice as large a network

Implementation difficulties • Original expert system shell no longer supported • Knowledge base evolved

Implementation difficulties • Original expert system shell no longer supported • Knowledge base evolved into opacity – Many tweaks over a decade – Many knowledge engineers – Most not available to work on Scrubber 3

Requirements • Level of performance at least as good as prior system – Overall

Requirements • Level of performance at least as good as prior system – Overall accuracy – False positives and false negatives in range • Comprehensible – For understanding and acceptance by experts

Additional requirements (ours) • Improved performance • Improved extensibility

Additional requirements (ours) • Improved performance • Improved extensibility

Phase I: Modeling Scrubber 2 • Applied a decision tree learning algorithm • Input

Phase I: Modeling Scrubber 2 • Applied a decision tree learning algorithm • Input data: – Trouble reports – Scrubber 2 diagnoses

Data 26, 000 trouble reports • 40 attributes (1/2 continuous; 1/2 symbolic) • Two

Data 26, 000 trouble reports • 40 attributes (1/2 continuous; 1/2 symbolic) • Two classes – Dispatch – Don’t -- I. e. , call customer to verify ok

Background knowledge • C 4. 5 selected • 17 of 40 attributes used

Background knowledge • C 4. 5 selected • 17 of 40 attributes used

Phase I results • Decision trees with predictive accuracy of. 99, with as few

Phase I results • Decision trees with predictive accuracy of. 99, with as few as 10, 000 examples • Less than two days of work (easy!)

Phase II: Acceptance • Comprehensibility Readability – Need to observe rationality in learned knwoledge

Phase II: Acceptance • Comprehensibility Readability – Need to observe rationality in learned knwoledge – Original trees on order of 1000 nodes • The simpler the model, the better it can be understood Comprehensibility = Readability + Simplicity + Fidelity

Trading off simplicity and correctness • Pruning nodes sacrifices correctness • Appropriate when comprehensibility

Trading off simplicity and correctness • Pruning nodes sacrifices correctness • Appropriate when comprehensibility an issue • Langley and Schwabacher, 2001 • Note: not pruning to avoid overfitting

Phase II results • Used only two most prominent attributes • New decision trees

Phase II results • Used only two most prominent attributes • New decision trees created • Still fell into acceptable zone

Phase III: Working toward extensibility • Hoped to gain flexibility for – Local modifiability

Phase III: Working toward extensibility • Hoped to gain flexibility for – Local modifiability – Additional attribute values • Moved toward probabilistic decision tree – Leaves labeled with probability estimates, not decisions – Stubby trees easy to represent in tabular form

Phase IIIb: More data • Focus on two attributes gave us access to an

Phase IIIb: More data • Focus on two attributes gave us access to an extensive data set – Many more trouble reports – Abridged (two-attribute) form had not been considered useful earlier

Phase III results • Simple diagnostic model • Greater empirical confidence -- impt due

Phase III results • Simple diagnostic model • Greater empirical confidence -- impt due to small disjunct problem – “Big” general rules cover approximately 50% of the data – Remaining 50% covered by small disjuncts

Summarizing the success story • C 4. 5 applied to induce Scrubber 2 model

Summarizing the success story • C 4. 5 applied to induce Scrubber 2 model • Pruned model for comprehensibility/simplicity • Converted new model into probabilistic one • Used newly gained data for additional tuning and confidence • Small(? ), simple model in very short time

Lessons can be learned from success Lesson 1: the importance of comprehensibility – Rationality

Lessons can be learned from success Lesson 1: the importance of comprehensibility – Rationality – Readability – Simplicity

Lessons can be learned from success Lesson 2: the need for algorithms to handle

Lessons can be learned from success Lesson 2: the need for algorithms to handle small data sets – Creative ways to engineer interesting features from few – Openness to alternative sources of data – Algorithms specifically tuned to handle small data sets Langley has noted this to be an issue of scientific data -- but true for industrial data as well

Lessons can be learned from success Lesson 3: the need to think about systematic

Lessons can be learned from success Lesson 3: the need to think about systematic error – Locally systematic error only look like noise with enough data – Clearly related to the problem of small data sets – How do our algorithms hold up?

Lessons can be learned from success Lesson 4: the need to think about the

Lessons can be learned from success Lesson 4: the need to think about the future – Learning results put into practice will be modifed and extended – Must new models be learned? – Can improvement be incremental?

Lessons can be learned from success Lesson 5: creative uses of the technology –

Lessons can be learned from success Lesson 5: creative uses of the technology – Learning for the purposes of re-engineering isn’t “standard” – New applications will serve to fuel new research

Further reading and acknowledgements • Carla Brodley et al, American Scientist, Jan. /Feb. ‘

Further reading and acknowledgements • Carla Brodley et al, American Scientist, Jan. /Feb. ‘ 99 • Pat Langley, various publications • Thanks to Foster Provost and many others at Nynex / Bell Atlantic