Overview of Methods Data mining techniques What techniques

  • Slides: 17
Download presentation
Overview of Methods • Data mining techniques • What techniques do, examples, • Advantages

Overview of Methods • Data mining techniques • What techniques do, examples, • Advantages & disadvantages

4 -2 History • Statistics • AI: – genetic algorithms, neural networks • analogies

4 -2 History • Statistics • AI: – genetic algorithms, neural networks • analogies with biology – memory-based reasoning – link analysis from graph theory Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -3 Techniques • Statistical – Market-Basket Analysis - find groups of items –

4 -3 Techniques • Statistical – Market-Basket Analysis - find groups of items – Memory-Based Reasoning- case based – Cluster Detection - undirected (quantitative MBA) • Artificial Intelligence – Link Analysis - MCI’s Friends & Family – Decision Trees, Rule Induction - production rule – Neural Networks - automatic pattern detection – Genetic Algorithms - keep best parameters Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -4 Models • Regression: Y = a + b. X • Classification: assign

4 -4 Models • Regression: Y = a + b. X • Classification: assign new record to class • Predictive: assign value to new record • Clustering: groups for data • Time-series: assign future value • Links: patterns in data Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -5 Fitting • Underfitting: not enough detail – leave out important variables •

4 -5 Fitting • Underfitting: not enough detail – leave out important variables • Overfitting: too much detail – memorizes training set, but doesn’t help with new data • data set too small • redundancy in data Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -6 Comparison of Features Rules Neural Net Case. Base Genetic Noisy data Good

4 -6 Comparison of Features Rules Neural Net Case. Base Genetic Noisy data Good Very good Missing data Good Very good Good Large sets Very good Poor Good Different types Good Numerical Very good Transform Accuracy High Very high High Explanation Very good Poor Very good Good Integration Good Very good Ease Easy Difficult Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -7 Data Mining Functions • Classification – Identify categories in data • Prediction

4 -7 Data Mining Functions • Classification – Identify categories in data • Prediction – Formula to predict future observations • Association – Rules using relationships among entities • Detection – Anomalies & irregularities (fraud detection) Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -8 Financial Applications Technique Application Problem Type Neural net Forecast stock price Prediction

4 -8 Financial Applications Technique Application Problem Type Neural net Forecast stock price Prediction NN, Rule Forecast bankruptcy Fraud detection Prediction Detection NN, Case Forecast interest rate Prediction NN, visual Late loan detection Detection Rule Credit assessment Risk classification Corporate bond rate Prediction Classification Prediction Rule, Case Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -9 Telecom Applications Technique Application Neural net, Rule induct Forecast network Prediction behav.

4 -9 Telecom Applications Technique Application Neural net, Rule induct Forecast network Prediction behav. Rule induct Churn Fraud detection Classification Detection Case based Call tracking Classification Mc. Graw-Hill/Irwin Problem Type © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -10 Marketing Applications Technique Application Problem Type Rule induct Market segment Cross-selling Classification

4 -10 Marketing Applications Technique Application Problem Type Rule induct Market segment Cross-selling Classification Association Rule induct, visual Lifestyle analysis Performance analy. Classification Association Rule induct, genetic, visual Reaction to promotion Prediction Case based Online sales support Classification Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -11 Web Applications Technique Application Rule induct, Visualization User browsing Classification, similarity analy.

4 -11 Web Applications Technique Application Rule induct, Visualization User browsing Classification, similarity analy. Association Rule-based heuristics Web page content similarity Mc. Graw-Hill/Irwin Problem Type Association © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -12 Other Applications Technique Application Problem Type Neural net Software cost Detection Neural

4 -12 Other Applications Technique Application Problem Type Neural net Software cost Detection Neural net, rule induct Litigation assessment Prediction Rule induct Insurance fraud Healthcare except. Detection Case based Insurance claim Software quality Prediction Classification Genetic algor. Budget spending Classification Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -13 Data Sets • Loan Applications – classification • Job Applications – classification

4 -13 Data Sets • Loan Applications – classification • Job Applications – classification • Insurance Fraud – detection • Expenditure Data – prediction Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -14 Loan Data • 650 observations • OUTCOMES (binary): – On-time – Late

4 -14 Loan Data • 650 observations • OUTCOMES (binary): – On-time – Late (default) cost of error: $300 cost of error: $2, 000 • Variables – Age, Income, Assets, Debts, Want, Credit • Credit ordinal – Transform: Assets, Debts, & Want →Risk Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -15 Job Application Data • 500 observations • OUTCOMES (ordinal): – – Unacceptable

4 -15 Job Application Data • 500 observations • OUTCOMES (ordinal): – – Unacceptable Minimal Acceptable Excellent • Variables – Age, State, Degree, Major, Experience • State nominal; degree & major ordinal • State is superfluous Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -16 Insurance Claim Data • 5000 observations • OUTCOMES (binary): – OK –

4 -16 Insurance Claim Data • 5000 observations • OUTCOMES (binary): – OK – Fraudulent cost of error $500 cost of error $2, 500 • Variables – Age, Gender, Claim, Tickets, Prior claims, Attorney • Gender & attorney nominal, tickets & prior claims categorical Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved

4 -17 Expenditure Data • 10, 000 observations • OUTCOMES: – Could predict response

4 -17 Expenditure Data • 10, 000 observations • OUTCOMES: – Could predict response in a number of categories – Others • Variables: – Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards – Churn, proportion of income spent on seven categories Mc. Graw-Hill/Irwin © 2007 The Mc. Graw-Hill Companies, Inc. All rights reserved