Overview of Data Mining Methods Data mining techniques

  • Slides: 20
Download presentation
Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages &

Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages

結束 Contents Reviews data mining tools Compares data mining perspectives Discusses data mining functions

結束 Contents Reviews data mining tools Compares data mining perspectives Discusses data mining functions Presents four sets of data used to demonstrate tools in subsequent chapters Shows the Enterprise Miner structure for data mining analysis in the appendix 4 -2

結束 Data mining applications Automobile insurance company: Fraud detection Business applications: loan evaluation, customer

結束 Data mining applications Automobile insurance company: Fraud detection Business applications: loan evaluation, customer segmentation, employee evaluation… Data mining tools categorized by the tasks of classification, estimation, prediction, clustering, and summarization. Classification, estimation, prediction are predictive, while clustering and summarization are descriptive. 4 -3

結束 History Statistics AI: Øgenetic algorithms, neural networks üanalogies with biology Ømemory-based reasoning Ølink

結束 History Statistics AI: Øgenetic algorithms, neural networks üanalogies with biology Ømemory-based reasoning Ølink analysis from graph theory See table. 4. 1 4 -4

結束 Data mining perspectives Methods can be viewed from different perspectives, data mining methods

結束 Data mining perspectives Methods can be viewed from different perspectives, data mining methods include: Ø Cluster analysis (Chapter 5) Ø Regression of various forms (best fit methods, chapter 6) Ø Discriminant analysis (use of regression for classification, chapter 6) Ø Line fitting through the operations research tool of multiple objective linear programming (Chapter 9) AI: Ø ANN (chapter 7) Ø Rule induction (decision trees, chapter 8) Ø Genetic algorithms (supplement) See page 55 for more descriptions 4 -5

結束 Techniques Statistical Ø Market-Basket Analysis - find groups of items Ø Memory-Based Reasoning-

結束 Techniques Statistical Ø Market-Basket Analysis - find groups of items Ø Memory-Based Reasoning- case based Ø Cluster Detection - undirected (quantitative) Artificial Intelligence Ø Link Analysis - MCI’s Friends & Family Ø Decision Trees, Rule Induction - production rule Ø Neural Networks - automatic pattern detection Ø Genetic Algorithms - keep best parameters 4 -6

結束 Models Regression: Y = a + b. X Classification: assign new record to

結束 Models Regression: Y = a + b. X Classification: assign new record to class Predictive: assign value to new record Clustering: groups for data Time-series: assign future value Links: patterns in data 4 -7

結束 Fitting Underfitting: not enough detail Øleave out important variables Overfitting: too much detail

結束 Fitting Underfitting: not enough detail Øleave out important variables Overfitting: too much detail Ømemorizes training set, but doesn’t help with new data üdata set too small üredundancy in data 4 -8

結束 Comparison of Features Rules Neural Net Case. Base Genetic Noisy data Good Very

結束 Comparison of Features Rules Neural Net Case. Base Genetic Noisy data Good Very good Missing data Good Very good Poor Good Different types Good Numerical Very good Transform Accuracy High Very high High Explanation Very good Poor Very good Good Integration Good Very good Ease Easy Difficult Large sets 4 -9

結束 Data Mining Functions Classification Ø Identify categories in data Prediction Ø Formula to

結束 Data Mining Functions Classification Ø Identify categories in data Prediction Ø Formula to predict future observations Association Ø Rules using relationships among entities Detection Ø Anomalies (unusual) & irregularities (fraud detection) 4 -10

結束 Financial Applications Technique Application Problem Type Neural net Forecast stock price Prediction NN,

結束 Financial Applications Technique Application Problem Type Neural net Forecast stock price Prediction NN, Rule Forecast bankruptcy Fraud detection Prediction Detection NN, Case Forecast interest rate Prediction NN, visual Late loan detection Detection Rule Credit assessment Risk classification Prediction Classification Rule, Case Corporate bond rate (公司債) Prediction 4 -11

結束 Telecom Applications Technique Application Neural net, Forecast network Rule induction behavior. Problem Type

結束 Telecom Applications Technique Application Neural net, Forecast network Rule induction behavior. Problem Type Prediction Churn Rule induction Fraud detection Classification Detection Case based Classification Call tracking 4 -12

結束 Marketing Applications Technique Rule induction, visual Rule induction, genetic, visual Case based Application

結束 Marketing Applications Technique Rule induction, visual Rule induction, genetic, visual Case based Application Market segment Cross-selling Lifestyle analysis Performance analysis. Reaction to promotion Online sales support Problem Type Classification Association Prediction Classification 4 -13

結束 Web Applications Technique Rule induction, Visualization Application Problem Type Classification, User browsing similarity

結束 Web Applications Technique Rule induction, Visualization Application Problem Type Classification, User browsing similarity analysis. Association Rule-based heuristics Web page content similarity Association 4 -14

結束 Other Applications Technique Application Problem Type Neural net Software cost Detection Neural net,

結束 Other Applications Technique Application Problem Type Neural net Software cost Detection Neural net, rule induction Litigation assessment Prediction Rule induction Insurance fraud Healthcare except. Detection Case based Insurance claim Software quality Genetic algorithm Budget spending Prediction Classification 4 -15

結束 Data Sets Loan Applications Ø classification Job Applications Ø classification Insurance Fraud Ø

結束 Data Sets Loan Applications Ø classification Job Applications Ø classification Insurance Fraud Ø detection Expenditure Data Ø prediction 4 -16

結束 Loan Data 650 observations OUTCOMES (binary): Ø On-time Ø Late (default) cost of

結束 Loan Data 650 observations OUTCOMES (binary): Ø On-time Ø Late (default) cost of error: $300 cost of error: $2, 000 Variables Ø Age, Income, Assets, Debts, Want, Credit üCredit ordinal Ø Transform: Assets, Debts, & Want →Risk 4 -17

結束 Job Application Data 500 observations OUTCOMES (ordinal): Ø Unacceptable Ø Minimal Ø Acceptable

結束 Job Application Data 500 observations OUTCOMES (ordinal): Ø Unacceptable Ø Minimal Ø Acceptable Ø Excellent Variables Ø Age, State, Degree, Major, Experience üState nominal; degree & major ordinal üState is superfluous 4 -18

結束 Insurance Claim Data 5000 observations OUTCOMES (binary): Ø OK Ø Fraudulent cost of

結束 Insurance Claim Data 5000 observations OUTCOMES (binary): Ø OK Ø Fraudulent cost of error $500 cost of error $2, 500 Variables Ø Age, Gender, Claim, Tickets, Prior claims, Attorney üGender & attorney nominal, tickets & prior claims categorical 4 -19

結束 Expenditure Data 10, 000 observations OUTCOMES: Ø Could predict response in a number

結束 Expenditure Data 10, 000 observations OUTCOMES: Ø Could predict response in a number of categories Ø Others Variables: Ø Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards Ø Churn, proportion of income spent on seven categories 4 -20