Everything you ever wanted to know about data
Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data Mining R&D Director SAS Institute Copyright © 2006, SAS Institute Inc. All rights reserved.
Abstract § Data mining is the process of systematically sifting through often large databases to identify patterns and trends relevant to solving business problems such as increasing sales and efficiency. Successful examples of data mining can be found in many areas of business and science including customer relations management, market basket analysis, human resources, bio-informatics and medicine, fraud detection, and searching the web. The rapid growth in data mining is largely due to the increased availability of large databases, advances in large scale computing and development of data mining algorithms. This presentation will begin with a brief history of data mining, cover current trends with case studies and finish with a look into the future of data mining from the SAS perspective. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: Where did data mining start ? § 1841 • Lewis Tappan formed the Mercantile Company to provide credit- worthiness reports (ie: credit scores) to New York merchants. • Employed a number of ‘correspondents’’ in western frontier towns to monitor the behavior of local traders, in addition to pooling merchant records. A huge data base was accumulated. • Enormous success ! § 1849 • John Bradstreet starts a credit reporting company in Cincinnati, OH. § 1859 • The Mercantile Company was sold to Robert Graham Dun § 1933 • The Dun company merged with the Bradstreet company Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: How about something with computers ? § Date: September, 1963 Authors: James Myers and Edward Forgy Title: The Development of Numerical Credit Evaluation Systems Publication: Journal of the American Statistical Association § Abstract Several discriminant and multiple regression analyses were performed on retail credit application data to develop a numerical scoring system for predicting credit risk in a finance company. Results showed that equal weights for all significantly predictive items were as effective as weights from the more sophisticated techniques of discriminant analysis and "stepwise multiple regression. " However, a variation of the basic discriminant analysis produced a better separation of groups at the lower score levels, where more potential losses could be eliminated with a minimum cost of potentially good accounts. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Data Mining, circa 1963 IBM 7090 600 applicants “Machine storage limitations restricted the total number of variables which could be considered at one time to 25. ” Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: That Sounds like Statistics so what’s the difference? Statistics § Experimental § Prior Hypothesis Data mining § Commercial § Posterior Hypothesis • Idea before data acquisition • Data acquisition planned § Experimental Design • • § No Experimental Design Sampling strategies Factorial designs Required confidence Minimize model terms § Inference • Hypothesis testing • Prediction Copyright © 2006, SAS Institute Inc. All rights reserved. • Idea after data acquisition • Data acquisition opportunistic • • Explore data Create hypothesis Generate query Create models § Prediction • Lift, Profit, Response • Inference Company confidential - for internal use only
It’s all about the Data Experimental Opportunistic Purpose Research Operational Value Scientific Commercial Generation Actively controlled Passively observed Size Small Massive Hygiene Clean Dirty State Static Dynamic Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Where does mining data come from ? Data Warehouses store detail data on transactions and states Very simple demo example Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Data: majority of time spent on mining Intelligent Enterprise Magazine http: //www. intelligententerprise. com/030405/606 feat 2_1. jhtml Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Each data type has its own mix of data prep and mining Demographics, Personal Information Market baskets, Item sets Market baskets with time order Web paths: unique sequences Time stamped transactions Text normalization Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Use integrated data for mining Market baskets Web paths Demographic, Seasonal indices Text dimensions Financial Interactions ID columns Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Example § Quest: Maximize response to this year’s summer promotion § How: Find those customers most likely to respond § Use response to last year’s summer promotion as indicator of response to this year’s promotion. This is the dependent variable. § Use all customer data available before last summer. These are the independent variables. • • • Demographics Sales item history Sales amount history Web site history Call center records …. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: Why don’t we just select last year’s customer lists ? Data is non-stationary (remember this point) § Move to new locations § Change jobs § Income goes up or down § Debt increase or decrease § Marital status § Parental status § … The model is a function of the attributes, not the individuals Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: What Functions are Popular ? § Associations unsupervised § Clustering unsupervised § PCA/SVD unsupervised § Logistic Regression supervised § Decision Tree supervised § Neural Network supervised § Ensembles supervised § … and many other forms, variants, and names Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
How do I find Patterns ? § Try ASSOCIATIONS and SEQUENCES § Searches for frequent patterns § (Car Wreck Dr. X ) (Diagnosis Code xxx MRI) § Confidence: § If (A) happens then (B) happens 80% of the time § C=(B|A)/A § Support: § (A) (B) happens in 10% of all itemsets § S=(B|A)/N Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Rule Sets show the next most likely action Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Associations and Sequences / Visualization Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
What is the Most Popular Data Mining Function ? -Logistic Regression Still Rules ! -Linear combination of terms (z) -Relatively easy to compute -Converges to a solution -Explainable Prob Input Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Input
Q: What is CART and why do I need it ? § A Decision Tree ! § Classification and Regression § Strategy Development § Hunt 1966 Concept Learning System § Kass 1980 Chi-squared Automatic Interaction Detection § Breiman 1984 Classification and Regression Trees § Quinlan 1993 C 4. 5 rule sets § Numerous others… • Algorithms for efficiently building trees • Hypothesis tests for finding split points − Various measurement scales Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Building a Decision Tree Keep doing that until there are no more beneficial splits. . . Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Recursive Partitioning Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Benefits of Trees § Interpretability • Tree structured presentation § Mixed Measurement Scales • Nominal, ordinal, interval • Regression trees § Robustness § Missing Values Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
…Benefits § Automatically Prob • Detects interactions (AID) • Accommodates nonlinearity • Selects input variables Input Multivariate Step Function Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Drawbacks of Trees § Roughness § Linear, Main Effects § Instability Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: Why do they call it ‘Neural’ network ? Neuron Hidden Unit Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Feed Forward Neural Network Input Layer Hidden Layers Copyright © 2006, SAS Institute Inc. All rights reserved. Output Layer Company confidential - for internal use only
How does it work? a b C= combination ( Weights * Inputs ) c A = Activation ( C ) f(W, I) = A[C] + b -> output … y~ f(W, (s, t)) s= f(S, (p, q, r)) t= f(T, (p, q, r)) p= f(P, X) q= f(Q, X) r= f(R, X) d p s e f t g h r i j … y ~ f(W, (f(S. (f(P, X), f(Q, X), f(R, X)))), f(T, (f(P, X), f(Q, X), f(R, X))))) Err = E(Y, y) ~ (Y - y)^2 Copyright © 2006, SAS Institute Inc. All rights reserved. y q Company confidential - for internal use only
Input Layer Activation Function Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Training § Iterative Optimization Algorithm Parameter 1 § Error Function Parameter 2 Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Training history for our example Error measure goes down with every iteration. Weights evolve at every iteration Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Neural Pros and Cons § Very Flexible functions § Implicit transformation and interactions § Good algorithms for controlling complexity § No inference § Complex function § Many possible networks – large search space Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: How do I know the model will work on new data ? § Make sure that you don’t have a perfect model ! • Real data has multiple forms of the dependent variable effect § Limit exposure to data that changes over time • • Examine distributions of data at several time points Select stable data Use standardizations Use category=other § Backtest • Use a hold out sample from a later time period § Monitor Performance • Compare actual and expected results • Compare input term distributions § Don’t fit the noise. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: How do I model signal instead of noise ? § Limit model complexity by using Validation Data. § Decision Tree: Pruning § Neural Network: Early Stopping Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Complexity -> Overfitting Training Set Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Test Set
Better Fitting … with a more simple model Training Set Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Test Set
How do I select the best model ? ROC for overall model performance: Decision Tree Lift for targeted model performance: Neural Network Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Computers keep getting so much faster, why does my neural network take so long to run? Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
The enemy: growth in data warehouses In the aggregate, the 2001 survey pool reported 632 TB of storage. Just two years later, those surveyed were using almost 2 petabytes (2, 000 TB) of storage. Based on the number of survey respondents, the average large database — whether used for decision support or transaction processing — increased its storage requirements three and one-half times in just two years. Company confidential - for internal use only DM-REVIEW Copyright © 2006, SAS Institute Inc. All rights reserved.
Single disk size growth Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Clock speed vs. disk sizes Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Even worse, it’s all about complexity of data 1 M rows x 100 columns x 8 bytes = 800 MB 1000 rows x 1000 columns x 8 bytes = 8 MB Which data is more complex ? Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: so how does this make me money ? § Models get deployed to operational systems • New data is acquired • Each case is scored with the model function • Action taken on each case: − Send promotion or don’t send promotion − Select item for cross sell offer − Grand credit or don’t grant credit − Alert engineers that a manufacturing defect has been found. § Model driven decision are nearly always better than intuition § …iff… the data miner has accounted for enough sources of variation. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Offline Applications • • Scheduled Scoring ETL process ETL engine ETL for model development and scoring Scores generated on nightly basis ID and Score data pre-loaded into data sto Score tables pushed to external applicatio Model Development Data Mining Scoring Engine BI Application Campaign Planning Operations Campaign Execution Information Technology Data Store Scores Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Online Applications • • Scheduled Scoring ETL process Scores generated on nightly basis ID and Score data pre-loaded into data store Individual score requests contain one or more ID Decision server translates score to action ETL engine Model Development Scoring Engine BI Application Decision Server Front Office Application Data Store Scores Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Customer call center
On-Demand Applications • • • Scheduled Scoring ETL process ETL engine Model input data pre-loaded into data store New data provided by application Score engine pulls data by ID from data st joins with new data Scores generated immediately Decision Server translates score to action Model Development Front Office Application Decision Server Automation Application Scoring Engine Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Fraud detection Monty laundering Medical diagnostic
Q: So what are the cool applications right now ? § GOOGLE, YAHOO, ASK, etc… • Huge model training task: • Techniques: • Real time scoring task: index and summarize the web text data processing; page rank process your query § NETFLIX • $1 M challenge: • Huge sparse matrix: • Techniques beat their statisticians fill in the blanks SVD by numerical approximation aka: Hebbian-learning Neural Net Ensembles Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
- Slides: 47