Data Mining and Knowledge Discovery in Databases Outline

Data Mining and Knowledge Discovery in Databases

Outline • • • What is Data Mining and KDD? Characteristics Applications Methods Packages & Close Relatives

What is Data Mining & KDD? • “The process of identifying hidden patterns and relationships within data” or • “Data mining helps end users extract useful business information from large databases”

What’s the Appeal? • • • Hidden nuggets of valuable information buried deep within a mountain of otherwise unremarkable data Pervasive data Seek competitive advantage

The Challenge 51020188905212001539458199000014198812294488219960816210000001010000000 110000311111000001003130200000002020010000000000000043 4388884242433301220202220000101001000000044100001100000000000000000000000000019981027510201896060120 021269409680000001590199809033798119980917310010001000000011000032000 000100000001239900000020022220031310031200000000424388888424 34242332121212222000000101100000024410000010020000000000100000000000000000000019981230510201897020320001862692920000 0047091998021356971199802273100000100100000110110000002000010000021 01100000001000000100011000000011100338888222233113233433300000011101001020001000000001000000000000000000019981221510201899093020052008986730000019410199901127 59811999012631001010000000001111122010100000111230010010000001021 00022000000000000011133438888434242424342423300000011110000010110010 00024410000010020000000100101000000000000000000000019990525510201899122720093540515830000014484199705271797119970610310 000001011001000003111201200001001001012000111100100001101001200000 010000010101324388888224242433100000001002100001110010011230100000010 000020001000001100000000100000000010000000001998 1117510201899122720093540515830000014484199705271797219980616310000001011001000 0000110100311111121000002022100122202200202212222010000000001010011 003243434321324221424242330021000011110110000011223100110000001000000110000000010000000000000000001998122351020190001

Process: Knowledge Discovery In Databases database cleaning & integration data warehouse modify data selection collect and transform data mining modify methods, parameters data mining engines, models discovered patterns user interface and expert knowledge domain evaluation & presentation

Context • Where you stand on Data Mining depends on where you sit: • Business User • Researcher • Computer Scientist

Data Mining Might Mean… • • • Statistics Visualization Artificial intelligence Machine learning Database technology Neural networks Pattern recognition Knowledge-based systems Knowledge acquisition Information retrieval High performance computing And so on. . .

What’s needed? • • • Suitable data Computing power Data mining software Skilled operator who knows both the nature of the data and the software tools Reason, theory, or hunch

Typical Applications of Data Mining & KDD • Marketing • • • Market Basket Analysis Customer Relationship Management New Product Development

Typical Applications of Data Mining & KDD • Financial Services • • • Credit Approval Fraud Detection Marketing

Typical Applications of Data Mining & KDD • Health Care • • • Epidemiological Analysis - incidence and prevalence of disease in large populations and detection of the source and cause of epidemics of infectious disease Knowledge for funding Policy, programs

Two Basic Approaches • Supervised • • A dependent or target variable Unsupervised • • • “Pure Data Mining” Fewer assumptions Typically used for clustering techniques

Automation • • The ability to aim a tool at some data and push a button Some methods of KDD/Data mining are more suitable for automation than others

Seven Basic Methods: 1. 2. 3. 4. 5. 6. 7. Decision Trees (Artificial) Neural Networks Cluster/Nearest Neighbour Genetic Algorithms/Evolutionary Computing Bayesian Networks Statistics Hybrids

Decision Trees • • Graphical representations of relationships with data Excel at Classification & Prediction Models

Sample of a Decision Tree gender male female age good health? <65 married? >=65 yes no urban? yes no - + yes pet owner? + yes no - + pet owner? no yes no - - +

Decision Trees • Strengths • • Easily understood and interpreted Represent complexity in a compact form Handle non-linear data well Relatively well suited to automation. • Weaknesses • • Large trees with large numbers of variables become difficult to understand Missing data must be appropriately managed in construction and use of the models

Neural Networks • • Derived from Artificial Intelligence Research Modelled on the Human Neuron

Neural Networks Prediction Weights 0. 4 0. 8 Hidden Layer Weights Input Variables 0. 6 Age 0. 1 0. 3 0. 7 0. 5 Gender 0. 2 Income

Neural Networks • Strengths • • Accuracy of prediction Robust performance with a wide variety of data types • Weaknesses • • Prone to overfitting Poor clarity of model

Clustering/Nearest Neighbour • • • Aim to assign “like” records to a group Groups assigned according to some target variable or criteria Nearest neighbour used for prediction

Clustering/Nearest Neighbour • Applications: • • • Text processing: search engines Image processing: radiology/image processing Fraud detection: outliers

Clustering/ Nearest Neighbour • Strengths • • Easily understood and interpreted Easily implemented in basic situations • Weaknesses • complex data not well suited to automation (much preprocessing required)

Genetic Algorithms/ Evolutionary Computing • • Grounded in Darwin – applied using mathematics Require • • • a way to represent a solution to a problem a way to test the “fitness” of the solution Solutions are mathematically “mutated” Fittest solutions survive Convergence

Genetic Algorithms/ Evolutionary Computing • Strengths • • • Suited to novel problems that are poorly understood Suitable where data is dirty or missing May be useful where other methods cannot be applied • Weaknesses • • Not easily automated Require creativity in their application

Bayesian Networks • Based on Bayes’ rule: • • P(a|b) = P(b|a) * P(a) / P(b) Can construct networks of linked events, each with prior probabilities

Bayesian Network Example Bobby publicly threatened Suicide Bobby shot him J. R. Treated for Depressio n J. R. Shot Mistress shot him Big fight between wife, mistress Wife shot him Just a dream sequence Producer s desperate for ratings

Bayesian Networks • Strengths • • • Clarity of the resulting models Good precision in predicting Easily adapt to new probabilities • Weaknesses • • Time consuming to construct and maintain Poor at predicting rare events

Statistics • With an outcome or dependent variable: • • Correlations ANOVA Regression Used by themselves or to confirm findings of another method

Statistics • Strengths • “Gold Standard” – valid and trusted in scientific circles • Weaknesses • Limits findings to those techniques that are applied and their associated limitations (normality, linearity, and so on)

Hybrids • • Techniques used in combination Example: use of a genetic algorithm to identify target variables for inclusion in a neural network model

Recap • • Data Mining is the core activity or method within a process of Knowledge Discovery in Databases Done in order to find useful information in large amounts of data not possible using “conventional” approaches Variety of methods Knowledge of data domain, methods, as well as creativity

Data Mining Packages • • Major vendors of database/data management products (IBM, SPSS, Oracle People. Soft, SAS, and so on) Added as a component of turnkey packages May incorporate several methods (SAS Enterprise Miner) Single method (Tree. Age Software Inc. : a dedicated decision tree product)

How to implement? • • Do it yourself (you know the data domain) Put a team together (domain and method specialists) Hire a consultant (who knows both your domain and the tools) Vertical markets in data mining

Close Relatives of Data Mining • • On-Line Analytical Processing (OLAP) Pivot tables in spreadsheets General statistical packages Intelligent Data Analysis – comprises the use of data mining methods in the analysis of “small” datasets