Data Mining Confluence of Multiple Disciplines Database Systems

Data Mining: Confluence of Multiple Disciplines Database Systems Machine Learning Algorithm Statistics Data Mining Visualization Other Disciplines

Data Mining Outline n Introduction n Classification n Clustering n Association Rules

Data Mining Outline n. Introduction n Classification n Clustering n Association Rules

Introduction n Data is growing at a phenomenal rate n Users expect more sophisticated information n How? UNCOVER HIDDEN INFORMATION DATA MINING

Data Mining Definition n Finding hidden information in a database n Fit data to a model: descriptive or predictive n Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning

But it isn’t Magic n You must know what you are looking for n You must know how to look for it Suppose you knew that a specific cave had gold: • What would you look for? • How would you look for it? • Might need an expert miner

“If it looks like a terrorist, duck, walks like a terrorist, duck, andand quacks like aa duck, terrorist, then it’sa aduck. ” terrorist. ” Description Behavior Classification Clustering Associations Link Analysis

Query Examples n Database – Find all credit applicants with last name of Smith. – Identify customers who have purchase more than $10, 000 in last month. – Find all customers who have purchased milk n Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules)

KDD Process © Prentice Hall n Selection: Obtain data from various sources. n Preprocessing: Cleanse data. n Transformation: Convert to common format. Transform to new format. n Data Mining: Obtain desired results. n Interpretation/Evaluation: Present results to user in meaningful manner.

Data Mining Outline n Introduction n Classification – Assign data to a predefined class – Decision Trees – Neural Networks – Distance Based n Clustering n Association Rules

The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Insect Abdomen Antennae Insect Class ID Length Grasshopper 1 2. 7 5. 5 2 3 4 5 6 7 8 9 10 previously unseen instance = 11 8. 0 0. 9 1. 1 5. 4 2. 9 6. 1 0. 5 8. 3 8. 1 5. 1 9. 1 4. 7 3. 1 8. 5 1. 9 6. 6 1. 0 6. 6 4. 7 7. 0 Katydid Grasshopper Katydid ? ? ? ?

Classification Process (1): Model Construction Training Data Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

Training Dataset This follows an example from Quinlan’s ID 3

Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30. . 40 yes >40 credit rating? no yes excellent fair no yes

Neural Network Example Tuple Input Output

Data Mining Outline n Introduction n Classification n. Clustering – Place data into groups – Hierarchical – K-Means – Partitional n Association Rules

Clustering Examples n Segment customer databased on similar buying patterns. n Group houses in a town into neighborhoods based on similar features. n Identify new plant species n Identify similar Web usage patterns

Clustering vs. Classification n No prior knowledge – Number of clusters – Meaning of clusters n Unsupervised learning

Data Mining Outline n Introduction n Classification n Clustering n. Association Rules – Find relationships between data – Apriori

Association Rules Example I = { Beer, Bread, Jelly, Milk, Peanut. Butter} Support of {Bread, Peanut. Butter} is 60%

Association Rules Ex (cont’d)

AR & Market Baskets n Determine items often purchased together (Marketbasket Data) n Determine optimal placement of data on store floor n Determine items for sales and/or specials n Increase sales of items n www. amazon. com

Summary n Data Mining is a fast growing area with many applications. n Data Mining algorithms are usually computationally expensive. n Data Mining tools may be difficult to use effectively.
- Slides: 24