DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn










![KDD Process Modified from [FPSS 96 C] n n n Selection: Obtain data from KDD Process Modified from [FPSS 96 C] n n n Selection: Obtain data from](https://slidetodoc.com/presentation_image_h/22d476612ab9e1aa0b6983dc804d4ebc/image-11.jpg)






- Slides: 17
DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M. H. Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. http: //iubio. indiana. edu/treeapp/treeprint-sample 1. html © Prentice Hall 1
Data Mining Outline – Introduction – Related Concepts – Data Mining Techniques © Prentice Hall 2
Introduction Outline Goal: Provide an overview of data mining. Define data mining n Data mining vs. databases n Basic data mining tasks n Data mining issues n © Prentice Hall 3
Introduction n Data is growing at a phenomenal rate (read “How Much Information Is There In the World? ” By Michael Lesk ) Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING © Prentice Hall 4
Data Mining Definition Finding hidden information in a database n Data Mining has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data”. n Similar terms n – Exploratory data analysis – Data driven discovery – Deductive learning – Discovery Science – Knowledge Discovery © Prentice Hall 5
Database Processing vs. Data Mining Processing n Query n – Poorly defined – No precise query language – Well defined – SQL n Query Output n – Subset of database Output –Not a subset of database © Prentice Hall 6
Query Examples n Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10, 000 in the last month. – Find all customers who have purchased milk n Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) © Prentice Hall 7
Data Mining Models and Tasks © Prentice Hall 8
Basic Data Mining Tasks I n Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction n n Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning © Prentice Hall H =1. 31 (Fem + Fib) + 63. 05 9
Basic Data Mining Tasks II n Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization n Link Analysis uncovers relationships among data. – – – Affinity Analysis Association Rules Sequential Analysis determines sequential patterns. © Prentice Hall 10
KDD Process Modified from [FPSS 96 C] n n n Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. © Prentice Hall 11
KDD Process Ex: Shuttle Data n Selection: – Select data (which missions etc) to use n Preprocessing: – Remove Spikes n Transformation: 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 00 100 200 300 400 500 600 700 800 900 1000 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 – DFT, DWT, PAA etc n Data Mining: – Look for Rules… n Interpretation/Evaluation: 0 100 200 300 400 500 600 700 800 900 1000 – Show rules to domain experts n Potential User Applications: – Prediction of Failures© Prentice Hall 12
Data Mining Development • Relational Data Model • SQL • Association Rule Algorithms • Data Warehousing • Scalability Techniques • Similarity Measures • Hierarchical Clustering • IR Systems • Imprecise Queries • Textual Data • Web Search Engines • Bayes Theorem • Regression Analysis • EM Algorithm • K-Means Clustering • Time Series Analysis • Algorithm Design Techniques • Algorithm Analysis • Data Structures • Neural Networks • Decision Tree Algorithms © Prentice Hall 13
KDD Issues Human Interaction n Overfitting n Outliers n Interpretation n Visualization n Large Datasets n High Dimensionality n © Prentice Hall 14
KDD Issues (cont’d) Multimedia Data n Missing Data n Irrelevant Data n Noisy Data n Changing Data (streams) n Integration n Application n © Prentice Hall 15
Social Implications of DM Privacy n Profiling n Unauthorized use n © Prentice Hall 16
Data Mining Metrics Usefulness n Return on Investment (ROI) n Accuracy n Space/Time Complexity n © Prentice Hall 17