1 Data Mining or KDD Let us find

  • Slides: 17
Download presentation
1. Data Mining (or KDD) Let us find something interesting! Definition : = “Data

1. Data Mining (or KDD) Let us find something interesting! Definition : = “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Why Mine Data? Scientific Viewpoint n Data collected and stored at enormous speeds (GB/hour)

Why Mine Data? Scientific Viewpoint n Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data – GIS Traditional techniques infeasible for raw data n Data mining may help scientists n – in classifying and segmenting data – in Hypothesis Formation

Ch. Eick 2. 1 Supervised Clustering Attribute 2 class 1 class 2 unclassified object

Ch. Eick 2. 1 Supervised Clustering Attribute 2 class 1 class 2 unclassified object Attribute 2 A B unclassified object C E Attribute 2 class 1 class 2 I J G F K D Attribute 1 a. Unsupervised Clustering H Attribute 1 b. Semi-supervised Clustering L Attribute 1 c. Supervised Clustering Applications of Supervised Clustering Include: a. b. c. d. Learning Subclasses for Region Discovery in Spatial Datasets Distance Function Learning Data Set Compression (reduce size of dataset by using cluster representatives) e. Adaptive Supervised Clustering Ch. Eick: Data Mining

Ch. Eick Example: Finding Subclasses Attribute 1 Ford Trucks : Ford : GMC Trucks

Ch. Eick Example: Finding Subclasses Attribute 1 Ford Trucks : Ford : GMC Trucks GMC Van Ford Vans Ford SUV GMC SUV Attribute 2 Ch. Eick: Data Mining

SC Algorithms Investigated Representative-based Clustering Algorithms 1. Supervised Partitioning Around Medoids (SPAM). 2. Single

SC Algorithms Investigated Representative-based Clustering Algorithms 1. Supervised Partitioning Around Medoids (SPAM). 2. Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). 3. Supervised Clustering using Evolutionary Computing (SCEC) 2. Agglomerative Hierarchical Supervised Clustering (AHSC) 3. Grid-Based Supervised Clustering (GRIDSC) 1. Naïve approach 2. Hierarchical Grid-based Clustering relying on data cubes 3. Grid-based Clustering relying on density estimation techniques 1. Ch. Eick: Data Mining

2. 2 Spatial Data Mining (SPDM) SPDM : = the process of discovering interesting,

2. 2 Spatial Data Mining (SPDM) SPDM : = the process of discovering interesting, useful, non-trivial patterns from (large) spatial datasets. n Spatial patterns – Spatial outlier, discontinuities • bad traffic sensors on highways – Location prediction models • model to identify habitat of endangered species – Spatial clusters • crime hot-spots , poverty clusters – Co-location patterns • identify arsenic risk zones in Texas and determine if there is a correlation between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc. n Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness. Ch. Eick: Data Mining

Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets Ch. Eick: Data Mining

Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets Ch. Eick: Data Mining

2. 3 Distance Function Learning Example: How to Find Similar Patients? Task: Construct a

2. 3 Distance Function Learning Example: How to Find Similar Patients? Task: Construct a distance function that measures patient similarity Motivation: Finding a “good” distance function is important for: – Case based reasoning – Clustering – Instance-based classification (e. g. nearest neighbor classifiers) Our Approach: Learn distance functions based on training examples and user feedback Ch. Eick: Data Mining

Motivating Example: How To Find Similar Patients? The following relation is given (with 10000

Motivating Example: How To Find Similar Patients? The following relation is given (with 10000 tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age, …) n Attribute Domains – ssn: 9 digits – weight between 30 and 650; mweight=158 sweight=24. 20 – height between 0. 30 and 2. 20 in meters; mheight=1. 52 sheight=19. 2 – cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor – eye-color: {brown, blue, green, grey } – age: between 3 and 100; mage=45 sage=13. 2 Task: Define Patient Similarity Ch. Eick: Data Mining

Idea: Coevolving Clusters and Distance Functions Weight Updating Scheme / Search Strategy Clustering X

Idea: Coevolving Clusters and Distance Functions Weight Updating Scheme / Search Strategy Clustering X Cluster q(X) Clustering Evaluation Goodness of the Distance Function Q “Bad” distance function Q 1 “Good” distance function Q 2 o o x oox x x o x o oo o x x Ch. Eick: Data Mining

Distance Function Learning Framework Weight-Updating Scheme / Search Strategy Inside/Outside Weight Updating Randomized Hill

Distance Function Learning Framework Weight-Updating Scheme / Search Strategy Inside/Outside Weight Updating Randomized Hill Climbing Other Research [BECV 05] Adaptive Clustering … [ERBV 04] Distance Function Evaluation K-Means Supervised Clustering Current Research [CHEN 05] Work By Karypis NN-Classifier … Ch. Eick: Data Mining

Ch. Eick 2. 4 Adaptive Data Mining Inputs Supervised Clustering Algorithm Summary Clustering Changes

Ch. Eick 2. 4 Adaptive Data Mining Inputs Supervised Clustering Algorithm Summary Clustering Changes Adaptation System Evaluation System Feedback Domain Expert Past Experience Quality q(X), … Fitness Functions (Predefined) Ch. Eick: Data Mining

2. 5 Signatures of Data Sets Input: a set of classified examples Output: Signatures

2. 5 Signatures of Data Sets Input: a set of classified examples Output: Signatures in the dataset that characterize 1. how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset 2. how many regions dominated by a single class exist in the data set 3. which regions dominated by one class are bordering regions dominated by another class? 4. where are the regions, identified in step 2 and 3, located 5. what are the density attactors (maxima of the density function) of the classes in the data set Why are we creating those signatures? – As a preprocessing step to develop smarter classifiers – To understand why a particular data mining techniques works well / do not work well for a particular dataset meta learning Methods employed: density estimation techniques, supervised clustering, proximity graphs (e. g. Delaunay, Gabriel graphs), … Ch. Eick: Data Mining

Example: Signatures of Data Sets Attribute 2 class 1 class 2 unclassified object Attribute

Example: Signatures of Data Sets Attribute 2 class 1 class 2 unclassified object Attribute 2 A B unclassified object E class 1 class 2 I J G F C Attribute 2 K D Attribute 1 a. Unsupervised Clustering H Attribute 1 b. Semi-supervised Clustering L Attribute 1 c. Supervised Clustering Ch. Eick: Data Mining

Applications of Creating Signatures: Class Decomposition (see also [VAE 03]) Attribute 1 Attribute 2

Applications of Creating Signatures: Class Decomposition (see also [VAE 03]) Attribute 1 Attribute 2 Ch. Eick: Data Mining

2. 6 Research Christoph F. Eick 2005 -2007 Clustering for Classification Creating Signatures For

2. 6 Research Christoph F. Eick 2005 -2007 Clustering for Classification Creating Signatures For Datasets Spatial Data Mining Measures of Interestingness File Prediction Supervised Clustering Editing / Data Set Compression Adaptive Clustering Distance Function Learning Mining Data Streams Online Data Mining Sensor Data Mining Semi-Structured Data Web Annotation Evolutionary Computing Ch. Eick: Data Mining

3. UH Data Mining and Machine Learning Group (UH-DMML) Co-Directors: Christoph F. Eick and

3. UH Data Mining and Machine Learning Group (UH-DMML) Co-Directors: Christoph F. Eick and Ricardo Vilalta Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine. Topics investigated: ü Meta Learning ü Classification and Learning from Examples ü Clustering ü Distance Function Learning ü Using Reinforcement Learning for Data Mining ü Spatial Data Mining ü Knowledge Discovery Ch. Eick: Data Mining