Data Preprocessing CSE 572 Data Mining by H

Data Preprocessing CSE 572: Data Mining by H. Liu Copyright, 1996 © Dale Carnegie & Associates, Inc.

Data preprocessing A necessary step for serious, effective, realworld data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM n n n Data reduction - too much data Data cleaning - noise Data integration and transformation 3/3/2021 CSE 572: Data Mining by H. Liu 2

Data reduction Data cube aggregation Feature selection and dimensionality reduction Sampling n random sampling and others Instance selection (search based) Data compression n PCA, Wavelet transformation Data discretization 3/3/2021 CSE 572: Data Mining by H. Liu 3

Feature selection The basic problem n n Finding a subset of original features that can learn the domain better or equally better What are the advantages of doing so? Curse of dimensionality n n From 1 -d, 2 -d, to 3 -d: an illustration Another example – the wonders of reducing the number of features since # of instances available to learning is dependent on # of features 3/3/2021 CSE 572: Data Mining by H. Liu 4

The illustration of the difficulty of the problem n n Search space (an example with 4 features) Overfitting – are the features selected really good? w How do we know? A standard procedure of feature selection n Search w SFS, SBS, Beam Search, Branch&Bound w Optimality of a selected set of features n Evaluation measures on goodness of selected features w Accuracy, distance, inconsistency, 3/3/2021 CSE 572: Data Mining by H. Liu 5

Feature extraction The basic problem n creating new features that are combinations of original features A common approach – PCA n Dimensionality reduction via transformation w D’ = DA, D is mean centered (NXn), A (n. Xm), so, D’ is (NXm) n Its variants are used widely in text mining and web mining 3/3/2021 CSE 572: Data Mining by H. Liu 6

Discretization Motivation from Decision Tree Induction The concept of discretization n Sort the values of a feature Group continuous values together Reassign values to each group The methods n n n Equ-width Equ-frequency Entropy-based w A possible problem: still too many intervals w So, when to stop? 3/3/2021 CSE 572: Data Mining by H. Liu 7

Data cleaning Missing values n n n ignore it fill in manually use a global value/mean/most frequent Noise n n smoothing (binning) outlier removal Inconsistency n 3/3/2021 domain knowledge, domain constraints CSE 572: Data Mining by H. Liu 8

Data integration n combines data from multiple sources into a coherent data store Schema integration n entity identification problem Redundancy n n an attribute may be derived from another table correlation analysis Data value conflicts 3/3/2021 CSE 572: Data Mining by H. Liu 9

Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include n n n smoothing aggregation generalization normalization (min-max) feature construction w using neural networks Traditional transformation methods 3/3/2021 CSE 572: Data Mining by H. Liu 10

Summary Data preprocessing cannot be overstressed in real-world applications It is an important, difficult, and lowprofile task There are different types of approaches for different preprocessing problems It should be considered with the mining algorithms to improve data mining effectiveness 3/3/2021 CSE 572: Data Mining by H. Liu 11