Data Preprocessing CSE 591 Data Mining by H
Data Preprocessing CSE 591: Data Mining by H. Liu Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data preprocessing A necessary step for serious, effective, realworld data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM n n n Data reduction - too much data Data cleaning - noise Data integration and transformation 1/1/2022 CSE 591: Data Mining by H. Liu 2
Data reduction Data cube aggregation Feature selection (dimensionality reduction) Sampling n random sampling and others Instance selection (search based) Data compression n PCA, Wavelet transformation Data discretization 1/1/2022 CSE 591: Data Mining by H. Liu 3
Feature selection The basic problem n Finding a subset of original features The illustration of the difficulty of the problem A standard procedure of feature selection n n Search Evaluation measures on goodness of selected features 1/1/2022 CSE 591: Data Mining by H. Liu 4
Feature extraction The basic problem n creating new features that are combinations of original features A common approach – PCA n Its variants are used widely in text mining and web mining 1/1/2022 CSE 591: Data Mining by H. Liu 5
Discretization The concept The methods n n n Equ-width Equ-frequency Entropy-based 1/1/2022 CSE 591: Data Mining by H. Liu 6
Data cleaning Missing values n n n ignore it fill in manually use a global value/mean/most frequent Noise n n smoothing (binning) outlier removal Inconsistency n 1/1/2022 domain knowledge, domain constraints CSE 591: Data Mining by H. Liu 7
Data integration n combines data from multiple sources into a coherent data store Schema integration n entity identification problem Redundancy n n an attribute may be derived from another table correlation analysis Data value conflicts 1/1/2022 CSE 591: Data Mining by H. Liu 8
Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include n n n smoothing aggregation generalization normalization (min-max) feature construction w using neural networks 1/1/2022 CSE 591: Data Mining by H. Liu 9
- Slides: 9