Data Preprocessing CSE 572 Data Mining by H

Data Preprocessing CSE 572: Data Mining by H. Liu Copyright, 1996 © Dale Carnegie & Associates, Inc.

Data preprocessing A necessary step for serious, effective, realworld data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM n n n Data reduction - too much data Data cleaning – extant noise Data integration and transformation 10/25/2021 CSE 572: Data Mining by H. Liu 2

Data reduction Data cube aggregation Feature selection and dimensionality reduction Sampling n random sampling and others Instance selection (search based) Data compression n PCA, Wavelet transformation Data discretization 10/25/2021 CSE 572: Data Mining by H. Liu 3

Feature selection The basic problem n n Finding a subset of original features that can learn the domain better or equally better What are the advantages of doing so? Curse of dimensionality n n From 1 -d, 2 -d, to 3 -d: an illustration Another example – the wonders of reducing the number of features since # of instances available to learning is dependent on # of features 10/25/2021 CSE 572: Data Mining by H. Liu 4

Illustration of the difficulty of the problem n Search space (an example with 4 features) 10/25/2021 CSE 572: Data Mining by H. Liu 5

Reduce the chance of data overfitting w Examples w Are the features selected really good? If they are, they may help mitigate the overfitting w How do we know? A standard procedure of feature selection n Search w SFS, SBS, Beam Search, Branch&Bound w Optimality of a selected set of features n Evaluation measures on goodness of selected features w Accuracy, distance, inconsistency, 10/25/2021 CSE 572: Data Mining by H. Liu 6

Quality (goodness) metrics Some example metrics n n n Dependency: depending on classes Distance: separating classes Information: entropy, how? Consistency: 1 - #inconsistencies/N w Example: (F 1, F 2, F 3) and (F 1, F 3) w Both sets have 2/6 inconsistency rate Accuracy (classifier based): 1 - error. Rate F 1 F 2 F 3 C 0 0 1 1 0 0 0 1 0 0 0 Which one algorithm is better - comparisons n Time complexity, number of features, removing redundancy 10/25/2021 CSE 572: Data Mining by H. Liu 7

Normalization Decimal scaling n n v’(i) = v(i)/10 k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, k is 3 (1000), -991 -. 991 Min-max normalization into the new max/min range: n n v’ = (v - min. A)/(max. A - min. A) * (new_max. A - new_min. A) + new_min. A v = 73600 in [12000, 98000] v’= 0. 716 in [0, 1] (new range) Zero-mean normalization: n n n v’ = (v - mean. A) / std_dev. A (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If mean. Income = 54000 and std_dev. Income = 16000, then v = 73600 1. 225 10/25/2021 CSE 572: Data Mining by H. Liu 8

Discretization Motivation from Decision Tree Induction The concept of discretization n Sort the values of a feature Group continuous values together Reassign values to each group The methods n n n Equ-width Equ-frequency Entropy-based w A possible problem: still too many intervals w So, when to stop? 10/25/2021 CSE 572: Data Mining by H. Liu 9

Binning Attribute values (for one attribute e. g. , age): n 0, 4, 12, 16, 18, 24, 26, 28 Equi-width binning – for bin width of e. g. , 10: n n Bin 1: 0, 4 [-, 10) bin Bin 2: 12, 16, 18 [10, 20) bin Bin 3: 24, 26, 28 [20, +) bin We use – to denote negative infinity, + for positive infinity Equi-frequency binning – for bin density of e. g. , 3: n n n Bin 1: 0, 4, 12 Bin 2: 16, 18 Bin 3: 24, 26, 28 [-, 14) bin [14, 21) bin [21, +] bin Any problems with the above methods? 10/25/2021 CSE 572: Data Mining by H. Liu 10

Data cleaning Missing values n n n ignore it fill in manually use a global value/mean/most frequent Noise n n smoothing (binning) outlier removal Inconsistency n domain knowledge, domain constraints 10/25/2021 CSE 572: Data Mining by H. Liu 11

Data integration n combines data from multiple sources into a coherent data store Schema integration n entity identification problem Redundancy n n an attribute may be derived from another table correlation analysis Data value conflicts 10/25/2021 CSE 572: Data Mining by H. Liu 12

Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include n n n smoothing aggregation generalization normalization (min-max) feature construction w using neural networks Traditional transformation methods 10/25/2021 CSE 572: Data Mining by H. Liu 13

Feature extraction The basic problem n creating new features that are combinations of original features A common approach – PCA n http: //en. wikipedia. org/wiki/Principal_components_analysis n http: //csnet. otago. ac. nz/cosc 453/student_tutorials/principal_components. pdf n Dimensionality reduction via feature extraction (or transformation) w D’ = DA, D is mean centered (NXn), A (n. Xm), so, D’ is (NXm) n Its variants (SVD, LSI) are used widely in text mining and web mining 10/25/2021 CSE 572: Data Mining by H. Liu 14

Transformation: PCA D’ = DA, D is meancentered, (N n) n Calculate and rank eigenvalues of the covariance matrix m n E-values Diff Prop Cumu 1 2. 91082 1. 98960 0. 72771 0. 72770 2 0. 92122 0. 77387 0. 23031 0. 95801 3 0. 14735 0. 12675 0. 03684 0. 99485 4 0. 02061 0. 00515 1. 00000 r = ( i ) / ( i ) i=1 n n i=1 Select largest ’s such that r > threshold (e. g. , . 95) corresponding eigenvectors F 1 F 2 form A (n m) Example of Iris data 10/25/2021 V 2 V 3 V 4 0. 522372 0. 372318 -. 721017 -. 261996 -. 263355 0. 925556 0. 242033 0. 12413 5 F 3 0. 581254 0. 021095 0. 140892 0. 80115 4 F 4 0. 565611 0. 065416 0. 633801 -. 523546 CSE 572: Data Mining by H. Liu 15

Summary Data preprocessing cannot be overstressed in real-world applications It is an important, difficult, and lowprofile task There are different types of approaches for different preprocessing problems It should be considered with the mining algorithms to improve data mining effectiveness 10/25/2021 CSE 572: Data Mining by H. Liu 16