2 Data Preparation and Preprocessing Data and Its

Data Types and Forms n n Attribute-vector data: Data types n numeric, categorical (see

Data Preparation n n n An important & time consuming task in KDD High

Data Preparation Methods n n Data annotation Data normalization n n Dealing with sequential

Normalization n Decimal scaling n n n Min-max normalization into new max/min range: n

Temporal Data n The goal is to forecast t(n+1) from previous values n n

Outlier Removal n n Outlier: Data points inconsistent with the majority of data Different

Data Preprocessing n Data cleaning n n missing data noisy data inconsistent data Data

Missing Data n Many types of missing data n n not measured not applicable

Noisy Data n Noise: Random error or variance in a measured variable n n

Inconsistent Data n n Inconsistent with our models or common sense Examples n n

Dimensionality Reduction n Feature selection n n select m from n features, m≤ n

Feature Selection n Problem illustration n n Full set Empty set Enumeration Search n

Feature Selection (2) n Goodness metrics n n n Dependency: dependence on classes Distance:

Feature Selection (3) n Filter vs. Wrapper Model n n Pros and cons n

Feature Selection (Examples) n SFS using consistency (c. Rate) n n n LVF using

Transformation: PCA n D’ = DA, D is meancentered, (N n) n Calculate and

Instance Selection n Sampling methods n n n random sampling stratified sampling Search-based methods

Value Discretization n Binning methods n n n Equal-width Equal-frequency Class information is not

Binning n Attribute values (for one attribute e. g. , age): n n Equi-width

Entropy-based n Given attribute-value/class pairs: n n Entropy-based binning via binarization: n n (0,

Entropy-based (2) n Let v be a possible split. Then S is divided into

Chi. Merge and Chi 2 n n n Given attribute-value/class pairs Build a contingency

Summary n Data have many forms n n Raw data need to be prepared

Bibliography n n 9/03 H. Liu & H. Motoda, 1998. Feature Selection for Knowledge

Slides: 25

Download presentation

2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction 9/03 Data Mining – Data Preprocessing Guozhu Dong 1

Data Types and Forms n n Attribute-vector data: Data types n numeric, categorical (see the hierarchy for their relationship) n n static, dynamic (temporal) Other data forms n n n 9/03 distributed data text, Web, meta data images, audio/video Data Mining – Data Preprocessing Guozhu Dong 2

Data Preparation n n n An important & time consuming task in KDD High dimensional data (20, 1000, …) Huge size (volume) data Missing data Outliers Erroneous data (inconsistent, mis-recorded, distorted) Raw data 9/03 Data Mining – Data Preprocessing Guozhu Dong 3

Data Preparation Methods n n Data annotation Data normalization n n Dealing with sequential or temporal data n n Examples: image pixels, age Transform to tabular form Removing outliers n 9/03 Different types Data Mining – Data Preprocessing Guozhu Dong 4

Normalization n Decimal scaling n n n Min-max normalization into new max/min range: n n n v’(i) = v(i)/10 k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, 10 k is 1000, -991 -. 991 v’ = (v - min. A)/(max. A - min. A) * (new_max. A - new_min. A) + new_min. A v = 73600 in [12000, 98000] v’= 0. 716 in [0, 1] (new range) Zero-mean normalization: n n n 9/03 v’ = (v - mean. A) / std_dev. A (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If mean. Income = 54000 and std_dev. Income = 16000, then v = 73600 1. 225 Data Mining – Data Preprocessing Guozhu Dong 5

Temporal Data n The goal is to forecast t(n+1) from previous values n n X = {t(1), t(2), …, t(n)} An example with two features and widow size 3 n How to determine the window size? Time A B Inst A(n-2) A(n-1) A(n) B(n-2) B(n-1) B(n) 1 7 215 2 10 211 1 7 10 6 215 211 214 3 6 214 2 10 6 11 214 221 4 11 221 5 12 210 3 6 11 12 214 221 210 6 14 218 4 11 12 14 221 210 218 9/03 Data Mining – Data Preprocessing Guozhu Dong 6

Outlier Removal n n Outlier: Data points inconsistent with the majority of data Different outliers n n n Valid: CEO’s salary, Noisy: One’s age = 200, widely deviated points Removal methods n n n 9/03 Clustering Curve-fitting Hypothesis-testing with a given model Data Mining – Data Preprocessing Guozhu Dong 7

Data Preprocessing n Data cleaning n n missing data noisy data inconsistent data Data reduction n 9/03 Dimensionality reduction Instance selection Value discretization Data Mining – Data Preprocessing Guozhu Dong 8

Missing Data n Many types of missing data n n not measured not applicable wrongly placed, and ? Some methods n n 9/03 leave as is ignore/remove the instance with missing value manual fix (assign a value for implicit meaning) statistical methods (majority, most likely, mean, nearest neighbor, …) Data Mining – Data Preprocessing Guozhu Dong 9

Noisy Data n Noise: Random error or variance in a measured variable n n n Noise is normally a minority in the data set n n inconsistent values for features or classes (processing) measuring errors (source) Why? Removing noise n n n 9/03 Clustering/merging Smoothing (rounding, averaging within a window) Outlier detection (deviation-based or distance-based) Data Mining – Data Preprocessing Guozhu Dong 10

Inconsistent Data n n Inconsistent with our models or common sense Examples n n n 9/03 The same name occurs as different ones in an application Different names appear the same (Dennis vs. Denis) Inappropriate values (Male-Pregnant, negative age) One bank’s database shows that 5% of its customers were born on 11/11/11 … Data Mining – Data Preprocessing Guozhu Dong 11

Dimensionality Reduction n Feature selection n n select m from n features, m≤ n remove irrelevant, redundant features + saving in search space Feature transformation (PCA) n n n 9/03 form new features (a) in a new domain from original features (f) many uses, but it does not reduce the original dimensionality often used in visualization of data Data Mining – Data Preprocessing Guozhu Dong 12

Feature Selection n Problem illustration n n Full set Empty set Enumeration Search n n 9/03 Exhaustive/Complete (Enumeration/B&B) Heuristic (Sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets generation/evaluation Data Mining – Data Preprocessing Guozhu Dong 13

Feature Selection (2) n Goodness metrics n n n Dependency: dependence on classes Distance: separating classes Information: entropy Consistency: 1 - #inconsistencies/N n Example: (F 1, F 2, F 3) and (F 1, F 3) n Both sets have 2/6 inconsistency rate Accuracy (classifier based): 1 - error. Rate F 1 F 2 F 3 C 0 0 1 1 0 0 0 1 0 0 0 Their comparisons n 9/03 Time complexity, number of features, removing redundancy Data Mining – Data Preprocessing Guozhu Dong 14

Feature Selection (3) n Filter vs. Wrapper Model n n Pros and cons n time n generality n performance such as accuracy Stopping criteria n n 9/03 thresholding (number of iterations, some accuracy, …) anytime algorithms n providing approximate solutions n solutions improve over time Data Mining – Data Preprocessing Guozhu Dong 15

Feature Selection (Examples) n SFS using consistency (c. Rate) n n n LVF using consistency (c. Rate) 1 2 3 n n select 1 from n, then 1 from n-1, n-2, … features increase the number of selected features until prespecified c. Rate is reached. randomly generate a subset S from the full set if it satisfies prespecified c. Rate, keep S with min #S go back to 1 until a stopping criterion is met LVF is an any time algorithm Many other algorithms: SBS, B&B, . . . 9/03 Data Mining – Data Preprocessing Guozhu Dong 16

Transformation: PCA n D’ = DA, D is meancentered, (N n) n Calculate and rank eigenvalues of the covariance matrix m E-values Diff Prop Cumu 1 2. 91082 1. 98960 0. 72771 0. 72770 2 0. 92122 0. 77387 0. 23031 0. 95801 3 0. 14735 0. 12675 0. 03684 0. 99485 4 0. 02061 0. 00515 1. 00000 n r = ( i ) / ( i ) i=1 n n n i=1 Select largest ’s such that r > threshold (e. g. , . 95) corresponding eigenvectors form A (n m) Example of Iris data 9/03 V 1 V 2 V 3 V 4 F 1 0. 522372 0. 372318 -. 721017 -. 261996 F 2 -. 263355 0. 925556 0. 242033 0. 124135 F 3 0. 581254 0. 021095 0. 140892 0. 801154 F 4 0. 565611 0. 065416 0. 633801 -. 523546 Data Mining – Data Preprocessing Guozhu Dong 17

Instance Selection n Sampling methods n n n random sampling stratified sampling Search-based methods n n 9/03 Representatives Prototypes Sufficient statistics (N, mean, std. Dev) Support vectors Data Mining – Data Preprocessing Guozhu Dong 18

Value Discretization n Binning methods n n n Equal-width Equal-frequency Class information is not used Entropy-based Chi. Merge n 9/03 Chi 2 Data Mining – Data Preprocessing Guozhu Dong 19

Binning n Attribute values (for one attribute e. g. , age): n n Equi-width binning – for bin width of e. g. , 10: n n n Bin 1: 0, 4 [-, 10) bin Bin 2: 12, 16, 18 [10, 20) bin Bin 3: 24, 26, 28 [20, +) bin We use – to denote negative infinity, + for positive infinity Equi-frequency binning – for bin density of e. g. , 3: n n 0, 4, 12, 16, 18, 24, 26, 28 Bin 1: 0, 4, 12 Bin 2: 16, 18 Bin 3: 24, 26, 28 [-, 14) bin [14, 21) bin [21, +] bin Any problems with the above methods? 9/03 Data Mining – Data Preprocessing Guozhu Dong 20

Entropy-based n Given attribute-value/class pairs: n n Entropy-based binning via binarization: n n (0, P), (4, P), (12, P), (16, N), (18, P), (24, N), (26, N), (28, N) Intuitively, find best split so that the bins are as pure as possible Formally characterized by maximal information gain. Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs. Entropy(S) = - p log p - n log n. n n 9/03 Smaller entropy – set is relatively pure; smallest is 0. Large entropy – set is mixed. Largest is 1. Data Mining – Data Preprocessing Guozhu Dong 21

Entropy-based (2) n Let v be a possible split. Then S is divided into two sets: n n Information of the split: n n n I(S 1, S 2) = (|S 1|/|S|) Entropy(S 1)+ (|S 2|/|S|) Entropy(S 2) Information gain of the split: n n S 1: value <= v and S 2: value > v Gain(v, S) = Entropy(S) – I(S 1, S 2) Goal: split with maximal information gain. Possible splits: mid points b/w any two consecutive values. n For v=14, I(S 1, S 2) = 0 + 6/9*Entropy(S 2) = 6/9 * 0. 65 = 0. 433 Gain(14, S) = Entropy(S) - 0. 433 n maximum Gain means minimum I. n The best split is found after examining all possible split points. n 9/03 Data Mining – Data Preprocessing Guozhu Dong 22

Chi. Merge and Chi 2 n n n Given attribute-value/class pairs Build a contingency table for every pair of intervals Chi-Squared Test (goodness-of-fit), 2 k 2 = (Aij – i=1 j=1 n Eij)2 / Eij Parameters: df = k-1 and p% level of significance n 9/03 Chi 2 algorithm provides an automatic way to adjust p Data Mining – Data Preprocessing Guozhu Dong C 1 C 2 I-1 A 12 R 1 I-2 A 21 A 22 R 2 C 1 C 2 N F C 12 P 12 N 12 P 16 N 16 P 24 N 23

Summary n Data have many forms n n Raw data need to be prepared and preprocessed for data mining n n n Attribute-vectors: the most common form Data miners have to work on the data provided Domain expertise is important in DPP Data preparation: Normalization, Transformation Data preprocessing: Cleaning and Reduction DPP is a critical and time-consuming task n 9/03 Why? Data Mining – Data Preprocessing Guozhu Dong 24

Bibliography n n 9/03 H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer. M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science. H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer. H. Liu, F. Hussain, C. L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6: 393423. Data Mining – Data Preprocessing Guozhu Dong 25