2 Data Preparation and Preprocessing Data and Its

  • Slides: 25
Download presentation
2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction

2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction 1

Data Types and Forms n n n Attribute-vector data: Data types n numeric, categorical

Data Types and Forms n n n Attribute-vector data: Data types n numeric, categorical (see the n static, dynamic (temporal) hierarchy for its relationship) Other data forms n n 2/4/03 distributed data text, Web, meta data images, audio/video You have seen most of them after the invited talks. CSE 575 Data Mining by H. Liu 2

Data Preparation n n n An important & time consuming task in KDD High

Data Preparation n n n An important & time consuming task in KDD High dimensional data (20, 1000) Huge size data Missing data Outliers Erroneous data (inconsistent, misrecorded, distorted) Raw data 2/4/03 CSE 575 Data Mining by H. Liu 3

Data Preparation Methods n n Data annotation as in driving data analysis Data normalization

Data Preparation Methods n n Data annotation as in driving data analysis Data normalization n n Dealing with sequential or temporal data n n Another example is of image mining Transform it to tabular form Removing outliers n 2/4/03 Different types CSE 575 Data Mining by H. Liu 4

Normalization n Decimal scaling n n n Min-max normalization into the new max/min range:

Normalization n Decimal scaling n n n Min-max normalization into the new max/min range: n n n v’(i) = v(i)/10 k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, k is 1000, -991 v’ = (v - min. A)/(max. A - min. A) * (new_max. A - new_min. A) + new_min. A v = 73600 in [12000, 98000] v’= 0. 716 in [0, 1] (new range) Zero-mean normalization: n n n 2/4/03 v’ = (v - mean. A) / std_dev. A (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If mean. Income = 54000 and std_dev. Income = 16000, then v = 73600 1. 225 CSE 575 Data Mining by H. Liu 5

Temporal Data n The goal is to forecast t(n+1) from previous values n n

Temporal Data n The goal is to forecast t(n+1) from previous values n n X = {t(1), t(2), …, t(n)} An example with two features and widow size 3 n How to determine the window size? Time A B Inst A(n-2) A(n-1) A(n) B(n-2) B(n-1) B(n) 1 7 215 2 10 211 1 7 10 6 215 211 214 3 6 214 2 10 6 11 214 221 4 11 221 5 12 210 3 6 11 12 214 221 210 6 14 218 4 11 12 14 221 210 218 2/4/03 CSE 575 Data Mining by H. Liu 6

Outlier Removal n n Data points inconsistent with the majority of data Different outliers

Outlier Removal n n Data points inconsistent with the majority of data Different outliers n n n Valid: CEO’s salary, Noisy: One’s age = 200, widely deviated points Removal methods n n n Clustering Curve-fitting Hypothesis-testing with a given model 2/4/03 CSE 575 Data Mining by H. Liu 7

Data Preprocessing n Data cleaning n n missing data noisy data inconsistent data Data

Data Preprocessing n Data cleaning n n missing data noisy data inconsistent data Data reduction n 2/4/03 Dimensionality reduction Instance selection Value discretization CSE 575 Data Mining by H. Liu 8

Missing Data n Many types of missing data n n not measured truly missed

Missing Data n Many types of missing data n n not measured truly missed wrongly placed, and ? Some methods n n 2/4/03 leave as is ignore/remove the instance with missing value manual fix (assign a value for implicit meaning) statistical methods (majority, most likely, mean, nearest neighbor, …) CSE 575 Data Mining by H. Liu 9

Noisy Data n Random error or variance in a measured variable n n n

Noisy Data n Random error or variance in a measured variable n n n Noise is normally a minority in the data set n n inconsistent values for features or classes (process) measuring errors (source) Why? Removing noise n n n 2/4/03 Clustering/merging Smoothing (rounding, averaging within a window) Outlier detection (deviation-based or distance-based) CSE 575 Data Mining by H. Liu 10

Inconsistent Data n n Inconsistent with our models or common sense Examples n n

Inconsistent Data n n Inconsistent with our models or common sense Examples n n n 2/4/03 The same name occurs differently in an application Different names appear the same (Dennis vs. Denis) Inappropriate values (Male-Pregnant, negative age) One bank’s database shows that 5% of its customers were born in 11/11/11 … CSE 575 Data Mining by H. Liu 11

Dimensionality Reduction n Feature selection n n select m from n features, m≤ n

Dimensionality Reduction n Feature selection n n select m from n features, m≤ n remove irrelevant, redundant features the saving in search space Feature transformation (PCA) n n n 2/4/03 form new features (a) in a new domain from original features (f) many uses, but it does not reduce the original dimensionality often used in visualization of data CSE 575 Data Mining by H. Liu 12

Feature Selection n Problem illustration n n Full set Empty set Enumeration Search n

Feature Selection n Problem illustration n n Full set Empty set Enumeration Search n n 2/4/03 Exhaustive/Complete (Enumeration/BAA) Heuristic (Sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets generation/evaluation CSE 575 Data Mining by H. Liu 13

Feature Selection (2) n Goodness metrics n n n Dependency: depending on classes Distance:

Feature Selection (2) n Goodness metrics n n n Dependency: depending on classes Distance: separating classes Information: entropy Consistency: 1 - #inconsistencies/N n Example: (F 1, F 2, F 3) and (F 1, F 3) n Both sets have 2/6 inconsistency rate Accuracy (classifier based): 1 - error. Rate F 1 F 2 F 3 C 0 0 1 1 0 0 0 1 0 0 0 Their comparisons n 2/4/03 Time complexity, number of features, removing redundancy CSE 575 Data Mining by H. Liu 14

Feature Selection (3) n Filter vs. Wrapper Model n n Pros and cons n

Feature Selection (3) n Filter vs. Wrapper Model n n Pros and cons n time n generality n performance such as accuracy Stopping criteria n n 2/4/03 thresholding (number of iterations, some accuracy, …) anytime algorithms n providing approximate solutions n solutions improve over time CSE 575 Data Mining by H. Liu 15

Feature Selection (Examples) n SFS using consistency (c. Rate) n n n LVF using

Feature Selection (Examples) n SFS using consistency (c. Rate) n n n LVF using consistency (c. Rate) 1 2 3 n n select 1 from n, then 1 from n-1, n-2, … features increase the number of selected features until prespecified c. Rate is reached. randomly generate a subset S from the full set if it satisfies prespecified c. Rate, keep S with min #S go back to 1 until a stopping criterion is met LVF is an any time algorithm Many other algorithms: SBS, B&B, . . . 2/4/03 CSE 575 Data Mining by H. Liu 16

Transformation: PCA n D’ = DA, D is meancentered, (N n) n Calculate and

Transformation: PCA n D’ = DA, D is meancentered, (N n) n Calculate and rank eigenvalues of the covariance matrix m E-values Diff Prop Cumu 1 2. 91082 1. 98960 0. 72771 0. 72770 2 0. 92122 0. 77387 0. 23031 0. 95801 3 0. 14735 0. 12675 0. 03684 0. 99485 4 0. 02061 0. 00515 1. 00000 n r = ( i ) / ( i ) i=1 n n n i=1 Select largest ’s such that r > threshold (e. g. , . 95) corresponding eigenvectors form A (n m) Example of Iris data 2/4/03 V 1 V 2 V 3 V 4 F 1 0. 522372 0. 372318 -. 721017 -. 261996 F 2 -. 263355 0. 925556 0. 242033 0. 124135 F 3 0. 581254 0. 021095 0. 140892 0. 801154 F 4 0. 565611 0. 065416 0. 633801 -. 523546 CSE 575 Data Mining by H. Liu 17

Instance Selection n Sampling methods n n n random sampling stratified sampling Search-based methods

Instance Selection n Sampling methods n n n random sampling stratified sampling Search-based methods n n 2/4/03 Representatives Prototypes Sufficient statistics (N, mean, std. Dev) Support vectors CSE 575 Data Mining by H. Liu 18

Value Descritization n Binning methods n n n Equal-width Equal-frequency Class information is not

Value Descritization n Binning methods n n n Equal-width Equal-frequency Class information is not used Entropy-based Chi. Merge n 2/4/03 Chi 2 CSE 575 Data Mining by H. Liu 19

Binning n Attribute values (for one attribute e. g. , age): n n Equi-width

Binning n Attribute values (for one attribute e. g. , age): n n Equi-width binning – for bin width of e. g. , 10: n n n Bin 1: 0, 4 [-, 10) bin Bin 2: 12, 16, 18 [10, 20) bin Bin 3: 24, 26, 28 [20, +) bin We use – to denote negative infinity, + for positive infinity Equi-frequency binning – for bin density of e. g. , 3: n n 0, 4, 12, 16, 18, 24, 26, 28 Bin 1: 0, 4, 12 Bin 2: 16, 18 Bin 3: 24, 26, 28 [-, 14) bin [14, 21) bin [21, +] bin Any problems with the above methods? 2/4/03 CSE 575 Data Mining by H. Liu 20

Entropy-based n Given attribute-value/class pairs: n n Entropy-based binning via binarization: n n (0,

Entropy-based n Given attribute-value/class pairs: n n Entropy-based binning via binarization: n n (0, P), (4, P), (12, P), (16, N), (18, P), (24, N), (26, N), (28, N) Intuitively, find best split so that the bins are as pure as possible Formally characterized by maximal information gain. Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs. Entropy(S) = - p log p - n log n. n n 2/4/03 Smaller entropy – set is relatively pure; smallest is 0. Large entropy – set is mixed. Largest is 1. CSE 575 Data Mining by H. Liu 21

Entropy-based (2) n Let v be a possible split. Then S is divided into

Entropy-based (2) n Let v be a possible split. Then S is divided into two sets: n n Information of the split: n n n I(S 1, S 2) = (|S 1|/|S|) Entropy(S 1)+ (|S 2|/|S|) Entropy(S 2) Information gain of the split: n n S 1: value <= v and S 2: value > v Gain(v, S) = Entropy(S) – I(S 1, S 2) Goal: split with maximal information gain. Possible splits: mid points b/w any two consecutive values. n For v=14, I(S 1, S 2) = 0 + 6/9*Entropy(S 2) = 6/9 * 0. 65 = 0. 433 Gain(14, S) = Entropy(S) - 0. 433 n maximum Gain means minimum I. n The best split is found after examining all possible split points. n 2/4/03 CSE 575 Data Mining by H. Liu 22

Chi. Merge and Chi 2 n n n Given attribute-value/class pairs Build a contingency

Chi. Merge and Chi 2 n n n Given attribute-value/class pairs Build a contingency table for every pair of intervals (I) Chi-Squared Test (goodness-of-fit), 2 k 2 = (Aij – i=1 j=1 n Eij)2 / Eij Parameters: df = k-1 and p% level of significance n 2/4/03 Chi 2 algorithm provides an automatic way to adjust p CSE 575 Data Mining by H. Liu C 1 C 2 I-1 A 12 R 1 I-2 A 21 A 22 R 2 C 1 C 2 N F C 12 P 12 N 12 P 16 N 16 P 24 N 23

Summary n Data have many forms n n Raw data need to be prepared

Summary n Data have many forms n n Raw data need to be prepared and preprocessed for data mining n n n Attribute-vectors is the most common form Data miners have to work on the data provided Domain expertise is important in DPP Data preparation: Normalization, Transformation Data preprocessing: Cleaning and Reduction DPP is a critical and time-consuming task n 2/4/03 Why? CSE 575 Data Mining by H. Liu 24

Bibliography n n H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery

Bibliography n n H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer. M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science. H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer. H. Liu, F. Hussain, C. L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6: 393423. 2/4/03 CSE 575 Data Mining by H. Liu 25