Data Preprocessing Chapter 2 10312021 DWDM Data Preprocessing




































![-- Data Transformation: Normalization n Min-max normalization: to [new_min. A, new_max. A] n n -- Data Transformation: Normalization n Min-max normalization: to [new_min. A, new_max. A] n n](https://slidetodoc.com/presentation_image_h2/fde2017990251d6ab1bf66cd68060c12/image-37.jpg)


























- Slides: 63
Data Preprocessing Chapter 2 10/31/2021 DW/DM: Data Preprocessing 1
Chapter Objectives n n 10/31/2021 Realize the importance of data preprocessing for real world data before data mining or construction of data warehouses. Get an overview of some data preprocessing issues and techniques. DW/DM: Data Preprocessing 2
The course (4) DS DS OLAP (2) Data Preprocessing (3) DW DS DM Association (5) Classification (6) Clustering (7) DS = Data source DW = Data warehouse DM = Data Mining 10/31/2021 DW/DM: Data Preprocessing 3
… - The Chapter (2. 3) (2. 4) (2. 5) 10/31/2021 DW/DM: Data Preprocessing 4
- Chapter Outline n Introduction (2. 1, 2. 2) n Data Cleaning (2. 3) n Data Integration (2. 4) n Data Transformation (2. 4) n Data Reduction (2. 5) n Concept Hierarchy (2. 6) 10/31/2021 DW/DM: Data Preprocessing 5
- Introduction n 10/31/2021 Introduction n Why Preprocess the Data n Where Preprocess Data n Identifying Typical properties of Data DW/DM: Data Preprocessing (2. 1) (2. 2) 6
-- Why Preprocess the Data … n 10/31/2021 A well-accepted multi-dimensional measure of data quality: n Accuracy n Completeness n Consistency n Timeliness n Believability n Value added n Interpretability n Accessibility DW/DM: Data Preprocessing 7
-- … Why Preprocess the Data n n n 10/31/2021 Reason for data cleaning n Incomplete data (missing data) n Noisy data (contains errors) n Inconsistent data (containing discrepancies) Reasons for data integration n Data comes from multiple sources Reason for data transformation n Some data must be transformed to be used for mining Reasons for data reduction n Performance No quality data no quality mining results! DW/DM: Data Preprocessing 8
-- Where Preprocess Data DS DS OLAP SD DW Data preprocessing is done here, In the Staging Database DS Association Classification Clustering DS = Data source DW = Data warehouse DM = Data Mining SD = Staging Database 10/31/2021 DM DW/DM: Data Preprocessing 9
-- Identifying Typical Properties of Data n 10/31/2021 Descriptive Data Summarization techniques can be used to identify the typical properties of data and helps which data values should be treated as noise. For many data preprocessing tasks it is useful to know the following measures of the data n The central tendency n The dispersion DW/DM: Data Preprocessing 10
--- Measuring the Central Tendency … n n 10/31/2021 Central Tendency measures n Mean n Median n Mode n Midrange: (max() – min())/2 For data mining purposes, we need to know how to compute these measures efficiently in large databases. It is important to know whether the measure is: n distributive n Algebraic or n holistic DW/DM: Data Preprocessing 11
… --- Measuring the Central Tendency n Distributive measure: A measure that can be computed by partitioning the data, compute the measure for each partition, and the merge the results to arrive at the measure’s value for the entire data. n n Algebraic measure: is a measure that can be computed by applying an algebraic function to one or more distributed measures. n n eg. Sum(), count(), max(), min(). eg. Avg() which is sum()/count() Holistic measure. You need the entire data to compute the measure n 10/31/2021 eg. median DW/DM: Data Preprocessing 12
--- Measuring the Dispersion n n 10/31/2021 Dispersion or variance is the degree to which numerical data tends to spread. The most common measures are: n Standard deviation n Range: max() – min() n Quartiles n The five-number summary n Interquartile range (IQR) n Boxplot Analysis DW/DM: Data Preprocessing 13
---- Quartiles n n n The kth percentile of a data sorted in ascending order is the value x having the property the k percent of the data entries lie at or below x. The first quartile, Q 1, is the 25 th percentile, Q 2 and median the 50 th percentile, and Q 3 is the 75 th percentile. IQR is Q 3 – Q 1 and is a simple measure that gives the spread of the middle half. n n The 5 -number summary: The min, Q 1, median, Q 3, the max n 10/31/2021 A common rule of thump for identifying suspected outliers is to single out values 1. 5 * IQR above Q 3 or below Q 1. Box plots can be plotted based on the 5 -number summary and are useful tools for identifying outliers. DW/DM: Data Preprocessing 14
---- Boxplot Analysis n Boxplot Highest value n Data is represented with a box n The ends of the box are Q 1 and Q 3, n n n 10/31/2021 i. e. , the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum DW/DM: Data Preprocessing Q 3 Median Q 1 Whisker Lowest value 15
- Chapter Outline n Introduction (2. 1, 2. 2) n Data Cleaning (2. 3) n Data Integration (2. 4) n Data Transformation (2. 4) n Data Reduction (2. 5) n Concept Hierarchy (2. 6) 10/31/2021 DW/DM: Data Preprocessing 16
- Data Cleaning n Importance n n 10/31/2021 “Data cleaning is the number one problem in data warehousing” In data cleaning, the following data problems are resolved: n Incomplete data (missing data) n Noisy data (contains errors) n Inconsistent data (containing discrepancies) DW/DM: Data Preprocessing 17
-- Missing Data n Data is not always available n n Missing data may be due to n equipment malfunction n inconsistent with other recorded data and thus deleted n data not entered due to misunderstanding n 10/31/2021 E. g. , many tuples have no recorded value for several attributes, such as customer income in sales data certain data may not be considered important at the time of entry DW/DM: Data Preprocessing 18
--- How to Handle Missing Data? n n Fill in missing value manually (often unfeasible) Fill in with a global constant. Unknown or n/a not recommended (data mining algorithm will see this as a normal value) n Fill in with attribute mean or median n Fill in with class mean or median (classes need to be known) n n n 10/31/2021 Fill in with most likely value (using regression, decision trees, most similar records, etc. ) Use other attributes to predict value (e. g. if a postcode is missing use suburb value) Ignore the record DW/DM: Data Preprocessing 19
-- Noisy Data n Noise: random error or variance in a measured variable n Incorrect attribute values may due to n faulty data collection n data entry problems n data transmission problems n data conversion errors n Data decay problems n technology limitations, e. g. buffer overflow or field size limits 10/31/2021 DW/DM: Data Preprocessing 20
--- How to Handle Noisy Data? n Binning n n Regression n n smooth by fitting the data into regression functions Clustering n n First sort data and partition into (equal-frequency) bins, then one can smooth by bin means, or by bin median, or by bin boundaries, etc. detect and remove outliers Combined computer and human inspection n 10/31/2021 detect suspicious values and check by human. DW/DM: Data Preprocessing 21
--- Binning Methods for Data Smoothing n Sorted data for price: 4, 8, 9, 15, 21, 24, 25, 26, 28, 29, 34 n Partition into equal-frequency (equi-depth) bins: n n 4, 8, 9, 15 21, 24, 25 26, 28, 29, 34 Smoothing by bin means: n n Bin 1: Bin 2: Bin 3: 9, 9, 9, 9 23, 23, 23 29, 29, 29 Smoothing by bin boundaries: n n n 10/31/2021 Bin 1: Bin 2: Bin 3: 4, 4, 4, 15 21, 25, 25 26, 26, 34 DW/DM: Data Preprocessing 22
--- Regression y Y 1 y=x+1 Y 1’ X 1 10/31/2021 DW/DM: Data Preprocessing x 23
--- Cluster Analysis 10/31/2021 DW/DM: Data Preprocessing 24
-- Inconsistent data n n Inconsistent data can be due to: n data entry errors n data integration errors (different formats, codes, etc. ) Handling inconsistent data n n 10/31/2021 Important to have data entry verification (check both format and values of data entered) Correct with help of external reference data DW/DM: Data Preprocessing 25
-- Data Cleaning as a Process n Data discrepancy detection n Use metadata (e. g. , domain, range, correlation, distribution, DDS) n Check field overloading n Inconsistent use of codes (e. g. 5/12/2004 and 12/5/2004) n Check uniqueness rule, consecutive rule, and null rule n Use commercial tools n n 10/31/2021 Data scrubbing: use simple domain knowledge (e. g. , postal code, spell-check) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e. g. , correlation and clustering to find outliers) DW/DM: Data Preprocessing 26
n The normal (distribution) curve --- Properties of Normal Distribution Curve n From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) n From μ– 2σ to μ+2σ: contains about 95% of it n From μ– 3σ to μ+3σ: contains about 99. 7% of it 10/31/2021 DW/DM: Data Preprocessing 27
--- Correlation Negative correlation Positive correlation No correlation 10/31/2021 DW/DM: Data Preprocessing 28
- Chapter Outline n Introduction (2. 1, 2. 2) n Data Cleaning (2. 3) n Data Integration (2. 4) n Data Transformation (2. 4) n Data Reduction (2. 5) n Concept Hierarchy (2. 6) 10/31/2021 DW/DM: Data Preprocessing 29
- Data Integration n n Data integration: Combines data from multiple sources into a coherent data store Main problems: n Entity identification problem: n n n Identify real world entities from multiple data sources, e. g. , A. cust-id B. cust-# Redundancy problem: n An attribute is redundant if it can be derived from other attribute(s). n Inconsistencies in attribute naming can cause redundancy Solutions: n n 10/31/2021 Entity identification problems can be resolved using metadata Some redundancy problems can be also be resolved using metadata and some others can be resolved correlation analysis. DW/DM: Data Preprocessing 30
-- Correlation Negative correlation Positive correlation No correlation 10/31/2021 DW/DM: Data Preprocessing 31
--- Correlation Analysis (Numerical Data) n Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. n If r. A, B > 0, A and B are positively correlated n r. A, B = 0: independent n r. A, B < 0: negatively correlated 10/31/2021 DW/DM: Data Preprocessing 32
--- Correlation Analysis (Categorical Data) n Χ 2 (chi-square) test n The larger the Χ 2 value, the more likely the variables are related n 10/31/2021 The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count DW/DM: Data Preprocessing 33
--- Chi-Square Calculation: An Example n n Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col. ) 300 1200 1500 Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) It shows that like_science_fiction and play_chess are correlated in the group 10/31/2021 DW/DM: Data Preprocessing 34
- Chapter Outline n Introduction (2. 1, 2. 2) n Data Cleaning (2. 3) n Data Integration (2. 4) n Data Transformation (2. 4) n Data Reduction (2. 5) n Concept Hierarchy (2. 6) 10/31/2021 DW/DM: Data Preprocessing 35
- Data Transformation n In data transformation, data is transformed or consolidated to forms appropriate for mining. Data transformation can involve: n n n Smoothing: remove noise from data using binning, regression, or clustering. Aggregation: E. g. sales data can be aggregated to monthly. Generalization: concept hierarchy climbing. E. g. cities can be generalized to countries. Ages can be generalized to youth, middleaged, and senior. Normalization: Attribute data scaled to fall within a small, specified range Attribute/feature construction: New attributes constructed from the given ones 10/31/2021 DW/DM: Data Preprocessing 36
-- Data Transformation: Normalization n Min-max normalization: to [new_min. A, new_max. A] n n Z-score normalization (μ: mean, σ: standard deviation): n n Ex. Let income range $12, 000 to $98, 000 normalized to [0. 0, 1. 0]. Then $73, 000 is mapped to Ex. Let μ = 54, 000, σ = 16, 000. Then Where scaling j is the smallest integer such that Max(|ν’|) < 1 Normalization by decimal 10/31/2021 DW/DM: Data Preprocessing 37
-- Attribute/feature construction n Sometimes it is helpful or necessary to construct new attributes or features n n 10/31/2021 Helpful for understanding and accuracy For example: Create attribute volume based on attributes height, depth and width Construction is based on mathematical or logical operations Attribute construction can help to discover missing information about the relationships between data attributes DW/DM: Data Preprocessing 38
- Chapter Outline n Introduction (2. 1, 2. 2) n Data Cleaning (2. 3) n Data Integration (2. 4) n Data Transformation (2. 4) n Data Reduction (2. 5) n Concept Hierarchy (2. 6) 10/31/2021 DW/DM: Data Preprocessing 39
- Data Reduction n n The data is often too large. Reducing the data can improve performance. Data reduction consists of reducing the representation of the data set while producing the same (or almost the same) results. Data Reduction Includes: n Reducing the number of rows (objects) n Reducing the number of attributes (features) n Compression n Discretization (will be covered in the next section) 10/31/2021 DW/DM: Data Preprocessing 40
-- Reducing the number of Rows n Aggregation n n We can have multiple levels of aggregation. E. g. , Weekly, monthly, quarterly, yearly, and so on. For data reduction use the highest aggregation level which is enough Numerosity reduction n 10/31/2021 Aggregation of data in to a higher concept level. Data volume can be reduced by choosing alternative forms of data representation DW/DM: Data Preprocessing 41
--- Types of Numerosity reduction n Parametric: n Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) n n 10/31/2021 E. g. : Linear regression: Data are modeled to fit a straight line Non-parametric: n Histograms n Clustering n Sampling DW/DM: Data Preprocessing 42
---- Reduction with Histograms n n A popular data reduction technique. Divide data into buckets and store representation of buckets (sum, count, etc. ) Histogram Types: n n 10/31/2021 Equal-width: Divides the range into N intervals of equal size. Equal-depth: Divides the range into N intervals, each containing approximately same number of samples V-optimal: Considers all histogram types for a given number of buckets and chooses the one with the least variance. Max. Diff: After sorting the data to be approximated, the borders of the buckets are defined at points where the adjacent values have the maximum difference DW/DM: Data Preprocessing 43
---- Example: Histogram 10/31/2021 DW/DM: Data Preprocessing 44
---- Reduction with Clustering n Partition data into clusters based on “closeness” in space. Retain representatives of clusters (centroids) and outliers. Effectiveness depends upon the distribution of data. Hierarchical clustering is possible (multi-resolution). Outlier x x x 10/31/2021 DW/DM: Data Preprocessing Centroid 45
---- Reduction with Sampling n n n 10/31/2021 Allows a large data set to be represented by a much smaller random sample of the data (sub-set). Will the patterns in the sample represent the patterns in the data? How to select a random sample? n Simple random sample without replacement (SRSWOR) n Simple random sampling with replacement (SRSWR) n Cluster sample (SRSWOR or SRSWR from clusters) n Stratified sample (stratum = group based on attribute value) DW/DM: Data Preprocessing 46
----Sampling OR dom W S SR an ut r e l p (sim le witho p sam ement) c repla SRSW R Raw Data 10/31/2021 DW/DM: Data Preprocessing 47
Sampling Example Cluster/Stratified Sample Raw Data 10/31/2021 DW/DM: Data Preprocessing 48
-- Reduce the number of Attributes n Reduce the number of attributes or dimensions or features. n Select a minimum set of attributes (features) that is sufficient for the data mining or analytical task. Purpose: n n n 10/31/2021 Avoid curse of dimensionality which creates sparse data space and bad clusters. Reduce amount of time and memory required by data mining algorithms n Allow data to be more easily visualized n May help to eliminate irrelevant and duplicate features or reduce noise DW/DM: Data Preprocessing 49
--- Reduce the number of Attributes: techniques n Step-wise forward selection: n n E. g. : {} {A 1} {A 1, A 3, A 5} Step-wise backward elimination: n E. g. : {A 1, A 2, A 3, A 4, A 5} {A 1, A 3, A 5} n Combining forward selection and backward elimination n Decision-tree induction (This will be covered in Chapter 5). 10/31/2021 DW/DM: Data Preprocessing 50
-- Data Compression n 10/31/2021 Data compression reduces the size of data and can be used for all sorts of data. n saves storage space. n saves communication time. There is lossless compression and lossy compression. E. g. , ZIP, Discrete wavelet transform (DWT), and Principal Component Analysis (PCA). For data mining, data compression is beneficial if data mining algorithms can manipulate compressed data directly without uncompressing it. Examples: String compression (e. g. ZIP, only allow limited manipulation of data. ) DW/DM: Data Preprocessing 51
- Chapter Outline n Introduction (2. 1, 2. 2) n Data Cleaning (2. 3) n Data Integration (2. 4) n Data Transformation (2. 4) n Data Reduction (2. 5) n Data discritization & Concept Hierarchy (2. 6) 10/31/2021 DW/DM: Data Preprocessing 52
- Discretization n n 10/31/2021 Three types of attributes: n Nominal — values from an unordered set, e. g. , color, profession n Ordinal — values from an ordered set, e. g. , military or academic rank n Continuous — real numbers, e. g. , integer or real numbers Discretization: n Divide the range of a continuous attribute into intervals n Some classification algorithms only accept categorical attributes. n Reduce data size by discretization n Prepare for further analysis DW/DM: Data Preprocessing 53
-- Discretization and Concept Hierarchy n Discretization n Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals n n Interval labels can then be used to replace actual data values n Supervised vs. unsupervised n Split (top-down) vs. merge (bottom-up) n Discretization can be performed recursively on an attribute Concept hierarchy formation n Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior) 10/31/2021 DW/DM: Data Preprocessing 54
-- Discretization and Concept Hierarchy Generation for Numeric Data n Typical methods: All the methods can be applied recursively n Binning (covered above) n n Histogram analysis (covered above) n n Top-down split, unsupervised, Top-down split, unsupervised Clustering analysis (covered above) n Either top-down split or bottom-up merge, unsupervised n Entropy-based discretization: supervised, top-down split n Interval merging by 2 Analysis: unsupervised, bottom-up merge n Segmentation by natural partitioning: top-down split, unsupervised 10/31/2021 DW/DM: Data Preprocessing 55
--- Entropy-Based Discretization n n Given a set of samples S, if S is partitioned into two intervals S 1 and S 2 using boundary T, the information gain after partitioning is Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S 1 is where pi is the probability of class i in S 1 n n The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization The process is recursively applied to partitions obtained until some stopping criterion is met 10/31/2021 DW/DM: Data Preprocessing 56
--- Interval Merge by 2 Analysis n Merging-based (bottom-up) vs. splitting-based methods n Merge: Find the best neighboring intervals and merge them to form larger intervals recursively n Chi. Merge n Initially, each distinct value of a numerical attr. A is considered to be one interval n 2 tests are performed for every pair of adjacent intervals n Adjacent intervals with the least 2 values are merged together, since low 2 values for a pair indicate similar class distributions n This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, max-interval, max inconsistency, etc. ) 10/31/2021 DW/DM: Data Preprocessing 57
--- Segmentation by Natural Partitioning n A simply 3 -4 -5 rule can be used to segment numeric data into relatively uniform, “natural” intervals. n If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals n If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals n If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals 10/31/2021 DW/DM: Data Preprocessing 58
---- Example of 3 -4 -5 Rule count Step 1: Step 2: -$351 -$159 Min Low (i. e, 5%-tile) msd=1, 000 profit Low=-$1, 000 (-$1, 000 - 0) (-$400 - 0) (-$200 -$100) (-$100 0) 10/31/2021 Max High=$2, 000 ($1, 000 - $2, 000) (0 -$ 1, 000) (-$400 -$5, 000) Step 4: (-$300 -$200) High(i. e, 95%-0 tile) $4, 700 (-$1, 000 - $2, 000) Step 3: (-$400 -$300) $1, 838 ($1, 000 - $2, 000) (0 - $1, 000) (0 $200) ($1, 000 $1, 200) ($200 $400) ($1, 200 $1, 400) ($1, 400 $1, 600) ($400 $600) ($600 $800) ($800 $1, 000) ($1, 600 ($1, 800) $2, 000) DW/DM: Data Preprocessing ($2, 000 - $5, 000) ($2, 000 $3, 000) ($3, 000 $4, 000) ($4, 000 $5, 000) 59
-- Concept Hierarchy Generation for Categorical Data … n n Categorical data n Discrete, finite cardinality, unordered n E. g. Geographic location, job category, product Problem: how to compose an order to categorical data n E. g. Organize location into categories n 10/31/2021 Street < city < province < country DW/DM: Data Preprocessing 60
… -- Concept Hierarchy Generation for Categorical Data n n A partial order of attributes at the schema level n Street < city < state < country A portion of a hierarchy by explicit data grouping n n A set of attributes n n n A partial order generated by cardinality of attributes E. g. , street < city <state < country Only a partial set of attributes n n 10/31/2021 {Jeddah, Riyadh} < Saudi Arabia E. g. , only street < city, not others System automatically fills in others DW/DM: Data Preprocessing 61
--- Automatic Concept Hierarchy Generation n Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set n n The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e. g. , weekday, month, quarter, year 15 distinct values country province_or_ state 365 distinct values city 3567 distinct values street 10/31/2021 674, 339 distinct values DW/DM: Data Preprocessing 62
End 10/31/2021 DW/DM: Data Preprocessing 63