Data Preprocessing 1 The need of data preprocessing

  • Slides: 9
Download presentation
Data Preprocessing 1

Data Preprocessing 1

The need of data preprocessing l Problems with huge real-world database – – –

The need of data preprocessing l Problems with huge real-world database – – – Incomplete data : missing value Noisy Inconsistent Influence data mining process, especially pattern mined 2

Techniques l l Data cleaning Data integration Data transformation Data reduction Improve the quality

Techniques l l Data cleaning Data integration Data transformation Data reduction Improve the quality of the pattern mined and/or the time required for the actual mining 3

Data Cleaning – Missing values Tuples have no recorded value for several attributes l

Data Cleaning – Missing values Tuples have no recorded value for several attributes l Ignore the tuple l Fill in the missing value – – 4 Using global constant Using ‘measured’ values : attribute mean, most probable value

Data Cleaning – Noisy Random error or variance in a measured variable l Binning

Data Cleaning – Noisy Random error or variance in a measured variable l Binning smooth a sorted data value by consulting its ‘neighborhood’ local smoothing 5

Clustering Detect the outliers by grouping similar values Regression smooth data by fitting data

Clustering Detect the outliers by grouping similar values Regression smooth data by fitting data to a function, such as regression linear regression, multiple linier regression l 6

Data Integration l l 7 Combine data from multiple sources into coherent data store

Data Integration l l 7 Combine data from multiple sources into coherent data store Schema integration: entity identification problem Redundancy: detected by correlation analysis Detection & resolution of data value conflict: semantic heterogenity & different representation

Data Transformation l l Data are transformed or consolidated into forms appropriate for mining

Data Transformation l l Data are transformed or consolidated into forms appropriate for mining Involve: – – 8 Smoothing Aggregation Generalisation Normalisation

Data Reduction l l Reduce representation of data set that is much smaller in

Data Reduction l l Reduce representation of data set that is much smaller in volume, while maintains the integrity of the original data. Strategies: – – – 9 Data cube aggregation Dimension reduction Data compression