Lecture 1 2 Data and Data Preparation Phayung

  • Slides: 38
Download presentation
Lecture 1 -2 Data and Data Preparation Phayung Meesad, Ph. D. King Mongkut’s University

Lecture 1 -2 Data and Data Preparation Phayung Meesad, Ph. D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok Thailand 6/7/2021 Data Mining 1

A Framework of Data Mining 6/7/2021 Data Mining 2

A Framework of Data Mining 6/7/2021 Data Mining 2

What is Data? n Collection of data objects and their attributes n An attribute

What is Data? n Collection of data objects and their attributes n An attribute is a property or characteristic of an object q q n A collection of attributes describe an object q 6/7/2021 Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature Object is also known as record, point, case, sample, entity, or instance Data Mining 3

Example of data Attributes Objects 6/7/2021 Data Mining 4

Example of data Attributes Objects 6/7/2021 Data Mining 4

Attributes Data Types n Numeric q q n n 6/7/2021 Continuous Discrete Nominal Ordinal

Attributes Data Types n Numeric q q n n 6/7/2021 Continuous Discrete Nominal Ordinal String Date Data Mining 5

Type of Data Sets n Record q q q n Graph q q n

Type of Data Sets n Record q q q n Graph q q n World Wide Web Molecular Structures Ordered q q 6/7/2021 Data Matrix Document Data Transaction Data Spatial Data Temporal Data Sequential Data Genetic Sequence Data Mining 6

Record Data n 6/7/2021 Data that consists of a collection of records, each of

Record Data n 6/7/2021 Data that consists of a collection of records, each of which consists of a fixed set of attributes Data Mining 7

Data Matrix n If data objects have the same fixed set of numeric attributes,

Data Matrix n If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute n Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute 6/7/2021 Data Mining 8

Document Data n Each document becomes a `term' vector, q q 6/7/2021 each term

Document Data n Each document becomes a `term' vector, q q 6/7/2021 each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document. Data Mining 9

Transaction Data n A special type of record data, where q 6/7/2021 each record

Transaction Data n A special type of record data, where q 6/7/2021 each record (transaction) involves a set of items. Data Mining 10

Graph Data n 6/7/2021 Examples: Generic graph and HTML Links Data Mining 11

Graph Data n 6/7/2021 Examples: Generic graph and HTML Links Data Mining 11

Chemical Data n 6/7/2021 Benzene Molecule: C 6 H 6 Data Mining 12

Chemical Data n 6/7/2021 Benzene Molecule: C 6 H 6 Data Mining 12

Ordered Data n Sequences of transactions Items/Events An element of the sequence 6/7/2021 Data

Ordered Data n Sequences of transactions Items/Events An element of the sequence 6/7/2021 Data Mining 13

Ordered Data n 6/7/2021 Genomic sequence data Data Mining 14

Ordered Data n 6/7/2021 Genomic sequence data Data Mining 14

Ordered Data n Spatio-Temporal Data Average Monthly Temperature of land ocean 6/7/2021 Data Mining

Ordered Data n Spatio-Temporal Data Average Monthly Temperature of land ocean 6/7/2021 Data Mining 15

Data Quality n What kinds of data quality problems? How can we detect problems

Data Quality n What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? n Examples of data quality problems: n n q q q 6/7/2021 Noise and outliers missing values duplicate data Data Mining 16

Noise n Noise refers to modification of original values q Examples: distortion of a

Noise n Noise refers to modification of original values q Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves 6/7/2021 Two Sine Waves + Noise Data Mining 17

Outliers n 6/7/2021 Outliers are data objects with characteristics that are considerably different than

Outliers n 6/7/2021 Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Data Mining 18

Missing Values n Reasons for missing values q q n Handling missing values q

Missing Values n Reasons for missing values q q n Handling missing values q q 6/7/2021 Information is not collected (e. g. , people decline to give their age and weight) Attributes may not be applicable to all cases (e. g. , annual income is not applicable to children) Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) Data Mining 19

Duplicate Data n Data set may include data objects that are duplicates, or almost

Duplicate Data n Data set may include data objects that are duplicates, or almost duplicates of one another q n Examples: q n Same person with multiple email addresses Data cleaning q 6/7/2021 Major issue when merging data from heterogeneous sources Process of dealing with duplicate data issues Data Mining 20

Data Preprocessing 6/7/2021 Data Mining 21

Data Preprocessing 6/7/2021 Data Mining 21

Data Preprocessing n n n n 6/7/2021 Aggregation Sampling Dimensionality Reduction Feature subset selection

Data Preprocessing n n n n 6/7/2021 Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation Data Mining 22

Aggregation n Combining two or more attributes (or objects) into a single attribute (or

Aggregation n Combining two or more attributes (or objects) into a single attribute (or object) n Purpose q Data reduction n q Change of scale n q Cities aggregated into regions, states, countries, etc More “stable” data n 6/7/2021 Reduce the number of attributes or objects Aggregated data tends to have less variability Data Mining 23

Sampling n Sampling is the main technique employed for data selection. q n It

Sampling n Sampling is the main technique employed for data selection. q n It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. n n 6/7/2021 Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. Data Mining 24

Sampling … n The key principle for effective sampling is the following: q q

Sampling … n The key principle for effective sampling is the following: q q 6/7/2021 using a sample will work almost as well as using the entire data sets, if the sample is representative A sample is representative if it has approximately the same property (of interest) as the original set of data Data Mining 25

Types of Sampling n Simple Random Sampling q n Sampling without replacement q n

Types of Sampling n Simple Random Sampling q n Sampling without replacement q n Objects are not removed from the population Stratified sampling q 6/7/2021 removed from the population Sampling with replacement q n equal probability of selecting any particular item Split the data into several partitions; then draw random samples from each partition Data Mining 26

Sample Size 8000 points 6/7/2021 2000 Points Data Mining 500 Points 27

Sample Size 8000 points 6/7/2021 2000 Points Data Mining 500 Points 27

Sample Size n 6/7/2021 What sample size is necessary to get at least one

Sample Size n 6/7/2021 What sample size is necessary to get at least one object from each of 10 groups. Data Mining 28

Curse of Dimensionality n n 6/7/2021 When dimensionality increases, data becomes increasingly sparse in

Curse of Dimensionality n n 6/7/2021 When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful • Randomly generate 500 points • Compute difference between max and min distance between any pair of points Data Mining 29

Dimensionality Reduction n Purpose: q q n Techniques q q q 6/7/2021 Avoid curse

Dimensionality Reduction n Purpose: q q n Techniques q q q 6/7/2021 Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques Data Mining 30

n Dimensionality Reduction: PCA Goal is to find a projection that captures the largest

n Dimensionality Reduction: PCA Goal is to find a projection that captures the largest amount of variation in data x 2 e x 1 6/7/2021 Data Mining 31

n n Dimensionality Reduction: PCA Find the eigenvectors of the covariance matrix The eigenvectors

n n Dimensionality Reduction: PCA Find the eigenvectors of the covariance matrix The eigenvectors define the new space x 2 e x 1 6/7/2021 Data Mining 32

Feature Subset Selection n Another way to reduce dimensionality of data n Redundant features

Feature Subset Selection n Another way to reduce dimensionality of data n Redundant features q q n Irrelevant features q q 6/7/2021 duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA Data Mining 33

Feature Subset Selection n Techniques: q Brute-force approch: n q Embedded approaches: n q

Feature Subset Selection n Techniques: q Brute-force approch: n q Embedded approaches: n q Features are selected before data mining algorithm is run Wrapper approaches: n 6/7/2021 Feature selection occurs naturally as part of the data mining algorithm Filter approaches: n q Try all possible feature subsets as input to data mining algorithm Use the data mining algorithm as a black box to find best subset of attributes Data Mining 34

Feature Creation n Create new attributes that can capture the important information in a

Feature Creation n Create new attributes that can capture the important information in a data set much more efficiently than the original attributes n Three general methodologies: q Feature Extraction n q q Mapping Data to New Space Feature Construction n 6/7/2021 domain-specific combining features Data Mining 35

Mapping Data to a New Space Fourier transform n Wavelet transform n Two Sine

Mapping Data to a New Space Fourier transform n Wavelet transform n Two Sine Waves 6/7/2021 Two Sine Waves + Noise Data Mining Frequency 36

Attribute Transformation n A function that maps the entire set of values of a

Attribute Transformation n A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values q q 6/7/2021 Simple functions: xk, log(x), ex, |x| Standardization and Normalization Data Mining 37

Conclusion n n 6/7/2021 Definition Data Mining Tasks Attributes Data Types Type of Data

Conclusion n n 6/7/2021 Definition Data Mining Tasks Attributes Data Types Type of Data Sets Data Preprocessing Data Mining 38