Lecture 1 2 Data and Data Preparation Phayung
- Slides: 38
Lecture 1 -2 Data and Data Preparation Phayung Meesad, Ph. D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok Thailand 6/7/2021 Data Mining 1
A Framework of Data Mining 6/7/2021 Data Mining 2
What is Data? n Collection of data objects and their attributes n An attribute is a property or characteristic of an object q q n A collection of attributes describe an object q 6/7/2021 Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature Object is also known as record, point, case, sample, entity, or instance Data Mining 3
Example of data Attributes Objects 6/7/2021 Data Mining 4
Attributes Data Types n Numeric q q n n 6/7/2021 Continuous Discrete Nominal Ordinal String Date Data Mining 5
Type of Data Sets n Record q q q n Graph q q n World Wide Web Molecular Structures Ordered q q 6/7/2021 Data Matrix Document Data Transaction Data Spatial Data Temporal Data Sequential Data Genetic Sequence Data Mining 6
Record Data n 6/7/2021 Data that consists of a collection of records, each of which consists of a fixed set of attributes Data Mining 7
Data Matrix n If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute n Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute 6/7/2021 Data Mining 8
Document Data n Each document becomes a `term' vector, q q 6/7/2021 each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document. Data Mining 9
Transaction Data n A special type of record data, where q 6/7/2021 each record (transaction) involves a set of items. Data Mining 10
Graph Data n 6/7/2021 Examples: Generic graph and HTML Links Data Mining 11
Chemical Data n 6/7/2021 Benzene Molecule: C 6 H 6 Data Mining 12
Ordered Data n Sequences of transactions Items/Events An element of the sequence 6/7/2021 Data Mining 13
Ordered Data n 6/7/2021 Genomic sequence data Data Mining 14
Ordered Data n Spatio-Temporal Data Average Monthly Temperature of land ocean 6/7/2021 Data Mining 15
Data Quality n What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? n Examples of data quality problems: n n q q q 6/7/2021 Noise and outliers missing values duplicate data Data Mining 16
Noise n Noise refers to modification of original values q Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves 6/7/2021 Two Sine Waves + Noise Data Mining 17
Outliers n 6/7/2021 Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Data Mining 18
Missing Values n Reasons for missing values q q n Handling missing values q q 6/7/2021 Information is not collected (e. g. , people decline to give their age and weight) Attributes may not be applicable to all cases (e. g. , annual income is not applicable to children) Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) Data Mining 19
Duplicate Data n Data set may include data objects that are duplicates, or almost duplicates of one another q n Examples: q n Same person with multiple email addresses Data cleaning q 6/7/2021 Major issue when merging data from heterogeneous sources Process of dealing with duplicate data issues Data Mining 20
Data Preprocessing 6/7/2021 Data Mining 21
Data Preprocessing n n n n 6/7/2021 Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation Data Mining 22
Aggregation n Combining two or more attributes (or objects) into a single attribute (or object) n Purpose q Data reduction n q Change of scale n q Cities aggregated into regions, states, countries, etc More “stable” data n 6/7/2021 Reduce the number of attributes or objects Aggregated data tends to have less variability Data Mining 23
Sampling n Sampling is the main technique employed for data selection. q n It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. n n 6/7/2021 Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. Data Mining 24
Sampling … n The key principle for effective sampling is the following: q q 6/7/2021 using a sample will work almost as well as using the entire data sets, if the sample is representative A sample is representative if it has approximately the same property (of interest) as the original set of data Data Mining 25
Types of Sampling n Simple Random Sampling q n Sampling without replacement q n Objects are not removed from the population Stratified sampling q 6/7/2021 removed from the population Sampling with replacement q n equal probability of selecting any particular item Split the data into several partitions; then draw random samples from each partition Data Mining 26
Sample Size 8000 points 6/7/2021 2000 Points Data Mining 500 Points 27
Sample Size n 6/7/2021 What sample size is necessary to get at least one object from each of 10 groups. Data Mining 28
Curse of Dimensionality n n 6/7/2021 When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful • Randomly generate 500 points • Compute difference between max and min distance between any pair of points Data Mining 29
Dimensionality Reduction n Purpose: q q n Techniques q q q 6/7/2021 Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques Data Mining 30
n Dimensionality Reduction: PCA Goal is to find a projection that captures the largest amount of variation in data x 2 e x 1 6/7/2021 Data Mining 31
n n Dimensionality Reduction: PCA Find the eigenvectors of the covariance matrix The eigenvectors define the new space x 2 e x 1 6/7/2021 Data Mining 32
Feature Subset Selection n Another way to reduce dimensionality of data n Redundant features q q n Irrelevant features q q 6/7/2021 duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA Data Mining 33
Feature Subset Selection n Techniques: q Brute-force approch: n q Embedded approaches: n q Features are selected before data mining algorithm is run Wrapper approaches: n 6/7/2021 Feature selection occurs naturally as part of the data mining algorithm Filter approaches: n q Try all possible feature subsets as input to data mining algorithm Use the data mining algorithm as a black box to find best subset of attributes Data Mining 34
Feature Creation n Create new attributes that can capture the important information in a data set much more efficiently than the original attributes n Three general methodologies: q Feature Extraction n q q Mapping Data to New Space Feature Construction n 6/7/2021 domain-specific combining features Data Mining 35
Mapping Data to a New Space Fourier transform n Wavelet transform n Two Sine Waves 6/7/2021 Two Sine Waves + Noise Data Mining Frequency 36
Attribute Transformation n A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values q q 6/7/2021 Simple functions: xk, log(x), ex, |x| Standardization and Normalization Data Mining 37
Conclusion n n 6/7/2021 Definition Data Mining Tasks Attributes Data Types Type of Data Sets Data Preprocessing Data Mining 38
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Data preparation and basic data analysis
- Markku roiha
- Sequential feature selection
- Exploratory data analysis lecture notes
- Bayesian classification in data mining lecture notes
- Data mining lecture notes
- Data visualization lecture
- Data mining lecture notes
- Data mining lecture notes
- Central editing in research methodology
- Pengertian preparation
- Discovery phase in data analytics
- Data preparation
- Maintaining records and reports
- Electricity and magnetism lecture notes
- Power system dynamics and stability lecture notes
- Microbial physiology notes
- Limits fits and tolerances
- Parallel and distributed computing lecture notes
- 50h8 tolerance
- Financial markets and institutions ppt
- Banking and finance lecture
- Extempore and lecture
- Utilities and energy lecture
- Pride and prejudice lecture
- Design of mechatronics system ppt
- Foetotomy
- Lecture on love courtship and marriage
- Power system dynamics and stability lecture notes
- Project planning and management lecture notes ppt
- Base of thigh
- Digital illuminate aqa food preparation and nutrition
- Verification in creative process
- Thick and thin blood film
- Planning and preparation examples
- Preparation and standardization of potassium permanganate
- Domain 1 planning and preparation
- Kitchen captions