Data Types 1 What is data 2 Classification

  • Slides: 25
Download presentation
Data Types 1. What is data 2. Classification of Data Attributes 3. Characteristics of

Data Types 1. What is data 2. Classification of Data Attributes 3. Characteristics of Data Sets 4. Data Formats and Tools for Data Preparation 5. Questions for next step

Data Preparation 1 1. What is Data? l. An attribute is a property or

Data Preparation 1 1. What is Data? l. An attribute is a property or characteristic of an object l. Examples: eye color of a Objects person, temperature, etc. l. Attribute is also known as variable, field, characteristic, or feature l. A collection of attributes describe an object l. Object is also known as record, point, case, sample, entity, instance, or observation Attributes 2

1. Basic Definitions Data Set (input): Collection of data objects and their attributes used

1. Basic Definitions Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme. Data object (instance, record, case, sample, observation): An individual, independent data example of the concept to be learned, characterized by a number of attributes. Attribute (feature): Property or characteristic of an object. Model (concept): Pattern or description that is to be learned.

2: Classification of Attribute Types Attribute value: Measurement of the quantity of that particular

2: Classification of Attribute Types Attribute value: Measurement of the quantity of that particular attribute. Two basic attribute types: Qualitative and Quantitative. Qualitative (categorical): Lack the properties of numbers. Quantitative (numeric): Attributes represented by numbers and have their properties. You need to understand what type of changes (Transformations) can preserve the meaning of the data!

2: Attribute Types Attribute types further distinguished by the number of values: Discrete (e.

2: Attribute Types Attribute types further distinguished by the number of values: Discrete (e. g. Integers) versus continuous (Real). Discrete: A discrete attribute can have values from only a finite or countably infinite set of values. Examples: Male/female, ages Continuous: A continuous attribute can have values from an uncountable set of values such as the real numbers. Examples: Temperature, weight, distance, time

2: Attribute Types Categorical or Numeric Categorical Type of Qualitative Data Nominal attribute: Qualitative

2: Attribute Types Categorical or Numeric Categorical Type of Qualitative Data Nominal attribute: Qualitative names providing only enough information to distinguish from each other. No order or distance measure is implied (e. g. Male or Female) (academy/business/Government), (Student ID). Ordinal attribute: Qualitative names providing enough information to rank their order (Example: small, medium, large), but not enough to measure distance.

2. Attribute Types Numeric (Quantitative) Data Interval attribute: Ordered and value differences are meaningful

2. Attribute Types Numeric (Quantitative) Data Interval attribute: Ordered and value differences are meaningful and measurable. (e. g. Temperature in Fahrenheit / Celsius, linear transformation is new_value = a * old_value + b ) Ratio attribute: Both differences and ratios are meaningful and measurable. (weight measured in kg or pounds) new_value = a * old_value

Special Discrete Types of Attributes Binary attributes (True/False, Positive/Negative, On/Off, Yes/No, Male/Female) Asymmetric (binary)

Special Discrete Types of Attributes Binary attributes (True/False, Positive/Negative, On/Off, Yes/No, Male/Female) Asymmetric (binary) attributes: only nonzero values are important. Example, Consider the data set that recorded a list of cancer patients who took a particular type of chemo therapy. Consider the record of election of a group of citizens either voted population matters

3. Data Set Characteristics Dimensionality: Number of attributes possessed by the data set instances.

3. Data Set Characteristics Dimensionality: Number of attributes possessed by the data set instances. Sparsity: Sparse data sets are those in which the most object attributes are zero. Quality Resolution: The degree of discernable detail of an attribute value. How finely an attribute is measured. Source of Data

3. 1. Data Quality Measurement errors Noise Artifacts Equipment limitations Data collection procedure errors

3. 1. Data Quality Measurement errors Noise Artifacts Equipment limitations Data collection procedure errors Human error Precision/Resolution Bias Accuracy Completeness

3. 1 Data Quality Handling Data Outliers Missing or incomplete values Estimate? Ignore? Inaccurate

3. 1 Data Quality Handling Data Outliers Missing or incomplete values Estimate? Ignore? Inaccurate values

3. 1 Data Quality Multiple data sources Inconsistent data: how to handle? Duplicate data

3. 1 Data Quality Multiple data sources Inconsistent data: how to handle? Duplicate data Age of data Data relevance

3. 2 Data Input Formats Data records Text Graph-based Data matrix Ordered data Spatial

3. 2 Data Input Formats Data records Text Graph-based Data matrix Ordered data Spatial data Visual inputs Video inputs

3. 2 Record Data that consists of a collection of records, each of which

3. 2 Record Data that consists of a collection of records, each of which consists of a fixed set of attributes

3. 2. Data Matrix If data objects have the same fixed set of numeric

3. 2. Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

3. 3. Sources of Data Sets Databases Web sites Streaming data Structured or unstructured

3. 3. Sources of Data Sets Databases Web sites Streaming data Structured or unstructured Data

3. 3. Document Data Each document becomes a `term' vector, each term is a

3. 3. Document Data Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document. (Turned-in)

3. 3. Transaction Data A special type of record data, where each record (transaction)

3. 3. Transaction Data A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

3. 3. Graph Data

3. 3. Graph Data

3. 3. Chemical Data Benzene Molecule: C 6 H 6

3. 3. Chemical Data Benzene Molecule: C 6 H 6

3. 3. Ordered Data Genomic sequence data

3. 3. Ordered Data Genomic sequence data

 Spatio-Temporal Data Ordered Data Surface Air Temperature over North America, January --February 2014

Spatio-Temporal Data Ordered Data Surface Air Temperature over North America, January --February 2014 https: //www. youtube. com/watch? v=VCCky. OTIS 3 o

4. Data Formats & Tools for Data Preparation Popular Tools and the Tools that

4. Data Formats & Tools for Data Preparation Popular Tools and the Tools that we uses Data Preparation and Exploration MATLAB, R, EXCELL Data Mining Weka MATLAB / R Different tool uses different data format e. g. CSV, ARFF, Matrix, Vector, Array, etc.

5. Questions for NEXT STEP Data Preprocessing Question 1: Tom is 10 and he

5. Questions for NEXT STEP Data Preprocessing Question 1: Tom is 10 and he is 5’ 1 ft, and John is 22 and he is 5’ 5, who is taller? Question 2: Can you really compare the apple and Orange? Question 3: How do we change continuous data to discrete data to simplify computation Task: Google Research, what is the curse of dimensionality in data mining

5. Next Step

5. Next Step