Data Mining Concepts and Techniques Chapter 2 Jiawei

  • Slides: 21
Download presentation
Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and

Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University © 2011 Han, Kamber, and Pei. All rights reserved. 1

Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n

Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Basic Statistical Descriptions of Data n Data Visualization n Measuring Data Similarity and Dissimilarity n Summary 2

Types of Data Sets n n Record n Relational records n Data matrix, e.

Types of Data Sets n n Record n Relational records n Data matrix, e. g. , numerical matrix, crosstabs n Document data: text documents: termfrequency vector n Transaction data Graph and network n World Wide Web n Social or information networks n Molecular Structures Ordered n Video data: sequence of images n Temporal data: time-series n Sequential Data: transaction sequences n Genetic sequence data Spatial, image and multimedia: n Spatial data: maps n Image data: n Video data: 3

Important Characteristics of Structured Data n Dimensionality n n Sparsity n n Only presence

Important Characteristics of Structured Data n Dimensionality n n Sparsity n n Only presence counts Resolution n n Curse of dimensionality Patterns depend on the scale Distribution n Centrality and dispersion 4

Attributes n Attribute (or dimensions, features, variables): a data field, representing a characteristic or

Attributes n Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. n n E. g. , customer _ID, name, address Types: n Nominal n Binary n Numeric: quantitative n Interval-scaled n Ratio-scaled 6

Attribute Types n n n Nominal: categories, states, or “names of things” n Hair_color

Attribute Types n n n Nominal: categories, states, or “names of things” n Hair_color = {auburn, black, blond, brown, grey, red, white} n marital status, occupation, ID numbers, zip codes Binary n Nominal attribute with only 2 states (0 and 1) n Symmetric binary: both outcomes equally important n e. g. , gender n Asymmetric binary: outcomes not equally important. n e. g. , medical test (positive vs. negative) n Convention: assign 1 to most important outcome (e. g. , HIV positive) Ordinal n Values have a meaningful order (ranking) but magnitude between successive values is not known. n Size = {small, medium, large}, grades, army rankings 7

Numeric Attribute Types n n n Quantity (integer or real-valued) Interval n Measured on

Numeric Attribute Types n n n Quantity (integer or real-valued) Interval n Measured on a scale of equal-sized units n Values have order n E. g. , temperature in C˚or F˚, calendar dates n No true zero-point Ratio n Inherent zero-point n We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). n e. g. , temperature in Kelvin, length, counts, monetary quantities 8

Discrete vs. Continuous Attributes n n Discrete Attribute n Has only a finite or

Discrete vs. Continuous Attributes n n Discrete Attribute n Has only a finite or countably infinite set of values n E. g. , zip codes, profession, or the set of words in a collection of documents n Sometimes, represented as integer variables n Note: Binary attributes are a special case of discrete attributes Continuous Attribute n Has real numbers as attribute values n E. g. , temperature, height, or weight n Practically, real values can only be measured and represented using a finite number of digits n Continuous attributes are typically represented as floating-point variables 9

Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n

Chapter 2: Getting to Know Your Data n Data Objects and Attribute Types n Basic Statistical Descriptions of Data n Data Visualization n Measuring Data Similarity and Dissimilarity n Summary 10

Basic Statistical Descriptions of Data Motivation n To better understand the data: central tendency,

Basic Statistical Descriptions of Data Motivation n To better understand the data: central tendency, variation and spread n Data dispersion characteristics n median, max, min, quantiles, outliers, variance, etc. n Numerical dimensions correspond to sorted intervals n Data dispersion: analyzed with multiple granularities of precision n Boxplot or quantile analysis on sorted intervals n Dispersion analysis on computed measures n Folding measures into numerical dimensions n Boxplot or quantile analysis on the transformed cube n 11

Measuring the Central Tendency n Mean (algebraic measure) (sample vs. population): Note: n is

Measuring the Central Tendency n Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. n n Weighted arithmetic mean: n Trimmed mean: chopping extreme values Median: n Middle value if odd number of values, or average of the middle two values otherwise n n Estimated by interpolation (for grouped data): Mode n Value that occurs most frequently in the data n Unimodal, bimodal, trimodal n Empirical formula: 12

Symmetric vs. Skewed Data n Median, mean and mode of symmetric, positively and negatively

Symmetric vs. Skewed Data n Median, mean and mode of symmetric, positively and negatively skewed data positively skewed 05 June 2021 symmetric negatively skewed Data Mining: Concepts and Techniques 13

Measuring the Dispersion of Data n Quartiles, outliers and boxplots n Quartiles: Q 1

Measuring the Dispersion of Data n Quartiles, outliers and boxplots n Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) n Inter-quartile range: IQR = Q 3 – Q 1 n Five number summary: min, Q 1, median, Q 3, max n Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually n n Outlier: usually, a value higher/lower than 1. 5 x IQR Variance and standard deviation (sample: s, population: σ) n Variance: (algebraic, scalable computation) n Standard deviation s (or σ) is the square root of variance s 2 (or σ2) 14

Boxplot Analysis n Five-number summary of a distribution n n Minimum, Q 1, Median,

Boxplot Analysis n Five-number summary of a distribution n n Minimum, Q 1, Median, Q 3, Maximum Boxplot n n n Data is represented with a box The ends of the box are at the first and third quartiles, i. e. , the height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum Outliers: points beyond a specified outlier threshold, plotted individually 15

Graphic Displays of Basic Statistical Descriptions n Boxplot: graphic display of five-number summary n

Graphic Displays of Basic Statistical Descriptions n Boxplot: graphic display of five-number summary n Histogram: x-axis are values, y-axis repres. frequencies n Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi n Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another n Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 18

Histogram Analysis n n Histogram: Graph display of tabulated frequencies, shown as bars It

Histogram Analysis n n Histogram: Graph display of tabulated frequencies, shown as bars It shows what proportion of cases fall into each of several categories Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 19

Histograms Often Tell More than Boxplots n The two histograms shown in the left

Histograms Often Tell More than Boxplots n The two histograms shown in the left may have the same boxplot representation n n The same values for: min, Q 1, median, Q 3, max But they have rather different data distributions 20

Quantile Plot n n Displays all of the data (allowing the user to assess

Quantile Plot n n Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information n For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi Data Mining: Concepts and Techniques 21

Scatter plot n n Provides a first look at bivariate data to see clusters

Scatter plot n n Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane 23

Positively and Negatively Correlated Data n The left half fragment is positively correlated n

Positively and Negatively Correlated Data n The left half fragment is positively correlated n The right half is negative correlated 24

Uncorrelated Data 25

Uncorrelated Data 25