Data Lecture Notes for Chapter 2 Tan Steinbach
Data Lecture Notes for Chapter 2 © Tan, Steinbach, Kumar
What is Data? l Collection of data objects and their attributes Attributes l An attribute is a property or characteristic of an object – Examples: Name, Gender, Age, etc. – Attribute is also known as variable, field, characteristic, Objects or feature l A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance © Tan, Steinbach, Kumar
Attribute Values Attribute values are numbers or symbols assigned to an attribute l Question: what is the first attribute of students and what is its value of the third student? l © Tan, Steinbach, Kumar
Types of Attributes l There are different types of attributes – Nominal u Examples: ID numbers, eye color, zip codes – Ordinal u Examples: rankings (e. g. , taste of potato chips on a scale from 1 -10), grades, height in {tall, medium, short} – Interval u Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio u Examples: temperature in Kelvin, length, time, counts © Tan, Steinbach, Kumar
Attribute Type Description Examples Nominal The values of a nominal attribute are just different names, i. e. , nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i. e. , a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Ratio Operations
Types of Attributes Order the objects according to grade level from A to E. l From the birth year, the third student is 3 years older than the first student. l l Assume that the length of one ruler is 20 centimeters and the length of another is 40 cm. The length is a ratio attribute. Specially, the second ruler is twice as long as the first one. © Tan, Steinbach, Kumar
Discrete and Continuous Attributes l Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, student ID – Note: binary attributes are a special case of discrete attributes (Gender) l Continuous Attribute – Has real numbers as attribute values – Examples: age, length, or weight. © Tan, Steinbach, Kumar
Discrete and Continuous Attributes In some cases, an attribute can be viewed as a discrete attribute. In other cases, it can be viewed as a continuous attribute. l For example, age is an attribute of persons. If age denotes an attribute of students in a university, its values usually fall into the range [10, 30]. In this case, it has 21 values, so it can be viewed as a discrete attribute. If we don’t limit the value range of age, it can be viewed as a continuous attribute. For example, we can say the age of a person is 23. 3. (23 years old + 3. 6 months) l Therefore, an interval attribute and a ration attribute may be a discrete attribute or a continuous attribute in the different cases. l © Tan, Steinbach, Kumar
Types of data sets l l l Record – Data Matrix – Document Data – Transaction Data Graph – World Wide Web – Molecular Structures Ordered – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data © Tan, Steinbach, Kumar
Record Data l A record data set consists of some records, each of which consists of a fixed set of attributes © Tan, Steinbach, Kumar
Transaction Data l A special type of record data is transaction data, where – each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. © Tan, Steinbach, Kumar
Important Characteristics of Record Data – Dimensionality u The dimensionality of a data set is the number of attributes that the objects have. – Sparsity u When most attributes of an object have values of 0, we say that the data set is sparse. © Tan, Steinbach, Kumar
Data Quality What kinds of data quality problems are there? l How can we detect problems with the data? l l Examples of data quality problems: – Noise and outliers – missing values – duplicate data © Tan, Steinbach, Kumar
Noise l Noise refers to modification of original values – Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves © Tan, Steinbach, Kumar Two Sine Waves + Noise
Noise reduces the data quality. Furthermore, it reduces the quality of data mining results, such as reducing classification accuracy and clustering accuracy. Noise can even cause the incorrect data mining results. l Therefore, it is an important task to reduce noise in the data preprocessing. l Nonetheless, the elimination of noise is frequently difficult, which requires us to devise robust data mining algorithms that produce acceptable results when noise is present. l © Tan, Steinbach, Kumar
Outliers l Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set © Tan, Steinbach, Kumar
Outliers Just note: Outliers are different from noise. l Outliers can be legitimate data objects or values l Unlike noise, outliers may sometimes be of interest. l One task of data mining is to detect outliers from a large amount of data. l Outliers detection algorithms will be presented in Chapter 10 (if time permits). l © Tan, Steinbach, Kumar
Missing Values l It is usual for one or more objects to be missing one or more attribute values. Student ID Name Gender Age 1203 Tom Male 23 1506 Lucy Female 2158 l Why? © Tan, Steinbach, Kumar Male 22
Missing Values l Reasons for missing values – Information is not collected (e. g. , people decline to give their age and weight) – Attributes may not be applicable to all cases (e. g. , annual income is not applicable to children) Table: Members of Tom’s family Person Name Salary (dollars) Age Father Jack 5600 43 Mother Lucy 5200 42 Tom © Tan, Steinbach, Kumar 8
Missing Values l Generally speaking, most of data mining algorithms cannot handle data sets with missing values. l Handling missing values--one task of preprocessing – Eliminate Data Objects – Estimate Missing Values © Tan, Steinbach, Kumar
Duplicate Data l Data set may include data objects that are duplicates, or almost duplicates of one another – Issue often appears when merging data from heterogeneous sources l Examples: – Same person with multiple email addresses – Many people receive duplicate mailings. © Tan, Steinbach, Kumar
Data Postprocessing: Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. l Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns © Tan, Steinbach, Kumar
Arrangement Example of visualization: arrangement by tabular l Help to well understand the data. l Example: --Nine objects with six binary attributes. --we can not observe any clear relationship between objects and attributes at first glance. l © Tan, Steinbach, Kumar
Arrangement Help to well understand the data. l Example: --Permute the rows and columns. --Only two types of objects, one that has all ones for the first three attributes and one that has all ones for the last three attributes. l © Tan, Steinbach, Kumar
Visualization Techniques: Scatter Plots Many visualization techniques have been developed, such as histogram, scatter plots, box plot, pie chart, contour plot… l If you are interested in data visualization, you can refer to the specialized books for visualization techniques. l Celsius © Tan, Steinbach, Kumar
Similarity and Dissimilarity l Similarity – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. l Dissimilarity – Numerical measure of how different two data objects are – Lower when objects are more alike – The term distance is used as a synonym for dissimilarity. l l Proximity refers to a similarity or dissimilarity Similarity and dissimilarity are important to data mining techniques. In many cases, the initial data set is not needed once the similarities and dissimilarities have been computed. © Tan, Steinbach, Kumar
Examples l In the classification task – We need to compute the dissimilarity (distance) between the test record and each training record so than we can find the most similar training objects. © Tan, Steinbach, Kumar
Example l In the clustering task – We need to compute the dissimilarity (distance) between each pair of points so that we can cluster the points into the different groups. © Tan, Steinbach, Kumar
Similarity and Dissimilarity l Proximity Transformation 1 – Generally, proximity measures, especially similarities, are transformed to have values in the interval [0, 1], so that we can use a scale in which a proximity value indicates the fraction of similarity or dissimilarity. – Example, if the similarities between objects range from 1 (not at all similar) to 10 (completely similar), fall them within the range [0, 1] by using the transformation: where s’ and s are the new similarity and original similarity values, respectively. – Exercise: if the original similarity between objects is 6. 4, what is the similarity when transformed to the range [0, 1]? © Tan, Steinbach, Kumar
Similarity and Dissimilarity l Proximity Transformation 2 – More generally, the transformation of similarities to the interval [0, 1] is given by the expression: where min_s and max_s are the original minimum and maximum similarity values, respectively. – Exercise: if all original similarities between two objects fall within [10, 30] and the real similarity between two certain objects is 14, what is the new similarity when transformed to the range [0, 1]? © Tan, Steinbach, Kumar
Similarity and Dissimilarity l Proximity Transformation 3 – Likewise, dissimilarity measures with a finite range can be mapped to [0, 1] by using the formula: where min_d and max_d are the minimum and maximum distance values, respectively. If the proximity measure originally takes values in the interval , one transformation of proximity measure to [0, 1] is: © Tan, Steinbach, Kumar
Similarity and Dissimilarity l Proximity Transformation 4 – Transform similarities to dissimilarity. – If the similarity falls in [0, 1], we can use the transformation: where d and s are the dissimilarity and similarity values, respectively. © Tan, Steinbach, Kumar
Proximity of Objects Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. l How to compute the dissimilarity of the two objects in the following table? l l Student ID Grade Level Age 1203 C 23 2501 A 20 Suppose that Grade level={A, B, C, D, E} and the age of each student falls in [10, 30]. © Tan, Steinbach, Kumar
Proximity of Objects l Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. Student ID Grade Level Age 1203 C 23 2501 A 20 l d=distance(student ID)+distance(Grade level)+distance(Age) l Student ID is a nominal attribute, Grade level is an ordinal attribute, and age is a ration/interval attribute. © Tan, Steinbach, Kumar
Proximity of Objects Distance (dissimilarity) of two nominal attribute values. l Let p and q be two nominal attribute values, we define the distance between two attribute values: l l d=distance(student ID)+distance(Grade level)+distance(Age) =1+distance(Grade level)+distance(Age) Student ID Grade Level Age 1203 C 23 2501 A 20 © Tan, Steinbach, Kumar
Proximity of Objects Distance (dissimilarity) of two ordinal attribute values. l Let p and q be two ordinal attribute values, we define the distance between them: l l Map each value of an ordinal attribute to integer 0 to n-1 l Suppose that Grade level={A, B, C, D, E}, so map the values to: A=4, B=3, C=2, D=1, E=0. Student ID Grade Level Age 1203 C 23 2501 A 20 © Tan, Steinbach, Kumar
Proximity of Objects l Distance (dissimilarity) of two ordinal attributes. l Suppose that Grade level={A, B, C, D, E}, so map the values to: A=4, B=3, C=2, D=1, E=0. l distance(Grade level)=|C-A|/n-1=|2 -4|/5 -1=0. 5 l d=distance(student ID)+distance(Grade level)+distance(Age) =1+0. 5+distance(Age) Student ID Grade Level Age 1203 C 23 2501 A 20 © Tan, Steinbach, Kumar
Proximity of Objects Distance (dissimilarity) of two ration attribute values. l Let p and q be two ratio attribute values, we define the distance of them: l l (Note: the distance of two interval attributes can also be computed using this formula. ) l distance(age)=|23 -20|=3 l Furthermore, we transform it to [0, 1]. l Suppose that the age of each student falls in [10, 30]. So, the distance interval of the age is [0, 20]. © Tan, Steinbach, Kumar
Proximity of Objects l Distance (dissimilarity) of two ration attribute values. l distance(age)=|23 -20|=3 l Furthermore, we transform it to [0, 1]. l Suppose that the age of each student falls in [10, 30]. So, the distance interval of the age is [0, 20]. l Transformed distance(age)=(3 -0)/(20 -0)=0. 15 l d=distance(student ID)+distance(Grade level)+distance(Age) =1+0. 5+0. 15=1. 65 © Tan, Steinbach, Kumar
Proximity of Objects Exercise. l What is the dissimilarity of the two objects in the following table? l l Student ID Grade Level Age 1215 E 18 2637 B 27 Suppose that Grade level={A, B, C, D, E, F} and the age of each student falls in [15, 30]. © Tan, Steinbach, Kumar
Proximity of Objects l In many cases, each object has only a number of ratio attributes. Two cuboids with length, width and height l Length (cm) Width (cm) Height (cm) 25 15 20 15 10 30 How to compute the distance between two objects with a number of ratio attributes? © Tan, Steinbach, Kumar
Euclidean Distance l Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. l Standardization is necessary, if scales differ. © Tan, Steinbach, Kumar
Euclidean Distance Matrix © Tan, Steinbach, Kumar
Minkowski Distance l Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. © Tan, Steinbach, Kumar
Minkowski Distance: Examples l r = 1. City block (Manhattan, taxicab, L 1 norm) distance. – A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors l r = 2. Euclidean distance (L 2 norm) distance. l r . “supremum” (Lmax norm, L norm) distance. – This is the maximum difference between any component of the vectors l Do not confuse r with n, i. e. , all these distances are defined for all numbers of dimensions. © Tan, Steinbach, Kumar
Minkowski Distance Matrix © Tan, Steinbach, Kumar
Common Properties of a Distance l Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q) 0 for all p and q, d(p, q) = 0 only if p = q. (Positive definiteness) 2. 3. d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. l A distance that satisfies these properties is a metric © Tan, Steinbach, Kumar
Similarity Between Binary Vectors l Common situation is that objects, p and q, have only binary attributes Object a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 p 1 0 0 0 0 0 q 0 0 0 1 © Tan, Steinbach, Kumar
Similarity Between Binary Vectors l p and q have only binary attributes l We define four quantities, M 01, M 10, M 00, M 11, as follows M 01 = the number of attributes where p was 0 and q was 1 M 10 = the number of attributes where p was 1 and q was 0 M 00 = the number of attributes where p was 0 and q was 0 M 11 = the number of attributes where p was 1 and q was 1 (M 00, M 11, are called matches and the number of attributes is M 01 + M 10 + M 11 + M 00) l Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M 11 + M 00) / (M 01 + M 10 + M 11 + M 00) J = number of 11 matches / number of not-both-zero attributes values = (M 11) / (M 01 + M 10 + M 11) © Tan, Steinbach, Kumar
SMC versus Jaccard: Example p= 100000 q= 0000001001 M 01 = 2 (the number of attributes where p was 0 and q was 1) M 10 = 1 (the number of attributes where p was 1 and q was 0) M 00 = 7 (the number of attributes where p was 0 and q was 0) M 11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M 11 + M 00)/(M 01 + M 10 + M 11 + M 00) = (0+7) / (2+1+0+7) = 0. 7 J = (M 11) / (M 01 + M 10 + M 11) = 0 / (2 + 1 + 0) = 0 © Tan, Steinbach, Kumar
Cosine Similarity l. Suppose that d 1 and d 2 are two objects, and each object is denoted by a vector. Object a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 d 1 3 2 0 5 0 0 0 2 0 0 d 2 1 0 0 0 1 0 2 d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 1 0 2 © Tan, Steinbach, Kumar
Cosine Similarity If d 1 and d 2 are two objects, each object is denoted by a vector, then l cos( d 1, d 2 ) = (d 1 d 2) / ||d 1|| ||d 2|| , where indicates vector dot product and || is the length of vector d. l Example: d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 1 0 2 d 1 d 2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d 1|| = (3*3+2*2+0*0+5*5+0*0+0*0+2*2+0*0)0. 5 = (42) 0. 5 = 6. 481 ||d 2|| = (1*1+0*0+0*0+0*0+1*1+0*0+2*2) 0. 5 = (6) 0. 5 = 2. 245 cos( d 1, d 2 ) = (d 1 d 2) / ||d 1|| ||d 2|| =5/6. 481*2. 245=0. 3150 © Tan, Steinbach, Kumar
Extended Jaccard Coefficient (Tanimoto) l Variation of Jaccard for continuous or count attributes – Reduces to Jaccard for binary attributes – If p and q are two objects, each object is denoted by a vector, then where p q indicates the vector dot product between p and q, and || p ||2 is the square of the length of vector p. © Tan, Steinbach, Kumar
Exercise p= 0101 q=10 10 Calculate Cosine, SMC, Jaccard distance and Extended Jaccard distance. © Tan, Steinbach, Kumar
- Slides: 54