Data Lecture Notes for Chapter 2 Tan Steinbach

Data Lecture Notes for Chapter 2 © Tan, Steinbach, Kumar

What is Data? l Collection of data objects and their attributes Attributes l An attribute is a property or characteristic of an object – Examples: Name, Gender, Age, etc. – Attribute is also known as variable, field, characteristic, Objects or feature l A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance © Tan, Steinbach, Kumar

Attribute Values Attribute values are numbers or symbols assigned to an attribute l Question: what is the first attribute of students and what is its value of the third student? l © Tan, Steinbach, Kumar

Types of Attributes l There are different types of attributes – Nominal u Examples: ID numbers, eye color, zip codes – Ordinal u Examples: rankings (e. g. , taste of potato chips on a scale from 1 -10), grades, height in {tall, medium, short} – Interval u Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio u Examples: temperature in Kelvin, length, time, counts © Tan, Steinbach, Kumar

Attribute Type Description Examples Nominal The values of a nominal attribute are just different names, i. e. , nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i. e. , a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Ratio Operations

Types of Attributes Order the objects according to grade level from A to E. l From the birth year, the third student is 3 years older than the first student. l l Assume that the length of one ruler is 20 centimeters and the length of another is 40 cm. The length is a ratio attribute. Specially, the second ruler is twice as long as the first one. © Tan, Steinbach, Kumar

Discrete and Continuous Attributes l Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, student ID – Note: binary attributes are a special case of discrete attributes （Gender） l Continuous Attribute – Has real numbers as attribute values – Examples: age, length, or weight. © Tan, Steinbach, Kumar

Discrete and Continuous Attributes In some cases, an attribute can be viewed as a discrete attribute. In other cases, it can be viewed as a continuous attribute. l For example, age is an attribute of persons. If age denotes an attribute of students in a university, its values usually fall into the range [10, 30]. In this case, it has 21 values, so it can be viewed as a discrete attribute. If we don’t limit the value range of age, it can be viewed as a continuous attribute. For example, we can say the age of a person is 23. 3. (23 years old + 3. 6 months) l Therefore, an interval attribute and a ration attribute may be a discrete attribute or a continuous attribute in the different cases. l © Tan, Steinbach, Kumar

Types of data sets l l l Record – Data Matrix – Document Data – Transaction Data Graph – World Wide Web – Molecular Structures Ordered – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data © Tan, Steinbach, Kumar

Record Data l A record data set consists of some records, each of which consists of a fixed set of attributes © Tan, Steinbach, Kumar

Transaction Data l A special type of record data is transaction data, where – each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. © Tan, Steinbach, Kumar

Important Characteristics of Record Data – Dimensionality u The dimensionality of a data set is the number of attributes that the objects have. – Sparsity u When most attributes of an object have values of 0, we say that the data set is sparse. © Tan, Steinbach, Kumar

Data Quality What kinds of data quality problems are there? l How can we detect problems with the data? l l Examples of data quality problems: – Noise and outliers – missing values – duplicate data © Tan, Steinbach, Kumar

Noise l Noise refers to modification of original values – Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves © Tan, Steinbach, Kumar Two Sine Waves + Noise

Noise reduces the data quality. Furthermore, it reduces the quality of data mining results, such as reducing classification accuracy and clustering accuracy. Noise can even cause the incorrect data mining results. l Therefore, it is an important task to reduce noise in the data preprocessing. l Nonetheless, the elimination of noise is frequently difficult, which requires us to devise robust data mining algorithms that produce acceptable results when noise is present. l © Tan, Steinbach, Kumar

Outliers l Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set © Tan, Steinbach, Kumar

Outliers Just note: Outliers are different from noise. l Outliers can be legitimate data objects or values l Unlike noise, outliers may sometimes be of interest. l One task of data mining is to detect outliers from a large amount of data. l Outliers detection algorithms will be presented in Chapter 10 (if time permits). l © Tan, Steinbach, Kumar

Missing Values l It is usual for one or more objects to be missing one or more attribute values. Student ID Name Gender Age 1203 Tom Male 23 1506 Lucy Female 2158 l Why? © Tan, Steinbach, Kumar Male 22

Missing Values l Reasons for missing values – Information is not collected (e. g. , people decline to give their age and weight) – Attributes may not be applicable to all cases (e. g. , annual income is not applicable to children) Table: Members of Tom’s family Person Name Salary (dollars) Age Father Jack 5600 43 Mother Lucy 5200 42 Tom © Tan, Steinbach, Kumar 8

Missing Values l Generally speaking, most of data mining algorithms cannot handle data sets with missing values. l Handling missing values--one task of preprocessing – Eliminate Data Objects – Estimate Missing Values © Tan, Steinbach, Kumar

Duplicate Data l Data set may include data objects that are duplicates, or almost duplicates of one another – Issue often appears when merging data from heterogeneous sources l Examples: – Same person with multiple email addresses – Many people receive duplicate mailings. © Tan, Steinbach, Kumar

Data Postprocessing: Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. l Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns © Tan, Steinbach, Kumar

Arrangement Example of visualization: arrangement by tabular l Help to well understand the data. l Example: --Nine objects with six binary attributes. --we can not observe any clear relationship between objects and attributes at first glance. l © Tan, Steinbach, Kumar

Arrangement Help to well understand the data. l Example: --Permute the rows and columns. --Only two types of objects, one that has all ones for the first three attributes and one that has all ones for the last three attributes. l © Tan, Steinbach, Kumar

Visualization Techniques: Scatter Plots Many visualization techniques have been developed, such as histogram, scatter plots, box plot, pie chart, contour plot… l If you are interested in data visualization, you can refer to the specialized books for visualization techniques. l Celsius © Tan, Steinbach, Kumar

Similarity and Dissimilarity l Similarity – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. l Dissimilarity – Numerical measure of how different two data objects are – Lower when objects are more alike – The term distance is used as a synonym for dissimilarity. l l Proximity refers to a similarity or dissimilarity Similarity and dissimilarity are important to data mining techniques. In many cases, the initial data set is not needed once the similarities and dissimilarities have been computed. © Tan, Steinbach, Kumar

Examples l In the classification task – We need to compute the dissimilarity (distance) between the test record and each training record so than we can find the most similar training objects. © Tan, Steinbach, Kumar

Example l In the clustering task – We need to compute the dissimilarity (distance) between each pair of points so that we can cluster the points into the different groups. © Tan, Steinbach, Kumar

Similarity and Dissimilarity l Proximity Transformation 1 – Generally, proximity measures, especially similarities, are transformed to have values in the interval [0, 1], so that we can use a scale in which a proximity value indicates the fraction of similarity or dissimilarity. – Example, if the similarities between objects range from 1 (not at all similar) to 10 (completely similar), fall them within the range [0, 1] by using the transformation: where s’ and s are the new similarity and original similarity values, respectively. – Exercise: if the original similarity between objects is 6. 4, what is the similarity when transformed to the range [0, 1]? © Tan, Steinbach, Kumar

Similarity and Dissimilarity l Proximity Transformation 2 – More generally, the transformation of similarities to the interval [0, 1] is given by the expression: where min_s and max_s are the original minimum and maximum similarity values, respectively. – Exercise: if all original similarities between two objects fall within [10, 30] and the real similarity between two certain objects is 14, what is the new similarity when transformed to the range [0, 1]? © Tan, Steinbach, Kumar

Similarity and Dissimilarity l Proximity Transformation 3 – Likewise, dissimilarity measures with a finite range can be mapped to [0, 1] by using the formula: where min_d and max_d are the minimum and maximum distance values, respectively. If the proximity measure originally takes values in the interval , one transformation of proximity measure to [0, 1] is: © Tan, Steinbach, Kumar

Similarity and Dissimilarity l Proximity Transformation 4 – Transform similarities to dissimilarity. – If the similarity falls in [0, 1], we can use the transformation: where d and s are the dissimilarity and similarity values, respectively. © Tan, Steinbach, Kumar

Proximity of Objects Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. l How to compute the dissimilarity of the two objects in the following table? l l Student ID Grade Level Age 1203 C 23 2501 A 20 Suppose that Grade level={A, B, C, D, E} and the age of each student falls in [10, 30]. © Tan, Steinbach, Kumar

Proximity of Objects l Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. Student ID Grade Level Age 1203 C 23 2501 A 20 l d=distance(student ID)+distance(Grade level)+distance(Age) l Student ID is a nominal attribute, Grade level is an ordinal attribute, and age is a ration/interval attribute. © Tan, Steinbach, Kumar

Proximity of Objects Distance (dissimilarity) of two nominal attribute values. l Let p and q be two nominal attribute values, we define the distance between two attribute values: l l d=distance(student ID)+distance(Grade level)+distance(Age) =1+distance(Grade level)+distance(Age) Student ID Grade Level Age 1203 C 23 2501 A 20 © Tan, Steinbach, Kumar

Proximity of Objects Distance (dissimilarity) of two ordinal attribute values. l Let p and q be two ordinal attribute values, we define the distance between them: l l Map each value of an ordinal attribute to integer 0 to n-1 l Suppose that Grade level={A, B, C, D, E}, so map the values to: A=4, B=3, C=2, D=1, E=0. Student ID Grade Level Age 1203 C 23 2501 A 20 © Tan, Steinbach, Kumar

Proximity of Objects l Distance (dissimilarity) of two ordinal attributes. l Suppose that Grade level={A, B, C, D, E}, so map the values to: A=4, B=3, C=2, D=1, E=0. l distance(Grade level)=|C-A|/n-1=|2 -4|/5 -1=0. 5 l d=distance(student ID)+distance(Grade level)+distance(Age) =1+0. 5+distance(Age) Student ID Grade Level Age 1203 C 23 2501 A 20 © Tan, Steinbach, Kumar

Proximity of Objects Distance (dissimilarity) of two ration attribute values. l Let p and q be two ratio attribute values, we define the distance of them: l l (Note: the distance of two interval attributes can also be computed using this formula. ) l distance(age)=|23 -20|=3 l Furthermore, we transform it to [0, 1]. l Suppose that the age of each student falls in [10, 30]. So, the distance interval of the age is [0, 20]. © Tan, Steinbach, Kumar

Proximity of Objects l Distance (dissimilarity) of two ration attribute values. l distance(age)=|23 -20|=3 l Furthermore, we transform it to [0, 1]. l Suppose that the age of each student falls in [10, 30]. So, the distance interval of the age is [0, 20]. l Transformed distance(age)=(3 -0)/(20 -0)=0. 15 l d=distance(student ID)+distance(Grade level)+distance(Age) =1+0. 5+0. 15=1. 65 © Tan, Steinbach, Kumar

Proximity of Objects Exercise. l What is the dissimilarity of the two objects in the following table? l l Student ID Grade Level Age 1215 E 18 2637 B 27 Suppose that Grade level={A, B, C, D, E, F} and the age of each student falls in [15, 30]. © Tan, Steinbach, Kumar

Proximity of Objects l In many cases, each object has only a number of ratio attributes. Two cuboids with length, width and height l Length (cm) Width (cm) Height (cm) 25 15 20 15 10 30 How to compute the distance between two objects with a number of ratio attributes? © Tan, Steinbach, Kumar

Euclidean Distance l Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. l Standardization is necessary, if scales differ. © Tan, Steinbach, Kumar

Euclidean Distance Matrix © Tan, Steinbach, Kumar

Minkowski Distance l Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. © Tan, Steinbach, Kumar

Minkowski Distance: Examples l r = 1. City block (Manhattan, taxicab, L 1 norm) distance. – A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors l r = 2. Euclidean distance (L 2 norm) distance. l r . “supremum” (Lmax norm, L norm) distance. – This is the maximum difference between any component of the vectors l Do not confuse r with n, i. e. , all these distances are defined for all numbers of dimensions. © Tan, Steinbach, Kumar

Minkowski Distance Matrix © Tan, Steinbach, Kumar

Common Properties of a Distance l Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q) 0 for all p and q, d(p, q) = 0 only if p = q. (Positive definiteness) 2. 3. d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. l A distance that satisfies these properties is a metric © Tan, Steinbach, Kumar

Similarity Between Binary Vectors l Common situation is that objects, p and q, have only binary attributes Object a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 p 1 0 0 0 0 0 q 0 0 0 1 © Tan, Steinbach, Kumar

Similarity Between Binary Vectors l p and q have only binary attributes l We define four quantities, M 01, M 10, M 00, M 11, as follows M 01 = the number of attributes where p was 0 and q was 1 M 10 = the number of attributes where p was 1 and q was 0 M 00 = the number of attributes where p was 0 and q was 0 M 11 = the number of attributes where p was 1 and q was 1 (M 00, M 11, are called matches and the number of attributes is M 01 + M 10 + M 11 + M 00) l Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M 11 + M 00) / (M 01 + M 10 + M 11 + M 00) J = number of 11 matches / number of not-both-zero attributes values = (M 11) / (M 01 + M 10 + M 11) © Tan, Steinbach, Kumar

SMC versus Jaccard: Example p= 100000 q= 0000001001 M 01 = 2 (the number of attributes where p was 0 and q was 1) M 10 = 1 (the number of attributes where p was 1 and q was 0) M 00 = 7 (the number of attributes where p was 0 and q was 0) M 11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M 11 + M 00)/(M 01 + M 10 + M 11 + M 00) = (0+7) / (2+1+0+7) = 0. 7 J = (M 11) / (M 01 + M 10 + M 11) = 0 / (2 + 1 + 0) = 0 © Tan, Steinbach, Kumar

Cosine Similarity l. Suppose that d 1 and d 2 are two objects, and each object is denoted by a vector. Object a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 d 1 3 2 0 5 0 0 0 2 0 0 d 2 1 0 0 0 1 0 2 d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 1 0 2 © Tan, Steinbach, Kumar

Cosine Similarity If d 1 and d 2 are two objects, each object is denoted by a vector, then l cos( d 1, d 2 ) = (d 1 d 2) / ||d 1|| ||d 2|| , where indicates vector dot product and || is the length of vector d. l Example: d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 1 0 2 d 1 d 2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d 1|| = (3*3+2*2+0*0+5*5+0*0+0*0+2*2+0*0)0. 5 = (42) 0. 5 = 6. 481 ||d 2|| = (1*1+0*0+0*0+0*0+1*1+0*0+2*2) 0. 5 = (6) 0. 5 = 2. 245 cos( d 1, d 2 ) = (d 1 d 2) / ||d 1|| ||d 2|| =5/6. 481*2. 245=0. 3150 © Tan, Steinbach, Kumar

Extended Jaccard Coefficient (Tanimoto) l Variation of Jaccard for continuous or count attributes – Reduces to Jaccard for binary attributes – If p and q are two objects, each object is denoted by a vector, then where p q indicates the vector dot product between p and q, and || p ||2 is the square of the length of vector p. © Tan, Steinbach, Kumar