MIS 451 Building Business Intelligence Systems Clustering 2

MIS 451 Building Business Intelligence Systems Clustering (2)

Problem n Target Marketing Diaper, Baby food, and Belgian Toys Swiss cheese chocolate French Wine 2

Clustering n n Clustering is a data mining method for grouping objects such that objects within the same cluster are similar and objects in different clusters are dissimilar. Why clustering n n SQL based OLAP is not suitable for clustering objects whose attributes have a large number of possible values SQL based OLAP is not suitable for clustering objects with a large number of attributes 3

Clustering n Steps in clustering objects n Compute similarity between objects n Clustering based on similarity between objects 4

Similarity n n n An object (e. g. , a customer) has a list of variables (e. g. , attributes of a customer such as age, spending, gender etc. ) When measuring similarity between objects we measure similarity between variables of objects. Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables. 5

Dissimilarity n Continuous variable n Manhattan distance n Euclidean distance 6

Dissimilarity n For two objects X and Y with continuous variables 1, 2, …n, Manhattan distance is defined as: 7

Dissimilarity n Example of Manhattan distance NAME AGE SPENDING($) Sue 21 2300 Carl 27 2600 TOM 45 5400 JACK 52 6000 8

Dissimilarity n For two objects X and Y with continuous variables 1, 2, …n, Euclidean distance is defined as: 9

Dissimilarity n Example of Euclidean distance NAME AGE SPENDING($) Sue 21 23200 Carl 27 23330 TOM 45 23260 JACK 52 23400 10

Dissimilarity n Standardize values of an variable n n n Calculate mean value Calculate mean absolute deviation Standardize values of an variable using the formula: new value = (old value – mean value)/mean standard deviation 11

Dissimilarity n Binary variable distance = number of matched variables/total number of variables NAME Married(Y/N) Gender Internet connection at home Sue Y M Y Carl Y F Y TOM N JACK N F N 12

Clustering based on dissimilarity n After calculating dissimilarity between objects, a dissimilarity matrix can be created with objects as indexes and dissimilarities between objects as elements. 13

Clustering based on dissimilarity Sue Tom Carl Jack Mary Sue 0 6 8 2 7 Tom 6 0 1 5 3 Carl 8 1 0 10 9 Jack 2 5 10 0 4 Mary 7 3 9 4 0 14

Clustering based on dissimilarity Step 1: Initially, place each object in an unique cluster Step 2: Calculate dissimilarity between clusters Dissimilarity between clusters is the minimum dissimilarity between two objects of the clusters, one from each cluster Step 3: Merge two clusters with the least dissimilarity Step 4: Continue step 1 -3 until all objects are in one cluster 15
- Slides: 15