Cluster Analysis Clustering Jiawei Han and Micheline Kamber

Cluster Analysis (Clustering) ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http: //www. cs. sfu. ca 2021/2/23 Data Mining: Concepts and Techniques 1

Chapter 7. Cluster Analysis Outline n Applications of Cluster Analysis n What is Good Cluster Analysis? n Types of Data in Cluster Analysis n n 2021/2/23 Major Clustering Methods n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods Summary Data Mining: Concepts and Techniques 2

General Applications of Clustering (Exploratory Analysis) n n n Business/Economic Science (especially marketing research) Biomedical Informatics Pattern Recognition Spatial Data Analysis n detect spatial clusters and explain them in terms of spatial data mining WWW n Web document profiling n Cluster log data to discover groups or profiles of similar access patterns Domain Knowledge is very, very important to the applications of clustering !! 2021/2/23 Data Mining: Concepts and Techniques 4

Specific Examples of Clustering Applications n Marketing: Help marketers discover group profiles of customers, and then use this knowledge to develop targeted marketing programs n Insurance: Identifying group profiles of motor insurance policy holders with a high average claim cost n Biology: Categorize genes with similar functionality, derive their taxonomies, and gain insight into the structures inherent in populations 2021/2/23 Data Mining: Concepts and Techniques 5

What Is Good Clustering? n A good clustering method will produce high quality clusters with the following features n n n high intra-class similarity low inter-class similarity The quality of a clustering result depends on the adopted clustering method and the similarity measure. n The quality of a clustering method is measured by its ability to discover the number of the hidden patterns. n The quality of a similarity measure may be highly subjective 2021/2/23 Data Mining: Concepts and Techniques 6

Issues of Clustering in Data Mining n Scalability (important for Big Data Analysis) n Ability to deal with different types of attributes n Discovery of clusters with arbitrary shape n Minimal requirements for domain knowledge to determine input parameters n Able to deal with noise and outliers n Insensitive to order of input records n Able to deal with high dimensionality n Incorporation of user-specified constraints n Interpretability and usability 2021/2/23 Data Mining: Concepts and Techniques 7

Chapter 7. Cluster Analysis n What is Cluster Analysis? n Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods n Outlier Analysis n Summary 2021/2/23 Data Mining: Concepts and Techniques 8

Data Structures in Clustering Two data structures n Data matrix n n Dissimilarity matrix n 2021/2/23 n objects × p variables n objects × n objects Data Mining: Concepts and Techniques 9

Similarity Measurement between Samples n Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function d (i, j), which is typically metric. n n The definitions of distance functions are usually very different for interval-scaled (numeric), boolean, categorical, ordinal and ratio variables. It is hard to define “similar enough” or “good enough” n n the answer is typically highly subjective. person weight height 1 70 1. 80 2 66 1. 76 3 65 1. 65 Sometimes, weights should be associated with different variables based on applications and data semantics. 2021/2/23 Data Mining: Concepts and Techniques 10

Types of data in clustering analysis n Interval-scaled variables n n n data values have order and equal intervals measured on a linear scale the intervals keep the same importance throughout the scale n Binary variables n Nominal, ordinal, and ratio variables n Variables of mixed types 2021/2/23 Data Mining: Concepts and Techniques 11

Similarity and Dissimilarity Between Data Objects n _ _ + + _. xq + _ + Distances are normally used to measure the similarity or dissimilarity between two data objects xi and xj (if data objects are viewed as points in data space) n Some popular ones include: Minkowski distance: data where xi = <xi 1, xi 2, …, xip> and xj = <xj 1, xj 2, …, xjp> are two p-dimensional data points, and q is a positive integer n If q = 1, d is Manhattan distance 2021/2/23 Data Mining: Concepts and Techniques 12

Similarity and Dissimilarity Between Objects (Cont. ) n If q = 2, d is Euclidean distance: n Properties n n n d(i, j) 0 d(i, i) = 0 d(i, j) = d(j, i) d(i, j) d(i, k) + d(k, j) Also one can use weighted distance, parametric Pearson product moment correlation (or simply correlation r), or other disimilarity measures. 2021/2/23 Data Mining: Concepts and Techniques 13

Data Normalization on Features (Interval-valued variables) n person weight height 1 70 1. 80 2 66 1. 76 3 65 1. 65 Data normalization (for each feature f ) 1. Calculate the mean: data zero-mean normalization 2. Calculate the mean absolute deviation : 3. Calculate the standardized measurement (z-score) (zero-mean normalization) n Using mean absolute deviation is more robust to outliers than using standard deviation for cluster analysis 2021/2/23 Data Mining: Concepts and Techniques 14

Distance Functions for Binary Variables n A contingency table for binary data Object j i: 01100110 j: 11001111 Object i a: 4, b: 2, c: 4, d: 2 n n a : the # of variables that equal 1 for both objects i and j d : the # of variables that equal 0 for both objects i and j b : the # of variables that equal 1 for object i but that equal 0 for object j c : the # of variables that equal 0 for object i but that equal 1 for object j 2021/2/23 Data Mining: Concepts and Techniques 15

Distance Functions for Binary Variables n A contingency table for binary data Object j Object i n Simple matching coefficient (if the binary variable is symmetric): n Jaccard coefficient (if the binary variable is asymmetric): 2021/2/23 Data Mining: Concepts and Techniques 16

Dissimilarity between Binary Variables Example n n 2021/2/23 Gender is a symmetric attribute which is ignored here The remaining attributes are asymmetric binary Let the values Y and P be set to 1, Let the value N be set to 0 Data Mining: Concepts and Techniques 17

Dissimilarity between Binary Variables (Asymmetric Example) Preprocessing 1. Ignore gender 2. Y, P → 1 3. N → 0 Object j Object i Conclusion : The clinical conditions of Jack and Mary are more similar. 2021/2/23 Data Mining: Concepts and Techniques 18

Nominal Variables (Categorical Variables) A generalization of the binary variable in that it can take more than 2 states, e. g. , color: red, yellow, blue, … Dissimilarity between Nominal Variables n Method 1: Simple matching n n n Method 2: use a large number of asymmetric binary variables n n m : # of matches, t : total # of variables B R P creating a new binary variable for each of the M nominal states of a particular variable, e. g. <0, 0, 0, 1> for purple color P R B Purple color is totally different from blue color Method 3: use a small number of additive binary variables n 2021/2/23 <1, 0, 1> for purple color (in 3 -bit RGB representation) Purple color is partially different from blue color <0, 0, 1> Data Mining: Concepts and Techniques 19

Dissimilarity between Ordinal Variables n An ordinal variable can be discrete or continuous value n n It can be treated like interval-scaled (for each feature f ) 1. data 2. 3. n The values are ordered in a meaningful sequence, e. g. , job ranks Define rank for feature f (having Mf states) Replacing xif (the value of f-th variable of i-th object ) by a rank Map the range of feature f onto [0, 1] by replacing its corresponding rank by Compute the dissimilarity using methods for intervalscaled variables 2021/2/23 Data Mining: Concepts and Techniques 20

Dissimilarity between Ratio-Scaled Variables n Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as Ae. Bt or Ae-Bt n Alternative methods: 1. treat them like interval-scaled variables — not a good choice! (why? Since the scale may be distorted) 2. apply logarithmic transformation yif = log(xif), and treat the transformed value like interval-scaled value 3. treat them as continuous ordinal data and treat their ranks as interval-scaled values. 2021/2/23 Data Mining: Concepts and Techniques 21

Variables of Mixed Types n n The data set may contain all the six types of variables n symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio-scaled variables. One may use a weighted formula to combine their effects. data (distance between objects i and j) n n n 2021/2/23 If feature f is binary or nominal: df (i, j) = 0 if xif = xjf , or df (i, j) = 1 otherwise If f is interval-based: use the normalized distance If feature f is ordinal or ratio-scaled n compute ranks rif and n treat rif as interval-scaled Data Mining: Concepts and Techniques 22

Chapter 7. Cluster Analysis n What is Cluster Analysis? n Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n 2021/2/23 n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods Summary Data Mining: Concepts and Techniques 23

Major Clustering Approaches n Partitioning algorithms: Construct a number of data partitions, and then improve them iteratively by some optimization criterion n Hierarchy algorithms: Create a hierarchical (clustering) structure for the data set using some optimization criterion n Density-based: Based on connectivity and/or density functions n Grid-based: Quantize the data space into a finite number of grid cells on which cluster analysis is performed n Model-based: A cluster model is hypothesized for each data cluster, and find the best cluster models fitting the data 2021/2/23 Data Mining: Concepts and Techniques 24

Chapter 7. Cluster Analysis n What is Cluster Analysis? n Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n 2021/2/23 n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods Summary Data Mining: Concepts and Techniques 25

Partitioning Algorithms: Basic Concept (Cluster Representation) n Partitioning method: Given n data objects, a partition algorithm organizes objects into k partitions. Each represents a cluster. n Given a k, find k clusters that optimizes the chosen partitioning criterion (based on similarity function) n Global optimal solution: exhaustively enumerate all combinations n Heuristic methods: k-means and k-medoids algorithms n k-means (Mac. Queen’ 67): Each cluster is represented by the center of the cluster during iterative clustering process n k-medoids or PAM (Partition Around Medoids) (Kaufman & Rousseeuw’ 87): Each cluster is represented by the representative object of the cluster during iterative clustering process 2021/2/23 Data Mining: Concepts and Techniques 26

Centroid-Based Clustering Technique (The k-Means Clustering Method ) Given k, the k-Means algorithm is implemented in 4 steps: 1. t iterations 2. 3. Randomly select k of the data objects as the seed points. Each seed point represents a cluster. Re-assign each data object to a cluster with the nearest seed point. (n objects, k clusters) Compute new seed points which are the centroids of individual clusters at this epoch. n 4. 2021/2/23 The centroid is the center (mean) of a cluster. Go back to Step 2, stop when no more new data assignment happens. Data Mining: Concepts and Techniques 27

The k-Means Clustering Method n Example : 2 -means Find the centroids assigned to new cluster Reassign all data Stop if stabilized 2021/2/23 Data Mining: Concepts and Techniques assigned to new cluster 28

Comments on the k-Means Method n n Strength n Relatively efficient: O(nkt), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. n Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing and genetic algorithms Weakness n Applicable only when mean is defined. It cannot handle n n n categorical data, like colors : red, blue, green, …? Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers which may distort the distribution of data Not suitable to discover clusters with non-convex shapes 2021/2/23 Data Mining: Concepts and Techniques 29

Variations of the k-Means Method n n n A few variants of the k-Means which differ in n Selection of the k initial seed points n Dissimilarity calculations n Strategies to calculate cluster means Handling categorical data: k-modes (Huang’ 98) n Using new dissimilarity measures to deal with categorical objects n Replacing means of clusters with modes n Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method 2021/2/23 Data Mining: Concepts and Techniques 30

Representative Object-Based Technique (The K-Medoids Clustering Method) n Find representative objects, called medoids, in clusters n n Centroid-Based Each cluster is denoted by one medoid. Medoid : the most centrally located data object in a cluster k-Means is sensitive to noisy data and outliers. PAM (Partitioning Around Medoids, 1987) n PAM works effectively for small data sets, but does not scale well for large data sets : O(k (n-k)2 t) n Improved algorithms : n CLARA (Kaufmann & Rousseeuw, 1990): Use sampling to find representatives of the data (by reducing input data size) n CLARANS (Ng & Han, 1994): Use different sample set to search representatives at each run. Return the best representative set in a user-defined number of runs. 2021/2/23 Data Mining: Concepts and Techniques 31

PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987) : O(k(n-k)2 t) n Use representative objects of a cluster as seed points. 1. Select k objects as representative objects arbitrarily 2. Assign each non-medoid object to the most similar representative object (n objects k clusters) 3. t iterations object h and medoid object i, calculate the total swapping cost TCih (after cluster reassignment). Find the best TCih. Best. If TCih. Best = (DTBest - DT)< 0 (where ), then medoid object i is replaced by h, otherwise stop. 4. 2021/2/23 Do a greedy search O(k(n-k)2) : for each pair of non-medoid repeat steps 3 until there is no change Data Mining: Concepts and Techniques 32

CLARA (Clustering Large Applications) (1990) n CLARA (Kaufmann and Rousseeuw in 1990) n n Built in statistical analysis packages, such as S+ It draws a sample set of the data, applies PAM on the samples, and gives the best clustering as the output n Strength: deals with larger data sets than PAM n Weakness: n n 2021/2/23 Efficiency depends on the sample size If the samples are biased, a good clustering based on the samples will not necessarily represent a good clustering for the whole data set Data Mining: Concepts and Techniques 33

CLARANS (“Randomized” CLARA) n n (1994) (Clustering Large Application based on Randomized Search-Ng & Han’ 94) The PAM clustering process can be presented as searching a graph where every node (a set of k medoids) is a potential solution : O(C(n, k)). n At each step, all the neighbors of the current node are examined. The current node is replaced by the neighbor with the best improvement. (Two nodes are neighbors if their medoid sets differ by only one object. Every node has k(n-k) neghbors. ) n n n CLARA confine to a fixed sample of size m at each search stage. m CLARANS dynamically draws a random sample of neighbors in each search step. If a local optimum is found, CLARANS starts with a new sample of neighbors in search for a new local optimum. Once a user-specified number of local minima has been found, CLARANS outputs the best one. It is more efficient and scalable than both PAM and CLARA 2021/2/23 Data Mining: Concepts and Techniques 34

Chapter 7. Cluster Analysis n What is Cluster Analysis? n Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods n Outlier Analysis n Summary 2021/2/23 Data Mining: Concepts and Techniques 35

Hierarchical Clustering n n Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input. Step 0 a b Step 1 Step 2 Step 3 Step 4 ab abcde c cde d de e Step 4 2021/2/23 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 Data Mining: Concepts and Techniques divisive (DIANA) 36

AGNES (Agglomerative Nesting) <Distance between Two Clusters> n Introduced in Kaufmann and Rousseeuw (1990) n Implemented in statistical analysis packages, e. g. , S+ n Use the Single-Link method and the dissimilarity matrix. n Merge nodes that have the least dissimilarity n Go on in a non-descending fashion n Eventually all nodes belong to the same cluster … 2021/2/23 … Data Mining: Concepts and Techniques 37

Dendrogram for Cluster Visualization (Shows How Clusters Merge Hierarchically) Hierarchical clustering algorithm AGNES organize data objects into a tree structure of hierarchical partitions (a tree of clusters), called dendrogram. A clustering result of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. 2021/2/23 Data Mining: Concepts and Techniques 38

Clustering Dendrogram in Gene Expression Analysis n Finding differentially regulated genes Clustering 2021/2/23 Data Mining: Concepts and Techniques 39

DIANA (Divisive Analysis) n Introduced in Kaufmann and Rousseeuw (1990) n Implemented in statistical analysis packages, e. g. , S+ n Inverse order of AGNES n Eventually each node (data object) forms a cluster on its own 2021/2/23 Data Mining: Concepts and Techniques 40

More on Hierarchical Clustering Methods n n Weakness of agglomerative clustering methods 2 n do not scale well: time complexity of at least O(n ), where n is the number of data objects n can never undo what was done previously at each stage Integration of hierarchical and distance-based clusterings n BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters (good for Big Data Analysis) n CURE (1998): selects well-scattered points from each cluster, and then shrinks them towards the center of the cluster by a specified fraction n CHAMELEON (1999): hierarchical clustering using dynamic modeling 2021/2/23 Data Mining: Concepts and Techniques 41

BIRCH (1996) n n Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’ 96) Incrementally construct a CF-tree (Clustering Feature Tree), a hierarchical data structure for multiphase clustering Example n n Phase 1: scan DB to build an initial in-memory CF-tree (a hierarchical compression structure of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary (hierarchical) clustering algo. to cluster the data in each leaf node of the CF-tree n Scales linearly: finds a good clustering with a single scan n Weakness: handles only numeric data, and sensitive to the and improves the quality with a few additional scans (A good feature for big data analysis) input order of the data records. 2021/2/23 Data Mining: Concepts and Techniques 42

BIRCH’s Clustering Feature Vector Clustering Feature of a sub-cluster: CF = (N, LS, SS) N: Number of data vectors in the sub-cluster LS: Ni=1 Xi (linear sum) SS: Ni=1 Xi 2 (square sum) CF = (5, (16, 30), (54, 190)) (3, 4) (2, 6) (4, 5) (4, 7) (3, 8) 2021/2/23 Data Mining: Concepts and Techniques 43

BIRCH’s Parameters & Their Meanings n A clustering feature is the summary statistics for the data in a sub-cluster n n The zero moment : the number of data vectors in a sub-cluster The first moment : the linear sum on N data vectors The second moment : the square sum on N data vectors BIRCH use two parameters for controlling node split in the construction process of a CF tree n n 2021/2/23 Branching factor : constrains the maximum number of child nodes per non-leaf node (a leaf node is required to fit in a memory page of size P which determines the branching factor of a leaf node) Threshold : constrains the maximum average distance of data pairs of a sub-cluster in the leaf nodes Data Mining: Concepts and Techniques 44

CF Tree Root Similar to the construction of B+-tree CF 1 CF 2 CF 3 CF 6 child 1 child 2 child 3 child 6 CF 1 Non-leaf node CF 2 CF 3 CF 5 child 1 child 2 child 3 child 5 Leaf node prev CF 1 CF 2 CF 6 next Branching factor = 7 Threshold = 6 Leaf node prev CF 1 CF 2 CF 4 next An object is inserted into the closed leaf entry. Split the node as needed After an insertion, information about it is passed toward the root of tree 2021/2/23 Data Mining: Concepts and Techniques 45

CURE (Clustering Using REpresentatives ) n CURE: proposed by Guha, Rastogi & Shim, 1998 A type of hierarchical clustering. n n 2021/2/23 Most clustering algorithms either favors clusters with spherical shape and similar sizes, or are fragile in the present of outliers. CURE overcomes these problems. Uses multiple representative points to represent a cluster. At each step of clustering, the two clusters with the closest pair of representative points are merged. Adjusts well to arbitrary shaped clusters. The representative points of a cluster attempt to capture the data distribution and shape of the cluster. Stops the creation of a cluster hierarchy until there are only k remaining clusters. Data Mining: Concepts and Techniques 46

The Steps for large data set in CURE Steps involved in clustering large data set using CURE 2021/2/23 Data Mining: Concepts and Techniques 47

Cure: The Algorithm 1/2 (for large data set in CURE ) 1. Phase 1 : Pre-clustering a. Draw a random sample S of the original data (if the size of the original data is very large) b. Partition sample S into p partitions (Partitioning for speedup) c. Partially cluster each partition into |S|/(p╳q) clusters (By CURE hierarchical clustering: starts with each input point as a separate cluster for some q>1) d. 2021/2/23 Outlier elimination n Most outliers are eliminated by random sampling (Step a) n If a cluster grows too slow, eliminate it. Data Mining: Concepts and Techniques 48

Cure: The Algorithm 2/2 Phase 2 : find the final clusters (By CURE) 2. a. Cluster partial clusters (# of partial clusters |S|/q) n n b. 2021/2/23 Find well scattered representative points for the new cluster at each iteration. The representative points of a cluster are used to compute its distance from other clusters. The distance between two clusters is the distance between the closest pair of representative points. Merge the closest pair of clusters. The representative points falling in each newly formed cluster are “shrinked” or moved toward the cluster center by a userspecified fraction. The representative points capture the shape of the cluster Mark the data with the corresponding cluster labels Data Mining: Concepts and Techniques 49

Data Partitioning and Clustering n n Sample size : s = 50 # of partitions : p = 2 n Partition size : s/p = 25 n cluster #: s/pq = 5 y y y x x 2021/2/23 Data Mining: Concepts and Techniques x x 50

Representative Points Shrinking in Cure y y old cluster center new cluster center Shrink representatives n x Shrink the multiple representative points towards the gravity center by a fraction of . n Multiple representatives capture the shape of the cluster 2021/2/23 Data Mining: Concepts and Techniques x 51

CHAMELEON n n n CHAMELEON: A type of hierarchical clustering, by G. Karypis, E. H. Han and V. Kumar’ 99 Measures the similarity based on a dynamic model n Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters A two phase algorithm n 1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters n 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters 2021/2/23 Data Mining: Concepts and Techniques 52

Overall Framework of CHAMELEON Construct K-nearest neighbor graph Partition the Graph By cutting long edges Sparse Graph Data Set Merge Partition According to relative interconnectivity and relative closeness Final Clusters 2021/2/23 Data Mining: Concepts and Techniques 53

Chapter 7. Cluster Analysis n What is Cluster Analysis? n Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods n Outlier Analysis n Summary 2021/2/23 Data Mining: Concepts and Techniques 54

Density-Based Clustering Methods n n n 2021/2/23 Clustering based on density (local cluster criterion, such as density-connected points) Major features: n Discover clusters of arbitrary shape n Handle noise n One scan n Need density parameters as termination condition Several interesting studies: n DBSCAN: Ester, et al. (KDD’ 96) n OPTICS: Ankerst, et al (SIGMOD’ 99). n DENCLUE: Hinneburg & D. Keim (KDD’ 98) Data Mining: Concepts and Techniques 55

Density-Based Clustering: Background (I) n n n Two density parameters: n ε : maximum neighborhood radius of a core point n Min. Pts: min. number of points in an ε-neighborhood of a core point Nε(p) : {q belongs to D | dist(p, q) <= ε} n D : data space, Nε(p) : neighbors of point p Directly density-reachable: A point p is directly densityreachable from a point q wrt. ε, Min. Pts if n 1) p belongs to Nε(q) p Min. Pts = 5 n 2) |Nε(q)| >= Min. Pts q ε = 1 cm (q is a core point) 2021/2/23 Data Mining: Concepts and Techniques 56

Density-Based Clustering: Background (II) n Density-reachable: n n A point p is density-reachable from a point q wrt. ε, Min. Pts if there is a chain of points p 1, …, pn, p 1 = q, pn = p such that pi+1 is directly density-reachable from pi p p 1 q Density-connected n A point p is density-connected to a p point q wrt. ε, Min. Pts if there is a point o such that both, p and q are density-reachable from o wrt. ε and Min. Pts. 2021/2/23 Data Mining: Concepts and Techniques q o 57

DBSCAN: Density Based Spatial Clustering of Applications with Noise n n Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points If the ε-neighborhood of an object contains at least Min. Pts of objects, then the object is called a core object Outlier border point ε = 1 cm core point 2021/2/23 Min. Pts = 5 Data Mining: Concepts and Techniques 58

DBSCAN: The Algorithm n n n 2021/2/23 Arbitrary select a point p Retrieve all points density-reachable from p wrt ε and Min. Pts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. Data Mining: Concepts and Techniques 59

Chapter 7. Cluster Analysis n What is Cluster Analysis? n Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods n Outlier Analysis n Summary 2021/2/23 Data Mining: Concepts and Techniques 60

Grid-Based Clustering Method n Using multi-resolution grid data structure n Several interesting methods n n STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) Wave. Cluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’ 98) n n 2021/2/23 A multi-resolution clustering approach using wavelet method CLIQUE: Agrawal, et al. (SIGMOD’ 98) Data Mining: Concepts and Techniques 61

STING: A Statistical Information Grid Approach (1) n n n Wang, Yang and Muntz (VLDB’ 97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to different levels of resolution (1→ 4→ 16→ 64→‥‥) 2021/2/23 Data Mining: Concepts and Techniques 62

STING: A Statistical Information Grid Approach (2) n n Each cell at a high level is partitioned into a number of smaller cells in the next lower level Statistical info of each cell is calculated and stored beforehand is used to answer queries Parameters of higher level cells can be easily calculated from parameters of lower level cell n count, mean, s, min, max n type of distribution—normal, uniform, etc. Use a top-down approach to answer spatial data queries n n Start from a pre-selected layer—typically with a small number of cells For each cell in the current level compute the confidence interval

STING: A Statistical Information Grid Approach (3) n n n Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages: n Query-independent, easy to parallelize, incremental update n O(K), where K is the number of grid cells at the lowest level Disadvantages: n All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

CLIQUE (Clustering In QUEst) (Clustering for High-Dimension Space) n n n Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’ 98). Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space CLIQUE can be considered as both density-based and gridbased n n n It partitions each dimension into the same number of equal length units A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter A cluster is a maximal set of connected dense units within a subspace of higher dimensionality 2021/2/23 Data Mining: Concepts and Techniques 65

CLIQUE: The Major Steps n Identify dense units of higher dimensionality : n n Partition the data space into units and find the number of points that lie inside each unit. Determine dense units in all 1 -D dimensions of interests. Identify dense units of higher dimensionality using the Apriori principle (if a k-D unit is dense, then its (k-1)-D projection units must be dense) Generate a minimal description for each cluster n Determine the maximal region that cover a set of connected dense units for the cluster n Determination of minimal cover for each cluster 2021/2/23 Data Mining: Concepts and Techniques 66

40 50 20 =3 a al ry 30 50 S 2021/2/23 30 40 50 age 60 Vacation 30 age 60 Vacation (week) 0 1 2 3 4 5 6 7 Salary (10, 000) 0 1 2 3 4 5 6 7 20 maximal region Data Mining: Concepts and Techniques age 67

Strength and Weakness of CLIQUE n n Strength n It automatically finds subspaces of the highest dimensionality s. t. high density clusters exist in those subspaces n It is insensitive to the order of records in input and does not presume some canonical data distribution n It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases Weakness n The accuracy of the clustering result may be degraded at the expense of simplicity of the method 2021/2/23 Data Mining: Concepts and Techniques 68

Chapter 7. Cluster Analysis n What is Cluster Analysis? n Types of Data in Cluster Analysis n A Categorization of Major Clustering Methods n Partitioning Methods n Hierarchical Methods n Density-Based Methods n Grid-Based Methods n Model-Based Clustering Methods n Outlier Analysis n Summary 2021/2/23 Data Mining: Concepts and Techniques 69

Model-Based Clustering Methods n n Attempt to optimize the fit between the data and the cluster models Conceptual clustering (a statistical and AI approach) n Clustering is performed first, followed by characterization n Produces a classification scheme for a set of unlabeled objects n Finds a conceptual characteristic description for each class/cluster COBWEB (Fisher’ 87，good for Big Data Analysis) n A popular and simple method of incremental conceptual clustering n Creates a hierarchical clustering in the form of a classification tree n Each node denotes a concept and contains a probabilistic 2021/2/23 description of that concept Data Mining: Concepts and Techniques 70

COBWEB Clustering Method A classification tree animal P(C 0) = 1. 0 P(scales|C 0) = 0. 25. . . fish P(C 1) = 0. 25 P(scales|C 1) = 1. 0. . . amphibian P(C 2) = 0. 25 P(moist|C 2) = 1. 0. . . Mammal/bird P(C 3) = 0. 5 P(hair|C 3) = 0. 5. . . Mammal P(C 4) = 0. 25 P(hair|C 4) = 1. 0. . . 2021/2/23 Data Mining: Concepts and Techniques bird P(C 5) = 0. 25 P(feathers|C 5) = 1. 0. . . 71

Other Model-Based Clustering Methods Neural network approaches n n n Represent each cluster as an exemplar, acting as a “prototype” of the cluster In the clustering process, a training sample is assigned to a cluster whose exemplar is the most similar to the sample (according to some distance measure) The training is based on competitive learning n n 2021/2/23 Involving an organized architecture of neuron units The neuron units compete in a “winner-takes-all” fashion for current input training sample Data Mining: Concepts and Techniques 72

Self-Organizing Feature Maps (SOMs) (A Neural Network Approach) n n Example of a SOM neuron architecture Example Clustering is performed by having output neurons competing for the current input sample n n n 2021/2/23 Each output neuron denotes a cluster The output neuron whose connection weight vector is closest to the current input training sample wins The winner and its neighbors learn by having their connection weights adjusted SOMs are believed to resemble the information processing that can occur in the human brain Also useful for visualizing high-dimensional data mapped into 2 - or 3 -D space Data Mining: Concepts and Techniques 73

An Example of SOM Architecture Output neurons neuron i (4╳ 4 architecture) Connections Input neurons An example of SOM network • 2 input neurons • Format for input X : <x 1 , x 2> • 16 output neurons (clusters) • Connection weight matrix : 16╳ 2 • Weight vector of neuron i : <wi 1 , wi 2> Example of SOM result representation 2021/2/23 Data Mining: Concepts and Techniques 74

Examples of SOM Feature Maps 6╳ 6 architecture 20╳ 15 architecture Result of input Age 2021/2/23 Data Mining: Concepts and Techniques 75

Model Representation of An SOM Example (A 3 -input, 25 -output SOM) Input Connection weights and competitive layer x 1 x 2 x 3 Feature Map (Output) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 . . . a 1 ai a 25╳ 1 = compet (W 25╳ 3 X 3╳ 1 ) 3 input neurons 25 output neurons 2021/2/23 Data Mining: Concepts and Techniques 76

Rule of Learning in SOM Update weight vectors in the neighborhood of the winning neuron i* • • X(p) : the training input vector at iteration p • : neighborhood of the winning neuron i* • Neighborhood example: 13 is the winning neuron 2021/2/23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Data Mining: Concepts and Techniques 77

SOM Learning Algorithm • Step 1: Initialisation • Step 2: Find the winning neuron • Step 3: Learning (Updating Weights) • Step 4: Iteration 2021/2/23 Data Mining: Concepts and Techniques 78

SOM Learning Algorithm Step 1: Initialisation. Set initial connection weights to small random values, say in an interval [0, 1], and assign a small positive value to the learning rate parameter . 2021/2/23 Data Mining: Concepts and Techniques 79

SOM Learning Algorithm Step 2: Find the winning neuron (iteration p) (Activation and Similarity Matching) • • Activate the Kohonen network by applying an input sample X Find the winning (best matching) neuron i *, using the criterion of minimal Euclidean distance as follows: where , m is the # of neurons in the Kohonen layer, and n is the # of input neurons. 2021/2/23 Data Mining: Concepts and Techniques 80

SOM Learning Algorithm Step 3: Learning. • • Update the connection weights between input neuron j and output neuron i : , 1≤ i≤m, 1≤ j≤n where wij(p) is the weight correction at iteration p. The weight correction is determined by the competitive learning rule: where • is the learning rate parameter, and • Ni*(d, p) is the neighbourhood function centred around the winning neuron i * at a distance d at iteration p. 2021/2/23 Data Mining: Concepts and Techniques 81

SOM Learning Algorithm Step 4: Iteration. Increase iteration p by one, go back to Step 2 and continue until n the criterion of minimal Euclidean distance is satisfied, or n no noticeable changes occur in the feature map. 2021/2/23 Data Mining: Concepts and Techniques 82

Learning in Kohonen Network n The overall effect of the SOM learning rule resides in moving the connection weight vector Wi of the trained output neuron i towards the training input vector X. input pattern X X- Wi Wi+ΔWi = α (X- Wi) Wi -Wi connection weight of neuron i 2021/2/23 Data Mining: Concepts and Techniques 83

Convergence Process of SOM 2021/2/23 100 iterations 1000 iterations 2000 iterations 5000 iterations Data Mining: Concepts and Techniques 84

What Is Outlier Discovery? n n n 2021/2/23 What are outliers? n The set of objects are considerably dissimilar from the remainder of the data n Example: Sports: Michael Jordan, . . . Problem n Find top n outlier points Applications: n Credit card fraud detection n Telecom fraud detection n Customer segmentation n Medical analysis Data Mining: Concepts and Techniques 85

Outlier Discovery: Statistical Approaches Assume a model underlying distribution that generates data set (e. g. normal distribution) n Use discordancy tests depending on n data distribution n distribution parameter (e. g. , mean, variance) n number of expected outliers n Drawbacks n most tests are for single attribute n In many cases, data distribution may not be known 2021/2/23 Data Mining: Concepts and Techniques 86

Outlier Discovery: Distance-Based Approach n n n Introduced to counter the main limitations imposed by statistical methods n We need multi-dimensional analysis without knowing data distribution. Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T s. t. at least a fraction p of the objects in T lies at a distance greater than D from O Algorithms for mining distance-based outliers n Index-based algorithm n Nested-loop algorithm n Cell-based algorithm

Outlier Discovery: Deviation-Based Approach n n n Identifies outliers by examining the main characteristics of objects in a group Objects that “deviate” from this description are considered outliers sequential exception technique n n simulates the way in which humans can distinguish unusual objects from a series of similar objects OLAP data cube technique n 2021/2/23 uses data cubes to identify regions of anomalies in a large multi-dimensional data set Data Mining: Concepts and Techniques 88

Summary n n n Cluster analysis groups data objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection and can be performed by statistical, distancebased or deviation-based approaches There are still lots of research issues on cluster analysis, such as constraint-based clustering 2021/2/23 Data Mining: Concepts and Techniques 89