Cluster Analysis Basic Concepts and Algorithms Lecture Notes

Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Clustering by Tan, Steinbach, Kumar © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Applications of Cluster Analysis l Understanding – Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations l Summarization – Reduce the size of large data sets Clustering precipitation in Australia © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

What is not Cluster Analysis? l Supervised classification – Have class label information l Simple segmentation – Dividing students into different registration groups alphabetically, by last name l Results of a query – Groupings are a result of an external specification l Graph partitioning – Some mutual relevance and synergy, but areas are not identical © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Types of Clusterings l A clustering is a set of clusters l Important distinction between hierarchical and partitional sets of clusters l Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset l Hierarchical clustering – A set of nested clusters organized as a hierarchical tree © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Partitional Clustering Original Points © Tan, Steinbach, Kumar A Partitional Clustering Introduction to Data Mining 4/18/2004 7

Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Other Distinctions Between Sets of Clusters l Exclusive versus non-exclusive – In non-exclusive clusterings, points may belong to multiple clusters. – Can represent multiple classes or ‘border’ points l Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics l Partial versus complete – In some cases, we only want to cluster some of the data l Heterogeneous versus homogeneous – Cluster of widely different sizes, shapes, and densities © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Types of Clusters l Well-separated clusters l Center-based clusters l Contiguous clusters l Density-based clusters l Property or Conceptual l Described by an Objective Function © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Types of Clusters: Well-Separated l Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Types of Clusters: Center-Based l Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Types of Clusters: Contiguity-Based l Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Types of Clusters: Density-Based l Density-based – A cluster is a dense region of points, which is separated by low -density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Types of Clusters: Conceptual Clusters l Shared Property or Conceptual Clusters – Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Types of Clusters: Objective Function l Clusters Defined by an Objective Function – Finds clusters that minimize or maximize an objective function. – Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) – Can have global or local objectives. u Hierarchical clustering algorithms typically have local objectives u Partitional algorithms typically have global objectives – A variation of the global objective function approach is to fit the data to a parameterized model. u Parameters for the model are determined from the data. Mixture models assume that the data is a ‘mixture' of a number of statistical distributions. u © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Types of Clusters: Objective Function … l Map the clustering problem to a different domain and solve a related problem in that domain – Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points – Clustering is equivalent to breaking the graph into connected components, one for each cluster. – Want to minimize the edge weight between clusters and maximize the edge weight within clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Characteristics of the Input Data Are Important l Type of proximity or density measure – This is a derived measure, but central to clustering l Sparseness – Dictates type of similarity – Adds to efficiency l Attribute type – Dictates type of similarity l Type of Data – Dictates type of similarity – Other characteristics, e. g. , autocorrelation l l l Dimensionality Noise and Outliers Type of Distribution © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Clustering Algorithms l K-means and its variants l Hierarchical clustering l Density-based clustering © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

K-means Clustering l Partitional clustering approach l Each cluster is associated with a centroid (center point) l Each point is assigned to the cluster with the closest centroid l Number of clusters, K, must be specified l The basic algorithm is very simple © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree l Can be visualized as a dendrogram l – A tree like diagram that records the sequences of merges or splits © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

Strengths of Hierarchical Clustering l Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level l They may correspond to meaningful taxonomies – Example in biological sciences (e. g. , animal kingdom, phylogeny reconstruction, …) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Hierarchical Clustering l Two main types of hierarchical clustering – Agglomerative: u Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left u – Divisive: u Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) u l Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward 1. 2. 3. 4. 5. 6. l Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

Starting Situation l Start with clusters of individual points and a proximity matrix p 1 p 2 p 3 p 4 p 5 . . . p 1 p 2 p 3 p 4 p 5. . . © Tan, Steinbach, Kumar Introduction to Data Mining Proximity Matrix 4/18/2004 25

Intermediate Situation l After some merging steps, we have some clusters C 1 C 2 C 3 C 4 C 5 Proximity Matrix C 1 C 2 © Tan, Steinbach, Kumar C 5 Introduction to Data Mining 4/18/2004 26

Intermediate Situation l We want to merge the two closest clusters (C 2 and C 5) and update the proximity matrix. C 1 C 2 C 3 C 4 C 5 Proximity Matrix C 1 C 2 © Tan, Steinbach, Kumar C 5 Introduction to Data Mining 4/18/2004 27

After Merging l The question is “How do we update the proximity matrix? ” C 1 C 2 U C 5 C 3 C 4 ? ? ? C 3 ? C 4 ? Proximity Matrix C 1 C 2 U C 5 © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

How to Define Inter-Cluster Similarity p 1 Similarity? p 2 p 3 p 4 p 5 . . . p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 . . . p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 30

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 . . . p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 31

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 . . . p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 32

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 . . . p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 33

Cluster Similarity: MIN or Single Link l Similarity of two clusters is based on the two most similar (closest) points in the different clusters – Determined by one pair of points, i. e. , by one link in the proximity graph. 1 © Tan, Steinbach, Kumar Introduction to Data Mining 2 3 4 4/18/2004 5 34

Hierarchical Clustering: MIN 1 3 5 2 1 2 3 4 5 6 4 Nested Clusters © Tan, Steinbach, Kumar Dendrogram Introduction to Data Mining 4/18/2004 35

Strength of MIN Original Points Two Clusters • Can handle non-elliptical shapes © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 36

Limitations of MIN Original Points Two Clusters • Sensitive to noise and outliers © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 37

Cluster Similarity: MAX or Complete Linkage l Similarity of two clusters is based on the two least similar (most distant) points in the different clusters – Determined by all pairs of points in the two clusters 1 © Tan, Steinbach, Kumar Introduction to Data Mining 2 3 4 4/18/2004 5 38

Hierarchical Clustering: MAX 4 1 2 5 5 2 3 3 6 1 4 Nested Clusters © Tan, Steinbach, Kumar Dendrogram Introduction to Data Mining 4/18/2004 39

Strength of MAX Original Points Two Clusters • Less susceptible to noise and outliers © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 40

Limitations of MAX Original Points Two Clusters • Tends to break large clusters • Biased towards globular clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 41

Cluster Similarity: Group Average l Proximity of two clusters is the average of pairwise proximity between points in the two clusters. l Need to use average connectivity for scalability since total proximity favors large clusters 1 © Tan, Steinbach, Kumar Introduction to Data Mining 2 3 4 4/18/2004 5 42

Hierarchical Clustering: Group Average 5 4 1 2 5 2 3 6 1 4 3 Nested Clusters © Tan, Steinbach, Kumar Dendrogram Introduction to Data Mining 4/18/2004 43

Hierarchical Clustering: Group Average l Compromise between Single and Complete Link l Strengths – Less susceptible to noise and outliers l Limitations – Biased towards globular clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 44

Cluster Similarity: Ward’s Method l Similarity of two clusters is based on the increase in squared error when two clusters are merged – Similar to group average if distance between points is distance squared l Less susceptible to noise and outliers l Biased towards globular clusters l Hierarchical analogue of K-means – Can be used to initialize K-means © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 45

Hierarchical Clustering: Comparison 1 3 5 5 1 2 3 6 MIN MAX 5 2 3 3 5 1 5 Ward’s Method 2 3 3 6 4 1 2 5 2 Group Average 3 1 6 1 4 4 © Tan, Steinbach, Kumar 6 4 2 4 5 4 1 5 1 2 2 4 4 Introduction to Data Mining 3 4/18/2004 46

Hierarchical Clustering: Time and Space requirements l O(N 2) space since it uses the proximity matrix. – N is the number of points. l O(N 3) time in many cases – There are N steps and at each step the size, N 2, proximity matrix must be updated and searched – Complexity can be reduced to O(N 2 log(N) ) time for some approaches © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 47

Hierarchical Clustering: Problems and Limitations l Once a decision is made to combine two clusters, it cannot be undone l No objective function is directly minimized l Different schemes have problems with one or more of the following: – Sensitivity to noise and outliers – Difficulty handling different sized clusters and convex shapes – Breaking large clusters © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 48

MST: Divisive Hierarchical Clustering l Build MST (Minimum Spanning Tree) – Start with a tree that consists of any point – In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not – Add q to the tree and put an edge between p and q © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 49