Cluster Analysis Dr Anil Maheshwari Agenda Cluster analysis

  • Slides: 23
Download presentation
Cluster Analysis Dr. Anil Maheshwari

Cluster Analysis Dr. Anil Maheshwari

Agenda Cluster analysis Clustering example Exercises using SPSS Modeler Project discussions

Agenda Cluster analysis Clustering example Exercises using SPSS Modeler Project discussions

Clustering is a technique used for automatic identification of natural groupings of things data

Clustering is a technique used for automatic identification of natural groupings of things data instances that are similar to (or near) each other are categorized into one cluster data instances that are very different (or far away) from each other into different clusters. Learns the clusters of things from past data, then assigns new instances to their cluster homes Part of the machine-learning family Employs unsupervised learning There is no output/dependent variable Also known as segmentation

Operational Definition Given a representation of n objects, find K groups based on a

Operational Definition Given a representation of n objects, find K groups based on a measure of similarity such that objects within the same group are alike but the objects in different groups are not alike. But, what is the notion of similarity? What is the definition of a cluster? Next chart shows that clusters can differ in terms of their shape, size, and density. The presence of noise in the data makes the detection of the clusters even more difficult. An ideal cluster can be defined as a set of points that is compact and isolated. In reality, a cluster is a subjective entity whose significance and interpretation requires domain knowledge. Source, Data Clustering: 50 Years Beyond K-Means, by Anil K. Jain

Cluster Analysis Used in almost every field where there is massive variety and transactions

Cluster Analysis Used in almost every field where there is massive variety and transactions Provide characterization, definition, and labels for populations Identify natural groupings of customers, products, patients, etc. Decrease the size and complexity of problems

Clustering examples Group people of similar sizes together to make small, medium and large

Clustering examples Group people of similar sizes together to make small, medium and large T-Shirt sizes To provide fit clothes at mass production rates Segment customers according to their similarities To do targeted marketing. Organize a given collection of text documents according to their content similarities To produce a topic hierarchy 6

An illustration This data set has three natural groups of data points, i. e.

An illustration This data set has three natural groups of data points, i. e. , 3 natural clusters.

Cluster Analysis for Data Mining How many clusters? There is not a “truly optimal”

Cluster Analysis for Data Mining How many clusters? There is not a “truly optimal” way to calculate it Heuristics are often used … ‘elbow method’ Most cluster analysis methods use a distance measure to calculate the closeness between pairs of items Euclidian versus Manhattan (rectilinear) distance Inter-clusters distance maximized Intra-clusters distance minimized The quality of a clustering result depends on the algorithm, the distance function, and the application.

Cluster Analysis for Data Mining Analysis methods Statistical methods such as k-means, k-modes Neural

Cluster Analysis for Data Mining Analysis methods Statistical methods such as k-means, k-modes Neural networks Fuzzy logic (e. g. , fuzzy c-means algorithm) Genetic algorithms Hierarchical vs Agglomerative methods Top-down vs Bottom up algorithms

k-Means Clustering Algorithm k-means algorithm partitions the given data into k clusters.

k-Means Clustering Algorithm k-means algorithm partitions the given data into k clusters.

Strengths of k-means K-means is the most popular clustering algorithm. Strengths: Simple: easy to

Strengths of k-means K-means is the most popular clustering algorithm. Strengths: Simple: easy to understand to implement Efficient: k-means is considered a linear algorithm. Weaknesses: The user needs to specify k. The process may not converge. Not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres). No other clustering algorithm performs better in general though some may be useful for specific purposes Comparing clustering algorithms is a difficult task. No one knows the correct clusters!

Common ways to represent clusters Centroids of the clusters compute the radius and standard

Common ways to represent clusters Centroids of the clusters compute the radius and standard deviation of the cluster to determine its spread in each dimension works well if the clusters are of the hyper-spherical shape. Frequent values Mainly for clustering of categorical data (e. g. , k-modes clustering). Main method used in text clustering, where a small set of frequent words in each cluster is selected to represent the cluster.

Clusters of arbitrary shapes Hyper-elliptical and hyperspherical clusters are usually easy to represent, using

Clusters of arbitrary shapes Hyper-elliptical and hyperspherical clusters are usually easy to represent, using their centroid together with spreads. Irregular shape clusters are hard to represent. They may not be useful in some applications. Using centroids are not suitable (upper figure) in general K-means clusters may be more useful (lower figure), e. g. , for making 2 size T-shirts.

Clustering Example: 10 data points 4, 7 5, 7 2, 6 5, 6 2,

Clustering Example: 10 data points 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

Two clusters 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6

Two clusters 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

Three clusters 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6

Three clusters 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

K-means: select arbit centroids 4, 7 5, 7 2, 6 5, 6 2, 4

K-means: select arbit centroids 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

Allocate points to nearest centroid 4, 7 5, 7 2, 6 5, 6 2,

Allocate points to nearest centroid 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

Recompute centroids 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6

Recompute centroids 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

Assign points to new centroids 4, 7 5, 7 2, 6 5, 6 2,

Assign points to new centroids 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

Continue till centroids stabilize 4, 7 5, 7 2, 6 5, 6 2, 4

Continue till centroids stabilize 4, 7 5, 7 2, 6 5, 6 2, 4 6, 6 4, 4 6, 3 8, 3 5, 2 0 1 2 3 4 5 6 7 8 9

Conclusion Clustering is grouping for convenience or other purpose Classification and Clustering? Both are

Conclusion Clustering is grouping for convenience or other purpose Classification and Clustering? Both are pattern recognition mechanisms Classification is supervised learning Clustering is unsupervised learning

Thank You!

Thank You!