Clustering kMeans hierarchical clustering SelfOrganizing Maps Outline kmeans

  • Slides: 38
Download presentation
Clustering k-Means, hierarchical clustering, Self-Organizing Maps

Clustering k-Means, hierarchical clustering, Self-Organizing Maps

Outline • • • k-means clustering Hierarchical clustering Self-Organizing Maps

Outline • • • k-means clustering Hierarchical clustering Self-Organizing Maps

Classification vs. Clustering Classification: Supervised learning

Classification vs. Clustering Classification: Supervised learning

Classification vs. Clustering labels unknown Clustering: Unsupervised learning No labels, find “natural” grouping of

Classification vs. Clustering labels unknown Clustering: Unsupervised learning No labels, find “natural” grouping of instances

Many Clustering Applications Basically, everywhere labels are unknown/uncertain/too expensive • Marketing: find groups of

Many Clustering Applications Basically, everywhere labels are unknown/uncertain/too expensive • Marketing: find groups of similar customers • Astronomy: find groups of similar stars, galaxies • Earthquake studies: cluster earth quake epicenters along continent faults • Genomics: find groups of genes with similar expressions

Clustering Methods: Terminology Non-overlapping Overlapping

Clustering Methods: Terminology Non-overlapping Overlapping

Clustering Methods: Terminology Top-down Bottom-up (agglomerative)

Clustering Methods: Terminology Top-down Bottom-up (agglomerative)

Clustering Methods: Terminology Hierarchical

Clustering Methods: Terminology Hierarchical

Clustering Methods: Terminology Deterministic Probabilistic

Clustering Methods: Terminology Deterministic Probabilistic

k-Means Clustering

k-Means Clustering

K-means clustering (k=3) k 1 y k 2 k 3 x Pick k random

K-means clustering (k=3) k 1 y k 2 k 3 x Pick k random points: initial cluster centers

K-means clustering (k=3) k 1 y k 2 k 3 x Assign each point

K-means clustering (k=3) k 1 y k 2 k 3 x Assign each point to nearest cluster center

K-means clustering (k=3) k 1 y k 2 k 3 x Move cluster centers

K-means clustering (k=3) k 1 y k 2 k 3 x Move cluster centers to mean of each cluster

K-means clustering (k=3) k 1 y k 2 k 3 x Reassign points to

K-means clustering (k=3) k 1 y k 2 k 3 x Reassign points to nearest cluster center

K-means clustering (k=3) k 1 y k 2 k 3 x Repeat step 3

K-means clustering (k=3) k 1 y k 2 k 3 x Repeat step 3 -4 until cluster centers converge (don’t/hardly move)

K-means • Works with numeric data only 1) Pick k random points: initial cluster

K-means • Works with numeric data only 1) Pick k random points: initial cluster centers 2) Assign every item to its nearest cluster center (e. g. using Euclidean distance) 3) Move each cluster center to the mean of its assigned items 4) Repeat steps 2, 3 until convergence (change in cluster assignments less than a threshold)

K-means clustering: another example http: //www. youtube. com/watch? feature=player_embedded&v=BVFG 7 fd 1 H 30

K-means clustering: another example http: //www. youtube. com/watch? feature=player_embedded&v=BVFG 7 fd 1 H 30

Discussion • • Result can vary significantly depending on initial choice of centers Can

Discussion • • Result can vary significantly depending on initial choice of centers Can get trapped in local minimum • Example: initial cluster centers instances • To increase chance of finding global optimum: restart with different random seeds

Discussion, circular data • Arbitrary results • Prototypes not ‘on’ data

Discussion, circular data • Arbitrary results • Prototypes not ‘on’ data

K-means clustering summary Advantages Disadvantages Simple, understandable • Instances automatically assigned to clusters •

K-means clustering summary Advantages Disadvantages Simple, understandable • Instances automatically assigned to clusters • Fast • • • Must pick number of clusters beforehand All instances forced into a single cluster Sensitive to outliers Random algorithm • Random results Not always intuitive • Higher dimensions

K-means variations k-medoids – instead of mean, use medians of each cluster • •

K-means variations k-medoids – instead of mean, use medians of each cluster • • Mean of 1, 3, 5, 7, 1009 is 205 Median of 1, 3, 5, 7, 1009 is 5 For large databases, use sampling

How to choose k? g One important parameter k, but how to choose? g

How to choose k? g One important parameter k, but how to choose? g g g Domain dependent, we simply want k clusters Alternative: repeat for several values of k and choose the best Example: g cluster mammals properties g each value of k leads to a different clustering g use an MDL-based encoding for the data in clusters g each additional cluster introduces a penalty g optimal for k = 6

Clustering Evaluation g g g Manual inspection Benchmarking on existing labels g Classification through

Clustering Evaluation g g g Manual inspection Benchmarking on existing labels g Classification through clustering g Is this fair? Cluster quality measures • • distance measures high similarity within a cluster, low across clusters

Hierarchical Clustering

Hierarchical Clustering

Hierarchical clustering g Hierarchical clustering represented in dendrogram • tree structure containing hierarchical clusters

Hierarchical clustering g Hierarchical clustering represented in dendrogram • tree structure containing hierarchical clusters • individual clusters in leafs, union of child clusters in nodes

Bottom-up vs top-down clustering Bottom up/Agglomerative g • • Start with single-instance clusters At

Bottom-up vs top-down clustering Bottom up/Agglomerative g • • Start with single-instance clusters At each step, join two “closest” clusters Top down g • • • Start with one universal cluster Split in two clusters Proceed recursively on each subset

Distance Between Clusters Centroid: distance between centroids • • Sometimes hard to compute (e.

Distance Between Clusters Centroid: distance between centroids • • Sometimes hard to compute (e. g. mean of molecules? ) • Single Link: smallest distance between points • Complete Link: largest distance between points • Average Link: average distance between points (d(A, C)+d(A, D)+d(B, C )+d(B, D))/4

Clustering dendrogram

Clustering dendrogram

How many clusters?

How many clusters?

Self-Organising Maps

Self-Organising Maps

Self Organizing Map Group similar data together g Dimensionality reduction g Data visualization technique

Self Organizing Map Group similar data together g Dimensionality reduction g Data visualization technique g Similar to neural networks g Neurons try to mimic the input vectors g The winning neuron (and its neighborhood) wins g Topology preserving, using Neighborhood function g

Self Organizing Map g g Input: high-dimensional input space Output: low dimensional (typically 2

Self Organizing Map g g Input: high-dimensional input space Output: low dimensional (typically 2 or 3) g g network topology Training Starting with a large learning rate and neighborhood size, both are gradually decreased to facilitate convergence g After learning, neurons with similar weights tend to cluster on the map g

Learning the SOM Determine the winner (the neuron of which the weight vector has

Learning the SOM Determine the winner (the neuron of which the weight vector has the smallest distance to the input vector) g Move the weight vector w of the winning neuron towards the input i g i w Before learning i w After learning

SOM Learning Algorithm Initialise SOM (random, or such that dissimilar input is mapped far

SOM Learning Algorithm Initialise SOM (random, or such that dissimilar input is mapped far apart) g for t from 0 to N g Randomly select a training instance g Get the best matching neuron g calculate distance, e. g. g Scale neighbors g Which? decrease over time: Hexagons, squares, Gaussian, … g Update of neighbors towards the training instance g

Self Organizing Map Neighborhood function to preserve topological properties of the input space g

Self Organizing Map Neighborhood function to preserve topological properties of the input space g Neighbors share the prize (postcode lottery principle) g

SOM of hand-written numerals

SOM of hand-written numerals

SOM of countries (poverty)

SOM of countries (poverty)

Clustering Summary g g Unsupervised Many approaches • k-means – simple, sometimes useful •

Clustering Summary g g Unsupervised Many approaches • k-means – simple, sometimes useful • Hierarchical clustering – works for symbolic attributes • Self-Organizing Maps Evaluation is a problem • g k-medoids is less sensitive to outliers