CS 728 Clustering the Web Lecture 13 What

CS 728 Clustering the Web Lecture 13

What is clustering? • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects – It is the commonest form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where the correct classification of examples is given – It is a common and important task that finds many applications in Web Science, IR and other places

Why cluster web documents? • Whole web navigation – Better user interface • Improving recall in web search – Better search results • For better navigation of search results – Effective “user recall” will be higher • For speeding up retrieval – Faster search

Yahoo! Tree Hierarchy www. yahoo. com/Science … (30) agriculture. . . biology physics. . . CS. . . dairy space. . . botany cell AI courses crops magnetism HCI agronomy evolution forestry relativity . . . craft missions CS Research Question: Given a set of related objects (e. g. , webpages) find the best tree decomposition. Best: helps a user find/retrieve object of interest

Scatter/Gather Method for Browsing a Large Collection (SIGIR ’ 92) Cutting, Karger, Pedersen, and Tukey Users browse a document collection interactively by selecting subsets of documents that are re-clustered on-the-fly

For improving search recall • Cluster hypothesis - Documents with similar text are related • Therefore, to improve search recall: – Cluster docs in corpus a priori – When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: The query “car” will also return docs containing automobile – Because clustering grouped together docs containing car with those containing automobile.

For better navigation of search results • For grouping search results thematically – clusty. com / Vivisimo

For better navigation of search results • And more visually: Kartoo. com

Defining What Is Good Clustering • Internal criterion: A good clustering will produce high quality clusters in which: – the intra-class (that is, intra-cluster) similarity is high – the inter-class similarity is low – The measured quality of a clustering depends on both the document representation and the similarity measure used • External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes – Assessable with gold standard data

External Evaluation of Cluster Quality • Assesses clustering with respect to ground truth • Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, π1, π2, …, πk with ni members. • Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi • Others are entropy of classes in clusters (or mutual information between classes and clusters)

Purity Cluster II Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 Cluster III

Issues for clustering • Representation for clustering – Document representation • Vector space? Normalization? – Need a notion of similarity/distance • How many clusters? – Fixed a priori? – Completely data driven? • Avoid “trivial” clusters - too large or small – In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.

What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity – We will typically use cosine similarity (Pearson correlation). – Docs as vectors. – For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. – We will describe algorithms in terms of cosine similarity.

Recall doc as vector • Each doc j is a vector of tf idf values, one component for each term. • Can normalize to unit length. • So we have a vector space – terms are axes - aka features – n docs live in this space – even with stemming, may have 20, 000+ dimensions – do we really want to use all terms? • Different from using vector space for search. Why?

Intuition t 3 D 2 D 3 D 1 x y t 1 t 2 D 4 Postulate: Documents that are “close together” in vector space talk about the same things.

Clustering Algorithms • Partitioning “flat” algorithms – Usually start with a random (partial) partitioning – Refine it iteratively • k means/medoids clustering • Model based clustering • Hierarchical algorithms – Bottom-up, agglomerative – Top-down, divisive

k-Clustering Algorithms • Given: a set of documents and the number k • Find: a partition into k clusters that optimizes the chosen partitioning criterion – Globally optimal: exhaustively enumerate all partitions – Effective heuristic methods: • Iterative k-means and k-medoids algorithms • Hierarchical methods – stop at level with k parts

How hard is clustering? • One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties • Suppose we are given n points, and would like to cluster them into k-clusters – How many possible clusterings?

Clustering Criteria: Maximum Spacing • Spacing between clusters is defined as Min distance between any pair of points in different clusters. • Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing k=4

Greedy Clustering Algorithm • Single-link k-clustering algorithm. – Form a graph on the vertex set U, corresponding to n clusters. – Find the closest pair of objects such that each object is in a different cluster, and add an edge between them. – Repeat n-k times until there are exactly k clusters. • Key observation. This procedure is precisely Kruskal's algorithm for Minimum-Cost Spanning Tree (except we stop when there are k connected components). • Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.

Greedy Clustering Analysis • Theorem. Let C* denote the clustering C*1, …, C*k formed by deleting the k-1 most expensive edges of a MST. C* is a k-clustering of max spacing. • Pf. Let C denote some other clustering C 1, …, Ck. – The spacing of C* is the length d* of the (k-1)st most expensive edge. – Since C is not C* there is pair pi, pj in the same cluster in C*, say C*r, but different clusters in C, say Cs and Ct. – Some edge (p, q) on pi-pj path in C*r spans two different clusters in C. – All edges on pi-pj path have length d* Ct Cs since Kruskal chose them. C*r – Spacing of C is d* since p and q are in different clusters. ▪ pi p q pj

Hierarchical Agglomerative Clustering (HAC) • Greedy is one example of HAC • Assumes a similarity function for determining the similarity of two instances. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. • The history of merging forms a binary tree or hierarchy.

A Dendogram: Hierarchical Clustering • Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). • Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj

Hierarchical Clustering algorithms • Agglomerative (bottom-up): – Start with each document being a single cluster. – Eventually all documents belong to the same cluster. • Divisive (top-down): – Start with all documents belong to the same cluster. – Eventually each node forms a cluster on its own. • Does not require the number of clusters k in advance • Needs a termination/readout condition – The final mode in both Agglomerative and Divisive is of no use.

Dendrogram: Document Example • As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d 3 d 5 d 1 d 2 d 3, d 4, d 5 d 4 d 1, d 2 d 4, d 5 d 3

Hierarchical Clustering • Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? • Max spacing – Measure intercluster distances by distances of nearest pairs. • Euclidean spacing – each cluster has a centroid = average of its points. – Measure intercluster distances by distances of centroids.

“Closest pair” of clusters • Many variants to defining closest pair of clusters • Single-link or max spacing – Similarity of the most similar pair (same as Kruskal’s MST algorithm) • “Center of gravity” – Clusters whose centroids (centers of gravity) are the most similar • Average-link – Average similarity between pairs of elements • Complete-link – Similarity of the “furthest” points, the least similar

Impact of Choice of Similarity Measure • Single Link Clustering – Can result in “straggly” (long and thin) clusters due to chaining effect. – Appropriate in some domains, such as clustering islands: “Hawaii clusters” • Uses min distance / max similarity update – After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Single Link Example

Complete-Link Clustering • Use minimum similarity (max distance) of pairs: • Makes “tighter, ” spherical clusters that are sometimes preferable. • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Complete Link Example

Computational Complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n 2). • In each of the subsequent n 2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. – Since we can just store unchanged similarities • In order to maintain an overall O(n 2) performance, computing similarity to each other cluster must be done in constant time. – Else O(n 2 log n) or O(n 3) if done naively

Key notion: cluster representative • We want a notion of a representative point in a cluster • Representative should be some sort of “typical” or central point in the cluster, e. g. , – point inducing smallest radii to docs in cluster – smallest squared distances, etc. – point that is the “average” of all docs in the cluster • Centroid or center of gravity

Example: n=6, k=3, closest pair of centroids d 6 d 4 d 3 d 5 Centroid after second step. d 1 d 2 Centroid after first step.

Outliers in centroid computation • Can ignore outliers when computing centroid. • What is an outlier? – Lots of statistical definitions, e. g. – moment of point to centroid > M some cluster moment. Centroid Outlier

Group-Average Clustering • Uses average similarity across all pairs within the merged cluster to measure the similarity of two clusters. • Compromise between single and complete link. • Two options: – Averaged across all ordered pairs in the merged cluster – Averaged over all pairs between the two original clusters • Some previous work has used one of these options; some the other. No clear difference in efficacy

Computing Group Average Similarity • Assume cosine similarity and normalized vectors with unit length. • Always maintain sum of vectors in each cluster. • Compute similarity of clusters in constant time:

Medoid As Cluster Representative • The centroid does not have to be a document. • Medoid: A cluster representative that is one of the documents • For example: the document closest to the centroid • One reason this is useful – Consider the representative of a large cluster (>1000 documents) – The centroid of this cluster will be a dense vector – The medoid of this cluster will be a sparse vector • Compare: mean/centroid vs. median/medoid

Homework Exercise • Consider different agglomerative clustering methods of n points on a line. Explain how you could avoid n 3 distance computations how many will your scheme use?

Efficiency: “Using approximations” • In standard algorithm, must find closest pair of centroids at each step • Approximation: instead, find nearly closest pair – use some data structure that makes this approximation easier to maintain – simplistic example: maintain closest pair based on distances in projection on a random line Random line

The dual space • So far, we clustered docs based on their similarities in term space • For some applications, e. g. , topic analysis for inducing navigation structures, can “dualize”: – use docs as axes – represent users or terms as vectors – proximity based on co-occurrence of usage – now clustering users or terms, not docs

Next time • Iterative clustering using K-means • Spectral clustering using eigenvalues