Clustering CS 685 Special Topics in Data Mining

  • Slides: 25
Download presentation
Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu The UNIVERSITY

Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu The UNIVERSITY KENTUCKY CS 685 : Special Topics in Dataof Mining, UKY

Outline • • What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods

Outline • • What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering methods Outlier analysis CS 685 : Special Topics in Data Mining, UKY

Hierarchical Clustering • Group data objects into a tree of clusters Step 0 a

Hierarchical Clustering • Group data objects into a tree of clusters Step 0 a b Step 1 Step 2 Step 3 Step 4 ab abcde c cde d de e Step 4 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) CS 685 : Special Topics in Data Mining, UKY

AGNES (Agglomerative Nesting) • Initially, each object is a cluster • Step-by-step cluster merging,

AGNES (Agglomerative Nesting) • Initially, each object is a cluster • Step-by-step cluster merging, until all objects form a cluster – Single-link approach – Each cluster is represented by all of the objects in the cluster – The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters CS 685 : Special Topics in Data Mining, UKY

Dendrogram • Show to merge clusters hierarchically • Decompose data objects into a multi-level

Dendrogram • Show to merge clusters hierarchically • Decompose data objects into a multi-level nested partitioning (a tree of clusters) • A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster CS 685 : Special Topics in Data Mining, UKY

DIANA (DIvisive ANAlysis) • Initially, all objects are in one cluster • Step-by-step splitting

DIANA (DIvisive ANAlysis) • Initially, all objects are in one cluster • Step-by-step splitting clusters until each cluster contains only one object CS 685 : Special Topics in Data Mining, UKY

Distance Measures • • Minimum distance Maximum distance Mean distance Average distance m: mean

Distance Measures • • Minimum distance Maximum distance Mean distance Average distance m: mean for a cluster C: a cluster n: the number of objects in a cluster CS 685 : Special Topics in Data Mining, UKY

Challenges of Hierarchical Clustering Methods • Hard to choose merge/split points – Never undo

Challenges of Hierarchical Clustering Methods • Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical • Do not scale well: O(n 2) • What is the bottleneck when the data can’t fit in memory? • Integrating hierarchical clustering with other techniques – BIRCH, CURE, CHAMELEON, ROCK CS 685 : Special Topics in Data Mining, UKY

BIRCH • Balanced Iterative Reducing and Clustering using Hierarchies • CF (Clustering Feature) tree:

BIRCH • Balanced Iterative Reducing and Clustering using Hierarchies • CF (Clustering Feature) tree: a hierarchical data structure summarizing object info – Clustering objects clustering leaf nodes of the CF tree CS 685 : Special Topics in Data Mining, UKY

Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data

Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: N 2 i=1=Xi CF = (5, (16, 30), (54, 190)) (3, 4) (2, 6) (4, 5) (4, 7) (3, 8) CS 685 : Special Topics in Data Mining, UKY

CF-tree in BIRCH • Clustering feature: – Summarize the statistics for a subcluster: the

CF-tree in BIRCH • Clustering feature: – Summarize the statistics for a subcluster: the 0 th, 1 st and 2 nd moments of the subcluster – Register crucial measurements for computing cluster and utilize storage efficiently • A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering – A nonleaf node in a tree has descendants or “children” – The nonleaf nodes store sums of the CFs of children CS 685 : Special Topics in Data Mining, UKY

CF Tree B = 7 CF 1 CF 2 CF 3 L = 6

CF Tree B = 7 CF 1 CF 2 CF 3 L = 6 child 1 child 2 child 3 CF 6 child 6 CF 1 Non-leaf node CF 2 CF 3 CF 5 child 1 child 2 child 3 child 5 Leaf node prev CF 1 CF 2 CF 6 next Root Leaf node prev CF 1 CF 2 CF 4 next CS 685 : Special Topics in Data Mining, UKY

Parameters of A CF-tree • Branching factor: the maximum number of children • Threshold:

Parameters of A CF-tree • Branching factor: the maximum number of children • Threshold: max diameter of sub-clusters stored at the leaf nodes CS 685 : Special Topics in Data Mining, UKY

BIRCH Clustering • Phase 1: scan DB to build an initial in-memory CF tree

BIRCH Clustering • Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) • Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree CS 685 : Special Topics in Data Mining, UKY

Pros & Cons of BIRCH • Linear scalability – Good clustering with a single

Pros & Cons of BIRCH • Linear scalability – Good clustering with a single scan – Quality can be further improved by a few additional scans • Can handle only numeric data • Sensitive to the order of the data records CS 685 : Special Topics in Data Mining, UKY

Drawbacks of Square Error Based Methods • One representative per cluster – Good only

Drawbacks of Square Error Based Methods • One representative per cluster – Good only for convex shaped having similar size and density • A number of clusters parameter k – Good only if k can be reasonably estimated CS 685 : Special Topics in Data Mining, UKY

CURE: the Ideas • Each cluster has c representatives – Choose c well scattered

CURE: the Ideas • Each cluster has c representatives – Choose c well scattered points in the cluster – Shrink them towards the mean of the cluster by a fraction of – The representatives capture the physical shape and geometry of the cluster • Merge the closest two clusters – Distance of two clusters: the distance between the two closest representatives CS 685 : Special Topics in Data Mining, UKY

Drawback of Distance-based Methods • Hard to find clusters with irregular shapes • Hard

Drawback of Distance-based Methods • Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense CS 685 : Special Topics in Data Mining, UKY

Directly Density Reachable • Parameters Min. Pts = 3 Eps = 1 cm –

Directly Density Reachable • Parameters Min. Pts = 3 Eps = 1 cm – Eps: Maximum radius of the neighborhood – Min. Pts: Minimum number of points in an Epsneighborhood of that point – NEps(p): {q | dist(p, q) Eps} p q • Core object p: |Neps(p)| Min. Pts • Point q directly density-reachable from p iff q Neps(p) and p is a core object CS 685 : Special Topics in Data Mining, UKY

Density-Based Clustering: Background (II) • Density-reachable p p 1 q – Directly density reachable

Density-Based Clustering: Background (II) • Density-reachable p p 1 q – Directly density reachable p 1 p 2, p 2 p 3, …, pn 1 pn density-reachable from p 1 • Density-connected – Points p, q are density-reachable from o p and q are density-connected p q o CS 685 : Special Topics in Data Mining, UKY

DBSCAN • A cluster: a maximal set of density-connected points – Discover clusters of

DBSCAN • A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Core Eps = 1 cm Min. Pts = 5 CS 685 : Special Topics in Data Mining, UKY

DBSCAN: the Algorithm • Arbitrary select a point p • Retrieve all points density-reachable

DBSCAN: the Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p wrt Eps and Min. Pts • If p is a core point, a cluster is formed • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed CS 685 : Special Topics in Data Mining, UKY

Problems of DBSCAN • Different clusters may have very different densities • Clusters may

Problems of DBSCAN • Different clusters may have very different densities • Clusters may be in hierarchies CS 685 : Special Topics in Data Mining, UKY

OPTICS: A Cluster-ordering Method • OPTICS: ordering points to identify the clustering structure •

OPTICS: A Cluster-ordering Method • OPTICS: ordering points to identify the clustering structure • “Group” points by density connectivity – Hierarchies of clusters • Visualize clusters and the hierarchy CS 685 : Special Topics in Data Mining, UKY

DENCLUE: Using Density Functions • DENsity-based CLUst. Ering • Major features – Solid mathematical

DENCLUE: Using Density Functions • DENsity-based CLUst. Ering • Major features – Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets – Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) – But need a large number of parameters CS 685 : Special Topics in Data Mining, UKY