Clustering Large DB n n n Most clustering
Clustering Large DB n n n Most clustering algorithms assume a large data structure which is memory resident. Clustering may be performed first on a sample of the database then applied to the entire database. Algorithms – – – BIRCH DBSCAN CURE Part II - Clustering © Prentice Hall 1
Desired Features for Large Databases One scan (or less) of DB n Online n Suspendable, stoppable, resumable n Incremental n Work with limited main memory n Different techniques to scan (e. g. sampling) n Process each tuple once n Part II - Clustering © Prentice Hall 2
BIRCH Balanced Iterative Reducing and Clustering using Hierarchies n Incremental, hierarchical, one scan n Save clustering information in a tree n Each entry in the tree contains information about one cluster n New nodes inserted in closest entry in tree n Part II - Clustering © Prentice Hall 3
Clustering Feature n n (N, LS, SS) – N: Number of points in cluster – LS: Sum of points in the cluster – SS: Sum of squares of points in the cluster CF Tree – Balanced search tree – Node has CF triple for each child – Leaf node represents cluster and has CF value for each subcluster in it. – Subcluster has maximum diameter Part II - Clustering © Prentice Hall 4
BIRCH Algorithm Part II - Clustering © Prentice Hall 5
Improve Clusters Part II - Clustering © Prentice Hall 6
DBSCAN Density Based Spatial Clustering of Applications with Noise n Outliers will not effect creation of cluster. n Input n – Min. Pts – minimum number of points in cluster – Eps – for each point in cluster there must be another point in it less than this distance away. Part II - Clustering © Prentice Hall 7
DBSCAN Density Concepts n n Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (Min. Pts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point si densityreachable form another point if there is a path from one to the other consisting of only core points. Part II - Clustering © Prentice Hall 8
Density Concepts Part II - Clustering © Prentice Hall 9
DBSCAN Algorithm Part II - Clustering © Prentice Hall 10
CURE Clustering Using Representatives n Use many points to represent a cluster instead of only one n Points will be well scattered n Part II - Clustering © Prentice Hall 11
CURE Approach Part II - Clustering © Prentice Hall 12
CURE Algorithm Part II - Clustering © Prentice Hall 13
CURE for Large Databases Part II - Clustering © Prentice Hall 14
- Slides: 14