Remaining Lectures in 2009 1 2 3 4

  • Slides: 25
Download presentation
Remaining Lectures in 2009 1. 2. 3. 4. 5. Advanced Clustering and Outlier Detection

Remaining Lectures in 2009 1. 2. 3. 4. 5. Advanced Clustering and Outlier Detection Advanced Classification and Prediction Top Ten Data Mining Algorithms (short) Course Summary (short) Assignment 5 Student Presentations Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1

Clustering Part 2: Advanced Clustering and Outlier Detection 1. 2. 3. 4. 5. Hierarchical

Clustering Part 2: Advanced Clustering and Outlier Detection 1. 2. 3. 4. 5. Hierarchical Clustering More on Density-based Clustering: DENCLUE [EM Top 10 -DM-Alg] Cluster Evaluation Measures Outlier Detection Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 2

More on Clustering 1. 2. Hierarchical Clustering to be discussed in Nov. 11 DBSCAN

More on Clustering 1. 2. Hierarchical Clustering to be discussed in Nov. 11 DBSCAN will be used in programming project Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree l

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree l Can be visualized as a dendrogram l – A tree like diagram that records the sequences of merges or splits Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward

Agglomerative Clustering Algorithm l More popular hierarchical clustering technique l Basic algorithm is straightforward 1. 2. 3. 4. 5. 6. l Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

Starting Situation l Start with clusters of individual points and a proximity matrix p

Starting Situation l Start with clusters of individual points and a proximity matrix p 1 p 2 p 3 p 4 p 5. . . Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN Proximity Matrix . . .

Intermediate Situation l After some merging steps, we have some clusters C 1 C

Intermediate Situation l After some merging steps, we have some clusters C 1 C 2 C 3 C 4 C 5 Proximity Matrix C 1 C 2 C 5 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN C 5

Intermediate Situation l We want to merge the two closest clusters (C 2 and

Intermediate Situation l We want to merge the two closest clusters (C 2 and C 5) and update the proximity matrix. C 1 C 2 C 3 C 4 C 5 Proximity Matrix C 1 C 2 C 5 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

After Merging l The question is “How do we update the proximity matrix? ”

After Merging l The question is “How do we update the proximity matrix? ” C 1 C 2 U C 5 C 3 C 4 ? ? ? C 3 ? C 4 ? Proximity Matrix C 1 C 2 U C 5 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

How to Define Inter-Cluster Similarity p 1 Similarity? p 2 p 3 p 4

How to Define Inter-Cluster Similarity p 1 Similarity? p 2 p 3 p 4 p 5 p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN . . .

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN . . .

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN . . .

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN . . .

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p

How to Define Inter-Cluster Similarity p 1 p 2 p 3 p 4 p 5 p 1 p 2 p 3 p 4 l l l p 5 MIN. MAX. Group Average. Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN . . .

Cluster Similarity: Group Average l Proximity of two clusters is the average of pairwise

Cluster Similarity: Group Average l Proximity of two clusters is the average of pairwise proximity between points in the two clusters. l Need to use average connectivity for scalability since total proximity favors large clusters 1 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 2 3 4 5

2009 Teaching of Clustering Part 1: Basics (September/October) 1. 2. 3. 4. 5. 6.

2009 Teaching of Clustering Part 1: Basics (September/October) 1. 2. 3. 4. 5. 6. What is Clustering? Partitioning/Representative-based Clustering • K-means • K-medoids Density Based Clustering centering on DBSCAN Region Discovery Grid-based Clustering Similarity Assessment Clustering Part 2: Advanced Topics (November) Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 16

DBSCAN (http: //www 2. cs. uh. edu/~ceick/7363/Papers/dbscan. pdf ) l DBSCAN is a density-based

DBSCAN (http: //www 2. cs. uh. edu/~ceick/7363/Papers/dbscan. pdf ) l DBSCAN is a density-based algorithm. – – – Density = number of points within a specified radius (Eps) Input parameter: Min. Pts and Eps A point is a core point if it has more than a specified number of points (Min. Pts) within Eps u These are points that are at the interior of a cluster – A border point has fewer than Min. Pts within Eps, but is in the neighborhood of a core point – A noise point is any point that is not a core point or a border point. Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

DBSCAN: Core, Border, and Noise Points Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

DBSCAN: Core, Border, and Noise Points Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

DBSCAN Algorithm (simplified view for teaching) Create a graph whose nodes are the points

DBSCAN Algorithm (simplified view for teaching) Create a graph whose nodes are the points to be clustered 2. For each core-point c create an edge from c to every point p in the -neighborhood of c 3. Set N to the nodes of the graph; 4. If N does not contain any core points terminate 5. Pick a core point c in N 6. Let X be the set of nodes that can be reached from c by going forward; 1. create a cluster containing X {c} 2. N=N/(X {c}) points that arestep not assigned to any cluster are outliers; 7. Remarks: Continue with 4 http: //www 2. cs. uh. edu/~ceick/7363/Papers/dbscan. pdf gives a more efficient implementation by 1. performing steps 2 and 6 in parallel Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise

DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, Min. Pts = 4 Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

When DBSCAN Works Well Original Points Clusters • Resistant to Noise • Can handle

When DBSCAN Works Well Original Points Clusters • Resistant to Noise • Can handle clusters of different shapes and sizes Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

When DBSCAN Does NOT Work Well (Min. Pts=4, Eps=9. 75). Original Points Problems with

When DBSCAN Does NOT Work Well (Min. Pts=4, Eps=9. 75). Original Points Problems with • Varying densities • High-dimensional data Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN (Min. Pts=4, Eps=9. 12)

Assignment 3 Dataset: Earthquake Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

Assignment 3 Dataset: Earthquake Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN

Assignment 3 Dataset: Complex 9 http: //www 2. cs. uh. edu/~ml_kdd/Complex&Diamond/2 DData. htm Dataset:

Assignment 3 Dataset: Complex 9 http: //www 2. cs. uh. edu/~ml_kdd/Complex&Diamond/2 DData. htm Dataset: http: //www 2. cs. uh. edu/~ml_kdd/Complex&Diamond/Complex 9. txt K-Means in Weka Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN in Weka

DBSCAN: Determining EPS and Min. Pts l l l Idea is that for points

DBSCAN: Determining EPS and Min. Pts l l l Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor Run DBSCAN for Minp=4 and =5 Core-points Non-Core-points Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN