Clustering Spatial Data Using Random Walks Author David

  • Slides: 18
Download presentation
Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate :

Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao

 • • • Outline Motivation Objective Introduction Basic Notions Modeling The Data Clustering

• • • Outline Motivation Objective Introduction Basic Notions Modeling The Data Clustering Using Random Walks – Separators and separating operators – Clustering by separation – Clustering spatial points • Integration with Agglomerative Clustering • Examples • Conclusion • Opinion

Motivation • The characteristics of spatial data pose several difficulties for clustering algorithms •

Motivation • The characteristics of spatial data pose several difficulties for clustering algorithms • The clusters may have arbitrary shapes and nonuniform sizes – Different cluster may have different densities • The existence of noise may interfere the clustering process

Objective • Present a new approach to clustering spatial data • Seeking efficient clustering

Objective • Present a new approach to clustering spatial data • Seeking efficient clustering algorithms. • Overcoming noise and outliers

Introduction • The heart of the method is in what we shall be calling

Introduction • The heart of the method is in what we shall be calling separating operators. • Their effect is to sharpen the distinction between the weights of inter-cluster edges and intra-cluster edges – By decreasing the former and increasing the latter • It can be used on their own or can be embedded in a classical agglomerative clustering framework.

BASIC NOTIONS • graph-theoretic notions (A higher value means more similar)

BASIC NOTIONS • graph-theoretic notions (A higher value means more similar)

BASIC NOTIONS • The probability of a transition from node i to node j

BASIC NOTIONS • The probability of a transition from node i to node j • The probability that a random walk originating at s will reach t before returning to s

MODELINE THE DATA • Delaunay triangulation (DT) – Many O(n log n) time and

MODELINE THE DATA • Delaunay triangulation (DT) – Many O(n log n) time and O(n) space algorithms exist for computing the DT of a planar point set. • K-mutual neighborhood – The k-nearest neighbors of each point can be O(n log n) time O(n) space for any fixed arbitrary dimension. • The weight of the edge (a, b) is – d(a, b) is the Euclidean distance between a and b. – ave is the average Euclidean distance between two adjacent points.

CLUSTERING USING RANDOM WALKS • To identifying natural clusters in a graph is to

CLUSTERING USING RANDOM WALKS • To identifying natural clusters in a graph is to find ways to compute an intimacy relation between the nodes incident to each of the graph’s edges. • Identifying separators is to use an iterative process of separation. – This is a kind of sharpening pass

NS : Separation by neighborhood similarity Definition :

NS : Separation by neighborhood similarity Definition :

CE : Separation by circular escape Definition :

CE : Separation by circular escape Definition :

Clustering spatial points

Clustering spatial points

Integration with Agglomerative Clustering • The separation operators can be used as a preprocessing

Integration with Agglomerative Clustering • The separation operators can be used as a preprocessing before activating agglomerative clustering on the graph • Can effectively prevent bad local merging opposing the graph structure. • It is equivalent to a “single link” algorithm preceded by a separation operation

Examples

Examples

Conclusion • It is robust in the presence of noise and outliers, and is

Conclusion • It is robust in the presence of noise and outliers, and is flexible in handling data of different densities. • The CE operator yields better results than the NS operator • The time complexity of our algorithm applied to n data points is O(n log n)

Opinion • Since the algorithm does not rely on spatial knowledge, we can to

Opinion • Since the algorithm does not rely on spatial knowledge, we can to try it on other types of data.

END

END