Densitybased Place Clustering in GeoSocial Networks Jieming Shi

Density-based Place Clustering in Geo-Social Networks Jieming Shi, Dingming Wu, Nikos Mamoulis, David W. Cheung Department of Computer Science, The University of Hong Kong

Clustering Spatial clustering – grouping of spatial objects (geographic places in our case) into clusters Useful for marketing and urban planning Density based clustering divides a large collection of points into densely populated regions

DBSCAN algorithm DBSCAN is one of the most common data clustering algorithms – proposed in 1996 For each place p it finds all the places within the radius ε of p – eps-neighborhood. If the number of places in eps-neighborhood is no less than Min. Pts – p is called a core point -> it will form a cluster or will be a part of cluster Dense eps-neighborhoods are put into the same cluster if they contain the cores of each other

Example 1 Min. Pts = 4 2 3 ε ε … finish ε

DBSCAN result example

Use of geo-social network data Current spatial clustering models disregard information about the people who are related to the clustered places. Social Network with geographic checkins includes: Users Friendship Checkins connections

Motivation Urban planning: land managers are interested in identifying regions with uniform demographic statistics (for example, areas where elderly people prefer to visit or areas with people that have in common special transportation or living needs) Data cleaning: nearby Geo-Social Network locations collected by user check-ins could belong to the same physical place Marketing: if two or more places belong to the same geo-social cluster, the user who likes one place will probably be interested to visit the others

users friendship connections checkins places

Example 1 Example 2

Density-based Clustering Places in Geo-Social Networks (DCPGS)

Input

DCPGS - Geo-social ε-neighborhood definition

DCPGS algorithm idea

Distance functions

Social distance

Alternative ways to compute social distance – (1) Jaccard

Alternative ways to compute social distance – (2) Sim. Rank

Alternative ways to compute social distance – (3) Katz

Alternative ways to compute social distance – (4) Commute Time

Algorithms DCPGS-R and DCPGS-G

DCPGS-R: R-tree based The algorithm uses R-Tree to facilitate the search of geo-social ε-neighborhood for a given place For the sake of efficiency the social network is stored in a hash table – each pair of friends as an entry

Spatial query – uses R-tree The distance has already been computed Compute social and geo-social distance

DCPGS-G: Grid-based Individual R-tree based range queries find all the places within the radius max. D of the given geographic place in O(log n + < #of places within radius max. D >) which will be equal to O(log n) in most cases But when we have millions of places – we need to perform millions of such queries

DCPGS-G: Grid-based

Results

Visualization-based Analysys

Social Entropy based Evaluation

Social Entropy based Evaluation Commute. Time, and Katz have the lowest social entropy however, these methods produce small clusters and have too many outliers Jaccard also has low social entropy for the same reason DCPGS is better than Sim. Rank