Community Detection Laks V S Lakshmanan based on

Community Detection Laks V. S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004). M. E. J. Newman. Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004).

The Problem �Can we partition the network into groups s. t. the inter-group edges are sparse while the intra-group edges are dense? �Why is it interesting/useful? ◦ Understanding comm. structure – means to understanding n/w structure. ◦ Graph partitioning – similar problem; graph of processes, edges=communication; assign sub-graphs to processors to minimize interprocessor comm. & balance processor load. (NP-hard in general. ) ◦ Diff. w/ graph partitioning.

An Example with Three Communities

A Hierarchical Clustering Approach �

Community detection via hierarchical clustering �Compute all pairwise node similarities for every edge present. �Repeatedly add edges with greatest similarity. � leads to a tree (called dendrogram). �A slice throguh the dendogram represents a clustering or comm. structure.

Dendrogram example

Limitations of HC approach �“Misplaces” �E. g. : nodes in the periphery. 1 5 2 4 3 Which community should 5 belong to? Alternative approach based on “edge betweenness”.

Key Intuition �An inter-comm. edge has a higher “betweenness” compared to an intracomm. edge, i. e. , more paths between node pairs pass through it. �Start with G. �Repeatedly remove edges with highest betweenness until <some stopping criterion>. �Communities = resulting components.

Basic Algorithm �repeat { ◦ Calculate betweenness of all edges; ◦ Remove one with highest betweenness, breaking ties arbitrarily; } �Until no edges �Remarks: left. ◦ Which betweenness score? ◦ Calculate upfront and reuse or recalculate? ◦ Can we incrementally recalculate after each edge removal? ◦ Related algorithms for node betweenness by Newman and Brandes.

A Real Example (Zachary’s Karate Club) With recalculation of betweenness. Without recalculation of betweenness.

Scalability Issues �

Computing edge betweenness �An Example b d a g c e Compute #geodesics from every node to g. f Breadth-first search – means for doing many things.

Computing edge betweenness �An Example b d d=0 w=1 a c g e f Breadth-first search – means for doing many things.

Computing edge betweenness �An Example b d=1 w=1 d d=0 w=1 a c e d=1 w=1 g f Breadth-first search – means for doing many things.

Computing edge betweenness �An Example d=2 w=2 b d=1 w=1 d d=0 w=1 a c d=2 w=2 e d=1 w=1 f g d=2 w=2 Breadth-first search – means for doing many things.

Computing edge betweenness �An Example d=2 w=2 d=3 w=4 b d=1 w=1 d d=0 w=1 a c d=2 w=2 e d=1 w=1 f g Have all info. we need for edge betweenness now. d=2 w=2 Breadth-first search – means for doing many things.

Computing edge betweenness �An Example d=2 w=2 d=3 2/4 w=4 b d=1 w=1 d d=0 w=1 1/2 a 2/4 c d=2 w=2 e d=1 w=1 1/2 f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search – means for doing many things.

Computing edge betweenness �An Example d=2 w=2 ½(1+2/4)d ) /4 +2 (1 a 2/4 ½ / 2 + (1 4) d=0 w=1 ½ d=3 2/4 w=4 b d=1 w=1 c ½(1+2/4) e d=2 w=2 1/2 d=1 w=1 1/2 f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search – means for doing many things.

Computing edge betweenness �An Example d=2 w=2 ½(1+2/4)d ) /4 +2 (1 a 2/4 ½ / 2 + (1 4) d=1 w=1 d=0 w=1 ½ d=3 2/4 w=4 b 1/1[ 1+½(1+2/4)+1/2] c ½(1+2/4) e d=2 w=2 1/2 d=1 w=1 1/2 f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search – means for doing many things.

EB Computation summary �

EB Computation summary (contd. ) �

EB computation – complexity analysis �

On scaling up CD algorithm � Point to ponder!

Closing Remarks 1/2 �Newman also proposed other bases for defining edge betweenness. �Electrical current flow through the edge where every edge is viewed as unit resistance and we consider all source-sink pairs. �Based on random walks. �Both less effective and more expensive than geodesics (see paper for details). �What about directed and weighted

Closing Remarks 2/2 �Goodness metric of community division. �Helpful when we don’t know the ground truth. �Q = ∑i (eii – ai 2 ), where Ekxk= matrix of community division: eij = fraction of edges linking comm. i to comm. j; ai = ∑j eij. Q measures fraction of intra-comm. edges over what is expected by chance (assuming uniform distribution). See paper for details of experimental results. �Turns out study of influence/information propagation can suggest new ways of detecting communities: will revisit this issue after we study influence propagation.

Recommended Reading � J. Ruan and W. Zhang. An Efcient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks. ICDM 2007. � M. E. J. Newman "Modularity and community structure in networks", physics/0602124 = Proceedings of the National Academy of Sciences (USA) 103 (2006): 87577— 8582. � Jure Leskovec, Kevin J. Lang, and Michael W. Mahoney. Empirical Comparison of Algorithms for Network Community Detection. WWW 2010. � M. E. J. Newman. Communities, modules and largescale structure in networks. Nature Physics 8, 25– 31 (2012) doi: 10. 1038/nphys 2162 Received 23 September 2011 Accepted 04 November 2011 Published online 22 December 2011.