Properties of network community structure Jure Leskovec CMU

  • Slides: 57
Download presentation
Properties of network community structure Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney

Properties of network community structure Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research

Networks �Big data �Study emerging behaviors �How are small networks different from large 2

Networks �Big data �Study emerging behaviors �How are small networks different from large 2

Network community structure �Communities (groups, clusters, modules): Sets of nodes with lots of connections

Network community structure �Communities (groups, clusters, modules): Sets of nodes with lots of connections inside and few to outside (the rest of the network) Communities, clusters, groups, modules 3

Example 1: Biology � Nodes represent proteins � Edges represent interactions/associations � Proteins with

Example 1: Biology � Nodes represent proteins � Edges represent interactions/associations � Proteins with same function interact more � Can use network to discover functional groups Yeast transcriptional regulatory modules [Bar-Joseph et al. , 2003] 4

Example 2: Social networks �Clusters correspond to social communities, organizational units (e. g. ,

Example 2: Social networks �Clusters correspond to social communities, organizational units (e. g. , departments) Zachary’s Karate club network • During the study the club split into 2 • The split corresponds to min-cut (● vs. ■) 5

Example 3: Web (blogs) Democrat vs. Republican blogs [Adamic-Glance 2005] 6

Example 3: Web (blogs) Democrat vs. Republican blogs [Adamic-Glance 2005] 6

Example 4: Science Citations Collaborations [Newman 2003] 7

Example 4: Science Citations Collaborations [Newman 2003] 7

Example 5: Hierarchies �Nested communities: modular structure of networks is hierarchically organized University Arts

Example 5: Hierarchies �Nested communities: modular structure of networks is hierarchically organized University Arts Science CS Math Drama Music 8

Example 6: Hierarchies �Recursive hierarchical network (a) N=5, E=8 (b) N=25, E=56 (c) N=125,

Example 6: Hierarchies �Recursive hierarchical network (a) N=5, E=8 (b) N=25, E=56 (c) N=125, E=344 9

Discovering communities �Intuition: Find nodes that can be easily separated from the rest of

Discovering communities �Intuition: Find nodes that can be easily separated from the rest of the network �Various objective functions Min-cut Normalized-cut Centrality, Modularity �Various algorithms Spectral clustering (random walks) Girvan-Newman (centrality) Metis (contraction based) Girvan-Newman: 1) Betweenness centrality: number of shortest paths passing through an edge. 2) Remove edges by decreasing centrality 10

Our question: How well are communities expressed?

Our question: How well are communities expressed?

Our question: How well are communities expressed? Statistical properties of community structure �Instead of

Our question: How well are communities expressed? Statistical properties of community structure �Instead of searching for communities we measure well how expressed are communities Questions �What is the community structure of real world networks? �How to measure and quantify this? �What does this tell us about network structure? �What is a good model (intuition)? �What are consequences for clustering/partitioning algorithms? 12

Community score (quality) �How community like is a S set of nodes? �Need a

Community score (quality) �How community like is a S set of nodes? �Need a natural intuitive S’ measure �Conductance (normalized cut) Φ(S) = # edges cut / # edges inside �Small Φ(S) corresponds to more community-like sets of nodes 13

Community score (quality) What is “best” community of 5 nodes? Score: Φ(S) = #

Community score (quality) What is “best” community of 5 nodes? Score: Φ(S) = # edges cut / # edges inside 14

Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/6 =

Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/6 = 0. 83 Score: Φ(S) = # edges cut / # edges inside 15

Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 =

Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 = 0. 7 Better community Φ=2/5 = 0. 4 Score: Φ(S) = # edges cut / # edges inside 16

Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 =

Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 = 0. 7 Best community Φ=2/8 = 0. 25 Better community Φ=2/5 = 0. 4 Score: Φ(S) = # edges cut / # edges inside 17

Network Community Profile Plot �We define: Network community profile (NCP) plot Plot the score

Network Community Profile Plot �We define: Network community profile (NCP) plot Plot the score of best community of size k �Search over all subsets of size k and find best: Φ(k=5) = 0. 25 �NCP plot is intractable to compute 18

Network Community Profile Plot �We define: Network community profile (NCP) plot Plot the score

Network Community Profile Plot �We define: Network community profile (NCP) plot Plot the score of best community of size k log Φ(k) k=5, Φ(k)=0. 25 k=7, Φ(k)=0. 18 Community size, log k 19

Community score, log Φ(k) NCP: Example Community size, log k 20

Community score, log Φ(k) NCP: Example Community size, log k 20

Scaling to large networks � Local spectral clustering algorithm Pick a seed node Slowly

Scaling to large networks � Local spectral clustering algorithm Pick a seed node Slowly diffuse mass around it (via Page. Rank like random walk) Find the bottleneck � Repeat many times � Many seed nodes for very local walks � Less seed nodes for more global (longer) walks 21

How are NCP plots of real networks?

How are NCP plots of real networks?

NCP plot: Small Social Network �Dolphin social network Two communities of dolphins Network NCP

NCP plot: Small Social Network �Dolphin social network Two communities of dolphins Network NCP plot 23

NCP plot: Zachary’s karate club �Zachary’s university karate club social network During the study

NCP plot: Zachary’s karate club �Zachary’s university karate club social network During the study club split into 2 The split (squares vs. circles) corresponds to cut B Network NCP plot 24

NCP plot: Network Science �Collaborations between scientists in Networks Network NCP plot 25

NCP plot: Network Science �Collaborations between scientists in Networks Network NCP plot 25

NCP plot: Grids Network NCP plot 26

NCP plot: Grids Network NCP plot 26

NCP plot: Graphs with geometry Network NCP plot 27

NCP plot: Graphs with geometry Network NCP plot 27

NCP plot: Manifold dataset �Manifold learning dataset (Hands) Network NCP plot 28

NCP plot: Manifold dataset �Manifold learning dataset (Hands) Network NCP plot 28

NCP plot: Power Grid �Eastern US power grid: 29

NCP plot: Power Grid �Eastern US power grid: 29

NCP plot: Hierarchical network – Small social networks – Geometric and – Hierarchical network

NCP plot: Hierarchical network – Small social networks – Geometric and – Hierarchical network have downward NCP plot Network What about large networks? NCP plot 30

NCP of Large Networks

NCP of Large Networks

Our work: Large networks �Previously researchers examined community structure of small networks (~100 nodes)

Our work: Large networks �Previously researchers examined community structure of small networks (~100 nodes) �We examined more than 70 different large networks Large real-world networks look very different! 32

Example of our findings �Typical example: General relativity collaboration network (4, 158 nodes, 13,

Example of our findings �Typical example: General relativity collaboration network (4, 158 nodes, 13, 422 edges) 33

NCP: Live. Journal (N=5 M, E=42 M) Community score Better and better communities Best

NCP: Live. Journal (N=5 M, E=42 M) Community score Better and better communities Best communities get worse and worse Best community has 100 nodes Community size 34

Explanation: Downward part �Whiskers are responsible for downward slope of NCP plot Largest whisker

Explanation: Downward part �Whiskers are responsible for downward slope of NCP plot Largest whisker Whisker is a set of nodes connected to the network by a single edge 35

Explanation: Upward part �Each new edge inside the community costs more Φ=1/3 = 0.

Explanation: Upward part �Each new edge inside the community costs more Φ=1/3 = 0. 33 NCP plot Φ=2/4 = 0. 5 Φ=8/6 = 1. 3 Φ=64/14 = 4. 5 Each node has twice as many children 36

Comparison to rewired network �Take a real network G �Rewire edges for a long

Comparison to rewired network �Take a real network G �Rewire edges for a long time �We obtain a random graph with same degree distribution as the real network G 37

Comparison to a rewired network Rewired network: random network with same degree distribution 38

Comparison to a rewired network Rewired network: random network with same degree distribution 38

Live. Journal whisker sizes Whiskers in real networks are larger than expected 39

Live. Journal whisker sizes Whiskers in real networks are larger than expected 39

Whisker shapes Edge to cut Whiskers in real networks are non-trivial (richer than trees)

Whisker shapes Edge to cut Whiskers in real networks are non-trivial (richer than trees) 40

Caveat: Bag of whiskers What if we allow cuts that give disconnected communities? •

Caveat: Bag of whiskers What if we allow cuts that give disconnected communities? • Cut all whiskers • Compose communities out of whiskers • How good “communities” do we get? 41

Community score Communities made of whiskers We get better community scores when composing disconnected

Community score Communities made of whiskers We get better community scores when composing disconnected sets of whiskers Connected communities Bag of whiskers Community size 42

What if I remove whiskers? Nothing happens! Now we have 2 -edge connected whiskers

What if I remove whiskers? Nothing happens! Now we have 2 -edge connected whiskers to deal with. 43

Conclusion: Core+Whiskers Rewired network Connected communities Bag of whiskers 44

Conclusion: Core+Whiskers Rewired network Connected communities Bag of whiskers 44

Conclusion: Network structure Denser and denser core of the network Core contains 60% node

Conclusion: Network structure Denser and denser core of the network Core contains 60% node and 80% edges Network structure: Core-periphery (jellyfish, octopus) Whiskers are responsible for good communities 45

What is a good model? Intuiton?

What is a good model? Intuiton?

(Sparse) Random graphs �(Sparse) Random graph: Start with N nodes Pick pairs of nodes

(Sparse) Random graphs �(Sparse) Random graph: Start with N nodes Pick pairs of nodes uniformly at random and connect Theorem (works for any degree distribution) Flat (long random connections) Sparsity does not explain our observation 47

Preferential attachment � Preferential attachment [Price 1965, Albert & Barabasi 1999]: Add a new

Preferential attachment � Preferential attachment [Price 1965, Albert & Barabasi 1999]: Add a new node, create m out-links Probability of linking a node ki is proportional to its degree � Based on Herbert Simon’s result Power-laws arise from “Rich get richer” (cumulative advantage) Flat (connections to hubs – no locality) 48

Small-World model �Let’s exploit local connections Down (locally network looks like a mesh) and

Small-World model �Let’s exploit local connections Down (locally network looks like a mesh) and Flat (at large scale network looks random) 49

Geometry + Preferential attachment �Geometric preferential attachment: Place nodes at random in 2 D

Geometry + Preferential attachment �Geometric preferential attachment: Place nodes at random in 2 D Pick a node Pick nodes in a radius Connect preferentially Flat (locally network is random) and Down (globally network is a mesh – union of local expanders) 50

Model that works: Forest Fire �Forest Fire: connections spread like a fire New node

Model that works: Forest Fire �Forest Fire: connections spread like a fire New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively As community grows it blends into the core of the network 51

Forest Fire NCP plot rewired network Bag of whiskers 52

Forest Fire NCP plot rewired network Bag of whiskers 52

Conclusion and connections �Whiskers: Largest whisker has ~100 nodes Independent of network size Dunbar

Conclusion and connections �Whiskers: Largest whisker has ~100 nodes Independent of network size Dunbar number: a person can maintain social relationship to at most 150 people �Core: Core has little structure (hard to cut) Still more structure than the random network 53

Connections to previous work �Other researchers examined small networks so they did not hit

Connections to previous work �Other researchers examined small networks so they did not hit the Dunbar’s limit �Small evidence: 400 k nodes Amazon co-purchasing network [Clauset et al. 2004] ▪ Largest community has 50% of all nodes ▪ It was labeled “Miscelaneous” Karate club has no significant community structure [Newman et al. 2007] 54

Other explanantions �Bond vs. identity communities �Multiple hierarchies that blur the community boundaries 55

Other explanantions �Bond vs. identity communities �Multiple hierarchies that blur the community boundaries 55

Is there still hope? �Ground truth �Yes, use attributes, better link semantics 56

Is there still hope? �Ground truth �Yes, use attributes, better link semantics 56

Conclusion and connections �NCP plot is a way to analyze network community structure �Our

Conclusion and connections �NCP plot is a way to analyze network community structure �Our results agree with previous work on small networks (that are commonly used for testing community finding algorithms) �But large networks are different �Large networks Whiskers + Core structure Small well isolated communities blend into the core of the networks as they grow 57