Community Structures What is Community Structure Definition q

What is Community Structure Ø Definition: q A community is a group of nodes

Why Community Structure (CS)? Ø Many systems can be expressed by a network, in

Why CS? Yeast Protein interaction networks My T. Thai mythai@cise. ufl. edu 4

Why CS? IP address network My T. Thai mythai@cise. ufl. edu 5

Why Community Structure? Ø Nodes in a community have some common properties Ø Communities

An Overview of Recent Work Ø Disjoint CS Ø Overlapping CS Ø Centralized Approach

Graph Partitioning? It’s not Ø Graph partitioning algorithms are typically based on minimum cut

Graph Partitioning Ø Minimum cut partitioning breaks down when we don’t know the sizes

Edge Betweeness Ø Focus on the edges which are least central, i. e. ,

Edge Betweeness Ø Definition: q For each edge (u, v), the edge betweeness of

Why Edge Betweeness My T. Thai mythai@cise. ufl. edu 12

Algorithm Ø Initialize G = (V, E) representing a network Ø while E is

Time Complexity Ø Let |V| = n and |E| = m Ø Calculate the

An Example My T. Thai mythai@cise. ufl. edu 15

Disadvantages/Improvements Ø Can we improve the time complexity? Ø The communities are in the

Define the quantity (measurement) of modularity Q and find an approximation algorithm to maximize

Finding community structure in very large networks Authors: Aaron Clauset, M. E. J. Newman,

Extensions to weighted networks Ø Betweenness clustering? q Will not work – strong ties

Structural Quality Coverage Modularity Conductance Inter-cluster conductance Average conductance There is no single perfect

Resolution Limit ls : # links inside module s L : # links in

The Limit of Modularity Ø Modularity seems to have some intrinsic scale of order

The Resolution Limit Since M 1 and M 2 are constructed modules, we have

The Resolution Limit (cont) Let’s consider the following case • QA : M 1

The Resolution Limit (cont) Now let’s see how it contradicts the constructed modules M

Scenario 1 (cont) When and , the right side of can reach the maximum

Scenario 2 (cont) a 1=a 2=b 1=b 2=1/l 28

Schematic Examples (cont) For example, p=5, m=20 The maximal modularity of the network corresponds

Fix the resolution? Ø Uncover communities of different sizes My T. Thai mythai@cise. ufl.

Community Detection Algorithms Ø Blondel (Louvian method), [Blondel et al. 2008] q Fast Modularity

Community Detection Algorithms Ø RN, [Ronhovde & Nussinov 2009] q Potts Model Community Detection

Blondel et al Ø Two Phases: q Phase 1: § § Initially, we have

State-of-the-art methods Ø Evaluated by Lancichinetti, Fortunato, Physical Review E 09 q Infomap[Rosvall and

LDF Algorithm – The Basis Ø w x v u z y 39

Dynamic Community Structure merge move more edges t t+1 t+2 Time Network evolution 47

Quantifying social group evolution (Palla et. al – Nature 07) Ø Developed an algorithm

Findings Ø Fundamental diffs b/w the dynamics of small and large groups q Large

Research Problems Ø How to update the evolving community structure (CS) without re-computing it

An Adaptive Model Input network Basic CS : : Basic communities Network changes •

Related Work in Dynamic Networks Ø Graph. Scope [J. Sun et al. , KDD

An Adaptive Algorithm for Overlapping Input network Phase 1: Basic CS detection ( )

Phase 1: Basic Communities Detection Ø Basic communities q Dense parts of the networks

Phase 1: Basic Communities Detection �Locating basic communities: when (C) = 0. 9 (C)

Phase 2: Adaptive CS Update Ø Update network communities when changes are introduced Need

Phase 2: Removing a Node � Identify the left-over structure(s) on C{u} � Merge

Phase 2: Removing an Edge � Identify the left-over structure(s) on C{u, v} �

AFOCS v. s. Static Detection + CFinder [G. Palla et al. , Nature 2005]

AFOCS v. s. Other Dynamic Methods + i. LCD [R. Cazabet et al. ,

Adaptive CS Detection in Dynamic Networks Ø Running time is proportional to the amount

Adaptive CS Detection in Dynamic Networks a z y 3 x z t b

A-LDF – Dynamic Network Algorithm START Initial Network Changes in the Network Compact Representation

Experimental Results Ø Datasets Ø Static data sets: Karate Club, Dolphin, Twitter, Flickr, .

Static Networks Size # 1 2 3 4 5 6 7 8 9 10

0. 8500 0. 8000 0. 7500 0. 7000 0. 6500 0. 6000 0. 5500

Evaluation in Dynamic Networks Modularity (FB) Running time (FB) 0. 65 1000 0. 6

Evaluation in Dynamic Networks Modularity (arxiv) Running time (arxiv) 0. 7 1000 0. 65

Incorporate other information Ø Social connections q Friendship (mutal) relation (Facebook, Google+) q Follower

Incorporate other information Ø The discussed topics q Topics that people in a group

Incorporate other information Ø Social interactions types q Wall posts, private or group messages

In rich-content social networks Ø Not only the topology that matters But also, Ø

In rich-content social networks Ø Communities = “groups of users who are interconnected and

Approaches Ø Use Bayesian models to extract latent communities q Topic User Community Model

Assumptions Ø A user can belong to multiple communities Ø A community can participate

Background Ø Multinormial distribution – Mult(. ) q n trials q k possible outcomes

Symmetric Dirichlet Distribution Ø Dir. K(α) where α = (α 1, …, αK) on

Notations Ø Observation variables Latent variables 87

TUCM Ø Model presentation Ø A Bayesian decomposition 92

Topic User Recipient Community Ø This model q Does not allow mass messaging q

Full TURC Model Ø Previous models q Assume that each post generated by a

Experiments Ø Data q 6 month of Twitter in 2009 § 5405 nodes, 13214

Slides: 104

Download presentation

Community Structures

What is Community Structure Ø Definition: q A community is a group of nodes in which: § There are more edges (interactions) between nodes within the group than to nodes outside of it My T. Thai mythai@cise. ufl. edu 2

Why Community Structure (CS)? Ø Many systems can be expressed by a network, in which nodes represent the objects and edges represent the relations between them: q Social networks: collaboration, online social networks q Technological networks: IP address networks, WWW, software dependency q Biological networks: protein interaction networks, metabolic networks, gene regulatory networks My T. Thai mythai@cise. ufl. edu 3

Why CS? Yeast Protein interaction networks My T. Thai mythai@cise. ufl. edu 4

Why CS? IP address network My T. Thai mythai@cise. ufl. edu 5

Why Community Structure? Ø Nodes in a community have some common properties Ø Communities represent some properties of a networks Ø Examples: q In social networks, represent social groupings based on interest or background q In citation networks, represent related papers on one topic q In metabolic networks, represent cycles and other functional groupings My T. Thai mythai@cise. ufl. edu 6

An Overview of Recent Work Ø Disjoint CS Ø Overlapping CS Ø Centralized Approach q Define the quantity of modularity and use the greedy algorithms, IP, SDP, Spectral, Random walk, Clique percolation Ø Localized Approach Ø Handle Dynamics and Evolution Ø Incorporate other information My T. Thai mythai@cise. ufl. edu 7

Graph Partitioning? It’s not Ø Graph partitioning algorithms are typically based on minimum cut approaches or spectral partitioning

Graph Partitioning Ø Minimum cut partitioning breaks down when we don’t know the sizes of the groups - Optimizing the cut size with the groups sizes free puts all vertices in the same group Ø Cut size is the wrong thing to optimize - A good division into communities is not just one where there a small number of edges between groups Ø There must be a smaller than expected number edges between communities

Edge Betweeness Ø Focus on the edges which are least central, i. e. , , the edges which are most “between” communities Ø Instead of adding edge to G = (V, emptyset), progressively removing edges from an original graph G = (V, E) My T. Thai mythai@cise. ufl. edu 10

Edge Betweeness Ø Definition: q For each edge (u, v), the edge betweeness of (u, v) is defined as the number of shortest paths between any pair of nodes in a network that run through (u, v) q betweeness(u, v) = | { Pxy | x, y in V, Pxy is a shortest path between x and y, and (u, v) in Pxy}| My T. Thai mythai@cise. ufl. edu 11

Why Edge Betweeness My T. Thai mythai@cise. ufl. edu 12

Algorithm Ø Initialize G = (V, E) representing a network Ø while E is not empty q Calculate the betweeness of all edges in G q Remove the edge e with the highest betweeness, G = (V, E – e) Ø Indeed, we just need to recalculate the betweeness of all edges affected by the removal My T. Thai mythai@cise. ufl. edu 13

Time Complexity Ø Let |V| = n and |E| = m Ø Calculate the betweeness of all edges: O(mn) Ø Since we need to recalculate each time we remove an edge: O(m 2 n) My T. Thai mythai@cise. ufl. edu 14

An Example My T. Thai mythai@cise. ufl. edu 15

Disadvantages/Improvements Ø Can we improve the time complexity? Ø The communities are in the hierarchical form, can we find the disjoint communities? My T. Thai mythai@cise. ufl. edu 16

Define the quantity (measurement) of modularity Q and find an approximation algorithm to maximize Q My T. Thai mythai@cise. ufl. edu 17

Finding community structure in very large networks Authors: Aaron Clauset, M. E. J. Newman, Cristopher Moore 2004 Ø Consider edges that fall within a community or between a community and the rest of the network if vertices are in the Ø Define modularity: same community adjacency matrix probability of an edge between two vertices is proportional to their degrees n For a random network, Q = 0 n the number of edges within a community is no different from what you would expect

Finding community structure in very large networks Authors: Aaron Clauset, M. E. J. Newman, Cristopher Moore 2004 Ø Algorithm q start with all vertices as isolates q follow a greedy strategy: § § successively join clusters with the greatest increase DQ in modularity stop when the maximum possible DQ <= 0 from joining any two q successfully used to find community structure in a graph with > 400, 000 nodes with > 2 million edges § Amazon’s people who bought this also bought that… q alternatives to achieving optimum DQ: § simulated annealing rather than greedy search

Extensions to weighted networks Ø Betweenness clustering? q Will not work – strong ties will have a disproportionate number of short paths, and those are the ones we want to keep q Modularity (Analysis of weighted networks, M. E. J. Newman) weighted edge reuters new articles keywords

Structural Quality Coverage Modularity Conductance Inter-cluster conductance Average conductance There is no single perfect quality function. [Almedia et al. 2011]

Resolution Limit ls : # links inside module s L : # links in the network ds : The total degree of the nodes in module s : Expected # of links in module s 22

The Limit of Modularity Ø Modularity seems to have some intrinsic scale of order , which constrains the number and the size of the modules. Ø For a given total number of nodes and links we could build many more than modules, but the corresponding network would be less “modular”, namely with a value of the modularity lower than the maximum 23

The Resolution Limit Since M 1 and M 2 are constructed modules, we have 24

The Resolution Limit (cont) Let’s consider the following case • QA : M 1 and M 2 are separate modules • QB : M 1 and M 2 is a single module Since both M 1 and M 2 are modules by construction, we need That is, 25

The Resolution Limit (cont) Now let’s see how it contradicts the constructed modules M 1 and M 2 We consider the following two scenarios: ( • • ) The two modules have a perfect balance between internal and external degree (a 1+b 1=2, a 2+b 2=2), so they are on the edge between being or not being communities, in the weak sense. The two modules have the smallest possible external degree, which means that there is a single link connecting them to the rest of the network and only one link connecting each other (a 1=a 2=b 1=b 2=1/l). 26

Scenario 1 (cont) When and , the right side of can reach the maximum value In this case, may happen. 27

Scenario 2 (cont) a 1=a 2=b 1=b 2=1/l 28

Schematic Examples (cont) For example, p=5, m=20 The maximal modularity of the network corresponds to the partition in which the two smaller cliques are merged 29

Fix the resolution? Ø Uncover communities of different sizes My T. Thai mythai@cise. ufl. edu 30

Community Detection Algorithms Ø Blondel (Louvian method), [Blondel et al. 2008] q Fast Modularity Optimization q Hierarchical clustering Ø Infomap, [Rosvall & Bergstrom 2008] q Maps of Random Walks q Flow-based and information theoretic Ø Info. H (Info. Hiermap), [Rosvall & Bergstrom 2011] q Multilevel Compression of Random Walks q Hierarchical version of Infomap

Community Detection Algorithms Ø RN, [Ronhovde & Nussinov 2009] q Potts Model Community Detection q Minimization of Hamiltonian of an Potts model spin system Ø MCL, [Dongen 2000] q Markov Clustering q Random walks stay longer in dense clusters Ø LC, [Ahn et al. 2010] q Link Community Detection q A community is redefined as a set of closely interrelated edges q Overlapping and hierarchical clustering

Blondel et al Ø Two Phases: q Phase 1: § § Initially, we have n communities (each node is a community) For each node i, consider the neighbor j of i and evaluate the modularity gain that would take place by placing i in the community of j. Node i will be placed in one of the communities for which this gain is maximum (and positive) Stop this process when no further improvement can be achieved q Phase 2: § § Compress each community into a node and thus, constructing a new graph representing the community structures after phase 1 Re-apply Phase 1 My T. Thai mythai@cise. ufl. edu 33

My T. Thai mythai@cise. ufl. edu 34

My T. Thai mythai@cise. ufl. edu 35

State-of-the-art methods Ø Evaluated by Lancichinetti, Fortunato, Physical Review E 09 q Infomap[Rosvall and Bergstrom, PNAS 07] q Blondel’s method [Blondel et. al, J. of Statistical Mechanics: Theory and Experiment 08] q Ronhovde & Nussinov’s method (RN) [Phys. Rev. E, 09] Ø Many other recent heuristics q OSLOM, QCA… No Provable Performance Guarantee Need Approximation Algorithms 36

Power-Law Networks Ø 37

PLNs Model P(α, β) 38

LDF Algorithm – The Basis Ø w x v u z y 39

LDF Algorithm Ø 40

An Example of LDF 41

Theorem: Sketch of the proof Ø 42

LDF Undirected -Theorem 43

D-LDF – Directed Networks Ø v u 44

D-LDF – Directed Networks Ø v u 45

LDF-Directed Networks 46

Dynamic Community Structure merge move more edges t t+1 t+2 Time Network evolution 47

Quantifying social group evolution (Palla et. al – Nature 07) Ø Developed an algorithm based on clique percolation -> allows to investigate the time dependence of overlapping communties q Uncover basic relationships characterizing community evolution q Understand the development and self-optimization 48

Findings Ø Fundamental diffs b/w the dynamics of small and large groups q Large groups persists for longer; capable of dynamically altering their membership q Small groups: their composition remains unchanged in order to be stable Ø Knowledge of the time commitment of members to a given community can be used for estimating the community’s lifetime 49

Research Problems Ø How to update the evolving community structure (CS) without re-computing it Ø Why? q Prohibitive computational costs for re-computing q Introduce incorrect evolution phenomena Ø How to predict new relationships based on the evolving of CS 53

An Adaptive Model Input network Basic CS : : Basic communities Network changes • Need to handle – Node insertion – Edge insertion – Node removal – Edge removal Updated communities 54

Related Work in Dynamic Networks Ø Graph. Scope [J. Sun et al. , KDD 2007] Ø Facet. Net [Y-R. Lin et al. , WWW 2008] Ø Bayesian inference approach [T. Yang et al. , J. Machine Learning, 2010] Ø QCA [N. P. Nguyen and M. T. Thai, INFOCOM 2011] Ø OSLOM [A. Lancichinetti et al. , PLo. S ONE, 2011] Ø AFOCS [Nguyen at el, Mobicom 2011] 55

An Adaptive Algorithm for Overlapping Input network Phase 1: Basic CS detection ( ) Basic communities Network changes Our solution: AFOCS: A 2 -phase and limited input dependent framework Phase 2: Adaptive CS update ( ) Updated communities N. Nguyen and M. T. Thai, ACM Mobi. Com 2011 56

Phase 1: Basic Communities Detection Ø Basic communities q Dense parts of the networks q Can possibly overlap q Bases for adaptive CS update Ø Duties q Locates basic communities q Merges them if they are highly overlapped 57

Phase 1: Basic Communities Detection �Locating basic communities: when (C) = 0. 9 (C) =0. 725 �Merging: when OS(Ci, Cj) = 1. 027 = 0. 75 58

Phase 1: Basic Communities Detection 59

Phase 2: Adaptive CS Update Ø Update network communities when changes are introduced Need to handle Basic communities Network changes – Adding a node/edge – Removing a node/edge Updated communities + Locally locate new local communities + Merge them if they highly overlap with current ones 60

Phase 2: Adding a New Node u u u 61

Phase 2: Adding a New Edge 62

Phase 2: Removing a Node � Identify the left-over structure(s) on C{u} � Merge overlapping substructure(s) 63

Phase 2: Removing an Edge � Identify the left-over structure(s) on C{u, v} � Merge overlapping substructure(s) 64

AFOCS performance: Choosing β 65

AFOCS v. s. Static Detection + CFinder [G. Palla et al. , Nature 2005] + COPRA [S. Gregory, New J. of Physics, 2010] 66

AFOCS v. s. Other Dynamic Methods + i. LCD [R. Cazabet et al. , SOCIALCOM 2010] 67

Adaptive CS Detection in Dynamic Networks Ø Running time is proportional to the amount of changes Ø Can be locally employed Ø More consistent community structure: Critical for applications such as routing. START 1. Initial Network 4. Changes in the Network 6. Compact Representation Graph (CRG) 3. Refine CS 5. Output CS 68

Adaptive CS Detection in Dynamic Networks a z y 3 x z t b b t y 2 aa b 2 22 y xx 20 20 28 10 10 10 1 16 2 2 x 2 z a t b 12 2 69

A-LDF – Dynamic Network Algorithm START Initial Network Changes in the Network Compact Representation Graph Refine CS Output CS • Both selected as the LDF algorithm (without the refining phase) • Compact representation: • Label nodes that represents communities with leader. • Unlabel all pulled out nodes (nodes that are incident to changes). 70

A-LDF – Dynamic Network Algorithm Ø 71

Experimental Results Ø Datasets Ø Static data sets: Karate Club, Dolphin, Twitter, Flickr, . etc Ø Dynamic social networks: § § Facebook (New Orleans): 63 K nodes, 1. 5 M edges Ar. Xiv Citation network: 225 K articles, ~40 K new each year 72

Static Networks Size # 1 2 3 4 5 6 7 8 9 10 11 Vertices Karate Dolphin Les Miserables Political Books Ame. Col. Fb. Elec. Cir. S 838 Erdos Scie. Collab. Foursquare Facebook Twitter Fllickr 34 62 77 105 115 512 6, 100 44, 832 63, 731 88, 484 80, 513 Edges 78 159 254 441 613 819 9, 939 1, 664, 402 905, 565 2, 364, 322 5, 899, 882 73

0. 8500 0. 8000 0. 7500 0. 7000 0. 6500 0. 6000 0. 5500 0. 5000 0. 4500 0. 4000 Karate Dolphin Les Miserables Politic Books Amer Colg. Fb. Elec. Cir. S 838 Scien. Collab. Foursquare Facebook Twitter Fllickr Average s M Ka r Am ise ate er rab le Sc Colg s ie. F n. Co b. Fa llab ce. bo o Fll k ick r Running time in second(s) Le Modularity Performance Evaluation LDF Blondel Optimal 10 1 0. 01 0. 0001 LDF Blondel 74

Evaluation in Dynamic Networks Modularity (FB) Running time (FB) 0. 65 1000 0. 6 0. 55 100 0. 5 LDF 0. 45 Oslom 0. 4 Blondel 0. 3 Oslom QCA 1 QCA 0. 35 LDF 10 0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324 0. 1 Blondel 0. 25 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425 Time points 0. 01 Number of communities (FB) Time points NMI (FB) 1 0. 9 0. 8 0. 7 0. 6 LDF 0. 5 Oslom 0. 4 QCA 0. 3 Blondel 0. 2 0. 1 0 1000 100 10 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425 Time points LDF Oslom QCA 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425 Time points 75

Evaluation in Dynamic Networks Modularity (arxiv) Running time (arxiv) 0. 7 1000 0. 65 0. 6 100 0. 55 LDF 0. 5 Oslom 0. 45 QCA 0. 4 Blondel 0. 35 0. 3 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Time points LDF 10 Oslom QCA 1 0 2 6 8 10 12 14 16 18 20 22 24 26 28 30 0. 1 0. 01 Blondel Time points NMI (arxiv) 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 4 Number of communities (arxiv) 1000 LDF Oslom QCA Blondel 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Time points 100 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Time points 76

Incorporate other information Ø Social connections q Friendship (mutal) relation (Facebook, Google+) q Follower (unidirectional) relation (Twitter) 77

Incorporate other information Ø The discussed topics q Topics that people in a group are mostly interested 78

Incorporate other information Ø Social interactions types q Wall posts, private or group messages (Facebook) q Tweets, retweets (Twitter) q Comments 79

In rich-content social networks Ø Not only the topology that matters But also, Ø User interests q A user may interested in many communities Ø Community interests q A community may interested in different topics 80

In rich-content social networks Ø Communities = “groups of users who are interconnected and communicate on shared topics” q interconnected by social connection and interaction types Ø Given a social network with q Network topology q Users and their social connections and interactions q Topics of interests Ø How can we find meaningful communities as well as their topics of interests? 81

Approaches Ø Use Bayesian models to extract latent communities q Topic User Community Model § Posts/Tweets can be broadcasted q Topic User Recipient Community Model § Posts/Tweets are restricted to desired users only q Full Topic User Recipient Community Model § A post/tweet generated by a user can be based on multiple topics 82

Assumptions Ø A user can belong to multiple communities Ø A community can participate in multiple topics Ø For TUCM and TURCM q Posts in general discuss one topic only Ø Full TURCM q Posts can discuss multiple topics 83

Background Ø Multinormial distribution – Mult(. ) q n trials q k possible outcomes with prob. p 1, p 2, …, pk sum up to 1 q X 1, X 2, . . , Xk (Xi denote the number of times outcome #i appears in n trials) 84

Multinormal distribution 85

Symmetric Dirichlet Distribution Ø Dir. K(α) where α = (α 1, …, αK) on variable x 1, x 2, …, x. K where x. K = 1 – (x 1+. . +x. K-1) has prob. 86

Notations Ø Observation variables Latent variables 87

Notations (cont’d) Ø 88

Topic User Community Model Ø Social Interaction Profile - SIP(ui) Ø The SIP of users is represented as random mixtures over latent community variables q Each community is in turn defined as a distribution over the interaction space 89

Topic User Community Model 1 2 90

Topic User Community Model 3 a 3 b 91

TUCM Ø Model presentation Ø A Bayesian decomposition 92

TUCM – Parameter Estimation Ø 93

TUCM – Parameter Estimation 94

TUCM – Parameter Estimation 95

Topic User Recipient Community Ø This model q Does not allow mass messaging q The sender typically sends out messages to his/her acquaintances q The post are on a topic that both sender and recipient are interested in. Ø In the same spirit of TUCM q Now we have user uj for all uj in Ri 96

TURC 97

Full TURC Model Ø Previous models q Assume that each post generated by a user is based on a single topic Ø Full TURC q Relaxes this requirement q Communities how have a higher relationship to authors 98

Full TURC Model 2 1 3 99

Full TURC Model 100

Experiments Ø Data q 6 month of Twitter in 2009 § 5405 nodes, 13214 edges, 23043 posts q Enron email § 150 nodes, ~300 K emails in total Ø Number of communities C = 10 Ø Number of topics = 20 Ø Competitor methods: CUT and CART 101

Results 102

Results 103

Results 104