Comparing Community Detection Algorithms Rutvik Manohar Background Community

Comparing Community Detection Algorithms Rutvik Manohar

Background Community - “a locally dense connected subgraph in a network” Connectedness- each member of a community must be reachable from all other members of a community Density - Higher probability of reaching members of the same community

Cliques/Community Properties Clique - complete subgraph with maximum link density Internal Degree(ID(n)) - number of links to nodes within the same community as n External Degree(ED(n)) - number of links to nodes to the rest of the network Strong Community - A subgraph Gi forms a strong community if ID(n) > ED(n) for each node n∈ Gi Weak Community - A subgraph Gi forms a weak community if for each node n∈ Gi

Clique Percolation/CFinder Algorithm - Defines a community as a set of overlapping cliques All cliques in a network are found and placed in a N clique x Nclique matrix - Oij: number of nodes shared by cliques i and j

CFinder Algorithm(Output Example)

Hierarchical Clustering -Agglomerative Procedures -Ravasz Algorithm - Nodes with high similarity are merged into the same community -Divisive Procedures -Girvan-Newman Algorithm - Links that connect nodes to different communities are disconnected and isolated communities are formed

Ravasz Algorithm 1) Similarity Matrix - Node pairs within the same community have a higher algorithm similarity than those within different communities 1) Topological Sort Matrix - Captures the likelihood/expectation that network context nodes belong to the same community - xij(o) = J(i, j)/(min(ki , kj) + 1 - Θ(Aij)) - The Average Cluster Similarity method is used to determine community similarity

Ravasz Algorithm(Clustering) - 3) Each node is assigned to its own community and x ij is evaluated for all node pairs. - - The two nodes with the highest similarity are merged into one community Similarities between the formed community and all other communities are computed Repeat until all nodes constitute a single community 4)In order extract individual communities/clusters, a dendrogram can be used. - Indicates the order in which nodes were assigned to communities

Ravasz Algorithm - Computational Complexity - - Each similarity matrix requires a usage of O(N^2) node pairs Computing distance of new cluster to all other clusters - O(N^2) Construction of dendrogram: O(Nlog. N) Overall complexity: O(N^2)

Girvan-Newman Algorithm - 1) Centrality(xij) - - Measure of whether nodes belong to different communities High if nodes i and j are from different communities and low if nodes i/j are in the same community Link Betweenness - xij defined to be the shortest number of paths that include edge e = (i, j) 2) Hierarchical Clustering - The centrality of each edge e = (i, j) is computed Remove link with the highest centrality Recompute centralities for all other links in the network Repeat until all links are removed

Girvan-Newman Algorithm - Computational Complexity Link Betweenness - O(LN) Step 3 -additional factor of L Overall Complexity: O(L^2 N)

Facebook Dataset - Circles(friend lists) from Facebook Anonymized Data - Circles Ego Networks Node Features Nodes 4039 Edges 88234 Nodes in Largest WCC 4039 Edges in Largest WCC 88234 Nodes in Largest SCC 4039 Edges in Largest SCC 88234 Average Clustering Coefficient 0. 6055 Number of Triangles 1612010 Fraction of Closed Triangles 0. 2647 Diameter 8 90 -Percentile effective diameter 4. 7

Twitter Dataset - Circles(lists) from Twitter Crawled from public resources Data Types - Node features Circles Ego Networks Nodes 81306 Edges 1768149 Nodes in Largest WCC 81306 Edges in Largest WCC 1768149 Nodes in Largest SCC 68413 Edges in Largest SCC 1685163 Average Clustering Coefficient 0. 5653 Number of Triangles 13082506 Fraction of Closed Triangles 0. 06415 Diameter 7 90 -Percentile effective diameter 4. 5

Bitcoin OTC Dataset - User Trust records from Bitcoin OTC Rankings of Bitcoin OTC members Flatfile Data - ID of source ID of target Rating of the Target Time of the Provided Rating Nodes 5881 Edges 35592 Range of Edge Weight -10 to +10 Percentage of Positive Edges 89%

Ravasz Algorithm Output

Girvan Newman Algorithm Output

Facebook Dataset Execution Times Agglomerative Divisive 0. edges 1. 006727933883667 s 6. 437301635742188 e-05 s 107. edges >5 s 0. 0003058910369873047 s 1684. edges >5 s 0. 006442546844482422 s 1912. edges >5 s 0. 0024690628051757812 s 3437. edges 3. 843674898147583 s 0. 009980440139770508 s 348. edges 1. 5739598274230957 s 0. 0017979145050048828 s 3980. edges 0. 011324167251586914 s 0. 0014109611511230469 s 414. edges 0. 41009020805358887 s 6. 246566772460938 e-05 s 686. edges 0. 36986780166625977 s 0. 0002701282501220703 s

Twitter Dataset Execution Times Agglomerative Divisive 100318079. edges 2. 601630926132202 s 8. 797645568847656 e-05 s 10146102. edges 0. 030692577362060547 s 0. 0008788108825683594 s 101859065. edges 0. 0016021728515625 s 0. 00021338462829589844 s 101903164. edges 0. 0014188289642333984 s 9. 012222290039062 e-05 s 102765423. edges 0. 036995649337768555 s 0. 00014328956604003906 s 102903198. edges 0. 8340291976928711 s 0. 0001735687255859375 s 103431502. edges 0. 009302139282226562 s 0. 0005862712860107422 s 103865085. edges 0. 19801998138427734 s 9. 012222290039062 e-05 s 103991905. edges 0. 0076751708984375 s 0. 0003268718719482422 s

Bitcoin OTC Dataset Execution Times Agglomerative Divisive > 20 s 4. 8160552978515625 e-05 s

Modularity - - Quality of each community partition Assuming a network with L links, N nodes, n c communities, Lc edges per community, and Nc nodes per community, if Lc exceeds the expected number of edges between the Nc nodes, then the nodes of a subgraph C c can be part of a community. The difference between the expected number of links between nodes i and j(p ij) and the network’s wiring diagram(A ij) is defined as the modularity Mc = 1/(2 L) * (Aij - pij), (i, j)∈ (Cc)| 1 ≤ i ≤ n, 1 ≤ j ≤ n

Louvain Algorithm - - Modularity optimization algorithm 1) For a weighted network of N nodes, place each node within its own community. For each node n, the increase in modularity of placing the node n in the same community as each of its neighbors j. The community placement with the highest positive modularity increase is selected. 2) A network with the defined communities in part 1 is constructed. 3) Repeat 1 -2) until the maximum modularity is achieved

Louvain Algorithm Computational Complexity - Number of computations directly proportional to L for the least efficient pass Worst Case Complexity: O(L) Can scale for networks with millions of nodes

Louvain Algorithm Output(Example)

Infomap Algorithm - Data compression algorithm which can identify communities Optimizes a quality function known as a map equation Input: Network N partitioned into nc communities Coding: 1) One code is assigned to each community 2) Codewords for each community node 3) Exit code indicating when a random walker leaves the community Algorithm Preprocessing: - Each node is assigned a code (Huffman encoding) Random walk in the network is represented by a two-level encoding

Infomap Algorithm 1) Optimal Code: minimum of the map equation L 2) Map Equation 1) q. H(Q): average number of bits necessary to describe movement between communities 2) Q: probability that a random walker switches communities during a certain step 3) : average number of bits necessary to describe movement within communities 4) H(Pc): measure of entropy of random walks within communities

Infomap Algorithm Steps 1) Each node is assigned to a different community 2) Neighboring nodes are placed into communities if this change decreases the map equation L 3) The resulting communities after a single pass are merged into supercommunities, which subsequently are passed into the next pass of the algorithm 4) L is minimized over all partitions

Infomap Algorithm Computational Complexity - Determined by minimization of the map equation L Dense Graphs: O(Llog. L) Sparse Graphs: O(N Log N)

Infomap Algorithm

Infomap Output(Example)

Facebook Dataset Execution Times Louvain Infomap CFinder 0. edges 0. 4128847122192383 s 0. 1195363998413086 s 0. 984748363494873 s 107. edges 1. 3428726196289062 s 0. 2625303268432617 s >5 s 1684. edges 1. 7239031791687012 s 0. 2573084831237793 s >5 s 1912. edges 1. 4437658786773682 s 0. 3103940486907959 s >5 s 3437. edges 0. 6212575435638428 s 0. 2569918632507324 s 4. 366451025009155 s 348. edges 0. 24696922302246094 s 0. 09439373016357422 s >5 s 3980. edges 0. 039075374603271484 s 0. 1972367763519287 s 0. 01833057403564453 s 414. edges 0. 12355947494506836 s 0. 15025591850280762 s 1. 8975183963775635 s 686. edges 0. 17152166366577148 s 0. 1234748363494873 s 0. 47782254219055176 s

Twitter Dataset Execution Times Louvain CFinder Infomap 100318079. edges 0. 31006526947021484 s > 10 s >5 s 10146102. edges 0. 08896231651306152 s 0. 056291818618774414 s >5 s 101859065. edges 0. 011868715286254883 s 0. 011034727096557617 s >5 s 101903164. edges 0. 004703998565673828 s 0. 006857395172119141 >5 s 102765423. edges 0. 09036827087402344 s 0. 10010647773742676 >5 s 102903198. edges 0. 25397682189941406 s > 10 s >5 s 103431502. edges 0. 06458067893981934 s 0. 04571080207824707 s >5 s 103865085. edges 0. 13692879676818848 s > 10 s >5 s 103991905. edges 0. 05409431457519531 s 0. 046225786209106445 s >5 s

Bitcoin OTC Execution Times Louvain Infomap CFinder 7. 772445678710938 e-05 s 0. 39581751823425293 s 37. 318554639816284 s

Further Applications - WBSN(web-based social networks) - - Tidal. Trust Algorithm - - Sets of priorities for each user which indicate levels of trust in information sources Modified BFS from the source node Ratings of the sink and weighted average of trust values are computed Degree of trust between the source node and the sink node is computed. Communities of Trust - Louvain/Infomap algorithms