Community Detection Modularity and Trawling CS 224 W

  • Slides: 43
Download presentation
Community Detection: Modularity and Trawling CS 224 W: Social and Information Network Analysis Jure

Community Detection: Modularity and Trawling CS 224 W: Social and Information Network Analysis Jure Leskovec, Stanford University http: //cs 224 w. stanford. edu

Network Communities �Communities: sets of tightly connected nodes �Define: Modularity Q § A measure

Network Communities �Communities: sets of tightly connected nodes �Define: Modularity Q § A measure of how well a network is partitioned into communities § Given a partitioning of the network into groups s S: Q ∑s S [ (# edges within group s) – (expected # edges within group s) ] Need a null model! 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 2

Null Model: Configuration Model � i j 12/5/2020 Jure Leskovec, Stanford CS 224 W:

Null Model: Configuration Model � i j 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu Note: 3

Modularity � Normalizing cost. : -1<Q<1 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social

Modularity � Normalizing cost. : -1<Q<1 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu Aij = 1 if i j, 0 else 4

Modularity: Number of clusters �Modularity is useful for selecting the number of clusters: Q

Modularity: Number of clusters �Modularity is useful for selecting the number of clusters: Q Why not optimize modularity directly? 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 5

Rewrite the question as two separate summations – one over A, one of ki*kj

Rewrite the question as two separate summations – one over A, one of ki*kj Method 2: Modularity Optimization � 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 1. . if si=sj 0. . else 6

Modularity Matrix � 12/5/2020 Why it B_ij s_i s_j a s. Bs product? Explain.

Modularity Matrix � 12/5/2020 Why it B_ij s_i s_j a s. Bs product? Explain. What do we mean rewrite Q in terms of eigen vals and vecs? Give the basic Note: each row definition of the eigen /column of B sums to 0 decompostion Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 7

Why is making s parallel to u_1 the right thing to do? Modularity Optimization

Why is making s parallel to u_1 the right thing to do? Modularity Optimization � 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 8

Finding Vector s � 12/5/2020 Explain the approximation – we only consider first term

Finding Vector s � 12/5/2020 Explain the approximation – we only consider first term in the summation. Explain the intuition that we are making s approximately parallel to u_1 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 9

Summary: Modularity Optimization �Fast Modularity Optimization Algorithm: § Find leading eigenvector u 1 of

Summary: Modularity Optimization �Fast Modularity Optimization Algorithm: § Find leading eigenvector u 1 of modularity matrix B § Divide the nodes by the signs of the elements of u 1 § Repeat hierarchically until: § If a proposed split does not cause modularity to increase, declare community indivisible and do not split it § If all communities are indivisible, stop �How to find u 1? Power method! § Start with random v(1), repeat : § When converged (v(t) ≈ v(t+1)), set u 1 = v(t) 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 10

Skip this slide! Additional Heuristic Approaches Start: �(1) Greedy post-processing: § Start with nodes

Skip this slide! Additional Heuristic Approaches Start: �(1) Greedy post-processing: § Start with nodes in two groups, s § Repeat t = 1. . n until all nodes have been moved: § For i = 1. . n § Consider moving node i, compute new Qt(si) § Move node j that hasn’t yet been moved and that maximizes Qt(sj) § Note that Qt can decrease with time t 1 5 2 6 3 7 Move best not-yet-moved node (3), store Q 1 1 5 2 6 7 3 Move best not-yet-moved node (5), store Q 2 § Once iteration is complete, find 5 1 intermediate state t with highest Qt 2 6 § Start from this state and repeat 7 3 until Q stops increasing Dot this for every not-yet-moved node, pick state x that max Qt 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 11

Skip! Too many details and not enough time to explain the updates to the

Skip! Too many details and not enough time to explain the updates to the modularity matrix Additional Heuristic Approaches � 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 12

Modularity Optimization Methods Cut out the CNM and DA columns. Fast modularity � 12/5/2020

Modularity Optimization Methods Cut out the CNM and DA columns. Fast modularity � 12/5/2020 GN = Betweenness centrality, O(n 3) CNM = Clauset-Newman-Moore (n log 2 n) DA = External optimization O(n 2 log 2 n) Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 13

Summary: Modularity �Girvan-Newman (previous lecture): § Based on the “strength of weak ties” §

Summary: Modularity �Girvan-Newman (previous lecture): § Based on the “strength of weak ties” § Remove edge of highest betweenness �Modularity: § Overall quality of the partitioning of a graph § Use to determine the number of communities �Fast modularity optimization: § Transform the modularity optimization to a eigenvalue problem �Clauset-Newman-Moore: § Agglomerative clustering based on Modularity 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 14

Trawling for Web Communities

Trawling for Web Communities

[Kumar et al. ‘ 99] Method 3: Trawling �Searching for small communities in …

[Kumar et al. ‘ 99] Method 3: Trawling �Searching for small communities in … … … the Web graph �What is the signature of a community / discussion in a Web graph? Use this to define “topics”: What the same people on the left talk about on the right Remember HITS! Dense 2 -layer graph Intuition: Many people all talking about the same things 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 16

Searching for Small Communities �A more well-defined problem: Enumerate complete bipartite subgraphs Ks, t

Searching for Small Communities �A more well-defined problem: Enumerate complete bipartite subgraphs Ks, t § Where Ks, t : s nodes on the “left” where each links to the same t other nodes on the “right” X K 3, 4 Y |X| = s = 3 |Y| = t = 4 Fully connected 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 17

The Plan: (1), (2) and (3) [Kumar et al. ‘ 99] �Two points: §

The Plan: (1), (2) and (3) [Kumar et al. ‘ 99] �Two points: § (1) Dense bipartite graph: the signature of a community/discussion § (2) Complete bipartite subgraph Ks, t § Ks, t = graph on s nodes, each links to the same t other nodes �Plan: § (A) From (2) get back to (1): § Via: Any dense enough graph contains a smaller Ks, t as a subgraph § (B) How do we solve (2) in a giant graph? § What similar problems were solved on big non-graph data? § (3) Frequent itemset enumeration [Agrawal-Srikant ‘ 99] 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 18

[Agrawal-Srikant ‘ 99] Frequent Itemset Enumeration �Marketbasket analysis: § What items are bought together

[Agrawal-Srikant ‘ 99] Frequent Itemset Enumeration �Marketbasket analysis: § What items are bought together in a store? �Setting: § Market: Universe U of n items § Baskets: m subsets of U: S 1, S 2, …, Sm U (Si is a set of items one person bought) § Support: Frequency threshold f Products sold in a store �Goal: § Find all subsets T s. t. T Si of f sets Si (items in T were bought together at least f times) 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 19

Frequent Itemsets: Example �Given: § Universe of items: § U={1, 2, 3, 4, 5}

Frequent Itemsets: Example �Given: § Universe of items: § U={1, 2, 3, 4, 5} § Market baskets: Support of T={2, 3} is 2 § S 1={1, 3, 5}, S 2={2, 3, 4}, S 3={2, 4, 5}, S 4={3, 4, 5}, S 5={1, 3, 4, 5}, S 6={2, 3, 4, 5} (T appears in S 2 and S 6) § Minimum support: f = 3 § Goal: Find all sets T that appear in at least f Si’s § Call such itemsets T frequent itemsets (they have support f) �Algorithm: Build the lists bottom-up § Insight: For a frequent set of size k, all its subsets are also frequent 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu If T={3, 4, 5} is frequent, then {3, 4}, {3, 5}, {4, 5} must also be frequent! 20

[Agrawal-Srikant ‘ 99] Example: the Apriori Algorithm �Setting: § U={1, 2, 3, 4, 5},

[Agrawal-Srikant ‘ 99] Example: the Apriori Algorithm �Setting: § U={1, 2, 3, 4, 5}, f=3 § S 1={1, 3, 5}, S 2={2, 3, 4}, S 3={2, 4, 5}, S 4={3, 4, 5}, S 5={1, 3, 4, 5}, S 6={2, 3, 4, 5} Itemset size 1 2 3 4 12/5/2020 Itemsets {1} {2} {3} {4} {5} {2, 3} {2, 4} {2, 5} {3, 4} {3, 5} {4, 5} {2, 3, 4} {3, 4, 5} 2 steps: 1) Candidate generation 2) Pruning {} Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 21

The Apriori Algorithm �For [Agrawal-Srikant ‘ 99] i = 1, …, k § Generate

The Apriori Algorithm �For [Agrawal-Srikant ‘ 99] i = 1, …, k § Generate all sets of size i by composing sets of size i-1 that differ in 1 element § Prune the sets of size i with support < f �Open question: § Efficiently find only maximal frequent sets �What’s the connection between the itemsets and complete bipartite graphs? 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 22

[Kumar et al. ‘ 99] From Itemsets to Bipartite Ks, t �Itemsets finds Complete

[Kumar et al. ‘ 99] From Itemsets to Bipartite Ks, t �Itemsets finds Complete bipartite graphs a �How? § View each node i as a set Si of nodes i points to § Ks, t = a set Y of size t that occurs in s sets Si § Looking for Ks, t set of frequency threshold to s and look at layer t – all frequent sets of size t 12/5/2020 b i c Si={a, b, c, d} d j i k X a b c d Y s … minimum support (|X|=s) t … itemset size Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 23

From Ks, t to Communities � 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social

From Ks, t to Communities � 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 24

*****END***** �END OF TIME – had 5 min left to cover the competition results

*****END***** �END OF TIME – had 5 min left to cover the competition results �Itemsets caught people’s attention �Need to find a better visual way to explain how itemsets find bipartite graphs 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 25

Proof: Ks, t and Communities f(x) � 12/5/2020 Jure Leskovec, Stanford CS 224 W:

Proof: Ks, t and Communities f(x) � 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu x 26

Nodes and Buckets �Consider node i of degree ki and neighbor set Si a

Nodes and Buckets �Consider node i of degree ki and neighbor set Si a i b i i c (a, b) (a, c) (a, d) (b, c) …. d …. �Put node i in buckets for all size t subsets of i’s neighbors 12/5/2020 Potential right-hand sides of Ks, t (i. e. , all size t subsets of Si) As soon as s nodes appear in a bucket we have a Ks, t Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 27

Nodes and Buckets � = # of ways to select t elements out of

Nodes and Buckets � = # of ways to select t elements out of ki ki … degree of node i By convexity (ki > t) 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 28

Nodes and Buckets � Plug in: 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social

Nodes and Buckets � Plug in: 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 29

And We are Done! �We have: Total height of all buckets: �How many buckets

And We are Done! �We have: Total height of all buckets: �How many buckets are there? �What is the average height of buckets? So, avg. bucket height s � By pigeonhole principle, there must be at least one bucket with more than s nodes in it. � We found a Ks, t 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 30

[Kumar et al. ‘ 99] Method 3: Trawling — Summary �Analytical result: § Complete

[Kumar et al. ‘ 99] Method 3: Trawling — Summary �Analytical result: § Complete bipartite subgraphs Ks, t are embedded in larger dense enough graphs (i. e. , the communities) § Biparite subgraphs act as “signatures” of communities �Algorithmic result: § Frequent itemset extraction and dynamic programming finds graphs Ks, t § Method is super scalable 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 31

Spectral Graph Partitioning

Spectral Graph Partitioning

Method 4: Graph Partitioning �Undirected graph G(V, E): 5 1 2 4 3 �Bi-partitioning

Method 4: Graph Partitioning �Undirected graph G(V, E): 5 1 2 4 3 �Bi-partitioning task: 6 § Divide vertices into two disjoint groups (A, B) A 2 3 B 5 1 4 6 �Questions: § How can we define a “good” partition of G? § How can we efficiently identify such a partition? 11/8/2010 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 33

Graph Partitioning �What makes a good partition? § Maximize the number of within-group connections

Graph Partitioning �What makes a good partition? § Maximize the number of within-group connections § Minimize the number of between-group connections 5 1 2 3 A 11/8/2010 6 4 B Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 34

Graph Cuts �Express partitioning objectives as a function of the “edge cut” of the

Graph Cuts �Express partitioning objectives as a function of the “edge cut” of the partition �Cut: Set of edges with only one vertex in a group: A 1 2 3 11/8/2010 B 5 4 6 cut(A, B) = 2 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 35

Graph Cut Criterion �Criterion: Minimum-cut § Minimise weight of connections between groups min. A,

Graph Cut Criterion �Criterion: Minimum-cut § Minimise weight of connections between groups min. A, B cut(A, B) �Degenerate case: “Optimal cut” Minimum cut �Problem: § Only considers external cluster connections § Does not consider internal cluster connectivity 11/8/2010 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 36

Graph Cut Criteria [Shi-Malik] � 11/8/2010 Jure Leskovec, Stanford CS 224 W: Social and

Graph Cut Criteria [Shi-Malik] � 11/8/2010 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 37

Competition Results: Graph Alignment Jure Leskovec, Stanford CS 224 W: Social and Information Network

Competition Results: Graph Alignment Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu

Wikipedia Graph Alignment �Given the German and French Wikipedia graph �And a few example

Wikipedia Graph Alignment �Given the German and French Wikipedia graph �And a few example corresponding articles �Goal: Find the remaining correspondences: § Link “Paris” in German to “Paris” in French § Intuition: Paris in both languages links to “similar” pages (pages that also link to each other) 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 39

Approach 1: Square Maximization Winning solution: �Start from some pairing S § Start from

Approach 1: Square Maximization Winning solution: �Start from some pairing S § Start from random pairing �Goodness of pairing S: § Number of “squares” �Consider transforming (u. F, u. G), (v. F, v. G) to (v. F, u. G), (u. F, v. G) �Accept the swap if the number of squares increases �Improvements: § Bound on swap improvement: § No need to swap nodes that don’t give good improvement § Computing swap change efficiently 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 40

Approach 1: Square Maximization 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information

Approach 1: Square Maximization 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 41

Approach 2: Machine Learning �For a pair of nodes (u. F, u. G) construct

Approach 2: Machine Learning �For a pair of nodes (u. F, u. G) construct a feature vector § Matches from the training set (M. txt) are “positive” examples § Pairs not in M. txt are “negative” examples �Use Random Forests to label pairs (AUC=0. 87) § Each pair gets a probability that they match �Now greedily fill-in the remaining pairings by considering correspondence probabilities 12/5/2020 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 42

Results and Extra Credit ID # Correct krish (10%) 3, 308 pmk (8%) 2,

Results and Extra Credit ID # Correct krish (10%) 3, 308 pmk (8%) 2, 941 lussier 1 (6%) 2, 191 prgao (4%) 2, 107 jieyang (4%) 1, 706 carmenv 978 anmittal 861 adotey 828 billyue 805 gibbons 4 507 leonlin 145 cktan 65 12/5/2020 Fraction 0. 83 0. 74 0. 55 0. 53 0. 43 0. 24 0. 22 0. 21 0. 20 0. 13 0. 04 0. 02 Jure Leskovec, Stanford CS 224 W: Social and Information Network Analysis, http: //cs 224 w. stanford. edu 43