FAST COUNTING OF TRIANGLES IN LARGE NETWORKS ALGORITHMS

  • Slides: 55
Download presentation
FAST COUNTING OF TRIANGLES IN LARGE NETWORKS: ALGORITHMS AND LAWS Charalampos (Babis) Tsourakakis School

FAST COUNTING OF TRIANGLES IN LARGE NETWORKS: ALGORITHMS AND LAWS Charalampos (Babis) Tsourakakis School of Computer Science Carnegie Mellon University http: //www. cs. cmu. edu/~ctsourak RPI Theory Seminar, 24 November 2008

Counting Triangles 2 Given an undirected, simple graph G(V, E) a triangle is a

Counting Triangles 2 Given an undirected, simple graph G(V, E) a triangle is a set of 3 vertices such that any two of them by an edge of the graph. Our Related Problems focus a) Decide if a graph is triangle-free. b) Count the total number of triangles δ(G). c) Count the number of triangles δ(v) that each vertex v participates at. d) List the triangles that each vertex v participates RPI, November 2008 at.

3 Why is triangle counting important*? Social Network Analysis: “Friends of friends are friends”

3 Why is triangle counting important*? Social Network Analysis: “Friends of friends are friends” [WF 94] Web Spam Detection [BPCG 08] Hidden Thematic Structure of the Web [EM 02] Motif Detection e. g. biological networks [YPSB 05] *few indicative reasons, from the graph mining perspective RPI, November 2008

4 Why is triangle counting important? Furthermore, two often used metrics are: Clustering Coefficient

4 Why is triangle counting important? Furthermore, two often used metrics are: Clustering Coefficient where: Transitivity Ratio v Triple at node v Triangle where: RPI, November 2008

Outline 5 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in

Outline 5 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in Kronecker Graphs Future Work & Open Problems RPI, November 2008

Counting methods 6 Dense graphs Sparse graphs Fast Low space Time complexity O(n 2.

Counting methods 6 Dense graphs Sparse graphs Fast Low space Time complexity O(n 2. 37) O(n 3) Space complexity O(n 2) O(m) Fast Time complexity O(m 0. 7 n 1. 2+n 2+o(1)) Space complexity Θ(n 2) (eventually) RPI, November 2008 Low space e. g. O( n Θ(m) )

Outline 7 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in

Outline 7 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in Kronecker Graphs Future Work & Open Problems RPI, November 2008

Outline of the Proposed Method 8 Eigen. Triangle theorem Eigen. Triangle. Local theorem Eigen.

Outline of the Proposed Method 8 Eigen. Triangle theorem Eigen. Triangle. Local theorem Eigen. Triangle algorithm Eigen. Triangle. Local algorithm Efficiency & Complexity Power law degree distributions Ø Gershgorin discs Ø Real world network spectra Ø RPI, November 2008

Theorem [Eigen. Triangle] 9 Theorem The number of triangles δ(G) in an undirected, simple

Theorem [Eigen. Triangle] 9 Theorem The number of triangles δ(G) in an undirected, simple graph G(V, E) is given by: where are the eigenvalues of the adjacency matrix of graph G. RPI, November 2008

Proof 10 Call A the adjacency matrix of the graph. Consider the i-th diagonal

Proof 10 Call A the adjacency matrix of the graph. Consider the i-th diagonal element of A 3, αii. This element is equal to the number of triangles vertex i participates at. So the trace is 6δ(G) because each triangle is counted 6 times (3 participating vertices and is also counted as i-j-k, and i-k-j). Furthermore, if Ax=λx, then λ 3 is an eigenvalue of A 3 (*) and vice versa if λ is an eigenvalue of A 3 , then is an eigenvalue of A. * A 3 x=AAAx=AAλx=λΑΑx=λΑλx=λ 2Αx=λ 3 x RPI, November 2008

Theorem [Eigen. Triangle. Local] 11 Theorem The number of triangles δ(i) vertex i partipates

Theorem [Eigen. Triangle. Local] 11 Theorem The number of triangles δ(i) vertex i partipates at is equal to: where is the j-th entry of the i-th eigenvector Proof [Sketch] Follows from the previous theorem and the fact that A is symmetric, therefore diagonalizable and also RPI, November 2008

Eigen. Triangle Algorithm 12 RPI, November 2008

Eigen. Triangle Algorithm 12 RPI, November 2008

Eigen. Triangle. Local Algorithm 13 Why are these two algorithms efficient? RPI, November 2008

Eigen. Triangle. Local Algorithm 13 Why are these two algorithms efficient? RPI, November 2008

Skewed Degree Distributions 14 Skewed degree distribution ubiquitous in nature! Have been termed as

Skewed Degree Distributions 14 Skewed degree distribution ubiquitous in nature! Have been termed as “the signature of human activity”[FKP 02] but appear as well to all other kind of networks, e. g. biological. See [N 05][M 04] for generative models of power law distributions. Typically referred to as power-laws (even if sometimes we abuse the strict definition of a power law, i. e ). RPI, November 2008

Examples of power laws 15 Newman [N 05] demonstrated how often power laws appear

Examples of power laws 15 Newman [N 05] demonstrated how often power laws appear using may different types of networks, ranging from word frequencies to population of cities. Many cities have a small population RPI, November 2008 Few cities have a huge population

Gershgorin’s Discs 16 Theorem Let B an arbitrary matrix. Then the eigenvalues λ of

Gershgorin’s Discs 16 Theorem Let B an arbitrary matrix. Then the eigenvalues λ of B are located in the union of the n discs For a proof see Demmel [D 97], p. 82. RPI, November 2008

Gershgorin Discs 17 Bounds on the airports network (Observe how loose) RPI, November 2008

Gershgorin Discs 17 Bounds on the airports network (Observe how loose) RPI, November 2008

Typical real world spectra 18 Political blogs Airports RPI, November 2008

Typical real world spectra 18 Political blogs Airports RPI, November 2008

Top Eigenvalues 19 Zooming in the top eigenvalues and plotting the rank vs. the

Top Eigenvalues 19 Zooming in the top eigenvalues and plotting the rank vs. the eigenvalue in log-log scale reveals that the top eigenvalues follow a power law [FFF 99] Some years later, Mihail & Papadimitriou [MP 02] and Chung, Lu and Vu [CLV 03] proved this fact. RPI, November 2008

Our idea 20 Simple & clear: Use a low-rank approximation of A 3 to

Our idea 20 Simple & clear: Use a low-rank approximation of A 3 to estimate the diagonal elements and the trace. Suggests also a way of thinking: Take advantage of special properties (e. g. power laws) to reduce the complexity of certain computational tasks in real-world networks. RPI, November 2008

21 Summing up: Why does it work? Almost symmetry of the spectrum around 0

21 Summing up: Why does it work? Almost symmetry of the spectrum around 0 for the bulk of the eigenvalues except the top ones is the first main reason. Cubes amplify strongly this phenomenon! RPI, November 2008

Complexity Analysis 22 Main computational bottleneck that determines the complexity is the Lanczos method.

Complexity Analysis 22 Main computational bottleneck that determines the complexity is the Lanczos method. Lanczos runs in linear time with respect to the non-zero entries of the matrix, i. e. the edges, assuming that we compute a few constant number of eigenvalues. Convergence of Lanczos is fast due to the eigenvalue power law (see Kaniel-Paige theory [GL 89]) RPI, November 2008

Outline 23 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in

Outline 23 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in Kronecker Graphs Future Work & Open Problems RPI, November 2008

Datasets 24 RPI, November 2008

Datasets 24 RPI, November 2008

Competitor: Node Iterator 25 Node Iterator algorithm considers each node at the time, looks

Competitor: Node Iterator 25 Node Iterator algorithm considers each node at the time, looks at its neighbors and checks how many among them are connected among them. Complexity: O(n ) We report the results as the speedup that Eigen. Triangle algorithm gives compared to the running time of the Node Iterator. RPI, November 2008

26 Results: #Eigenvalues vs. Speedup RPI, November 2008

26 Results: #Eigenvalues vs. Speedup RPI, November 2008

Results: #Edges vs. Speedup 27 RPI, November 2008

Results: #Edges vs. Speedup 27 RPI, November 2008

Main points 28 Some interesting facts for the two scatterplots: Mean required approximations rank

Main points 28 Some interesting facts for the two scatterplots: Mean required approximations rank for at least 95% is 6. 2 Speedups are between 33. 7 x and 1159 x. The mean speedup is 250. Notice the increasing speedup as the size of the network grows. RPI, November 2008

Zooming in 29 Zooming in this point RPI, November 2008

Zooming in 29 Zooming in this point RPI, November 2008

30 Evaluating the Local Counting Method Pearson’s correlation coefficient ρ Relative Reconstruction Error Political

30 Evaluating the Local Counting Method Pearson’s correlation coefficient ρ Relative Reconstruction Error Political Blogs: RRE 7*10 -4 ρ 99. 97% RPI, November 2008

31 #Eigenvalues vs. ρ for three networks Observe how a low rank results in

31 #Eigenvalues vs. ρ for three networks Observe how a low rank results in almost optimal results. This holds for surprisingly many real world networks RPI, November 2008

Outline 32 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in

Outline 32 • • • Related Work Proposed Method Experiments Triangle-related Laws Triangles in Kronecker Graphs Future Work & Open Problems RPI, November 2008

Triangle Participation Law 33 Plots the number of triangles δ (x-axis) vs. the count

Triangle Participation Law 33 Plots the number of triangles δ (x-axis) vs. the count of vertices with δ participating triangles. (a) (c) (b) a) EPINIONS, who trusts-whos b) ASN, social network c) HEP_TH, collaboration network RPI, November 2008

Degree Triangle Law 34 Plots the degree di (x-axis) vs. the mean number of

Degree Triangle Law 34 Plots the degree di (x-axis) vs. the mean number of triangles that nodes with degree di participate at. Epinions ASN RPI, November 2008

Outline 35 • • • Related Work Proposed Method Experiments New Triangle-related Laws Triangles

Outline 35 • • • Related Work Proposed Method Experiments New Triangle-related Laws Triangles in Kronecker Graphs Future Work & Open Problems RPI, November 2008

Kronecker Graphs 36 This model was introduced in [LCKF 05]. It is based on

Kronecker Graphs 36 This model was introduced in [LCKF 05]. It is based on the simple operation of the Kronecker product to generate graphs that mimic real world networks. Deterministic Kronecker Graphs: Kronecker Product of the adjacency matrix at the current step k with the initiator adjacency matrix (typically small). Stochastic Kronecker Graphs: Kronecker Product of the matrix at the current step k with the initiator matrix. Initiator matrix contains RPI, November 2008 probabilities.

Triangles in Kronecker Graphs 37 Some notation first: A: nxn initiatior adjacency matrix of

Triangles in Kronecker Graphs 37 Some notation first: A: nxn initiatior adjacency matrix of the undirected, simple graph GA B = A[k] k-th Kronecker product λ=(λ 1, . . . , λn) the eigenvalues of A Δ(GA), Δ(GΒ) #triangles of GA , GΒ Theorem [Kronecker. TRC] RPI, November 2008

Proof 38 We use induction on the number of recursion steps k. For k=0

Proof 38 We use induction on the number of recursion steps k. For k=0 theorem trivially holds. Assume now that Kronecker. TRC holds now for some. Call C=A[r], D=A[r+1] and the eigenvalues of C, [μi]i=1. . s. By the assumption The eigenvalues of D are given by the RPI, November 2008 Kronecker product. By the Eigen. Triangle

Proof 39 Therefore Kronecker. TRC holds for all Q. E. D RPI, November 2008

Proof 39 Therefore Kronecker. TRC holds for all Q. E. D RPI, November 2008 .

Outline 40 • • • Related Work Proposed Method Experiments New Triangle-related Laws Triangles

Outline 40 • • • Related Work Proposed Method Experiments New Triangle-related Laws Triangles in Kronecker Graphs Future Work & Open Problems RPI, November 2008

41 Theoretical Challenge I: Spectra of real world networks Can we prove things about

41 Theoretical Challenge I: Spectra of real world networks Can we prove things about the distribution of the eigenvalues, adopting a random graph model such as the expected degree model G(w) [CLV 03]? An analog to Wigner’s semicircle law for random Erdos-Renyi graphs (see Furedi. Spectrum of Komlos [FK 81]) RPI, November 2008 over 100000 Iterations [S 07]

42 Theoretical Challenge I: Spectra of real world networks Empirically, the rest Can weofprove

42 Theoretical Challenge I: Spectra of real world networks Empirically, the rest Can weofprove the spectrum: Something about Triangular-like thisdistribution empirical [FDBV 01] ? observation RPI, November 2008

43 Theoretical Challenge II: Eigenvectors of real world networks Things even “worse” than the

43 Theoretical Challenge II: Eigenvectors of real world networks Things even “worse” than the case of spectra. Very few knowledge about the eigenvectors. Related work: See [P 08] for random graphs. RPI, November 2008

44 Theoretical Challenge III: Degree Triangle Law Prove using the expected degree random graph

44 Theoretical Challenge III: Degree Triangle Law Prove using the expected degree random graph model G(w) the pattern we saw (see [S 04]) Conjecture: The relationship we observed probably appears for some cases of the slope of the degree distribution. Further experiments, recently showed that for some graphs pattern does not RPI, this November 2008

45 Experimental Challenge I: Compare with Streaming Methods Streaming or Semi-Streaming methods, perform one

45 Experimental Challenge I: Compare with Streaming Methods Streaming or Semi-Streaming methods, perform one or O(1) passes over the graph. [YKS 02] [BFLSS 06] [BPCG 08] Common Underlying Idea: Sophisticated sampling methods Implement and compare. RPI, November 2008

46 Practical Challenge I: Triangles in Large Scale Graph Mining q Many Giga-byte and

46 Practical Challenge I: Triangles in Large Scale Graph Mining q Many Giga-byte and Peta-byte sized graphs. How to handle these graphs? HADOOP Eigen. Triangle algorithms are based just on simple matrix vector multiplications. Easy to parallelize in all sorts of architectures (distributed memory , shared memory). See [DHV 93] for the details. RPI, November 2008

47 PEGASUS: Peta-Graph Mining from the Triangle perspective Soon… Stay tuned! On-going work with

47 PEGASUS: Peta-Graph Mining from the Triangle perspective Soon… Stay tuned! On-going work with U Kang and Christos Faloutsos in collaboration with Yahoo! Research. Among others: Implement Eigen. Triangle algorithms in HADOOP and compare to other methods. Find outliers in graphs with many billions of edges wrt triangles. RPI, November 2008

Curious about: 48 RPI, November 2008

Curious about: 48 RPI, November 2008

Acknowledgements 49 Christos Faloutsos For the helpful discussions Yiannis Koutis RPI, November 2008

Acknowledgements 49 Christos Faloutsos For the helpful discussions Yiannis Koutis RPI, November 2008

Acknowledgements 50 Maria Tsiarli For the PEGASUS logo RPI, November 2008

Acknowledgements 50 Maria Tsiarli For the PEGASUS logo RPI, November 2008

51 RPI, November 2008

51 RPI, November 2008

References 52 [WF 94] Wasserman, Faust: “Social Network Analysis: Methods and Applications (Structural Analysis

References 52 [WF 94] Wasserman, Faust: “Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences)” [EM 02] Eckmann, Moses: “Curvature of co-links uncovers hidden thematic layers in the World Wide Web” [BPCG 08] Becchetti, Boldi, Castillo, Gionis Efficient Semi-Streaming Algorithms for Local Triangle Counting in Massive Graphs [FKP 02] Fabrikant, Koutsoupias, Papadimitriou: “Heuristically Optimized Trade-offs: A New Paradigm for Power Laws in the Internet” [N 05] Newman: “Power laws, Pareto distributions and Zipf's law” [M 04] Mitzenmacher: “A brief history of generative models for power law and lognormal distributions” [FK 81] Furedi-Komlos: “Eigenvalues of random symmetric matrices” RPI, November 2008

References 53 [S 04] Danilo Sergi: “Random graph model with power-law distributed triangle subgraphs”

References 53 [S 04] Danilo Sergi: “Random graph model with power-law distributed triangle subgraphs” [D 97] Demmel: “Applied Numerical Algebra” [LCKF 05] Leskovec, Chakrabarti, Kleinberg, Faloutsos: “Realistic, Mathematically Tractable Graph Generation and Evolution using Kronecker Multiplication” [LK 07] Leskovec, Faloutsos: “Scalable Modeling of Real Graphs using Kronecker Multiplication” [FFF 09] Faloutsos, Faloutsos: “On power-law relationships of the Internet topology” [MP 02] Mihail, Papadimitriou: “On the Eigenvalue Power Law” [CLV 03] Chung, Lu, Vu: “Spectra of Random Graphs with given expected degrees” RPI, November 2008

References 54 [YKS 02] Yossef, Kumar, Sivakumar: “Scalable Modeling of Real Graphs using Kronecker

References 54 [YKS 02] Yossef, Kumar, Sivakumar: “Scalable Modeling of Real Graphs using Kronecker Multiplication” [GL 89] Golub, Van Loan: “Matrix Computations” [BFLSS 06] Buriol, Frahling, Leonardi, Spaccamela, Sohler: “Counting triangles in data streams” [DHV 93] Demmel, Heath, Vorst: “Parallel Numerical Linear Algebra” [YPSB 05] Ye, Peyser, Spencer, Bader: “Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast” [P 08] Mitra Pradipta: “Entrywise Bounds for Eigenvectors of Random Graphs” [FDBV 01] Farkas, Derenyi, Barabasi, Vicsek: “Spectra of "real-world" graphs: Beyond the semi-circle law” [S 07] Spielman’s “Spectral Graph RPI, Theory Novemberand 2008 its Applications” class (YALE): http: //www. cs. yale. edu/homes/spielman/eigs/

References 55 [F 08] Faloutsos’ “Multimedia Databases and Data Mining” class (CMU): http: //www.

References 55 [F 08] Faloutsos’ “Multimedia Databases and Data Mining” class (CMU): http: //www. cs. cmu. edu/~christos/courses/826. S 08 For more references, take a look also in the paper: http: //www. cs. cmu. edu/~ctsourak/tsour. ICDM 08. pdf RPI, November 2008