TriFly Distributed Estimation of Global and Local Triangle
Tri-Fly Distributed Estimation of Global and Local Triangle Counts in Graph Streams Kijung Shin 1 Mohammad Hammoud 1 Euiwoong Lee 1 Jinoh Oh 2 Christos Faloutsos 1 1 Carnegie Mellon University 2 Adobe Systems
Introduction Problem Algorithm Analysis Experiments Conclusion Triangles in a Graph • Graphs are everywhere! ◦ social networks, the web, citation networks • Triangles are a fundamental primitive ◦ 3 nodes connected to each other • Counting triangles has many applications ◦ community detection, anomaly detection, query optimization Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 236/
Introduction Problem Algorithm Analysis Experiments Conclusion Application: Anomaly Detection [LJK 18] # Incident Triangles [KMF 11] Degree Telemarketer Degree Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 336/
Introduction Problem Algorithm Analysis Experiments Conclusion Remaining Challenges • Counting triangles in real-world graphs, such as online social networks Web Citation networks Call networks • Real-world graphs are ◦ Large: not fitting in main memory ◦ Dynamic: growing with new nodes and edges Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 436/
Introduction Problem Algorithm Analysis Experiments Conclusion Previous Approaches • Distributed algorithms [SS 11] [PC 13] [PPK 18] ◦ pros: utilize multiple machines ◦ cons: inapplicable to dynamic graphs • Streaming algorithms [DERU 16] [Shi 17] [LJK 18] ◦ pros: applicable to dynamic graphs ◦ cons: limited to a single machine Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 536/
Introduction Problem Algorithm Analysis Experiments Conclusion Our Approach and Goal • Can we have the best of both worlds? ◦ for dynamic graphs ◦ on multiple machines • We design a distributed streaming algorithm Fast and Accurate: outperforming competitors Scalable: with linear data scalability Theoretically Sound: with unbiased estimates Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 636/
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses • Experiments • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 736/
Introduction Problem Algorithm Analysis Experiments Conclusion Problem Definition • Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 836/
Introduction Problem Algorithm Analysis Experiments Conclusion Problem Definition (cont. ) • 1 3 2 3 1 2 2 4 3 1 • Global triangles: all triangles in the graph • Local triangles: the triangles incident to each node Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 936/
Road Map • Problem Definition • Algorithm: Tri-Fly << • Theoretical Analyses • Experiments • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1036/
Introduction Problem Algorithm Analysis Experiments Conclusion Overview of Tri-Fly Inputs: new edges streamed from source(s) master(s) worker(s) aggregator(s) Outputs: estimated counts of global and local triangles • Processes each new edge when it arrives • Updates estimated counts for each edge Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1136/
Introduction Problem Algorithm Analysis Experiments Conclusion Overview of Tri-Fly (cont. ) new edge source(s) unicast master(s) count new triangles using local memory broadcast worker(s) aggregator(s) aggregate counts & update outputs Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1236/
Introduction Problem Algorithm Analysis Experiments Conclusion Challenge: Limited Memory • How should we ‘count’ and ‘aggregate’ for accurate estimation when each machine has limited memory? • Our solution adapts Triest-IMPR [DERU 16] source(s) master(s) count new triangles using local memory worker(s) aggregator(s) aggregate counts & update outputs Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1336/
s l i a t Introduction De Problem Algorithm Analysis Experiments Conclusion Workers in Detail • Runs three steps for each received edge (a) Edge arrival (b) Discovering (c) Sampling new edge memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1436/
s l i a t Introduction De Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont. ) • (a) Edge arrival step ◦ receives a new edge (a) Edge arrival new edge memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1536/
s l i a t Introduction De Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont. ) • (a) Edge arrival (b) Discovering discovered !! new edge memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1636/
s l i a t Introduction De Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont. ) • (a) Edge arrival (b) Discovering discovered !! new edge memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1736/
s l i a t Introduction De Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont. ) • (c) Sampling step ◦ stores or discards the new edge ◦ follows the standard reservoir sampling (a) Edge arrival (b) Discovering (c) Sampling new edge memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1836/
s l i a t Introduction De Problem Algorithm Analysis Experiments Conclusion Aggregators in Detail • Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 1936/
Introduction Problem Algorithm Analysis Experiments Conclusion Summary of Tri-Fly new edge source(s) unicast master(s) count new triangles in its local memory broadcast worker(s) aggregator(s) aggregate counts & update outputs Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2036/
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses << • Experiments • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2136/
Introduction Problem Algorithm Analysis Experiments Conclusion THM 1: Unbiasedness • Tri-Fly maintains estimates satisfying the following: Frequency • True Count Estimates Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2236/
Introduction Problem Algorithm Analysis Experiments Conclusion THM 2: Linear Drop of Variance • Tri-Fly maintains estimates satisfying the following: log(Variance) • log(# Workers) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2336/
Introduction Problem Algorithm Analysis Experiments Conclusion THM 3: Linear Scalability Running Time • # Edges Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2436/
Introduction Problem Algorithm Analysis Experiments Conclusion Properties of Tri-Fly Fast and accurate: outperforming competitors Scalable: with linear data scalability (THM 3) Theoretically sound: with unbiased estimates (THM 1) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2536/
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses • Experiments << • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2636/
Introduction Problem Algorithm Analysis Experiments Conclusion Experimental Settings • Competitors: MASCOT [LJK 18] & Triest-IMPR [DERU 16] ◦ state-of-the-art single-machine streaming algorithms ◦ for both global and local triangle counts • Implementations: ◦ C++ & MPICH (asynchronous communication) ◦ 1 master & 1 aggregator & up to 40 workers • Datasets: ER Synthetic (100 B) Social (1. 8 B+) Social (22 M+) Patent citation (16 M+) Web (6 M+) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2736/
Introduction Problem Algorithm Analysis Experiments Conclusion EXP 1. Bias Analysis “Does Tri-Fly give unbiased estimates? ” (THM 1) True Count Tri-Fly (10 workers) Tri-Fly (5 workers) Tri-Fly (1 worker) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion EXP 2. Variance Analysis “How rapidly does the variance decrease w. r. t. the number of workers? ” (THM 2) MASCOT Triest-IMPR Tri-Fly Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 2936/
Introduction Problem Algorithm Analysis Experiments Conclusion EXP 3. Speed and Accuracy “Does Tri-Fly outperform single-machine baselines? ” Tri-Fly Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 3036/
Introduction Problem Algorithm Analysis Experiments Conclusion EXP 3. Speed and Accuracy “Does Tri-Fly outperform single-machine baselines? ” Tri-Fly Root Mean Square Error Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 3136/
Introduction Problem Algorithm Analysis Experiments Conclusion EXP 4. Scalability “Does Tri-Fly scale linearly with the size of the input stream? ” (THM 3) Tri-Fly 100 B edges (800 GB) ER Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 3236/
Introduction Problem Algorithm Analysis Experiments Conclusion Properties of Tri-Fly Fast and accurate: outperforming competitors (EXP 3) Scalable: with linear data scalability (EXP 4) Theoretically sound: with unbiased estimates (EXP 1) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 3336/
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses • Experiments • Conclusion << Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 3436/
Introduction Problem Algorithm Analysis Experiments Conclusion • We propose Tri-Fly ◦ the first distributed streaming algorithm ◦ for counting global and local triangles Fast & Accurate Scalable • Code and datasets: ◦ https: //github. com/kijungs/trifly Theoretically Sound Download Tri-Fly Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 3536/
Introduction Problem Algorithm Analysis Experiments Conclusion References • [SV 11] Siddharth Suri, Sergei Vassilvitskii, “Counting triangles and the curse of the last reducer” WWW 2011 • [KMF 11] U Kang, Brendan Meeder, Christos Faloutsos, “Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation” PADD 2011 • [PC 13] Ha-Myung Park, Chin-Wan Chung, “An Efficient Map. Reduce Algorithm for Counting Triangles in a Very Large graph”, CIKM 2013 • [DERU 16] Lorenzo De Stefani et al. , “TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size. ” KDD 2016 • [Shi 17] Kijung Shin, “WRS: Waiting Room Sampling for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017 • [LJK 18] Yongsub Lim, Minsoo Jung, U Kang, “Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams: From Simple to Multigraphs”, TKDD 2018 • [PPK 18] Ha-Myung Park, Chiwan Park, U Kang, “Pegasus. N: A Scalable and Versatile Graph Mining System”, AAA 18 Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 3636/
- Slides: 36