QueryFriendly Compression of Graph Streams Arijit Khan Charu

  • Slides: 28
Download presentation
Query-Friendly Compression of Graph Streams Arijit Khan Charu C. Aggarwal Nanyang Technical University Singapore

Query-Friendly Compression of Graph Streams Arijit Khan Charu C. Aggarwal Nanyang Technical University Singapore IBM T. J. Watson Research Lab NY, USA

Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social

Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social media data, IP traffic e 1 e 2 …. e 2 e 1 e 5 e 9 e 11 e 5 e 3 e 6 e 9 e 10 Edge Stream e 4 e 8 e 11 e 12 e 13 e 15 e 16 Graph Structure With Edge Frequency A. Khan, C. Aggarwal 1/23

Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social

Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social media data, IP traffic e 1 e 2 …. e 2 e 1 e 5 e 9 e 5 e 11 e 3 e 6 e 9 e 10 Edge Stream Massive volume and high speed Construct summary to support future queries e 4 e 8 e 11 e 12 e 13 e 15 e 16 Graph Structure With Edge Frequency 1/23

Challenges in Data Streams Querying Trade-off among Space, Accuracy, and Efficiency: -- Increasing space

Challenges in Data Streams Querying Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary A. Khan, C. Aggarwal 2/23

Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges

Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges e 1 e 2 …. e 2 e 1 e 5 e 9 e 11 e 5 e 3 e 6 e 9 e 10 Edge Stream e 4 e 8 e 11 e 12 e 13 e 15 e 16 Graph Data: Red Edges are heavy-hitter edges A. Khan, C. Aggarwal 3/23

Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges

Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges e 1 e 2 …. e 2 e 1 e 5 e 9 e 5 e 11 e 3 e 6 e 9 e 10 Edge Stream V 2 V 1 e 4 e 8 e 11 e 12 e 13 e 15 e 16 Graph Data: Red Edges are heavy-hitter edges A. Khan, C. Aggarwal 3/23

Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges

Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges e 1 e 2 …. e 2 e 1 e 5 e 9 e 5 e 11 e 3 e 6 e 9 e 10 Edge Stream Need to preserve connectivity information of the edges in the graph data V 2 V 1 e 4 e 8 e 11 e 12 e 13 e 15 e 16 Graph Data: Red Edges are heavy-hitter edges 3/23

Related Work Graph Summarization: - Query Preserving Graph Compression (SIGMOD 2012) - Graph Summarization

Related Work Graph Summarization: - Query Preserving Graph Compression (SIGMOD 2012) - Graph Summarization with Bounded Error (SIGMOD 2008) - Representing Web Graphs (ICDE 2003) - The Transitive Reduction of a Directed Graph (SIGCOMP 1972) Data Stream Summarization: - Sketches (SIGMOD 2002, VLDB 2002, SIGMOD 2004) - Histograms (SIGMOD 1996, VLDB 1998) - Wavelets (SIAM Rev. 1996) - Space Saving (ICDT 2005) Graph Streams Querying: - g. Sketches (VLDB 2012) - Analyzing Graph Structure via Linear Measurements (SODA 2012) - Graph Sketches: Sparsification, Spanners, and Subgraphs (PODS 2012) - TCM Sketch (SIGMOD 2016) 4/23

Related Work Graph Summarization: ing t t se 2012) - Query Preserving Graph Compression

Related Work Graph Summarization: ing t t se 2012) - Query Preserving Graph Compression a (SIGMOD m re t s - Graph Summarization with Bounded r Error (SIGMOD 2008) o f t 2003) - Representing Web Graphs No(ICDE - The Transitive Reduction of a Directed Graph (SIGCOMP 1972) ph a r g Data Stream Summarization: e v er ation s e r p. SIGMOD rm 2004) t - Sketches (SIGMOD 2002, VLDB 2002, o o f n s l in e a o r - Histograms (SIGMOD 1996) s, f D u e o t i r c ion que - Wavelets (SIAM Rev. 1996) stru t a in ased nts b - Space Saving (ICDT 2005) com ure-b pone a om s er ruct c w s t d Graph Streams Querying: s n te edge a d c t n e no cy a onn tter n a C - g. Sketches (VLDB 2012) en all c y-hi u q e find heav r f - Analyzing Graph Structure via Linear (SODA 2012) y. , Measurements b g. d and Subgraphs (PODS 2012) e - Graph Sketches: Sparsification, e. Spanners, n i ef d - TCM Sketch (SIGMOD 2016) 5/23

Related Work TCM Sketch (SIGMOD 2016): - Does not provide theoretical error bounds -

Related Work TCM Sketch (SIGMOD 2016): - Does not provide theoretical error bounds - Difficult to answer reachability over heavy-hitter edges A. Khan, C. Aggarwal 6/23

Count-Min Sketch h ( e, f ) +f H 1(e) Hw(e) +f w +f

Count-Min Sketch h ( e, f ) +f H 1(e) Hw(e) +f w +f “h” much smaller than total no of edges Estimate frequency of an edge, find heavy-hitter edges Cannot answer structural queries: are these two nodes connected by only high-frequency edges? 7/23

Our Solution: GMatrix Synopsis H 4(. ) incoming edge: e = (i, j) “h”

Our Solution: GMatrix Synopsis H 4(. ) incoming edge: e = (i, j) “h” much smaller than total no of nodes H 3(. ) H 2(. ) w H 1(. ) h k-th Hash Function hashes into ( Hk(i), Hk(j)) (H 1(i), H 1(j)) A. Khan, C. Aggarwal h 8/23

GMatrix Compression Contract nodes into a total of h super-nodes Different hash functions create

GMatrix Compression Contract nodes into a total of h super-nodes Different hash functions create different contractions ⇒ Holds key to effective query processing A graph with 108 nodes, 1010 edges ⇒ Storage 40 GB GMatrix with h = 103 and w = 10 ⇒ Storage 40 MB A. Khan, C. Aggarwal 9/23

Choice of Hash Functions Pair-wise independent, e. g. , modular hash function P is

Choice of Hash Functions Pair-wise independent, e. g. , modular hash function P is a prime number larger than any node id: (1, 2, … , n) a, b chosen uniformly from (1, P-1) A. Khan, C. Aggarwal 10/23

Reverse Hash Mapping 7 x mod 9 = 1 x= 4 7*4 = 3*9

Reverse Hash Mapping 7 x mod 9 = 1 x= 4 7*4 = 3*9 + 1 Reverse hash mapping ⇒ small size and computed efficiently Modular hash function: reverse hash mapping size �P/h� Can be computed in time O(�P/h�log P) using extended Euclidean algorithm A. Khan, C. Aggarwal 11/23

Other Synopsis Options with Same Functionality as GMatrix h 2 ( ij, f )

Other Synopsis Options with Same Functionality as GMatrix h 2 ( ij, f ) +f H 1(ij) Hw(ij) +f w +f Reverse hash mapping computes w. n 2/h 2 intersections In GMatrix, reverse hash mapping computes 2. w. n/h intersections A. Khan, C. Aggarwal 12/23

Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query

Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query Node Frequency Query Sub-graph Edge Frequency Query Heavy-hitter Node Query Reachability Query over High-frequency Edges A. Khan, C. Aggarwal 13/23

Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query

Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query ge d e th i w e r tu c u r st aph r Node Frequency Query g ine b m s, o c m s h rit rie o e g u l r q Frequency Query ga Sub-graph Edge u n i o f n st mi h p • La ency gra s u u q go fre o l a ng i n Heavy-hitter Node Query n a i fine aphs m e d gr to b e l u ssib uent s o P • q Reachability over High-frequency Edges e. Query r f , . e. g 13/23

Edge-Frequency Estimation Query For edge (i, j), compute the frequencies of w different cells:

Edge-Frequency Estimation Query For edge (i, j), compute the frequencies of w different cells: (Hk(i), Hk(j), k) The minimum of these values is returned as the estimated frequency Estimation is good for high-frequency edges If true frequency is significant fraction of total stream size, then relative error is small A. Khan, C. Aggarwal 14/23

Heavy-Hitter Edge Query Find all edges with frequency greater than F No false negative,

Heavy-Hitter Edge Query Find all edges with frequency greater than F No false negative, but false positive Find all hash-edges with frequency at least F Reverse hash mapping to find real edges Intersection of edge sets A. Khan, C. Aggarwal 15/23

Heavy-Hitter Edge Query: Optimization First Optimization If a node does not appear as the

Heavy-Hitter Edge Query: Optimization First Optimization If a node does not appear as the source node of some potential frequent edge in at least one of the w hash functions, that node and its outgoing edges can be safely eliminated. Second Optimization A. Khan, C. Aggarwal 16/23

Heavy-Hitter Edge Query: Time Complexity A. Khan, C. Aggarwal 17/23

Heavy-Hitter Edge Query: Time Complexity A. Khan, C. Aggarwal 17/23

Reachability Query Find if two query nodes are connected by a path with edges

Reachability Query Find if two query nodes are connected by a path with edges having frequency at least F Determine all edges for which frequency is at least F using heavyhitter edge query Answer reachability query with these edges A. Khan, C. Aggarwal 18/23

Experimental Results #Nodes #Edges Agg. Edge Max. Edge Flat Stream Freq. Size 4. 43

Experimental Results #Nodes #Edges Agg. Edge Max. Edge Flat Stream Freq. Size 4. 43 × 108 Skew 1. 0 Skew 1. 2 66 M 3612 M 1010 1. 81 × 109 16. 47 GB 80 GB 3. 22 × 109 Skew 1. 4 Compressed Stream Size 2. 37 GB 250 MB Friendster Stream (Zipf Frequency Distribution with Varying Skew) GMatrix Size GMatrix Update Time 40 MB (h=1000, w=10) 10 -6 sec Experiments were performed on a single core of 10 GB, 2. 4 GHz Xeon server A. Khan, C. Aggarwal 19/23

7. 0 E-04 6. 0 E-04 5. 0 E-04 4. 0 E-04 3. 0

7. 0 E-04 6. 0 E-04 5. 0 E-04 4. 0 E-04 3. 0 E-04 2. 0 E-04 1. 0 E-04 0. 0 E+00 GMatrix Count-Min Sketch Skew (Zipf) Stream Update Time (micro seconds) Observed Error Edge Frequency Estimation Query 5 4 3 2 1 0 # Hash Functions (w) Query over top-500 frequent edges A. Khan, C. Aggarwal 20/23

Heavy Hitter Edge Query 10 8 False 6 Positive 4 Rate 2 0 Frequency

Heavy Hitter Edge Query 10 8 False 6 Positive 4 Rate 2 0 Frequency Threshold = 0. 01% of Total Stream Size GMatrix Count Min Sketch Skew (Zip. F) Frequency Threshold (% of Total Stream Size) GMatrix Count-Min Sketch 1 28 sec 1 sec 0. 1 149 sec 2 sec 0. 01 771 sec 7 sec Query Answering Time 21/23

Reachability Query Skew (Zip. F) Reachability Error 1. 0 0. 012 1. 2 0.

Reachability Query Skew (Zip. F) Reachability Error 1. 0 0. 012 1. 2 0. 008 1. 4 0. 004 Frequency Threshold = 0. 01% of Total Stream Size Each reachability query can be processed in 0. 1 sec A. Khan, C. Aggarwal 22/23

Conclusions GMatrix synopsis for summarizing rapid graph streams Can be leveraged for a variety

Conclusions GMatrix synopsis for summarizing rapid graph streams Can be leveraged for a variety of frequency and structural queries Future Work: Improving accuracy by hashing high- and low-frequency edges separately? A. Khan, C. Aggarwal 23/23