Graph X Unifying Table and Graph Analytics Presented
Graph. X: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica IPDPS 2014 *These slides are best viewed in Power. Point with anima
Graphs are Central to Analytics Hyperlinks Raw Wikipedia <</ />> </> XML Page. Rank Top 20 Page Title PR Text Table Title Body Term-Doc Topic Model Graph (LDA) Word Topics Word Topic Discussion Table User Disc. Community User Editor Graph Detection Community User Community Topic Com.
Page. Rank: Identifying Leaders Rank of user i Weighted sum of neighbors’ ranks Update ranks in parallel Iterate until convergence 3
Recommending Products f(i) Movie s f(j) User Factors (U) Movie s ≈ Users Netflix x f(1) r 13 r 14 f(2) f(3) f(4) r 24 r 25 f(5) Iterate: 5 Movie Factors (M) Users Low-Rank Matrix Factorization:
The Graph-Parallel Pattern Model / Alg. State Computation depends only on the neighbors 6
Many Graph-Parallel Algorithms • Collaborative Filtering – Co. EM SOCIAL NETWORK ANALYSIS – Alternating Least Squares • Community Detection – Stochastic Gradient – Triangle-Counting Descent – K-core Decomposition – Tensor Factorization – K-Truss MACHINE • LEARNING • Graph Analytics • Semi-supervised ML • Classification Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling – Graph SSL GRAPH ALGORITHMS – – Page. Rank Personalized Page. Rank Shortest Path Graph Coloring – Neural Networks 7
Graph-Parallel Systems Pregel oogle Expose specialized APIs to simplify graph programming. 8
“Think like a Vertex. ” - Pregel [SIGMOD’ 10] 9
The Pregel (Push) Abstraction Vertex-Programs interact by sending messages. Pregel_Page. Rank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg i // Update the rank of this vertex R[i] = 0. 15 + total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i]) to vertex j Malewicz et al. [PODC’ 09, SIGMOD’ 10] 10
The Graph. Lab (Pull) Abstraction Vertex Programs directly access adjacent vertices and edges ]*w + w 21 3 1 ]* // Update the Page. Rank R[i] = 0. 15 + total + R[3 4 31 R[4] * w 41 R[2 Graph. Lab_Page. Rank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * wji 2 Data movement is managed by the system and not the user. 11
Iterative Bulk Synchronous Execution Compute Communicate Barrier
Graph-Parallel Systems Pregel oogle Expose specialized APIs to simplify graph programming. Exploit graph structure to achieve ordersof-magnitude performance gains over more general 13
Page. Rank on the Live-Journal Graph Mahout/ Hadoop Naïve Spark Graph. La b 1340 354 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, Page. Rank for 10 iterations) Spark is 4 x faster than Hadoop Graph. Lab is 16 x faster than Spark
Triangle Counting on Twitter 40 M Users, 1. 4 Billion Links Counted: 34. 8 Billion Triangles Hadoop [WWW’ 11] Graph. Lab 1536 Machines 423 Minutes 64 Machines 15 Seconds 1000 x Faster 15 S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer, ” WWW’ 11
Graph Analytics Pipeline Hyperlinks Raw Wikipedia <</ />> </> XML Page. Rank Top 20 Page Title PR Text Table Title Body Term-Doc Topic Model Graph (LDA) Word Topics Word Topic Discussion Table User Disc. Community User Editor Graph Detection Community User Community Topic Com.
Tables Hyperlinks Raw Wikipedia <</ />> </> XML Page. Rank Top 20 Page Title PR Text Table Title Body Term-Doc Topic Model Graph (LDA) Word Topics Word Topic Discussion Table User Disc. Community User Editor Graph Detection Community User Community Topic Com.
Graphs Hyperlinks Raw Wikipedia <</ />> </> XML Page. Rank Top 20 Page Title PR Text Table Title Body Term-Doc Topic Model Graph (LDA) Word Topics Word Topic Discussion Table User Disc. Community User Editor Graph Detection Community User Community Topic Com.
Separate Systems to Support Each View Table View Graph View Pregel Table Dependency Graph Row Row Resul t
Having separate systems for each view is difficult to use and inefficient 20
Difficult to Program and Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 21
Inefficient Extensive data movement and duplication across the network and file system <</ />> </> XML HDFS Limited reuse internal data-structures across stages 22
Graph. X Solution: Tables and Graphs are views of the same physical data Table View Graph. X Unified Representation Graph View Each view has its own operators that exploit the semantics of the view to achieve efficient execution
Graphs Relational Algebra 1. Encode graphs as distributed tables 2. Express graph computation in relational algebra 3. Recast graph systems optimizations as: 1. Distributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance parity with specialized systems.
Distributed Graphs as Distributed Tables Vertex Table Property Graph Routing Table Part. 1 B C A D 2 D Vertex A Cut Heuristic D E Part. 2 A B A C A A 12 B B 1 B C C C 1 C D 12 A E A F E D E F D F Edge Table D E E 2 F F 2
Table Operators Table operators are inherited from Spark: map reduce sample filter count take group. By fold first sort reduce. By. Key partition. By union group. By. Key map. With join cogroup pipe left. Outer. Join cross save right. Outer. Join zip . . . 26
Graph Operators class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, E) ]) // Table Views --------def vertices: Table[ (Id, V) ] def edges: Table[ (Id, E) ] def triplets: Table [ ((Id, V), E) ] // Transformations ---------------def reverse: Graph[V, E] def subgraph(p. V: (Id, V) => Boolean, p. E: Edge[V, E] => Boolean): Graph[V, E] def map. V(m: (Id, V) => T ): Graph[T, E] def map. E(m: Edge[V, E] => T ): Graph[V, T] // Joins --------------------def join. V(tbl: Table [(Id, T)]): Graph[(V, T), E ] def join. E(tbl: Table [(Id, T)]): Graph[V, (E, T)] // Computation -----------------def mr. Triplets(map. F: (Edge[V, E]) => List[(Id, T)], reduce. F: (T, T) => T): Graph[T, E] } 27
Triplets Join Vertices and Edges The triplets operator joins vertices and edges: Verticess. Id, d. Id, Triplets SELECT s. P, e. P, d. P Edges A B FROMA edges AS e C JOIN Bvertices AS ASAd A s, vertices C C ON e. src. Id = s. Id. B ANDCe. dst. Id =Bd. Id C D C D The mr. Triplets operator sums adjacent triplets. SELECT t. dst. Id, reduce( map(t) ) AS sum FROM triplets AS t GROUPBY t. dst. Id
Example: Oldest Follower 23 Calculate the number of older followers for each user? val older. Follower. Age = graph. mr. Triplets( e => // Map if(e. src. age < e. dst. age) { (e. src. Id, 1) else { Empty } , (a, b) => a + b // Reduce ). vertices 42 B C 30 A D E 19 F 16 29 75
We express enhanced Pregel and Graph. Lab abstractions using the Graph. X operators in less than 50 lines of code! 30
Enhanced Pregel in Graph. X pregel. PR(i, message. List ): message. Sum Require Message Combiners // Receive all the messages total = 0 message. Sum foreach( msg in message. List) : total = total + msg // Update the rank of this vertex R[i] = 0. 15 + total combine. Msg(a, b): // Compute summessages of two messages // Send new to neighbors send. Msg(i j, R[i], R[j], E[i, j]): return a + b in out_neighbors[i]) : foreach(j // Compute single message Send msg(R[i]/E[i, j]) to vertex return msg(R[i]/E[i, j]) Malewicz et al. [PODC’ 09, SIGMOD’ 10] Remove Message Computation from the Vertex Program 31
Page. Rank in Graph. X // Load and initialize the graph val graph = Graph. Builder. text(“hdfs: //web. txt”) val pr. Graph = graph. join. Vertices(graph. out. Degrees) // Implement and Run Page. Rank val page. Rank = pr. Graph. pregel(initial. Message = 0. 0, iter = 10)( (old. V, msg. Sum) => 0. 15 + 0. 85 * msg. Sum, triplet => triplet. src. pr / triplet. src. deg, (msg. A, msg. B) => msg. A + msg. B) 32
Join Elimination Identify and bypass joins for unused triplet fields send. Msg(i j, R[i], R[j], E[i, j]): // Compute single message return msg(R[i]/E[i, j]) Communication (MB) Page. Rank on Twitter 14000 12000 10000 8000 6000 4000 2000 0 Three Way Join Elimination Factor of 2 reduction in communication 0 5 10 Iteration 15 20 33
We express the Pregel and Graph. Lab like abstractions using the Graph. X operators in less than 50 lines of code! By composing these operators we can construct entire graph-analytics pipelines. 34
Example Analytics Pipeline // Load raw data tables val verts = sc. text. File(“hdfs: //users. txt”). map(parser. V) val edges = sc. text. File(“hdfs: //follow. txt”). map(parser. E) // Build the graph from tables and restrict to recent links val graph = new Graph(verts, edges) val recent = graph. subgraph(edge => edge. date > LAST_MONTH) // Run Page. Rank Algorithm val pr = graph. Page. Rank(tol = 1. 0 e-5) // Extract and print the top 25 users val top. Users = verts. join(pr). top(25). collect top. Users. foreach(u => println(u. name + ‘t’ + u. pr))
The Graph. X Stack (Lines of Code) Page. Ran Connected k (5) Comp. (10) Shortest SVD Path (40) (10) ALS (40) Pregel (28) + Graph. Lab (50) Graph. X (3575) Spark K-core (51) Triangle LDA Count (120) (45)
Performance Comparisons Live-Journal: 69 Million Edges Maho. . . 1340 Naïve. . . 354 Giraph 207 68 Graph. X Graph. . . 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, Page. Rank for 10 iterations) Graph. X is roughly 3 x slower than Graph. Lab
Graph. X scales to larger graphs Twitter Graph: 1. 5 Billion Edges Giraph 749 451 Graph. X Graph. La b 203 0 200 400 600 800 Runtime (in seconds, Page. Rank for 10 iterations) Graph. X is roughly 2 x slower than Graph. Lab » Scala + Java overhead: Lambdas, GC time, … » No shared memory parallelism: 2 x increase in comm.
Page. Rank is just one stage…. What about a pipeline?
A Small Pipeline in Graph. X Raw Wikipedia Hyperlinks <</ />> </> Page. Rank HDFS XML Spark Preprocess Top 20 Pages HDFS Compute Spark Post. 1492 Giraph + Spark 605 342 Graph. X Graph. Lab + Spark 375 0 200 400 600 800 1000 1200 1400 1600 Total Runtime (in Seconds) Timed end-to-end Graph. X is faster than
Status Part of Apache Spark In production at several large technology companies
Graph. X: Unified Analytics New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems Enabling users to easily and efficiently express the entire graph analytics pipeline
A Case for Algebra in Graphs A standard algebra is essential for graph systems: • e. g. : SQL proliferation of relational system By embedding graphs in relational algebra: • Integration with tables and preprocessing • Leverage advances in relational systems • Graph opt. recast to relational systems
Thanks! http: //amplab. cs. berkeley. edu/projects/gra phx/ ankurd@eecs. berkeley. edu crankshaw@eecs. berkeley. edu rxin@eecs. berkeley. edu jegonzal@eecs. berkeley. edu
- Slides: 43