Graph Analytics on Massive Collections of Small Graphs

























- Slides: 25
Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis dritanbleco@aueb. gr kotidis@aueb. gr Department of Informatics Athens University Of Economics and Business EDBT 2014 - Athens
Outline • Motivation • Graph Records & Queries • Storage of Graph Records and Indexing using a Column Store • Graph View Materialization • Selection of Graph Views • Extensions • Experiments • Conclusions Dritan Bleco
Motivational Example • Focus on small graphs that are generated continuously – Examples: data from CRM , WMS and SCM applications • Difference between our targeted applications and other applications of graphs (e. g. social web, biology) – Not a single massive graph but a massive collection of smaller graphs – Nodes/ Edges are mapped to real world entities • Thus, no need for isomorphism discovery Dritan Bleco
Framework Overview • Our framework puts together three different techniques – A column-oriented relational backend to permit a flat description of the graph records. • Alleviates recursion and costly joins for path calculations (required in a straightforward relational implementation) – A very efficient indexing mechanism using bitmap columns • Analogous to bitmap indexes frequently used in DWs • This model is generic and can accommodate specialized graph indexes (for example the g. Index) – A framework that permits the creation and reuse of materialized graph views of different types • These views improve query times especially for aggregation queries Dritan Bleco
Region 1 A B C D Region 2 Own Route Leased Route E F H G I K J Production Lines Hubs Customer Locations QUERIES • Delivery Time for products shipped via [A, D, E, G, I] path • Delivery Cost for products shipped using Leased Routes • The longest delay for products shipped from Region 1 to Location I via Hubs of Region 2 Dritan Bleco
Primitive Query Types • Graph Queries – Find records that contain a given query graph Gq – The result is the record id with the respective measures of each matching record – For example return delivery times along all hops in [A, D, E, G, I] • Aggregate Graph Queries – A Graph Query Gq with the addition of a user-defined aggregate function f – The result is the aggregation of the measures along all maximal paths (paths connecting sink and terminal nodes in Gq) – E. g. total delivery time for all shipments via [A, D, E, G, I] Dritan Bleco
Graph Queries 1: 3 B 4: 1 A 2: 1 A A C D 2: 4 4: 2 C 3: 2 D 5: 3 D 5: 2 E Record 2 E 6: 4 F 7: 1 G Record 3 5: 4 4: 5 Record 1 3: 2 E 6: 3 F 7: 1 Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Find records that follow path [ACEF] Result : r 2 , AC: 1, CE: 2, EF: 4 (record id , related measures) Dritan Bleco
Graph Aggregate Queries 1: 3 B 4: 1 A 2: 1 A A C D 2: 4 4: 2 C 3: 2 D 5: 3 D 5: 2 E Record 2 E 6: 4 F 7: 1 G Record 3 5: 4 4: 5 Record 1 3: 2 Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 E 6: 3 F 7: 1 G Find records and the total (sum) cost for path [ADEF] Result : r 2 , ADEF: 9 (record id, aggregated measures) r 3, ADEF: 12 Dritan Bleco
Storage Model 1: 3 B 4: 1 A 2: 1 A A C D 2: 4 4: 2 C 3: 2 D 5: 3 5: 4 4: 5 rec Id 1 2 3 D m 1 3 Null Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 Record 1 3: 2 5: 2 E Record 2 E E m 2 4 1 Null 6: 4 6: 3 F F m 3 2 2 Null 7: 1 G Record 3 7: 1 G m 4 1 2 5 m 5 2 3 4 m 6 Null 4 3 m 7 Null 1 1 Dritan Bleco
Bitmap Columns – a simple index 1: 3 B 4: 1 A C 3: 2 D 5: 3 2: 1 A A C D 2: 4 4: 2 5: 2 Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 E Record 2 6: 4 E F 7: 1 G Record 3 6: 3 E 5: 4 D 4: 5 Record 1 3: 2 F 7: 1 G rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 3 4 2 1 2 Null 1 1 1 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 Dritan Bleco
Queries using Bitmap Columns B C A E D Graph Query F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Graph Aggregate Query Get the costs delay of [ACEF] path Select recid, m 2, m 3, m 6 where b 2=1 AND b 3=1 AND b 6=1 Get the total cost delay of [ACEF] path Select recid, m 2 + m 3 + m 6 where b 2=1 AND b 3=1 AND b 6=1 rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 3 4 2 1 2 Null 1 1 1 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 Dritan Bleco
Graph View Materialization • Materialized Graph Views – Used for Graph Queries / Aggregate Graph Queries – Implemented as bitmaps resulting from ANDing the edges of a subgraph derived (by our techniques) from a set of graph queries – These bitmaps are added as a new columns in the database • Materialized Aggregate Graph Views – Used for Graph Queries / Graph Aggregate Queries – A Bitmap (as in a Graph View) and pre-computed aggregates • Bitmap is the corresponding materialized Graph View • Aggregates are derived from the measures stored in graph records Dritan Bleco
Materialized Graph Views B C A E D Query F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Q 1 = Get the cost delay of [ACEF] path Select recid, m 2 , m 3 , m 6 where bq 1=1 (b 2=1 AND b 3=1 AND b 6=1) : bq 1 = b 2 AND b 3 AND b 6 Materialized View for Q 1 rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 bq 1 1 3 4 2 1 2 Null 1 1 1 0 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 0 Dritan Bleco
Materialized Aggregate Views B C A E D Query F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Q 1 = Get the total cost of [ACEF] path Select recid, mq 1 (m 2 + m 3 + m 6 ) where bq 1=1 (b 2=1 AND b 3=1 AND b 6=1) Path Aggregated Q 1 rec Id m 1 : bq 1 = b 2 AND b 3 AND b 6 mq 1 = m 2 + m 3 + m 6 m 2 m 3 m 4 m 5 m 6 m 7 mq 1 b 2 b 3 b 4 b 5 b 6 b 7 bq 1 Null 1 1 1 0 0 0 1 3 4 2 1 2 Null 1 2 2 3 4 1 7 0 1 1 1 1 3 Null 5 4 3 1 Null 0 0 0 1 1 0 Dritan Bleco
B C A E D F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Another Query can use the materialization of Q 1 Q 2 = Get the total cost delay of [ACEFG] path Select recid, mq 1 + m 7 (m 2 + m 3 + m 6 +m 7 ) where bq 1=1 AND b 7=1 (b 2=1 AND b 3=1 AND b 6=1 AND b 7=1 ) Aggregated Q 1 rec Id m 1 : bq 1 = b 2 AND b 3 AND b 6 mq 1 = m 2 + m 3 + m 6 m 2 m 3 m 4 m 5 m 6 m 7 mq 1 b 2 b 3 b 4 b 5 b 6 b 7 bq 1 Null 1 1 1 0 0 0 1 3 4 2 1 2 Null 1 2 2 3 4 1 7 0 1 1 1 1 3 Null 5 4 3 1 Null 0 0 0 1 1 0 Dritan Bleco
Re-use of materialized graph views • See our past work "Business Intelligence on Complex Graph Data", BEWEB, Berlin, Germany, March 2012, – How to formulate complex graph expressions using a set of intuitive operators we define • How to best answer a user query using materialized (Aggregate or not) Graph Views? – A simple cost model based on the number of bitmaps required for answering a query – Mapped to a set cover problem – Solved via a greedy algorithm – – Details are in the paper. Dritan Bleco
What to materialize? • Aggressive materialization: Materialize whole queries – Often not possible due to space limitations • Our approach: Query Driven Graph View Selection • First need to derive a set of candidate views – Naïve approach : Consider all subsets of the edges in the Union of all Query Graphs • Exponential number of candidates (thus not feasible) • Many redundant Views – Intuition: Prune candidates based on a monotonicity property Dritan Bleco
Candidate Generation B A C D E F G H J Frequent Query Set {[ACEFGHJ], [ADEFGHJ]} Based on this property we only consider the following candidates : 1. Each query graph +{[ACEFGHJ], [ADEFGHJ]} 2. All the subgraphs that are intersection between 2 query graphs +{[EFGHJ]} 3. All the subgraphs that are intersection between 2 graphs of the previous step until no more new views are created The view selection from candidate set mapped as set a cover problem Dritan Bleco
Extensions All data are be stored in a single relation rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 3 4 2 1 2 Null 1 1 1 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 But obviously can be partitioning in more than one relation rec Id m 1 m 2 m 3 b 1 b 2 b 3 rec Id m 4 m 5 m 6 m 7 b 4 b 5 b 6 b 7 1 3 4 2 1 1 1 2 Null 1 1 0 0 2 Null 1 2 0 1 1 2 2 3 4 1 1 1 3 Null 0 0 0 3 5 4 3 1 1 1 Can easily incorporate Specialized Graph Indexes (for example the g. Index) Dritan Bleco
Experiments • Graph records from two datasets 1. * NY: Depicts New York roads and 2. **Gnutella: Describes connections among Gnutella hosts from August 2002. • Experimental evaluation among 4 systems – – • • Commercial Row Store Relational DB Column Store Relational DB Neo 4 j Commercial Native RDF DB * http: //www. dis. uniroma 1. it/~challenge 9/download. shtml ** http: //snap. stanford. edu/data/p 2 p-Gnutella 05. html Dritan Bleco
Comparison to alternative Systems (no views) • Our System provides almost constant query times with increasing graph query size as fewer records are retrieved (even though more bitmaps are being used) • Column store not affected from increasing density (% edges in a record) Dritan Bleco
Benefit of Using Graph Views Runtime for 100 uniform Graph Queries Runtime for 100 uniform Aggregate Graph Queries • Graph views provide savings of up to 32% in query times – there is a mandatory cost for fetching the records that is not affected by materialization • Thus, more savings are seen in aggregate queries – using 100 aggregate graph views reduce the execution time by 89% • Larger gains when queries exhibit skew (graphs in the paper) Dritan Bleco
Using Additional Indexes g. Index in 100 uniform Graph Queries g. Index 100 uniform Aggregate Graph Queries • g. Index (record driven): trained the index using records that are part of the query result set – It took about 24 hours to process about 100. 000 records • Graph views (query driven) result in up to 6 times faster query processing times – It ran in less than one second Dritan Bleco
Conclusions • Presented a framework where both data and queries are modeled as abstract graph structures – – Abstracted two primitive query graphs Introduced two types of Graph Views for expediting queries Discussed an efficient mechanism for selecting a set of non-redundant views Answering queries using Graph Views by solving an instance of a set cover problem • Argued for a simple yet effective representation of graph records using a flat relational model implemented in a column store – Introduced bitmap indexes for efficient query processing – Graph Views are stored within the same relational schema • Presented experimental results using datasets consisting of hundreds of millions of graph records – Experimental results show that our platform is orders of magnitude faster than • A straightforward relational implementation • Alternative systems that natively handle graph data. Dritan Bleco
Thank you, Questions? Dritan Bleco