Graph Analytics on Massive Collections of Small Graphs

  • Slides: 25
Download presentation
Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis dritanbleco@aueb. gr

Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis dritanbleco@aueb. gr kotidis@aueb. gr Department of Informatics Athens University Of Economics and Business EDBT 2014 - Athens

Outline • Motivation • Graph Records & Queries • Storage of Graph Records and

Outline • Motivation • Graph Records & Queries • Storage of Graph Records and Indexing using a Column Store • Graph View Materialization • Selection of Graph Views • Extensions • Experiments • Conclusions Dritan Bleco

Motivational Example • Focus on small graphs that are generated continuously – Examples: data

Motivational Example • Focus on small graphs that are generated continuously – Examples: data from CRM , WMS and SCM applications • Difference between our targeted applications and other applications of graphs (e. g. social web, biology) – Not a single massive graph but a massive collection of smaller graphs – Nodes/ Edges are mapped to real world entities • Thus, no need for isomorphism discovery Dritan Bleco

Framework Overview • Our framework puts together three different techniques – A column-oriented relational

Framework Overview • Our framework puts together three different techniques – A column-oriented relational backend to permit a flat description of the graph records. • Alleviates recursion and costly joins for path calculations (required in a straightforward relational implementation) – A very efficient indexing mechanism using bitmap columns • Analogous to bitmap indexes frequently used in DWs • This model is generic and can accommodate specialized graph indexes (for example the g. Index) – A framework that permits the creation and reuse of materialized graph views of different types • These views improve query times especially for aggregation queries Dritan Bleco

Region 1 A B C D Region 2 Own Route Leased Route E F

Region 1 A B C D Region 2 Own Route Leased Route E F H G I K J Production Lines Hubs Customer Locations QUERIES • Delivery Time for products shipped via [A, D, E, G, I] path • Delivery Cost for products shipped using Leased Routes • The longest delay for products shipped from Region 1 to Location I via Hubs of Region 2 Dritan Bleco

Primitive Query Types • Graph Queries – Find records that contain a given query

Primitive Query Types • Graph Queries – Find records that contain a given query graph Gq – The result is the record id with the respective measures of each matching record – For example return delivery times along all hops in [A, D, E, G, I] • Aggregate Graph Queries – A Graph Query Gq with the addition of a user-defined aggregate function f – The result is the aggregation of the measures along all maximal paths (paths connecting sink and terminal nodes in Gq) – E. g. total delivery time for all shipments via [A, D, E, G, I] Dritan Bleco

Graph Queries 1: 3 B 4: 1 A 2: 1 A A C D

Graph Queries 1: 3 B 4: 1 A 2: 1 A A C D 2: 4 4: 2 C 3: 2 D 5: 3 D 5: 2 E Record 2 E 6: 4 F 7: 1 G Record 3 5: 4 4: 5 Record 1 3: 2 E 6: 3 F 7: 1 Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Find records that follow path [ACEF] Result : r 2 , AC: 1, CE: 2, EF: 4 (record id , related measures) Dritan Bleco

Graph Aggregate Queries 1: 3 B 4: 1 A 2: 1 A A C

Graph Aggregate Queries 1: 3 B 4: 1 A 2: 1 A A C D 2: 4 4: 2 C 3: 2 D 5: 3 D 5: 2 E Record 2 E 6: 4 F 7: 1 G Record 3 5: 4 4: 5 Record 1 3: 2 Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 E 6: 3 F 7: 1 G Find records and the total (sum) cost for path [ADEF] Result : r 2 , ADEF: 9 (record id, aggregated measures) r 3, ADEF: 12 Dritan Bleco

Storage Model 1: 3 B 4: 1 A 2: 1 A A C D

Storage Model 1: 3 B 4: 1 A 2: 1 A A C D 2: 4 4: 2 C 3: 2 D 5: 3 5: 4 4: 5 rec Id 1 2 3 D m 1 3 Null Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 Record 1 3: 2 5: 2 E Record 2 E E m 2 4 1 Null 6: 4 6: 3 F F m 3 2 2 Null 7: 1 G Record 3 7: 1 G m 4 1 2 5 m 5 2 3 4 m 6 Null 4 3 m 7 Null 1 1 Dritan Bleco

Bitmap Columns – a simple index 1: 3 B 4: 1 A C 3:

Bitmap Columns – a simple index 1: 3 B 4: 1 A C 3: 2 D 5: 3 2: 1 A A C D 2: 4 4: 2 5: 2 Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 E Record 2 6: 4 E F 7: 1 G Record 3 6: 3 E 5: 4 D 4: 5 Record 1 3: 2 F 7: 1 G rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 3 4 2 1 2 Null 1 1 1 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 Dritan Bleco

Queries using Bitmap Columns B C A E D Graph Query F Edge Id

Queries using Bitmap Columns B C A E D Graph Query F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Graph Aggregate Query Get the costs delay of [ACEF] path Select recid, m 2, m 3, m 6 where b 2=1 AND b 3=1 AND b 6=1 Get the total cost delay of [ACEF] path Select recid, m 2 + m 3 + m 6 where b 2=1 AND b 3=1 AND b 6=1 rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 3 4 2 1 2 Null 1 1 1 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 Dritan Bleco

Graph View Materialization • Materialized Graph Views – Used for Graph Queries / Aggregate

Graph View Materialization • Materialized Graph Views – Used for Graph Queries / Aggregate Graph Queries – Implemented as bitmaps resulting from ANDing the edges of a subgraph derived (by our techniques) from a set of graph queries – These bitmaps are added as a new columns in the database • Materialized Aggregate Graph Views – Used for Graph Queries / Graph Aggregate Queries – A Bitmap (as in a Graph View) and pre-computed aggregates • Bitmap is the corresponding materialized Graph View • Aggregates are derived from the measures stored in graph records Dritan Bleco

Materialized Graph Views B C A E D Query F Edge Id AB 1

Materialized Graph Views B C A E D Query F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Q 1 = Get the cost delay of [ACEF] path Select recid, m 2 , m 3 , m 6 where bq 1=1 (b 2=1 AND b 3=1 AND b 6=1) : bq 1 = b 2 AND b 3 AND b 6 Materialized View for Q 1 rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 bq 1 1 3 4 2 1 2 Null 1 1 1 0 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 0 Dritan Bleco

Materialized Aggregate Views B C A E D Query F Edge Id AB 1

Materialized Aggregate Views B C A E D Query F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Q 1 = Get the total cost of [ACEF] path Select recid, mq 1 (m 2 + m 3 + m 6 ) where bq 1=1 (b 2=1 AND b 3=1 AND b 6=1) Path Aggregated Q 1 rec Id m 1 : bq 1 = b 2 AND b 3 AND b 6 mq 1 = m 2 + m 3 + m 6 m 2 m 3 m 4 m 5 m 6 m 7 mq 1 b 2 b 3 b 4 b 5 b 6 b 7 bq 1 Null 1 1 1 0 0 0 1 3 4 2 1 2 Null 1 2 2 3 4 1 7 0 1 1 1 1 3 Null 5 4 3 1 Null 0 0 0 1 1 0 Dritan Bleco

B C A E D F Edge Id AB 1 AC 2 CE 3

B C A E D F Edge Id AB 1 AC 2 CE 3 AD 4 DE 5 EF 6 FG 7 G Another Query can use the materialization of Q 1 Q 2 = Get the total cost delay of [ACEFG] path Select recid, mq 1 + m 7 (m 2 + m 3 + m 6 +m 7 ) where bq 1=1 AND b 7=1 (b 2=1 AND b 3=1 AND b 6=1 AND b 7=1 ) Aggregated Q 1 rec Id m 1 : bq 1 = b 2 AND b 3 AND b 6 mq 1 = m 2 + m 3 + m 6 m 2 m 3 m 4 m 5 m 6 m 7 mq 1 b 2 b 3 b 4 b 5 b 6 b 7 bq 1 Null 1 1 1 0 0 0 1 3 4 2 1 2 Null 1 2 2 3 4 1 7 0 1 1 1 1 3 Null 5 4 3 1 Null 0 0 0 1 1 0 Dritan Bleco

Re-use of materialized graph views • See our past work "Business Intelligence on Complex

Re-use of materialized graph views • See our past work "Business Intelligence on Complex Graph Data", BEWEB, Berlin, Germany, March 2012, – How to formulate complex graph expressions using a set of intuitive operators we define • How to best answer a user query using materialized (Aggregate or not) Graph Views? – A simple cost model based on the number of bitmaps required for answering a query – Mapped to a set cover problem – Solved via a greedy algorithm – – Details are in the paper. Dritan Bleco

What to materialize? • Aggressive materialization: Materialize whole queries – Often not possible due

What to materialize? • Aggressive materialization: Materialize whole queries – Often not possible due to space limitations • Our approach: Query Driven Graph View Selection • First need to derive a set of candidate views – Naïve approach : Consider all subsets of the edges in the Union of all Query Graphs • Exponential number of candidates (thus not feasible) • Many redundant Views – Intuition: Prune candidates based on a monotonicity property Dritan Bleco

Candidate Generation B A C D E F G H J Frequent Query Set

Candidate Generation B A C D E F G H J Frequent Query Set {[ACEFGHJ], [ADEFGHJ]} Based on this property we only consider the following candidates : 1. Each query graph +{[ACEFGHJ], [ADEFGHJ]} 2. All the subgraphs that are intersection between 2 query graphs +{[EFGHJ]} 3. All the subgraphs that are intersection between 2 graphs of the previous step until no more new views are created The view selection from candidate set mapped as set a cover problem Dritan Bleco

Extensions All data are be stored in a single relation rec Id m 1

Extensions All data are be stored in a single relation rec Id m 1 m 2 m 3 m 4 m 5 m 6 m 7 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 3 4 2 1 2 Null 1 1 1 0 0 2 Null 1 2 2 3 4 1 0 1 1 1 3 Null 5 4 3 1 0 0 0 1 1 But obviously can be partitioning in more than one relation rec Id m 1 m 2 m 3 b 1 b 2 b 3 rec Id m 4 m 5 m 6 m 7 b 4 b 5 b 6 b 7 1 3 4 2 1 1 1 2 Null 1 1 0 0 2 Null 1 2 0 1 1 2 2 3 4 1 1 1 3 Null 0 0 0 3 5 4 3 1 1 1 Can easily incorporate Specialized Graph Indexes (for example the g. Index) Dritan Bleco

Experiments • Graph records from two datasets 1. * NY: Depicts New York roads

Experiments • Graph records from two datasets 1. * NY: Depicts New York roads and 2. **Gnutella: Describes connections among Gnutella hosts from August 2002. • Experimental evaluation among 4 systems – – • • Commercial Row Store Relational DB Column Store Relational DB Neo 4 j Commercial Native RDF DB * http: //www. dis. uniroma 1. it/~challenge 9/download. shtml ** http: //snap. stanford. edu/data/p 2 p-Gnutella 05. html Dritan Bleco

Comparison to alternative Systems (no views) • Our System provides almost constant query times

Comparison to alternative Systems (no views) • Our System provides almost constant query times with increasing graph query size as fewer records are retrieved (even though more bitmaps are being used) • Column store not affected from increasing density (% edges in a record) Dritan Bleco

Benefit of Using Graph Views Runtime for 100 uniform Graph Queries Runtime for 100

Benefit of Using Graph Views Runtime for 100 uniform Graph Queries Runtime for 100 uniform Aggregate Graph Queries • Graph views provide savings of up to 32% in query times – there is a mandatory cost for fetching the records that is not affected by materialization • Thus, more savings are seen in aggregate queries – using 100 aggregate graph views reduce the execution time by 89% • Larger gains when queries exhibit skew (graphs in the paper) Dritan Bleco

Using Additional Indexes g. Index in 100 uniform Graph Queries g. Index 100 uniform

Using Additional Indexes g. Index in 100 uniform Graph Queries g. Index 100 uniform Aggregate Graph Queries • g. Index (record driven): trained the index using records that are part of the query result set – It took about 24 hours to process about 100. 000 records • Graph views (query driven) result in up to 6 times faster query processing times – It ran in less than one second Dritan Bleco

Conclusions • Presented a framework where both data and queries are modeled as abstract

Conclusions • Presented a framework where both data and queries are modeled as abstract graph structures – – Abstracted two primitive query graphs Introduced two types of Graph Views for expediting queries Discussed an efficient mechanism for selecting a set of non-redundant views Answering queries using Graph Views by solving an instance of a set cover problem • Argued for a simple yet effective representation of graph records using a flat relational model implemented in a column store – Introduced bitmap indexes for efficient query processing – Graph Views are stored within the same relational schema • Presented experimental results using datasets consisting of hundreds of millions of graph records – Experimental results show that our platform is orders of magnitude faster than • A straightforward relational implementation • Alternative systems that natively handle graph data. Dritan Bleco

Thank you, Questions? Dritan Bleco

Thank you, Questions? Dritan Bleco