Querying Big Data Edinburgh Wenfei Fan School of

Querying Big Data @ Edinburgh Wenfei Fan School of Informatics University of Edinburgh 1

Theory for querying big data 2

The good, the bad and the ugly ü Traditional computational complexity theory of 50 years: • The good: polynomial time computable (PTIME) • The bad: NP-hard (intractable) • The ugly: PSPACE-hard, EXPTIME-hard, undecidable… What happens when it comes to big data? Using SSD of 6 G/s, a linear scan of a data set D would take • 1. 9 days when D is of 1 PB (1015 B) • 5. 28 years when D is of 1 EB (1018 B) O(n) time is already beyond reach data in to practice! Whenonisbig it feasible query big data? Polynomial time queries become intractable on big data! 3

Facebook: Graph Search ü Find me restaurants in New York my friends have been to in 2014 • friend(pid 1, pid 2) • • ü person(pid, name, city) dine(pid, rid, dd, mm, yy) SQL query (actually, a simple conjunctive query): select rid from friend f, person p, dine d where f. pid 1 = p 0 and f. pid 2 = p. pid and Facebook f. pid 2 = d. pid and p. city =: more NYC than and 1. 38 d. yybillion = 2014 nodes, and over 140 billion links Is it feasible on big data? 4

BD-tractability relational queries (even SPC) NP and beyond subgraph isomorphism P BD-tractable graph simulation not BD-tractable W. Fan, F. Geerts, F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013. How to query big data? 5

Parallel scalability A parallel algorithm for answering a class Q of queries ü Input: G = (G 1, …, Gn), distributed to (S 1, …, Sn), and a query Q ü Output: Q(G), the answer to Q in G Complexity ü t(|G|, |Q|): the time taken by a sequential algorithm with a single processor When G is big, we can still query G by adding ü T(|G|, |Q|, n): the time by a parallel with n moretaken processors if we canalgorithm afford them processors W. Fan, X. Wang, Y. Wu. Making ü Parallel scalable: if Distributed k) T(|G|, |Q|, n) = O(t(|G|, |Q|)/f(n)) + O((ngraph + |Q|)simulation: impossibility and possibility, NO • • f(n): a sublinear function VLDB 2014. O(t(|G|, |Q|)/f(n)) > O((n + |Q|)k), |G| > n Is it always possible? 6

Systems for querying big data 7

GRAPE (GRAPh Engine) manageable sizes Divide and conquer ü partition G into fragments (G 1, …, Gn), distributed to various sites ü upon receiving a query Q, evaluate Q on smaller Gi • evaluate Q( Gi ) in parallel • collect partial answers at a coordinator site, and assemble them to find the answer Q(provably G ) in the entire GRAPE: better than. G Q( Map. Reduce, Pregel and Graph. Lab ) ü AG principled approach ü Ease of use of existing sequential algorithm: no need for recasting Q( G … ) Q( G 1 ) Q( G 2 ) n ERC project: querying big graphs with bounded resources Parallel processing = Partial + incremental evaluation 8

A query engine for big relations ü ü Input: A class Q of queries, a set A of access constraints Output: Using A, for any query Q Q and any (possibly big) dataset D, find a fraction of D such that ü 77% D of. Q conjunctive queries under 84 simple access constraints [VLDB 2014] ü Q(D) = Q(D ), and Q ü ü 60% of SQL queries are bounded under 108 DQ can be identified in time determined by Q and A simple constraints [new results] ü Efficiency: 9 seconds vs. 14 hours of My. SQL; at least 3 orders of magnitude ü 60% of graph pattern queries [ICDE 2015] ü 28587 times faster (subgraph isomorphism) Querying big data by accessing small data 9

Data-driven: Resource bounded query answering Input: A class Q of queries, a resource ratio [0, 1), and a performance ratio (0, 1] ü Question: Develop an algorithm Decided that given query Qresources Q by any our available and dataset D, • accesses a fraction D of D such that |D | |D| • computes as Q(D ) as approximate answers to Q(D), and • accuracy(Q, D, ) ü dynamic reduction D Q Q( D ) D Q approximation Q( D ) Accessing |D| amount of data in the entire process 10

Personalized social search queries Graph Search, Facebook ü Find me all my friends who live in Edinburgh and like cycling Find me restaurants in London my friends have been to ü Find me photos of my friends in New York ü We can do personalized social search with = 0. 0015%! with 100% accuracy! ü ü 1. 5 * 10 -6 * 1 PB (1015 B) = 15 * 109 = 15 GB We are making big graphs of PB size as small as 15 GB! Beyond graphs: relational queries ü SPC: 7. 9 * 10 -4 ü SQL: 1. 9 * 10 -3 W. Fan, X. Wang, and Y. Wu. Querying big graphs within bounded resources, SIGMOD 2014. We can make big data of PB size fit into our memory! 11

Applications of big data analysis 12

Association rules revised for graphs ü ü Conventional association rules: X Y • X and Y: itemsets (attributes of a relation) • milk, diaper beer Social media marketing: a revolution to conventional marketing • 90% of customers trust their peer (friends, colleagues) recommendations vs. 14% who trust advertising • 60% of users said that Twitter plays an important roles in their shopping The need for association rules for graphs 13

Graph pattern association rules a) b) if x and x’ are friends who live in the same city c, We caninadvertise they both like at least 3 French restaurants c, and restaurant y if x’ likes a newly opened French restaurant y, to person x x then x may also like y c) x’ friend live-in c Association rules defined with graph patterns like y 14

Advertising mobile phones a) b) if more than 80% of friends of person x use Redmi 2 A, and none of the friends of x gave Redmi 2 A a bad rating, then the chances are that x may like Redmi 2 A x 2 x friend > 80% friend =0 x 1 use recommend bad-rating Advertise Redmi 2 A to x y 15

Other topics: data quality 16

Can we trust the answers to our queries? FN LN address AC city Mary Smith 2 Small St 020 EDI Mary Dupont 10 Elm St 610 PHI Mary Dupont 6 Main St 212 NYC Bob Luth 8 Cowan St 215 PHI Robert Luth 6 Drum St 212 NYC ü Q 1: how many distinct employees have first name Marry? ü 3 may not be the correct answer: • • • The first three tuples refer to the same person The first tuple is inconsistent The information may be incomplete real life data is dirty 17

Data fusion FN LN address salary status Mary Smith 2 Small St 50 k single Mary Dupont 10 Elm St 50 k married Mary Dupont 6 Main St 80 k married Bob Luth 8 Cowan St 80 k married Robert Luth 6 Drum St 55 k married ü Q 2: what is Mary’s current last name? ü In real life: • • Dupont Marital status only changes from single married divorced Tuples with the most current marital status also have the most current last name Deduce the true values of an entity 18

Challenges and opportunities Big data = data quantity + data quality ü Theory: • Revising the classical computational complexity theory A new theory of parallel scalability Effective techniques for making big data “small” • bounded evaluable query answering • parallel algorithms with performance guarantees • approximation algorithms: dynamic and static Systems: relations and graphs Applications: social media marketing Ph. D, job, and $$$ Data quality: knowledge base, big data • ü ü Join the database group? 19

References Theory for querying big data ü W. Fan, F. Geerts, F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013. ü W. Fan, F. Geerts, and L. Libkin. On Scale Independence for Querying Big Data. PODS 2014 ü W. Fan, F. Geerts, Y. Cao and T. Deng. Querying big data by accessing small data. PODS 2015. ü W. Fan and J. Huai. Querying Big data: Theory and Practice. JCST 2014. ü W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012. 20

References Techniques for querying big data: ü W. Fan, X. Wang, and Y. Wu. Querying big graphs within bounded resources, SIGMOD 2014. ü Y. Cao and W. Fan. Bounded conjunctive queries. VLDB 2014. ü Y. Cao, W. Fan and R. Huang. Making pattern queries bounded in big graphs. ICDE 2015. ü W. Fan, X. Wang, and Y. Wu. Distributed Graph Simulation: Impossibility and Possibility, VLDB 2014. ü W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB 2014. ü W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. 21

References Techniques for querying big data: ü W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching, TODS 38(3), 2013 ü W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. ü S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Strong Simulation: Capturing Topology in Graph Pattern Matching. TODS 39(1), 2014 ü W. Fan J. Li, S. Ma, and H. Wang, and Y. Wu. Graph Homomorphism Revisited for Graph Matching, VLDB 2010. ü W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010. ü W. Fan, X. Wang, and Y. Wu. Performance Guarantees for Distributed Reachability Queries, VLDB, 2012. 22

References Applications of big data analysis ü W. Fan, X. Wang, Y. Wu, and J. Xu. Association rules with graph patterns. VLDB 2015 ü W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular Expressions to Graph Reachability and Pattern Queries, ICDE 2011. ü W. Fan, X. Wang, S. Ma, Y. Wu and J. Xu. Association Rules with Graph Patterns, VLDB 2015. ü W. Fan, Z. Fan, C. Tian, and L. Dong. Keys for Graphs, VLDB 2015. ü W. Fan, X. Wang, C. and Y. Wu. Exp. Finder: Finding Experts by Graph Pattern Matching, ICDE 2013 (demo). 23