Wander Join Online Aggregation via Random Walks Feifei

Database Workloads n 2 Wander Join: Online Aggregation via Random Walks

Online Aggregation n Goal: Analytical queries do not always need 100% accuracy. Can we

Online Aggregation [Haas, Hellerstein, Wang SIGMOD’ 97] n Confidence Interval 4 Confidence Level Wander

Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders,

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n n Store tuples in each table in

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n Estimator for Is this estimator unbiased? Yes,

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n How do we use this estimator to

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n How do we find n We need

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] Now can use procedure for calculating confidence interval

A Running Example Nation CID US 1 US China 11 2 What’s Buyer. ID

Wander Join n n Take a randomly sampled tuple from ONLY one table Conduct

Join as a Graph Conceptual only Never materialized 13 Wander Join: Online Aggregation via

Join as a Graph Conceptual only Never materialized 14 Wander Join: Online Aggregation via

Join as a Graph SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE

Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE

Sampling by Random Walks n Estimator of aggregate might be biased Idea: Penalize paths

Sampling by Random Walks 20 Wander Join: Online Aggregation via Random Walks

Confidence Interval n 21 More complicated than it looks! This is just the normal

Walk Plan Optimization n n Structure of the data graph Selection predicates Starting table:

Walk Plan Optimizer n n n 24 Enumerate all plans Conduct ~ 100 trial

Convergence Comparison 25 Wander Join: Online Aggregation via Random Walks

Wander Join in Postgre. SQL Logarithmic growth due to B-tree lookup to find random

Running on Insufficient Memory (4 GB) n n n Insufficient memory incurs a heavy,

Wander Join vs Ripple Join 28 Wander Join Ripple Join Sampling methodology Independent but

Online Aggregation vs Data Cube 29 Online Aggregation Data Cube Queries Online, ad hoc

Dealing with Selection Predicates n One predicate – n Little impact: Can start walk

Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD’ 90] 32 Nation CID Buyer. ID Order.

Sampling from a B-tree [Olken, ’ 93] 4 n n 2 3 Sampling from

Rejection Sampling [Olken, ’ 93] n n 34 Imagine each node has maximum fanout

Non-Uniform Sampling n 35 As long as we can compute the sampling probability, wander

Compare with Blink. DB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, ’ 13] 36 Wander

Accuracy Achieved in 1/10 Time of Full Join 37 Wander Join: Online Aggregation via

Slides: 37

Download presentation

Wander Join: Online Aggregation via Random Walks Feifei Li Bin Wu, Ke Yi Zhuoyue Zhao University of Utah Hong Kong University Shanghai Jiao Tong of Science and Technology University Slides from http: //datagroup. cs. utah. edu/seminars. php Adapted for Duke DB Group by Brett Walenz Added slides 3, 7 -10, 12, 19 -21

Database Workloads n 2 Wander Join: Online Aggregation via Random Walks

Online Aggregation n Goal: Analytical queries do not always need 100% accuracy. Can we return an approximate answer with improving ‘quality’ guarantee? n Concretely, how do we estimate an aggregate query that involves multiple joins? n Notion of quality: express in form of confidence intervals: that is, we’d like to be able to say that with high probability, the actual query answer is somewhere in a given interval (preferably small). 3 Wander Join: Online Aggregation via Random Walks

Online Aggregation [Haas, Hellerstein, Wang SIGMOD’ 97] n Confidence Interval 4 Confidence Level Wander Join: Online Aggregation via Random Walks

Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. 5 Wander Join: Online Aggregation via Random Walks

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n n Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion – Join with sampled tuples from other tables – Estimate the aggregation value from samples, calculate confidence interval from estimator – n 6 Works well for full Cartesian product – But most joins are sparse … Wander Join: Online Aggregation via Random Walks

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n Estimator for Is this estimator unbiased? Yes, since tuples pulled at random from EACH table. Is this estimator consistent? Yes, the final result is the actual query result. 7 Wander Join: Online Aggregation via Random Walks

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n How do we use this estimator to develop a confidence interval? Use the central limit theorem. 1. n 2. Shift to a standard normal: n 3. Find the area under this curve n 4. Let n 5. Then n 8 Wander Join: Online Aggregation via Random Walks

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n How do we find n We need another estimator: n Ripple join is NOT independent: for every sample r, there are multiple samples s that may join. Thus the variance estimator needs to take into account the proportion of EACH table it has seen so far. 9 Wander Join: Online Aggregation via Random Walks

Ripple Join [Haas, Hellerstein, SIGMOD’ 99] Now can use procedure for calculating confidence interval earlier. 10 Wander Join: Online Aggregation via Random Walks

A Running Example Nation CID US 1 US China 11 2 What’s Buyer. ID Order. ID 4 Order. ID Item. ID 1 4 301 the 3 total revenue of all 2 orders 304 2 in 3 China? 3 from customers 1 3 201 Price $2100 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks

Wander Join n n Take a randomly sampled tuple from ONLY one table Conduct a random walk from that tuple to the neighbors (join tuples) For queries with many join relations, there may be different walk paths – Can handle cyclical queries – Assumes indexes on other tables – n n n 12 Provide an unbiased estimator for each aggregator, calculate confidence intervals Does not provide consistent result: must run full join in conjunction with wander join Estimate and confidence interval converges faster than ripple join in experiments Wander Join: Online Aggregation via Random Walks

Join as a Graph Conceptual only Never materialized 13 Wander Join: Online Aggregation via Random Walks

Join as a Graph Conceptual only Never materialized 14 Wander Join: Online Aggregation via Random Walks

Join as a Graph SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 15 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price

Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 16 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price

Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 17 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price

Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 18 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price

Sampling by Random Walks n Estimator of aggregate might be biased Idea: Penalize paths that are sampled with higher probability proportionally. 19 Wander Join: Online Aggregation via Random Walks

Sampling by Random Walks 20 Wander Join: Online Aggregation via Random Walks

Confidence Interval n 21 More complicated than it looks! This is just the normal variance formula. Estimator is more straightforward than ripple join. Wander Join: Online Aggregation via Random Walks

Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 22 Nation CID US 1 4 301 $2100 US China 2 3 2 2 304 $100 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 UK Buyer. ID Order. ID Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price

Walk Plan Optimization n n Structure of the data graph Selection predicates Starting table: use index – Table in the middle: reject random walk – n Data distribution – Non-uniformity may not be a bad thing! 23 5 6 3 0 1 1 1 1 5 6 3 0 Wander Join: Online Aggregation via Random Walks

Walk Plan Optimizer n n n 24 Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful Wander Join: Online Aggregation via Random Walks

Convergence Comparison 25 Wander Join: Online Aggregation via Random Walks

Wander Join in Postgre. SQL Logarithmic growth due to B-tree lookup to find random neighbours 26 Wander Join: Online Aggregation via Random Walks

Running on Insufficient Memory (4 GB) n n n Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks But does it matter? Spark, In-Memory DB, RAM cloud… – The algorithm is embarrassingly parallel – Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB’ 09] 27 Wander Join: Online Aggregation via Random Walks

Wander Join vs Ripple Join 28 Wander Join Ripple Join Sampling methodology Independent but non-uniform Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Easy, time Complicated, time : # tables Convergence time (20 GB data, 3 tables) ~ 3 s ~ 50 s Scalability Logarithmic Slightly less than linear System implementation Postgre. SQL (finished) Oracle (in progress) Spark. SQL (in progress) Informix (internal project) DBO Wander Join: Online Aggregation via Random Walks

Online Aggregation vs Data Cube 29 Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report Wander Join: Online Aggregation via Random Walks

Thank you!

Dealing with Selection Predicates n One predicate – n Little impact: Can start walk from that table Multiple highly selective predicates More random walks will fail – Running full query becomes faster – Can simply switch to full query when selectivity <1% (say) – 31 Wander Join: Online Aggregation via Random Walks

Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD’ 90] 32 Nation CID Buyer. ID Order. ID Item. ID US 1 4 8 4 301 $2100 US 2 3 5 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 2 3 401 $230 US 6 5 3 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Price

Sampling from a B-tree [Olken, ’ 93] 4 n n 2 3 Sampling from an aggregate (ranked) B-tree is easy But incurs heavy cost for transactions – need to modify existing B-tree implementations – 33 Wander Join: Online Aggregation via Random Walks

Rejection Sampling [Olken, ’ 93] n n 34 Imagine each node has maximum fanout Reject as soon as it walks out of bound Wander Join: Online Aggregation via Random Walks

Non-Uniform Sampling n 35 As long as we can compute the sampling probability, wander join still works! Wander Join: Online Aggregation via Random Walks

Compare with Blink. DB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, ’ 13] 36 Wander Join Blink. DB Methodology Query Sampling Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced Wander Join: Online Aggregation via Random Walks

Accuracy Achieved in 1/10 Time of Full Join 37 Wander Join: Online Aggregation via Random Walks