Wander Join Online Aggregation via Random Walks Feifei
- Slides: 37
Wander Join: Online Aggregation via Random Walks Feifei Li Bin Wu, Ke Yi Zhuoyue Zhao University of Utah Hong Kong University Shanghai Jiao Tong of Science and Technology University Slides from http: //datagroup. cs. utah. edu/seminars. php Adapted for Duke DB Group by Brett Walenz Added slides 3, 7 -10, 12, 19 -21
Database Workloads n 2 Wander Join: Online Aggregation via Random Walks
Online Aggregation n Goal: Analytical queries do not always need 100% accuracy. Can we return an approximate answer with improving ‘quality’ guarantee? n Concretely, how do we estimate an aggregate query that involves multiple joins? n Notion of quality: express in form of confidence intervals: that is, we’d like to be able to say that with high probability, the actual query answer is somewhere in a given interval (preferably small). 3 Wander Join: Online Aggregation via Random Walks
Online Aggregation [Haas, Hellerstein, Wang SIGMOD’ 97] n Confidence Interval 4 Confidence Level Wander Join: Online Aggregation via Random Walks
Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. 5 Wander Join: Online Aggregation via Random Walks
Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n n Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion – Join with sampled tuples from other tables – Estimate the aggregation value from samples, calculate confidence interval from estimator – n 6 Works well for full Cartesian product – But most joins are sparse … Wander Join: Online Aggregation via Random Walks
Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n Estimator for Is this estimator unbiased? Yes, since tuples pulled at random from EACH table. Is this estimator consistent? Yes, the final result is the actual query result. 7 Wander Join: Online Aggregation via Random Walks
Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n How do we use this estimator to develop a confidence interval? Use the central limit theorem. 1. n 2. Shift to a standard normal: n 3. Find the area under this curve n 4. Let n 5. Then n 8 Wander Join: Online Aggregation via Random Walks
Ripple Join [Haas, Hellerstein, SIGMOD’ 99] n How do we find n We need another estimator: n Ripple join is NOT independent: for every sample r, there are multiple samples s that may join. Thus the variance estimator needs to take into account the proportion of EACH table it has seen so far. 9 Wander Join: Online Aggregation via Random Walks
Ripple Join [Haas, Hellerstein, SIGMOD’ 99] Now can use procedure for calculating confidence interval earlier. 10 Wander Join: Online Aggregation via Random Walks
A Running Example Nation CID US 1 US China 11 2 What’s Buyer. ID Order. ID 4 Order. ID Item. ID 1 4 301 the 3 total revenue of all 2 orders 304 2 in 3 China? 3 from customers 1 3 201 Price $2100 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks
Wander Join n n Take a randomly sampled tuple from ONLY one table Conduct a random walk from that tuple to the neighbors (join tuples) For queries with many join relations, there may be different walk paths – Can handle cyclical queries – Assumes indexes on other tables – n n n 12 Provide an unbiased estimator for each aggregator, calculate confidence intervals Does not provide consistent result: must run full join in conjunction with wander join Estimate and confidence interval converges faster than ripple join in experiments Wander Join: Online Aggregation via Random Walks
Join as a Graph Conceptual only Never materialized 13 Wander Join: Online Aggregation via Random Walks
Join as a Graph Conceptual only Never materialized 14 Wander Join: Online Aggregation via Random Walks
Join as a Graph SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 15 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price
Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 16 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price
Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 17 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price
Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 18 Nation CID Buyer. ID Order. ID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price
Sampling by Random Walks n Estimator of aggregate might be biased Idea: Penalize paths that are sampled with higher probability proportionally. 19 Wander Join: Online Aggregation via Random Walks
Sampling by Random Walks 20 Wander Join: Online Aggregation via Random Walks
Confidence Interval n 21 More complicated than it looks! This is just the normal variance formula. Estimator is more straightforward than ripple join. Wander Join: Online Aggregation via Random Walks
Sampling by Random Walks SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C. Nation = ‘China’ C. CID = O. Buyer. ID O. Order. ID = I. Order. ID 22 Nation CID US 1 4 301 $2100 US China 2 3 2 2 304 $100 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 UK Buyer. ID Order. ID Wander Join: Online Aggregation via Random Walks Order. ID Item. ID Price
Walk Plan Optimization n n Structure of the data graph Selection predicates Starting table: use index – Table in the middle: reject random walk – n Data distribution – Non-uniformity may not be a bad thing! 23 5 6 3 0 1 1 1 1 5 6 3 0 Wander Join: Online Aggregation via Random Walks
Walk Plan Optimizer n n n 24 Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful Wander Join: Online Aggregation via Random Walks
Convergence Comparison 25 Wander Join: Online Aggregation via Random Walks
Wander Join in Postgre. SQL Logarithmic growth due to B-tree lookup to find random neighbours 26 Wander Join: Online Aggregation via Random Walks
Running on Insufficient Memory (4 GB) n n n Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks But does it matter? Spark, In-Memory DB, RAM cloud… – The algorithm is embarrassingly parallel – Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB’ 09] 27 Wander Join: Online Aggregation via Random Walks
Wander Join vs Ripple Join 28 Wander Join Ripple Join Sampling methodology Independent but non-uniform Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Easy, time Complicated, time : # tables Convergence time (20 GB data, 3 tables) ~ 3 s ~ 50 s Scalability Logarithmic Slightly less than linear System implementation Postgre. SQL (finished) Oracle (in progress) Spark. SQL (in progress) Informix (internal project) DBO Wander Join: Online Aggregation via Random Walks
Online Aggregation vs Data Cube 29 Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report Wander Join: Online Aggregation via Random Walks
Thank you!
Dealing with Selection Predicates n One predicate – n Little impact: Can start walk from that table Multiple highly selective predicates More random walks will fail – Running full query becomes faster – Can simply switch to full query when selectivity <1% (say) – 31 Wander Join: Online Aggregation via Random Walks
Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD’ 90] 32 Nation CID Buyer. ID Order. ID Item. ID US 1 4 8 4 301 $2100 US 2 3 5 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 2 3 401 $230 US 6 5 3 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 Wander Join: Online Aggregation via Random Walks Price
Sampling from a B-tree [Olken, ’ 93] 4 n n 2 3 Sampling from an aggregate (ranked) B-tree is easy But incurs heavy cost for transactions – need to modify existing B-tree implementations – 33 Wander Join: Online Aggregation via Random Walks
Rejection Sampling [Olken, ’ 93] n n 34 Imagine each node has maximum fanout Reject as soon as it walks out of bound Wander Join: Online Aggregation via Random Walks
Non-Uniform Sampling n 35 As long as we can compute the sampling probability, wander join still works! Wander Join: Online Aggregation via Random Walks
Compare with Blink. DB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, ’ 13] 36 Wander Join Blink. DB Methodology Query Sampling Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced Wander Join: Online Aggregation via Random Walks
Accuracy Achieved in 1/10 Time of Full Join 37 Wander Join: Online Aggregation via Random Walks
- Wander join
- She who walks with integrity walks securely meaning
- Feifei jiang
- Linear algebra
- Random walks
- Wander and wonder
- Abe odd hull luck oak
- I wonder as i wander poem
- Random assignment vs random sampling
- Random assignment vs random selection
- Palavras convergentes
- Via positiva and via negativa
- 14 estaciones del vía lucis
- Motoneurona
- Via crucis via lucis
- Analysis class diagram
- Aggregation in system analysis and design
- Data preprocessing
- Secure aggregation
- Pods aggregation and silos in cloud computing
- Equivalent variation formula
- Route
- Liquidity aggregate
- Resource leveling
- Uml composition
- Data aggregation
- Ap human geography
- Delivery and forwarding of ip packets
- Xerogel
- Objek diagram
- Composition uml
- Temporal aggregation
- Composition vs inheritance vs aggregation
- Fish aggregation devices
- Resource and cost planning
- Rank aggregation methods for the web
- It is a special form of association where it represents
- Interest aggregation