Online Aggregation 2011 4 15 Liu Long Aggregation
- Slides: 38
Online Aggregation 2011 -4 -15 Liu Long
Aggregation • Operations related to aggregating data in DBMS – AVG – SUM – COUNT
An Example Select AVG(grade) from ENROLL; • A “fancy” interface: Query Results + AVG 3. 262574342
A Better Approach • Don’t process in batch! Online aggregation: represent the final result
Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation • Summary & Future Work
Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation • Summary & Future Work
Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;
Goals & Requirements • Continuous output – Non-blocking query plans • Time/precision control • Fairness/partiality
Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation A Naïve Approach Modify the database engine • Summary & Future Work
A Naïve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;
A Naïve Approach • Drawbacks: – No grouping – No guarantee of continuous output – No guarantee of fairness (or control over partiality)
1. Random Access to Data • Heap Scan – OK if clustering uncorrelated to agg & grouping attrs • Index Scan – can scan an index on attrs uncorrelated to agg or grouping • Sampling: – could introduce new sampling access methods (e. g. Olken’s work 93’)
User Interface
2. Group By & Distinct • Fair, Non-Blocking Group By/Distinct – Can’t sort! • sorting blocks • sorting is unfair – Must use hash-based techniques – Hybrid hashing(84)! • especially for duplicate elimination. – “Hybrid Cache” (96) even better. • Memorization • sorting
User Interface
3. Index Striding • For fair Group By: – want random tuple from Group 1, random tuple from Group 2, . . . – Sol’n: one access method opens many cursors in index, one per group. Fetch round-robin. – Can control speed by weighting the schedule
User Interface
4. Join Algorithms • Non-Blocking Joins Algorithms Acceptable Notice Sort-Merge join _ Merge join Chosen with care because of the sorted output Hybrid hash join(84) Acceptable if the inner relation is small Pipeline hash join(91) Non-blocking join but less efficient(completion time) Nested-loops join(96) Stable, but unacceptable for its unindexed inner relation
5. Query Optimization • Entirely avoid sorting in an online aggregation system • Extend “Interesting Orders”(79) to online aggregation • User control vs. performance? – Running multiple versions of a query(Rdb, 96)
6. Aggregate Functions • Add a 4 th aggregate function – SUM, COUNT, AVG • Use the formulae provided in the Appendix • Extended aggs need to return running confidence intervals
7. API • Current API uses built-in methods – Three basic functions: • speedup. Group, slow. Down. Group, stop. Group • e. g. select Stop. Group(val) • Very flexible. Easy to code – The fourth function: • set. Skip. Factor(val, int)
User Interface • Inter-tuple speed is critical!!
8. Statistical Issues • Confidence Intervals for SQL aggs – given an estimate, probability p that we’re within e of the right answer • 3 types of estimates Conservative (Hoeffding’s inequality) Large-Sample (Central Limit Theorems) Deterministic Previous work + new results from Peter Haas
Conservative confidence interval • Based on Hoeffding’s inequality • Valid for all
Large-Sample confidence interval • Based on central limit theorems(CLT) • Appropriate when n is small enough
Deterministic confidence interval • With probability 1 • Useful only when n is very large
Appendix: formulae
Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation • Summary & Future Work
Evaluation • Testbed – prototype in Postgres 95 – 96 MB main memoty – 1 GB disk • Dataset – University of Wisconsin – Single table – 1, 547, 606 rows, about 316. 6 MB
Pacing study Select AVG(grade), Interval(0. 99) From ENROLL; Traditional plan
Access Methods, Big Group Select AVG(grade), Interval(0. 95) From ENROLL GROUP BY college; College L: 925 K tuples
Access Methods, Small Group Select AVG(grade), Interval(0. 95) From ENROLL GROUP BY college; Surprise! College S: 15 K tuples
Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation • Summary & Future Work
Summary • A new approach to aggregation! – Implementation of a RDBMS engine – Continuous output – Hash-based group by – Duplicate elimination – Index striding – New APIs – User Interface
Future Work • Better UI – online data visualization • graphical aggregate • great for panning, zooming • Checkpointing/continuation – also continuous data streams • Extension of statistical results: – simultaneous confidence intervals
What makes it outstanding ? • Leading a new subject • Applicable to practical problems • Sufficient and broad analysis – Join algorithms – Various statistical estimation theories • Implementation with real system • A little bit luck
Online Aggregation With MR • • Progress Estimation Multi-job Map. Reduce Statistical model for Map. Reduce Join Algorithms
Thank you! 2011 -4 -15 Liu Long
- Alex liu cecilia liu
- Líu líu lo lo ta ca hát say sưa
- Tall + short h
- Once upon a time there lived a little country girl
- Not so long ago, people
- There lived
- Tinikling costume and props
- Once upon a time long long ago
- Long long int c
- Greek
- Sóc nhí câu đố
- Once upon a time a long long time ago begins the story
- để nguyên nước chấm cổ truyền
- Analysis class diagram example
- Uml private public
- Difference aggregation and composition
- Capacity aggregation supply chain
- Resource aggregation
- Rank aggregation methods for the web
- Wan aggregation
- Aggregation in oop
- Agglomeration vs aggregation
- Microservices data aggregation
- Advanced state
- Bgp route aggregation
- An aggregation relationship is usually represented as
- Remote sensing
- Er model arrows
- Rank aggregation methods for the web
- Mr stepek
- Aggregation database
- Address aggregation simplifies the forwarding process in
- Choropleth map example ap human geography
- Foo9822
- Pods aggregation and silos in cloud computing
- Resource and cost planning
- Platelet aggregation test
- In seconds
- Stepek ap human geo