Online Aggregation 2011 4 15 Liu Long Aggregation

Aggregation • Operations related to aggregating data in DBMS – AVG – SUM –

An Example Select AVG(grade) from ENROLL; • A “fancy” interface: Query Results + AVG

A Better Approach • Don’t process in batch! Online aggregation: represent the final result

Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation

Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;

Goals & Requirements • Continuous output – Non-blocking query plans • Time/precision control •

A Naïve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;

A Naïve Approach • Drawbacks: – No grouping – No guarantee of continuous output

1. Random Access to Data • Heap Scan – OK if clustering uncorrelated to

2. Group By & Distinct • Fair, Non-Blocking Group By/Distinct – Can’t sort! •

3. Index Striding • For fair Group By: – want random tuple from Group

4. Join Algorithms • Non-Blocking Joins Algorithms Acceptable Notice Sort-Merge join _ Merge join

5. Query Optimization • Entirely avoid sorting in an online aggregation system • Extend

6. Aggregate Functions • Add a 4 th aggregate function – SUM, COUNT, AVG

7. API • Current API uses built-in methods – Three basic functions: • speedup.

User Interface • Inter-tuple speed is critical!!

8. Statistical Issues • Confidence Intervals for SQL aggs – given an estimate, probability

Conservative confidence interval • Based on Hoeffding’s inequality • Valid for all

Large-Sample confidence interval • Based on central limit theorems(CLT) • Appropriate when n is

Deterministic confidence interval • With probability 1 • Useful only when n is very

Evaluation • Testbed – prototype in Postgres 95 – 96 MB main memoty –

Pacing study Select AVG(grade), Interval(0. 99) From ENROLL; Traditional plan

Access Methods, Big Group Select AVG(grade), Interval(0. 95) From ENROLL GROUP BY college; College

Access Methods, Small Group Select AVG(grade), Interval(0. 95) From ENROLL GROUP BY college; Surprise!

Summary • A new approach to aggregation! – Implementation of a RDBMS engine –

Future Work • Better UI – online data visualization • graphical aggregate • great

What makes it outstanding ? • Leading a new subject • Applicable to practical

Online Aggregation With MR • • Progress Estimation Multi-job Map. Reduce Statistical model for

Slides: 38

Download presentation

Online Aggregation 2011 -4 -15 Liu Long

Aggregation • Operations related to aggregating data in DBMS – AVG – SUM – COUNT

An Example Select AVG(grade) from ENROLL; • A “fancy” interface: Query Results + AVG 3. 262574342

A Better Approach • Don’t process in batch! Online aggregation: represent the final result

Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation • Summary & Future Work

Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;

Goals & Requirements • Continuous output – Non-blocking query plans • Time/precision control • Fairness/partiality

Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation A Naïve Approach Modify the database engine • Summary & Future Work

A Naïve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;

A Naïve Approach • Drawbacks: – No grouping – No guarantee of continuous output – No guarantee of fairness (or control over partiality)

1. Random Access to Data • Heap Scan – OK if clustering uncorrelated to agg & grouping attrs • Index Scan – can scan an index on attrs uncorrelated to agg or grouping • Sampling: – could introduce new sampling access methods (e. g. Olken’s work 93’)

User Interface

2. Group By & Distinct • Fair, Non-Blocking Group By/Distinct – Can’t sort! • sorting blocks • sorting is unfair – Must use hash-based techniques – Hybrid hashing(84)! • especially for duplicate elimination. – “Hybrid Cache” (96) even better. • Memorization • sorting

User Interface

3. Index Striding • For fair Group By: – want random tuple from Group 1, random tuple from Group 2, . . . – Sol’n: one access method opens many cursors in index, one per group. Fetch round-robin. – Can control speed by weighting the schedule

User Interface

4. Join Algorithms • Non-Blocking Joins Algorithms Acceptable Notice Sort-Merge join _ Merge join Chosen with care because of the sorted output Hybrid hash join(84) Acceptable if the inner relation is small Pipeline hash join(91) Non-blocking join but less efficient(completion time) Nested-loops join(96) Stable, but unacceptable for its unindexed inner relation

5. Query Optimization • Entirely avoid sorting in an online aggregation system • Extend “Interesting Orders”(79) to online aggregation • User control vs. performance? – Running multiple versions of a query(Rdb, 96)

6. Aggregate Functions • Add a 4 th aggregate function – SUM, COUNT, AVG • Use the formulae provided in the Appendix • Extended aggs need to return running confidence intervals

7. API • Current API uses built-in methods – Three basic functions: • speedup. Group, slow. Down. Group, stop. Group • e. g. select Stop. Group(val) • Very flexible. Easy to code – The fourth function: • set. Skip. Factor(val, int)

User Interface • Inter-tuple speed is critical!!

8. Statistical Issues • Confidence Intervals for SQL aggs – given an estimate, probability p that we’re within e of the right answer • 3 types of estimates Conservative (Hoeffding’s inequality) Large-Sample (Central Limit Theorems) Deterministic Previous work + new results from Peter Haas

Conservative confidence interval • Based on Hoeffding’s inequality • Valid for all

Large-Sample confidence interval • Based on central limit theorems(CLT) • Appropriate when n is small enough

Deterministic confidence interval • With probability 1 • Useful only when n is very large

Appendix: formulae

Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation • Summary & Future Work

Evaluation • Testbed – prototype in Postgres 95 – 96 MB main memoty – 1 GB disk • Dataset – University of Wisconsin – Single table – 1, 547, 606 rows, about 316. 6 MB

Pacing study Select AVG(grade), Interval(0. 99) From ENROLL; Traditional plan

Access Methods, Big Group Select AVG(grade), Interval(0. 95) From ENROLL GROUP BY college; College L: 925 K tuples

Access Methods, Small Group Select AVG(grade), Interval(0. 95) From ENROLL GROUP BY college; Surprise! College S: 15 K tuples

Content • Question & Motivation • The Approach – Goals – Implementation – Evaluation • Summary & Future Work

Summary • A new approach to aggregation! – Implementation of a RDBMS engine – Continuous output – Hash-based group by – Duplicate elimination – Index striding – New APIs – User Interface

Future Work • Better UI – online data visualization • graphical aggregate • great for panning, zooming • Checkpointing/continuation – also continuous data streams • Extension of statistical results: – simultaneous confidence intervals

What makes it outstanding ? • Leading a new subject • Applicable to practical problems • Sufficient and broad analysis – Join algorithms – Various statistical estimation theories • Implementation with real system • A little bit luck

Online Aggregation With MR • • Progress Estimation Multi-job Map. Reduce Statistical model for Map. Reduce Join Algorithms

Thank you! 2011 -4 -15 Liu Long