Scalable Approximate Query Processing Florin Rusu Data Explosion
Scalable Approximate Query Processing Florin Rusu
Data Explosion • Data storage advancements – Price / capacity ($70 / 1 TB) • Human generated – Web 2. 0 & social networking • User data – Communication • Network & web logs (e. Bay – 50 TB / day) • Call Detail Records (CDRs) • Scientific experiments – LHC (Large Hadron Collider) – SKA (Square Kilometer Array) – 1 EB (1018) / day – Sensor networks 04/19/2010 2
Large-Scale Data Analytics • Traditional DB (OLTP) – Multi-user transaction processing – Optimized for specific workloads (views, indexes, …) • Analytic processing (OLAP) – Data cubes • Aggregate at different hierarchical levels • Pre-defined aggregates, not flexible – Shared-nothing architectures (MPP) • • 04/19/2010 Startups: Netezza, Greenplum, Aster. Data, Vertica, … Parallel databases on clusters of computers Storage layer (row store, column store, hybrid) Compression 3
Interactive Data Analysis & Exploration • Ad-hoc queries • Compute statistical aggregates over all data • Example: web log analysis – Documents (URL, Content) – User. Visits (IP, URL, Date, Duration) – “How much time did users spend searching for cars during the period May – July 2009? ” SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 04/19/2010 4
Roadmap • Database query execution • System design & implementation – Data. Base. Online (DBO) • Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches over samples 04/19/2010 5
Query Execution IP URL Date URL Content 1 A 05 -30 -09 45 A car 1 B 06 -01 -09 60 B car 1 J 06 -01 -09 30 C car 1 D 05 -15 -09 90 D phone 1 I 04 -28 -09 35 E car 2 A 04 -30 -09 60 F car 2 F 06 -15 -09 15 G car 2 G 06 -13 -09 10 H PC 2 E 06 -01 -09 20 I car 2 E 07 -10 -09 35 J car 3 C 04 -28 -09 25 3 B 05 -23 -09 25 3 J 05 -29 -09 35 3 I 06 -13 -09 25 3 D 06 -09 -09 40 4 C 07 -30 -09 50 4 H 05 -14 -09 75 4 H 08 -02 -09 65 4 G 07 -23 -09 90 4 F 06 -16 -09 5 04/19/2010 Duration Σ ⋈ σ σ UV D • Selections push down • Sort-Merge Join • Aggregate SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 6
Selection IP URL Date 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 30 1 D 05 -15 -09 90 1 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 35 3 C 04 -28 -09 25 3 B 05 -23 -09 25 3 J 05 -29 -09 35 3 I 06 -13 -09 25 3 D 06 -09 -09 40 4 C 07 -30 -09 50 4 H 05 -14 -09 75 4 H 08 -02 -09 65 4 G 07 -23 -09 90 4 F 06 -16 -09 5 04/19/2010 Σ Duration ⋈ URL Content A car B car σ σ C car D phone UV D E car F car G car H PC I car J car • Storage manager • One thread for each table scan • Project unused columns SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 7
Selection URL Σ Duration A 45 B 60 J 30 D 90 F 15 G 10 E 20 E 35 B 25 J 35 I 25 D 40 C 50 H 75 G 90 F 5 04/19/2010 URL ⋈ A B σ σ C UV D F E G I • Tuples are pipelined into join J SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 8
Sort-Merge Join – Sort Phase URL Σ Duration URL A A B 45 25 JB 60 35 IJ 30 25 D 90 40 F C 15 50 G H 10 75 E G 20 90 I FE 35 5 J ⋈ σ UV B σ D C E F G • Sort tuples on join attribute • Write sorted runs to disk • Buffer space: UV(8) URL Duration A 45 B 25 B 60 C 50 D 90 D 40 E 20 F 5 E 35 G 90 F 15 H 75 G 10 I 25 J 30 J 35 Run 1 04/19/2010 SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Run 2 9
Sort-Merge Join – Merge Phase Σ Duration 45 URL Duration A URL A 45 URL ⋈ 25 B 60 URL Duration UV D URL D 90 C 50 B E 20 D 40 C E 35 F 5 E F 15 G 90 F G 10 H 75 G J 30 I 25 I J 35 J Run 1 04/19/2010 σ Duration B Duration σ Run 2 SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Run 10
Sort-Merge Join – Merge Phase Σ ⋈ URL D 40 URL D 90 E URL Duration 20 E 35 F 5 URL Duration URL F 15 G 90 F G 10 H 75 G J 30 I 25 I J 35 J Run 1 04/19/2010 σ UV D Duration E Duration σ Run 2 SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Run 11
Aggregation Σ ⋈ Duration 45 0 σ σ UV D • Update the sum as tuples are produced Duration 45 SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 04/19/2010 12
Final Result Duration Σ 445 ⋈ σ σ UV D Duration 45 25 60 50 20 35 15 5 10 90 25 SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 30 35 04/19/2010 13
Roadmap • Database query execution • System design & implementation – Data. Base. Online (DBO) • Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches over samples 04/19/2010 14
What is the problem? • TPC-H benchmark results (price / performance) – 10 TB scale • • 928 hard-disks (90 TB total storage capacity) 16 × quad-core processors 512 GB RAM $1. 5 million – Load time: 55 hours – Q 1: linear scan over one table with aggregates on top • 1 query: 19 minutes • 9 queries: 3 hours (linear scaling) 04/19/2010 15
Approximate Query Processing SELECT SUM f(r 1 • r 2 • … • rn) FROM R 1 as r 1, R 2 as r 2, …, Rn as rn Result estimate Query result Traditional query processing Confidence bounds 04/19/2010 Time 16
DBO System Architecture [Rusu et al. 2008] Query Result In-Memory Join 4 ⋈ Σ UV' Levelwise Step 5 1 σ σ UV DB Engine 04/19/2010 D Controller 3 2 ⋈ D' 7 Estimation Module Confidence bounds 6 Result Approximate answer 17
Roadmap • Database query execution • System design & implementation – Data. Base. Online (DBO) • Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches over samples 04/19/2010 18
Sampling [Dobra, Jermaine, Rusu & Xu 2009] Σ IP URL Date 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 30 1 D 05 -15 -09 90 1 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 35 URL Content 3 C 04 -28 -09 25 A car 3 B 05 -23 -09 25 B car 3 J 05 -29 -09 35 C car 3 I 06 -13 -09 25 D phone 3 D 06 -09 -09 40 E car 4 C 07 -30 -09 50 F car 4 H 05 -14 -09 75 G car 4 H 08 -02 -09 65 H PC 4 G 07 -23 -09 90 I car 4 F 06 -16 -09 5 J car 04/19/2010 Duration ⋈ σ σ UV D • Control, coordinate & schedule data flow between operators • Embed randomness in each operator SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 19
Sampling – Selection SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Σ In-Memory Join ⋈ URL J IP URL Date 1 A 05 -30 -09 45 70 1 B 06 -01 -09 60 140 1 J 06 -01 -09 30 185 1 D 05 -15 -09 90 252 1 I 04 -28 -09 35 X 2 A 04 -30 -09 60 X 2 F 06 -15 -09 15 358 2 G 06 -13 -09 10 409 2 E 06 -01 -09 20 476 2 E 07 -10 -09 35 495 3 C 04 -28 -09 25 X 3 B 05 -23 -09 σ σ UV D Duration 25 • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators URL 722 3 J 05 -29 -09 35 739 3 I 06 -13 -09 25 745 3 D 06 -09 -09 40 791 J Duration URL Content J car 68 F car 220 C car 312 D phone A car 389 B car 447 X 4 C 07 -30 -09 50 798 URL 4 H 05 -14 -09 75 837 A 45 70 FJ 220 68 G car 515 4 H 08 -02 -09 65 X B 60 140 F C 220 312 H PC X 4 G 07 -23 -09 90 953 J 30 185 C A 312 389 I car 695 4 F 06 -16 -09 5 973 D 90 252 H B 389 447 E car 799 URL
Sampling – Selection SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Σ In-Memory Join ⋈ URL J IP URL Date 1 A 05 -30 -09 45 70 1 B 06 -01 -09 60 140 1 J 06 -01 -09 30 185 1 D 05 -15 -09 90 252 1 I 04 -28 -09 35 X 2 A 04 -30 -09 60 X 2 F 06 -15 -09 15 358 2 G 06 -13 -09 10 409 2 E 06 -01 -09 20 476 2 E 07 -10 -09 35 495 3 C 04 -28 -09 25 X 3 B 05 -23 -09 σ σ UV D Duration 25 • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators URL 722 3 J 05 -29 -09 35 739 3 I 06 -13 -09 25 745 3 D 06 -09 -09 40 791 A Duration 45 URL Content J car 68 F car 220 C car 312 D phone A car 389 B car 447 X 4 C 07 -30 -09 50 798 URL 4 H 05 -14 -09 75 837 B 60 140 F 220 G car 515 4 H 08 -02 -09 65 X J 30 185 C 312 H PC X 4 G 07 -23 -09 90 953 D 90 252 A 389 I car 695 4 F 06 -16 -09 5 973 F 15 358 B 447 E car 799 URL
Sampling – Selection SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Σ In-Memory Join URL Duration J IP URL Date 1 A 05 -30 -09 45 70 1 B 06 -01 -09 60 140 1 J 06 -01 -09 30 185 1 D 05 -15 -09 90 252 1 I 04 -28 -09 35 X 2 A 04 -30 -09 60 X 2 F 06 -15 -09 15 358 2 G 06 -13 -09 10 409 2 E 06 -01 -09 20 476 2 E 07 -10 -09 35 495 3 C 04 -28 -09 25 X 3 B 05 -23 -09 ⋈ URL 30 J σ σ UV D Duration 25 • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators URL 722 3 J 05 -29 -09 35 739 3 I 06 -13 -09 25 745 3 D 06 -09 -09 40 791 J Duration 30 URL Content J car 68 F car 220 C car 312 D phone A car 389 B car 447 X 4 C 07 -30 -09 50 798 URL 4 H 05 -14 -09 75 837 D 90 252 F 220 G car 515 4 H 08 -02 -09 65 X F 15 358 C 312 H PC X 4 G 07 -23 -09 90 953 G 10 409 H 389 I car 695 4 F 06 -16 -09 5 973 E 20 476 B 447 E car 799 URL
Sampling – Selection SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Σ In-Memory Join URL IP URL Date Duration 1 A 05 -30 -09 45 70 1 B 06 -01 -09 60 140 1 J 06 -01 -09 30 185 1 D 05 -15 -09 90 252 1 I 04 -28 -09 35 X 2 A 04 -30 -09 60 X 2 F 06 -15 -09 15 358 2 G 06 -13 -09 10 409 2 E 06 -01 -09 20 476 2 E 07 -10 -09 35 495 3 C 04 -28 -09 25 3 B 05 -23 -09 3 J 3 3 Duration ⋈ URL J 30 J F 15 F C A B σ σ UV D • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators 50% input: 360; [-328, 1048] 95% probability URL Content X J car 68 25 722 F car 220 05 -29 -09 35 739 C car 312 I 06 -13 -09 25 745 D phone D 06 -09 -09 40 791 A car 389 B car 447 Duration X 4 C 07 -30 -09 50 798 URL 4 H 05 -14 -09 75 837 B 25 722 G 515 G car 515 4 H 08 -02 -09 65 X J 35 739 I 695 H PC X 4 G 07 -23 -09 90 953 I 25 745 E 799 I car 695 4 F 06 -16 -09 5 973 D 40 791 E car 799 URL
Sampling – Selection SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Σ In-Memory Join URL IP URL Date Duration 1 A 05 -30 -09 45 70 1 B 06 -01 -09 60 140 1 J 06 -01 -09 30 185 1 D 05 -15 -09 90 252 1 I 04 -28 -09 35 X 2 A 04 -30 -09 60 X 2 F 06 -15 -09 15 358 2 G 06 -13 -09 10 409 2 E 06 -01 -09 20 476 2 E 07 -10 -09 35 495 3 C 04 -28 -09 25 X 3 B 05 -23 -09 25 722 3 J 05 -29 -09 35 739 3 I 06 -13 -09 25 745 3 D 06 -09 -09 40 791 Duration ⋈ URL J 30 J F 15 F B 25 C J 35 A B G I Exceed In-Memory Join capacity (10 tuples)! Eliminate tuples such that variance is minimized. Duration σ σ UV D • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators URL Content J car 68 F car 220 C car 312 D phone A car 389 B car 447 G car 515 X 4 C 07 -30 -09 50 798 URL 4 H 05 -14 -09 75 837 I 25 745 4 H 08 -02 -09 65 X D 40 791 H PC X 4 G 07 -23 -09 90 953 C 50 798 I car 695 4 F 06 -16 -09 5 973 H 75 837 E car 799 URL E 799
Sampling – Selection SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Σ In-Memory Join URL IP URL Date Duration 1 A 05 -30 -09 45 70 1 B 06 -01 -09 60 140 1 J 06 -01 -09 30 185 1 D 05 -15 -09 90 252 1 I 04 -28 -09 35 X 2 A 04 -30 -09 60 X 2 F 06 -15 -09 15 358 2 G 06 -13 -09 10 409 2 E 06 -01 -09 20 476 2 E 07 -10 -09 35 495 3 C 04 -28 -09 25 3 B 05 -23 -09 3 J 3 3 Duration ⋈ URL J 30 J B 25 A J 35 B G σ σ UV D • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators 74% input: 258; [-293, 808] 95% probability URL Content X J car 68 25 722 F car 220 05 -29 -09 35 739 C car 312 I 06 -13 -09 25 745 D phone D 06 -09 -09 40 791 A car 389 B car 447 G car 515 Duration X 4 C 07 -30 -09 50 798 URL 4 H 05 -14 -09 75 837 I 25 745 4 H 08 -02 -09 65 X D 40 791 H PC X 4 G 07 -23 -09 90 953 C 50 798 I car 695 4 F 06 -16 -09 5 973 H 75 837 E car 799 URL E 799
Sampling – Selection SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Σ In-Memory Join URL IP URL Date Duration 1 A 05 -30 -09 45 70 1 B 06 -01 -09 60 140 1 J 06 -01 -09 30 185 1 D 05 -15 -09 90 252 1 I 04 -28 -09 35 X 2 A 04 -30 -09 60 X 2 F 06 -15 -09 15 358 2 G 06 -13 -09 10 409 2 E 06 -01 -09 20 476 2 E 07 -10 -09 35 495 3 C 04 -28 -09 25 3 B 05 -23 -09 3 J 3 3 Duration ⋈ URL J 30 J B 25 A J 35 B G 90 G E All input: 448; [3, 892] 95% probability σ σ UV D • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators URL Content X J car 68 25 722 F car 220 05 -29 -09 35 739 C car 312 I 06 -13 -09 25 745 D phone D 06 -09 -09 40 791 A car 389 B car 447 URL Duration URL X 4 C 07 -30 -09 50 798 4 H 05 -14 -09 75 837 G car 515 4 H 08 -02 -09 65 X H PC X 4 G 07 -23 -09 90 953 I car 695 4 F 06 -16 -09 5 973 E car 799
Sampling Estimation – Intermediate Levels • Query result estimator & variance estimator computed from result tuples found by In-Memory Join • Confidence bounds derived with Central Limit Theorem • Solve optimization problem to keep bounds stable when tuples are deleted from In-Memory Join 04/19/2010 27
Sampling – Join (Sort) URL Σ Duration SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] A 45 227 B 60 987 J 30 888 D 90 43 F 15 67 G 10 51 E 20 739 E 35 739 B 25 987 J 35 888 I 25 342 D 40 43 C 50 489 H 75 150 G 90 51 URL F 5 67 D 90 43 D 40 43 G 10 51 G 90 51 F 15 67 F 5 67 A 45 227 H 75 150 E 20 739 I 25 342 E 35 739 C 50 489 J 30 888 J 35 888 B 60 987 B 25 987 04/19/2010 ⋈ σ σ UV D URL • Sort tuples on random function of join attribute Duration Run 1 URL J 888 F 67 C 489 A 227 B 987 G 51 I 342 E 739 Duration Run 2 URL F 67 G 51 A 227 I 342 C 489 E 739 J 888 B 987 Run 1 Run 2 28
Sampling – Join (Merge) Duration URL 51 90 51 0 100 URL 10 51 G 90 51 G Duration 51 Σ URL G 10 51 G 90 51 F 67 Duration URL ⋈ URL G 10 51 G 90 51 F 67 G 51 F 15 67 F 5 67 A 227 I 342 A 45 227 H 75 150 C 489 E 739 E 20 739 I 25 342 J 888 B 987 E 35 739 C 50 489 J 30 888 J 35 888 B 60 987 B 25 987 Run 1 04/19/2010 0 51 In-Memory Join Duration G URL 10 Run 2 Run 1 Run 2 σ σ UV D SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 29
Sampling – Join (Merge) Duration 50 URL 489 C 50% input: 468; [194, 741] 95% probability 489 Duration Σ C 50 489 URL E 20 739 C 489 E 35 739 E 739 Duration URL ⋈ URL E 20 739 C 50 489 C 489 E 739 E 35 739 J 35 888 J 888 B 987 J 30 888 B 25 987 B 60 987 Run 1 04/19/2010 489 In-Memory Join 489 Duration C URL 240 Run 2 Run 1 Run 2 σ σ UV D SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 30
Sampling – Join (Merge) Duration URL 987 B 60 987 04/19/2010 987 URL 987 In-Memory Join B 987 Σ URL 25 Run 1 60 445 Duration B 60 987 URL 25 Duration B URL Duration B ⋈ Duration B 25 Run 2 987 URL B 987 Run 1 σ σ UV D 987 Run 2 SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 31
Sampling Estimation – Upper Level • Bernoulli sampling with probability given by domain fraction seen so far • Consolidate tuples generated by same join key • Solve optimization problem to minimize variance across levels – Keep confidence bounds stable 04/19/2010 32
Contributions • Design & implement DBO, first online analytical processing engine – Provide estimates & confidence bounds throughout entire query execution – SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries over any number of relations • Design & analyze fastest convergent estimation method for online aggregation – Statistics & optimization techniques 04/19/2010 33
Roadmap • Database query execution • System design & implementation – Data. Base. Online (DBO) • Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches over samples 04/19/2010 34
Sketches Σ IP URL Date 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 30 1 D 05 -15 -09 90 1 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 35 URL Content 3 C 04 -28 -09 25 A car 3 B 05 -23 -09 25 B car 3 J 05 -29 -09 35 C car 3 I 06 -13 -09 25 D phone 3 D 06 -09 -09 40 E car 4 C 07 -30 -09 50 F car 4 H 05 -14 -09 75 G car 4 H 08 -02 -09 65 H PC 4 G 07 -23 -09 90 I car 4 F 06 -16 -09 5 J car 04/19/2010 Duration ⋈ σ σ UV D • Build sketches on join attribute while data is read from disk • Use attributes in aggregate SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 35
SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Sketches Duration 1 2 3 0 0 0 1 2 3 0 1 0 0 IP URL Date 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 30 1 D 05 -15 -09 90 1 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 35 URL Content 3 C 04 -28 -09 25 A car 3 B 05 -23 -09 25 B car 3 J 05 -29 -09 35 3 I 06 -13 -09 25 3 D 06 -09 -09 4 C 4 S 1 URL A S 1 + S 1 1 A B C D E F G H I J C car + - - + + + - - D phone 40 E car 07 -30 -09 50 F car H 05 -14 -09 75 G car 4 H 08 -02 -09 65 H PC 4 G 07 -23 -09 90 I car 4 F 06 -16 -09 5 J car 04/19/2010 S 1 A B C D E F G H I J 1 2 3 1 1 2 2 3 36
SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Sketches Duration 1 2 3 45 0 0 0 1 2 3 1 0 0 IP URL Date 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 30 1 D 05 -15 -09 90 1 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 35 URL Content 3 C 04 -28 -09 25 A car 3 B 05 -23 -09 25 B car 3 J 05 -29 -09 35 3 I 06 -13 -09 25 3 D 06 -09 -09 4 C 4 S 1 URL Duration A S 1 45 + S 1 1 A B C D E F G H I J C car + - - + + + - - D phone 40 E car 07 -30 -09 50 F car H 05 -14 -09 75 G car 4 H 08 -02 -09 65 H PC 4 G 07 -23 -09 90 I car 4 F 06 -16 -09 5 J car 04/19/2010 S 1 A B C D E F G H I J 1 2 3 1 1 2 2 3 37
SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Sketches IP URL Date Duration 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 30 1 D 05 -15 -09 90 1 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 35 URL Content 3 C 04 -28 -09 25 A car 3 B 05 -23 -09 25 B car 3 J 05 -29 -09 35 3 I 06 -13 -09 25 3 D 06 -09 -09 4 C 4 S 1 1 2 3 -140 35 -65 S 1 1 2 3 0 1 -3 230 A B C D E F G H I J C car + - - + + + - - D phone 40 E car 07 -30 -09 50 F car H 05 -14 -09 75 G car 4 H 08 -02 -09 65 H PC 4 G 07 -23 -09 90 I car 4 F 06 -16 -09 5 J car 04/19/2010 S 1 A B C D E F G H I J 1 2 3 1 1 2 2 3 38
SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Sketches IP URL Date Duration 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 1 D 1 1 2 3 S 1 -140 35 -65 30 S 2 -225 140 05 -15 -09 90 S 3 -20 90 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 2 3 S 1 0 1 -3 -15 S 2 -1 2 1 130 S 3 -3 0 1 230; [-416, 876] 95% probability S 1 230 35 S 2 490 URL Content S 3 190 A car B car 3 C 04 -28 -09 25 3 B 05 -23 -09 25 3 J 05 -29 -09 35 3 I 06 -13 -09 25 3 D 06 -09 -09 40 4 C 07 -30 -09 50 4 H 05 -14 -09 75 4 H 08 -02 -09 65 4 G 07 -23 -09 90 4 F 06 -16 -09 5 04/19/2010 1 A B C D E F G H I J C car S 1 + - - + + + - - D phone S 2 + - + - + - E car S 3 - - - + + - + F car G car H PC I car J car A B C D E F G H I J S 1 1 2 3 1 1 2 2 3 3 3 S 2 3 3 2 1 2 1 3 2 S 3 1 1 2 1 3 2 39
Sketches Estimation • Two random processes – Bucket selection – Sign • Sketch update • Estimator • Confidence bounds – Multiple independent sketches – Chebyshev & Chernoff inequalities (worst-case) – Median Central Limit Theorem, Student-t distribution (statistics) 04/19/2010 40
Pseudo-Random Number Generators [Rusu & Dobra 2006, 2007 b] • Detailed comparison of generating schemes – Abstract algebra (orthogonal arrays, vector spaces, prime & extension fields) • Degree of independence as function of seed size • Fast range-summable – Empirical evaluation • Generating time is few processor cycles • Identify EH 3 as generator for sketches – Lowest possible degree of independence – 7. 3 ns to generate single number 04/19/2010 41
Statistical Analysis [Rusu & Dobra 2007 a, 2008] • Detailed comparison of sketch estimators – Same accuracy (worst-case analysis) – Statistical analysis • Distribution (probability density function) • Higher frequency moments (kurtosis) • Confidence bounds – Empirical evaluation • Data skew, correlation, memory usage, update time • Identify Fast-AGMS as most reliable scheme – Accurate over entire range of data – Small memory footprint, fast update time 04/19/2010 42
Roadmap • Database query execution • System design & implementation – Data. Base. Online (DBO) • Approximation methods (theoretical analysis & practical implementation) – Sampling – Sketches over samples 04/19/2010 43
Sketches over Samples [Rusu & Dobra 2009] IP URL Date 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 30 1 D 05 -15 -09 90 1 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 Σ Duration ⋈ 2 E 07 -10 -09 35 URL 3 C 04 -28 -09 25 J car 3 B 05 -23 -09 25 F car 3 J 05 -29 -09 35 C car 3 I 06 -13 -09 25 D phone 3 D 06 -09 -09 40 A car 4 C 07 -30 -09 50 B car 4 H 05 -14 -09 75 G car 4 H 08 -02 -09 65 H PC 4 G 07 -23 -09 90 I car 4 F 06 -16 -09 5 E car 04/19/2010 Content σ σ UV D • Data is random on disk • Build sketches on join attribute while data is read from disk • Use attributes in aggregate • Provide estimates at any point SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] 44
SELECT SUM(UV. Duration) FROM Documents D, User. Visits UV WHERE D. URL = UV. Doc. URL AND D. Content contains ‘car’ AND UV. Date between [05 -01 -09, 07 -31 -09] Sketches over Samples IP URL Date Duration 1 A 05 -30 -09 45 1 B 06 -01 -09 60 1 J 06 -01 -09 1 D 1 1 2 3 S 1 -100 -35 -30 30 S 2 -105 35 05 -15 -09 90 S 3 -30 30 I 04 -28 -09 35 2 A 04 -30 -09 60 2 F 06 -15 -09 15 2 G 06 -13 -09 10 2 E 06 -01 -09 20 2 E 07 -10 -09 2 3 S 1 1 1 -2 -15 S 2 -1 0 1 65 S 3 -2 0 0 50% input: 100; [-2382, 2582] 95% probability S 1 -300 35 S 2 360 URL Content S 3 240 J car F car 3 C 04 -28 -09 25 3 B 05 -23 -09 25 3 J 05 -29 -09 35 3 I 06 -13 -09 25 3 D 06 -09 -09 40 4 C 07 -30 -09 50 4 H 05 -14 -09 75 4 H 08 -02 -09 65 4 G 07 -23 -09 90 4 F 06 -16 -09 5 04/19/2010 1 A B C D E F G H I J C car S 1 + - - + + + - - D phone S 2 + - + - + - A car S 3 - - - + + - + B car G car H PC I car E car A B C D E F G H I J S 1 1 2 3 1 1 2 2 3 3 3 S 2 3 3 2 1 2 1 3 2 S 3 1 1 2 1 3 2 3 2 45
Sketches over Samples – Estimation • Define estimator over two completely different random processes & analyze statistically – Sampling – random partition, tuple domain – Sketches – random projection, frequency domain – Consider correlation between multiple sketches that share sample – Moment generating functions • Generic analysis independent of sampling process – Bernoulli sampling – Sampling without replacement – Sampling with replacement 04/19/2010 46
Sketches over Samples – Analysis Var[sketch over samples] = Var[samples] + Var[sketch] + Var[interaction] 04/19/2010 47
Conclusions • Data explosion – Cheap, high-capacity storage – Current processing technology is too expensive for performance it provides • Framework for online analytical processing – DBO system architecture • Embed randomization into data processing • Provide estimates and bounds at any time – Approximation methods • Sampling – most flexible • Sketches – single pass • Sketches over samples – fastest 04/19/2010 48
Future Work • Short term – Define & design query optimization for DBO – Extend DBO to other types of queries and with other approximation techniques (end-biased samples, histograms, …) – Generalize sketches to multiple relations – Find optimal amount of data to sketch – Fully integrate sketches into DBO system • Medium term – Develop data aggregation & approximation techniques for other types of architectures • Multicore processors, GPUs • Distributed processing (Map-Reduce, Hadoop, …) • Long term – Design & build scalable analytic processing system • Aggregation & approximation 04/19/2010 49
Publications • A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate Convergence in DBO. In VLDB 2009. • F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE 2009. • F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo). • F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol. 33, no. 3, 2008. • F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-Based Estimations. In TODS, vol. 32, no. 2, 2007. • F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In SIGMOD 2007. • F. Rusu and A. Dobra – Fast Range-Summable Random Variables for Efficient Aggregate Estimation. In SIGMOD 2006. 04/19/2010 50
- Slides: 50