Compression Aware Physical Database Design Brown University Hideaki
Compression Aware Physical Database Design Brown University Hideaki Kimura* hkimura@cs. brown. edu (*) Graduates soon. On Job Market. Microsoft Research Vivek Narasayya Manoj Syamala {viveknar, manojsy}@microsoft. com
Background: Compression in DB �Every Major DBMS Supports �Saves Storage Consumption �Saves I/O Bandwidth DBMS A: 4 x! DBMS B: 10 x! DBMS C: 12 x! Query Process Engine Tables, Indexes Compressed Data Decompress SELECT Compress INSERT 2/28
Compression Schemes in DB Dictionary Encoding City Seattle 1 San Jose 2 Seattle 1 . . + NULL Suppression Dict. Price 1: Sea ttle 000321 @321 2: San Jose 000054 @54 000015 @15 . . ◦ Local dict. (Oracle, SQL Server) ◦ Global dict. (DB 2) Prefix Suppression, LZO, RLE… 3/28
Two Types of Compression in DB B X X Y Y IAB A @AA @AA @BBB = IBA A @AA @BBB ◦ NULL-Supp. ◦ Global dict. ◦ … Order Dependent IAB A 000 AA ↑ ↑ ↑ 00 BBB ↑ ↑ ↑ IBA page B X X Y Y IBA A 000 AA 00 BBB fragmented IAB A 000 AA 00 BBB Order Independent ≠ A 000 AA ↑ 00 BBB ↑ ◦ Run Length Enc. ◦ Local dict. ◦ … 4/28
Benefits and Overheads �Saves Storage Space, I/O �CPU Overhead to Compress & Decompress �Different Compression Scheme = Different Saving ↔ Overhead How Do We Use It? DBA 5/28
Issue 1: To Compress or not. . �Depends 10 GB -90% on Data 1 GB High Compression Ratio �Depends 10 GB -10% 9 GB Low Compression Ratio on Workload ◦ SELECTs/INSERTs Frequency ◦ CPU bottleneck? IO bottleneck? 6/28
Issue 2: What Index to Create Physical DB Design Tool Q 1 Q 2 I 4 I 3 I 2 I 1 I 3 I 1 I 5 Prune I 5 Select Candidate Configurations Configuration I 1 Syntactically Relevant Indexes I 5 Enumerate Best Configuration DBMS Hypothetical Indexes What-if Analysis Query Optimizer Estimate Runtime 7/28
Naïve Solution: Staged Design �Run Design Tool to Select Indexes �Compress them, then Repeat. Stage 1 MV Workload 100 MB Budget Stage 2 Compress! MV Idx 100 MB MV Idx 50 MB Idx 100 MB 8/28
Problem in tight space budget SELECT SUM(Price*Discount) FROM Sales WHERE State='CA' and Jul 01 < Shipdate < Sep 01 Sales Choice for 100 MB? Shipdate State Price Discount Feb 21 CA $123 10% Jan 9 RI $222 0% Jul 5 TX $213 5% I 1 (State, Shipdate): 95 MB → 50 MB I 2 (State, Shipdate) Include (Price, Discount): 170 MB → 90 MB �Misses an index that makes sense only with compression 9/28
Example: Tight Space Budget Good design: TPC-H, 2 ndary Index Only 70 60 Improvement [%] 50 Good design 40 30 Staged 20 ? 10 0 0 200 400 600 800 Space Budget [MB] 1000 175 MB CREATE COMPRESSED INDEX (L_PARTKEY, L_ORDERKEY, L_ SUPPKEY) INCLUDE (L_QUANTITY, L_EXTENDEDP RICE, L_DISCOUNT) Staged: 155 MB CREATE INDEX (L_ORDERKEY) INCLUDE(L_SUPPKEY, L_COM MITDATE, L_RECEIPTDATE) 10/28
Problem in plenty space budget Choice for 200 MB? I 1 (State, Shipdate): 95 MB → 50 MB I 2 (State, Shipdate) Include (Price, Discount): 170 MB → 90 MB INSERT INTO Sales … UPDATE Sales SET Price=. . CPU Overheads �Result in too high CPU overheads for compression/decompression. 11/28
Example: Plenty Space Budget 60 UPDATE Intensive TPC-H, 2 ndary Index Only 50 Improvement [%] Good design 40 Staged 30 20 10 0 0 1000 2000 Space Budget [MB] 3000 Worse with More Budget! 12/28
Integrated Solution Needed! �How to Estimate Index-size after compression? �How to Evaluate benefits/overheads of compression? �How Compression affects Candidate Selection/Enumeration? 13/28
Size Estimation �Essential Metric of Indexes ◦ To Fit Space Budget ◦ To Estimate I/O cost �Need Stats Compression Fraction Col-A Width=8 Col-B Width=4 Col-C Width=10 Clust. Key Width=4 Table #tuple=1 M Size (IABC) = (8 + 4 + 10 + 4) * 1 M = 26 MB Comp. Size (IABC) = 26 MB * CF (IABC) 14/28
Prior work �Sample . . p. 10 MB Im Sample ve CREATE COMPRESSED INDEX 20 15 10 5 0 aï 1 GB Sample. CF Overheads N Table Design Tool Runtime [min] �Sample. CF [Idreos et al. ICDE'10] Size: Cost ↔ Accuracy �Still Expensive for 1, 000 s of indexes 15/28
Solution Overview 16/28
Index Size Deduction Ia NULL supp. (ORD-IND) Ib Ia, b Sum-up Savings Sample. CF Ia Col-Ext Deduction Ib Ia, b Local dict. (ORD-DEP) IAB A 000 AA / / / 00 BBB / / / IBA A 000 AA / 00 BBB / Col-Set Deduction Ib, a Estimate From Run-Length More Details in paper 17/28
Optimize Accuracy-Cost Trade-off �Size-Estimation ◦ Sample Size? ◦ Deduction Path? ◦ Expected Errors? Strategy �Formulate as Graph Problem �Greedy algorithm to solve (details in the paper) 18/28
Issues in Design Tool �Query Cost model to consider (De)Compression CPU cost �Candidate Selection/Enumeration Key Challenge: Space-Performance Trade-off 19/28
Candidate Selection: Space-Performance Trade-off Q 1 Q 2 IB IA ID IC Compressed Versions IB IA ID IC Select Fastest IA IC �Add Compressed Indexes Most of them are Ignored! Compressed Indexes are often Slower-but-Smaller (exception: very high compression ratio) 20/28
Query Cost Skyline Candidate Selection Slow-small Fast-large Configuration Size �Construct �Pick Skyline of Configurations Both Fast-Indexes and Small-Indexes 21/28
Enumeration: Problem �Greedy picks un-compressed indexes too early IB 10 MB Comp. IC B 5 MB Seed IA IC IA IB 15 MB Room IC 10 MB IA IB IC Optimal Design IA IC B IC 22/28
Local Backtrack in Enumeartion �Recoversized configurations �Compress indexes in the config. … IA IB IC C IA IC B IC Recover If Oversized IA IB IC 23/28
Experimental Results �Implemented on SQL Server 2008 ◦ Modified Database Tuning Advisor (DTA) "DTAc" ◦ Modified Query Cost Model �TPC-H Scale-1 (more results in paper) ◦ SELECT-intensive/UPDATE-intensive ◦ Compared Estimated Runtime 24/28
Candidate Selection/Enumeration Select Intensive 80 70 70 Improvement [%] 60 50 50 40 40 Backtrack 30 30 20 20 DTAc (None) 10 10 50 300 700 Budget [MB] 1500 0 DTAc (Both) Skyline Improvement [%] 60 0 Update Intensive 50 300 700 Budget [MB] 1500 DTA Clustered/2 ndary Indexes �Both Skyline & Backtrack are required esp. for tight budget 25/28
DTAc vs. DTA Select Intensive Improvement [%] 80 Improvements[%] 60 60 DTAc 40 40 DTAc 20 0 Update Intensive 0 500 Budget [MB] DTA 1000 20 0 DTA 0 500 Budget [MB] 1000 Clustered/2 ndary/MV Indexes �Especially better in tight budget �Choose lightly compressed designs in UPDATE-intensive 26/28
Design Tool Runtime [min] Overhead in DTA 20 MV-Sample 15 Partial-Estimate 10 Partial-Sample 5 0 MV-Estimate Table-Sample DTAc w/o Optimization DTAc Other �Reduce Size Estimation Overheads for a factor of 3 �Mostly <10% Estimation Error 27/28
Conclusion �Opportunities and Challenges �Integrated Approach to exploit compression in physical design ◦ Space-Performance Tradeoff ◦ Size Estimation �Open Issues ◦ Column-Store 28/28
- Slides: 28