A Framework for Testing Query Transformation Rules Hicham


























- Slides: 26
A Framework for Testing Query Transformation Rules Hicham Elmongui Purdue University Vivek Narasayya, Ravi Ramamurthy Microsoft Research 12/15/2021 ACM SIGMOD 2009 1
Query Optimizer Database System Optimizer Responsible for producing a good execution plan for a given SQL query Crucial for decision support queries 2
Query Optimizer Components Search Strategy Rule Engine Apply rule Query Apply rule Execution Plan Cost Model Cardinality Estimation Query Optimizer 12/15/2021 ACM SIGMOD 2009 3
Query Transformation Rules Apply Join Associativity Rule R S R T S S Hash Join Apply Join To Hash Join Rule R Logical Rule Implementation Rule S Search space extensible by adding new rules Group By, De-correlation, Star Join, etc. Modern optimizers have large number of rules 12/15/2021 ACM SIGMOD 2009 4
Implementing Rule Engine is Non-Trivial SELECT D. Name FROM DEPT D WHERE D. BUDGET <= ( SELECT COUNT(E. eno)*10000 FROM E WHERE E. Dno = D. Dno) SELECT D. Name FROM DEPT D , EMP E WHERE D. no = E. Dno GROUP BY D. Name HAVING D. Budget <= COUNT(E. Eno)*10000 Count Bug in De-correlation Rewrite rules can be subtle Implementation errors can lead to incorrect results RAGS paper (VLDB’ 98) 4 DBMSs disagreed on query results 16% of the time! 12/15/2021 5
Testing Optimizer Rule Engine Coverage Is a given rule (or set of rules) exercised? Correctness Does exercising a rule (or set of rules) change the query results? Performance How does a rule (or set of rules) affect query performance? 12/15/2021 ACM SIGMOD 2009 6
Rule Coverage Transformation rules exercised Query Q 1 1 2 3 4 5 … n API to track which rules are exercised for a given query Q 2 … Qm Definitions of when a rule is exercised 1. Rule must generate at least one expression during optimization 2. At least one expression in the final plan must be generated by rule 12/15/2021 ACM SIGMOD 2009 7
Testing Rule Coverage Generate query such that each rule is exercised Hard to precisely characterize when a rule will be exercised Depends on rule semantics, optimizer heuristics etc. Extend for a set of rules (e. g. rule pairs) Large space of combinations Efficient query generation Time required to generate query that exercises rule should be as small as possible Need multiple queries per rule (or set of rules) Random query generation can be inefficient 12/15/2021 ACM SIGMOD 2009 8
Rule Correctness Query Q 1 2 3 4 5 … Results R Plan P Transformation rules exercised n Execute Optimize Disable rule r 2 Results R΄ Plan P΄ Query Q 1 2 R ≠R΄ 12/15/2021 3 4 5 … n Optimize Execute bug ACM SIGMOD 2009 9
Testing Rule Correctness Transformation rules exercised Query Q Disable rule r 2 Plan P 2 1 2 3 Disable rule r 3 4 5 … Plan P 2 Plan P n Plan Pn-1 Optimize Disable rule rn-1 For each rule, repeat for multiple such queries (k) Need to execute if P ≠ P΄ Queries are usually complex Equivalence of plan P and P΄cannot be inferred in most cases Time consuming 12/15/2021 ACM SIGMOD 2009 10
DBMS Testing Query Generation Data Generation • Quickly generating Billion-Record databases (SIGMOD’ 94) • Flexible Database Generators (VLDB’ 05) • Generating Thousand Benchmark Queries in Seconds (VLDB’ 04) • Genetic approach (VLDB’ 07) • Reverse Query Processing (ICDE’ 07) • MUDD: A Multi-dimensional data generator(WOSP’ 04) 12/15/2021 • RAGS (VLDB’ 98) • Unit testing query transformation rules (DBTest’ 08) • Generating queries with cardinality constraints (TKDE’o 6, SIGMOD’ 08) ACM SIGMOD 2009 11
Query Generation for Rule Testing RAGS (VLDB’ 98) Stochastic SQL statement generation Control SQL generated via configuration parameters #Joins, #columns in Group-By, max sub-query depth, … Genetic approach (VLDB’ 07) Queries are mutated, combined, etc. to generate new queries Feedback function applied on each query to determine “fitness” E. g. prefer queries with non-empty results 12/15/2021 ACM SIGMOD 2009 12
Our Contributions Query generation Exploit “rule patterns” to identify necessary condition for a rule to be exercised Significantly reduces number of trials compared to previous approaches Correctness validation Novel problem of test suite compression Significantly reduce time for correctness testing Shown to be NP-Hard Principled solution (factor 2 approximation) 12/15/2021 ACM SIGMOD 2009 13
QRel Framework QREL: (DBTest’ 08) Programming framework for generating queries Generate logical query tree from tree “pattern” Generate SQL from a given logical query tree 12/15/2021 ACM SIGMOD 2009 14
Architecture 12/15/2021 ACM SIGMOD 2009 15
Rule Patterns Rule (Rule Name, Rule Pattern, Substitution) Input expression e If e matches Rule Pattern Generate new expression by invoking Substitution function on e Apply rule R R Rule Pattern for Join Commutativity 12/15/2021 S T ACM SIGMOD 2009 S T 16
Exposing Rule Patterns Idea: Optimizer exposes a Rule Pattern for a given rule Returns (a subset of) necessary conditions for rule to be exercised Encoded using XML in our implementation DBMS Query Generation Tool 12/15/2021 “Join Commutativity” Query Optimizer ACM SIGMOD 2009 17
Rule Interactions Bugs in implementation of one rule may manifest when another rule is also applied “Get to Index Scan” rule R. a = S. b Get R Get S “Get to Index Scan” rule R. a = S. b Get R 12/15/2021 Index Scan I (a, d) Get S Index Scan I (d, a) “Join to Merge Join” rule Get S Index Scan I (a, d) “Join to Merge Join” rule Get S ACM SIGMOD 2009 Merge Join R. a = S. b Get S Merge Join R. a = S. b Index Scan I (d, a) Get S 18
Rule Composition Rule Pattern for Join Commutativity Group-By Rule Pattern for Pulling GB above Join Wildcard Combine rule patterns by replacing a wildcard node with the other rule pattern Other kinds of composition possible as well Group-By 12/15/2021 Group-By ACM SIGMOD 2009 19
Query Generation Algorithm For each rule pair (r 1, r 2) 1. 2. 3. 4. Select a composition of rule patterns T = Generate logical query tree for rule pattern S = Generate SQL statement for T // use QREL Repeat if r 1 and r 2 not exercised when S is optimized SELECT T 3. a, … FROM T 1, T 2, T 3 WHERE … GROUP BY T 3. a, … T 1 Group-By T 2 Group-By T 3 12/15/2021 ACM SIGMOD 2009 20
Experiments Number of Trials (Rule Pairs) 45 40 35 30 25 20 15 10 5 0 Random Pattern 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Rule Id Number of trials Number of Trials (Singleton rules) 16000 14000 12000 10000 8000 6000 4000 2000 0 Random Pattern 15 30 Number of rules Number of trials significantly fewer using Rule Patterns 12 x reduction in number of trials for rule pairs 12/15/2021 ACM SIGMOD 2009 21
Test Suite Compression 110 r 1 Q 1 100 130 r 2 160 500 r 3 400 Q 2 150 Q 3 300 Baseline Cost = 100 + 150 + 300 + 110 + 160 + 400 = 1220 Find sub-graph of bipartite graph such that Each rule is selected Degree of each rule node is equal to test suite size (k) Sum of the edge costs is minimized Problem is NP-Hard (reduction from Set Cover problem) 12/15/2021 ACM SIGMOD 2009 22
Set Cover Heuristic Benefit(Q) = Number of new rules exercised/ Cost(Q) Greedily add query with largest “Benefit” Add edges corresponding to Q 110 r 1 Q 1 100 130 r 2 r 3 160 500 400 Q 2 150 Q 3 300 Benefit(Q 1) = 3/100 Benefit(Q 2) = 1/150 Benefit(Q 3) = 1/200 Total Solution Cost = 100 + 110 + 130+ 500 = 840 Key drawback: ignores edge costs Turning off a rule can significantly plan cost 12/15/2021 ACM SIGMOD 2009 23
Top K Independent Algorithm For each rule r, add k edges with the lowest cost Factor 2 approximation of the optimal Ignores node cost 110 r 1 Q 1 100 130 r 2 r 3 160 500 400 Q 2 150 Q 3 300 Total solution cost = 100 + 150+ 110 + 130 + 160 = 650 In practice much better than alternatives 12/15/2021 ACM SIGMOD 2009 24
Experiments Quality vs. Test suite size (k) (Rule pairs, n=15) Test Suite Compression Quality (Singleton Rule, k=10) 1 E+09 1 E+08 Cost 100000 1 E+07 BASELINE 1 E+06 1 E+05 1000 100 BASELINE 1 E+04 SMC 1 E+03 TOPK 1 E+02 SMC TOPK 1 E+01 1 5 10 20 50 10 5 10 15 20 25 Test suite size per rule pair (k) Number of Rules Top K Independent is significantly better Even better for case of rule pairs Further optimizations, experiments in paper 12/15/2021 ACM SIGMOD 2009 25
Conclusion Testing query optimizer rule engine is important Query generation for rule testing Significant gains by exploiting rule patterns Correctness validation Dramatic reductions possible using test suite compression Many open problems in rule testing Other variants of “rule exercising” Other kinds of rule interactions Data generation to ensure other necessary conditions (e. g. star join optimization rule requires FK relationship) 12/15/2021 ACM SIGMOD 2009 26