Semantic Query Optimization Outline Semantic Query Optimization Soft

  • Slides: 49
Download presentation
Semantic Query Optimization

Semantic Query Optimization

Outline • • Semantic Query Optimization Soft Constraints Query Optimization via Soft Constraints Selectivity

Outline • • Semantic Query Optimization Soft Constraints Query Optimization via Soft Constraints Selectivity Estimation via Soft Constraints 2

Semantic Query Optimization Use integrity constraints associated with a database to rewrite a query

Semantic Query Optimization Use integrity constraints associated with a database to rewrite a query into a form that may be evaluated more efficiently Some Techniques: • Join Elimination • Predicate Elimination • Join Introduction • Predicate Introduction • Detecting an Empty Answer Set 3

Commercial implementations of SQO Few (if any!) Early Experiences: • Could not spend too

Commercial implementations of SQO Few (if any!) Early Experiences: • Could not spend too much time on optimization • Few integrity constraints are ever defined • Association with deductive databases 4

Join elimination: example select from where p_name, p_retailprice, s_name, s_address tpcd. lineitem, tpcd. partsupp,

Join elimination: example select from where p_name, p_retailprice, s_name, s_address tpcd. lineitem, tpcd. partsupp, tpcd. part, tpcd. supplier p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey; RI constraints: select from where part-partsupp (on partkey) supplier-partsupp (on partkey) partsupp-lineitem (on partkey and suppkey) p_name, p_retailprice, s_name, s_address tpcd. lineitem, tpcd. partsupp, tpcd. part, tpcd. supplier p_partkey = l_partkey and s_suppkey = l_suppkey; 5

Algorithm for join elimination 1. Derive column transitivity classes from the join predicates in

Algorithm for join elimination 1. Derive column transitivity classes from the join predicates in the query 2. Divide the relations in the query that are related through RI constraints into removable and non-removable 3. Eliminate all removable relations from the query 4. Add is not null predicate to foreign key columns of all tables whose RI parents were removed 6

Algorithm for join elimination: example S. S PS. S C. C O. C 7

Algorithm for join elimination: example S. S PS. S C. C O. C 7

Performance results for join elimination 9

Performance results for join elimination 9

Predicate Introduction: Example select from where sum(l_extendedprice * l_discount) as revenue tpcd. lineitem shipdate

Predicate Introduction: Example select from where sum(l_extendedprice * l_discount) as revenue tpcd. lineitem shipdate >date('1994 -01 -01'); Check constraint: receiptdate >= shipdate Clustered Index on receiptdate select from where sum(l_extendedprice * l_discount) as revenue tpcd. lineitem shipdate >date('1994 -01 -01') and receiptdate >= date('1994 -01 -01'); 10

Algorithm for Predicate Introduction N - set of predicates derivable from the query and

Algorithm for Predicate Introduction N - set of predicates derivable from the query and check constraints • If N is inconsistent, stop. • Else, for each predicate A op B in N, add it to the query if: • A or B is a join column • B is a major column of an index • no other index on B’s table can be used in the plan for the original query 11

Queries select from where 100. 00 * sum (case when p_type like 'PROMO%' then

Queries select from where 100. 00 * sum (case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0 end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue tpcd. lineitem, tpcd. part l_partkey = p_partkey and l_shipdate >= date('1998 -09 -01') and l_shipdate < date('1998 -09 -01') + 1 month; Given the check constraint l_receiptdate >= l_shipdate we may add a new predicate to the query: l_receiptdate >= date(‘ 1998 -09 -01’) 13

Performance Results for Index Introduction 14

Performance Results for Index Introduction 14

The Culprit New query plan uses an index, but the original table scan is

The Culprit New query plan uses an index, but the original table scan is still better! Why did this happen: • incorrect estimate of the filter factor • underestimation of the CPU cost of locking index pages 15

Soft Constraints 16

Soft Constraints 16

Soft Constraints Traditional (“hard”) integrity constraints are defined to prevent incorrect updates. A soft

Soft Constraints Traditional (“hard”) integrity constraints are defined to prevent incorrect updates. A soft constraint is a statement that is true about the current state of the database, but does not verify updates. In fact, a soft constraint can be invalidated by an update. 17

Soft Constraints (cont. ) • Absolute soft constraints – no violation in the current

Soft Constraints (cont. ) • Absolute soft constraints – no violation in the current state of the database FAbsolute soft constraints can be used for optimization in exactly the same way traditional constraints are. • Statistical soft constraints – can have some (small) degree of violation FStatistical soft constraints can be used for improved selectivity estimation 18

Implementation of Soft Constraints In Oracle the standard integrity constraints are marked with a

Implementation of Soft Constraints In Oracle the standard integrity constraints are marked with a rely option, so that they are not verified on updates. In DB 2 soft constraints are called informational constraints. 19

Informational Check Constraint Example 1: Create an employee table where a minimum salary of

Informational Check Constraint Example 1: Create an employee table where a minimum salary of $25, 000 is guaranteed by the application CREATE TABLE emp(empno INTEGER NOT NULL PRIMARY KEY, name VARCHAR(20), firstname VARCHAR(20), salary INTEGER CONSTRAINT minsalary CHECK (salary >= 25000) NOT ENFORCED ENABLE QUERY OPTIMIZATION); 20

Enforcing Validation Example 2: Alter the employee table to start enforcing the minimum wage

Enforcing Validation Example 2: Alter the employee table to start enforcing the minimum wage of $25, 000 using DB 2 will also verify existing data right away. ALTER TABLE emp ALTER CONSTRAINT minsalary ENFORCED 21

Informational RI Constraint Example 3: Create a department table where the application ensures the

Informational RI Constraint Example 3: Create a department table where the application ensures the existence of departments to which the employees belong. CREATE TABLE dept(deptno INTEGER NOT NULL PRIMARY KEY, dept. Name VARCHAR(20), budget INTEGER); ALTER TABLE emp ADD COLUMN dept INTEGER NOT NULL CONSTRAINT dept_exist REFERENCES dept NOT ENFORCED ENABLE QUERY OPTIMIZATION); 22

Query Optimization via Empty Joins 23

Query Optimization via Empty Joins 23

Example select Model from Tickets T, Registration R where T. Reg. Num = R.

Example select Model from Tickets T, Registration R where T. Reg. Num = R. Reg. Num and T. date > “ 1990 -01 -01” and R. Model LIKE “BMW Z 3%” First BMW Z 3 series cars were made in 1997. select Model from Tickets T, Registration R where T. Reg. Num = R. Reg. Num and T. date > “ 1997 -01 -01” and R. Model LIKE “BMW Z 3%” 24

Matrix representation of empty joins (R A, B S) 25

Matrix representation of empty joins (R A, B S) 25

Staircase data structure 26

Staircase data structure 26

Properties of the algorithm • Time Complexity O(nm) requires a single scan of the

Properties of the algorithm • Time Complexity O(nm) requires a single scan of the sorted data • Space Complexity O(min(n, m)) only two rows of the matrix need be kept in memory • Scalable with respect to: • number of tuples in the join result • number of discovered empty rectangles • size of the domain of one of the attributes 27

How many empty rectangles are there? Tests done on 4 pairs of attributes with

How many empty rectangles are there? Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company. 28

How big are the rectangles? 29

How big are the rectangles? 29

Query rewrite: simple case select … from R, S, . . . where R.

Query rewrite: simple case select … from R, S, . . . where R. C=S. C and 60<R. A<80 and 20<S. B<80 and. . . select … from R, S, . . . where R. C=S. C and 60<R. A<80 and 20<S. B<60 and. . . 30

Query rewrite: complex case select from where … R, S, . . . R.

Query rewrite: complex case select from where … R, S, . . . R. C=S. C and 60<R. A<80 and 20<S. B<80 and. . . select from where … R, S, . . . R. C=S. C and (… and …) or. . . 31

Experiment I: Size of the Overlap 32

Experiment I: Size of the Overlap 32

Experiment 2: Type of Overlap 33

Experiment 2: Type of Overlap 33

Experiment 3: Number of Empty Joins Used in Rewrite 34

Experiment 3: Number of Empty Joins Used in Rewrite 34

How much do the rectangles overlap with queries? 35

How much do the rectangles overlap with queries? 35

Query optimization experiments • real-world workload of 26 queries • 5 of the queries

Query optimization experiments • real-world workload of 26 queries • 5 of the queries “qualified” for the rewrite • only simple rewrites were considered • all rewrites led to improved performance 36

Query Cardinality Estimate via Empty Joins 37

Query Cardinality Estimate via Empty Joins 37

Query Cardinality Estimate via Empty Joins (SIEQE) • Cardinality estimates crucial for designing good

Query Cardinality Estimate via Empty Joins (SIEQE) • Cardinality estimates crucial for designing good query evaluation plans • Uniform data distribution (UDA): standard assumption in database systems • Histograms effective in single dimensions: too expensive to build and maintain otherwise 38

The Strategy • With UDA, the “density”: 1 tuple/sq unit • Empty joins cover

The Strategy • With UDA, the “density”: 1 tuple/sq unit • Empty joins cover 20% of the area • Adjusted density: 1. 25 tuples/sq unit Q 1 Cardinality UDA SIEQE Q 2 Q 1 100 62 Q 2 100 125 39

Experiments Number of queries for which the error is less than a given limit

Experiments Number of queries for which the error is less than a given limit 40

Discovery of Check Constraints and Their Application in DB 2 We discover two types

Discovery of Check Constraints and Their Application in DB 2 We discover two types of (rules) check constraints: • correlations between attributes over ordered domains • partitioning of attributes 42

Correlations between attributes over ordered domains Rules have the form: Y = b. X

Correlations between attributes over ordered domains Rules have the form: Y = b. X + a + [emin, emax] Algorithm for all tables in the database for all comparable variable pairs (X and Y) in the table apply OLS estimation to get the function of the form: Y = a + b. X calculate the max and min error (or residual) emax and emin endfor 43

Partitioning Rules have the form: If X = a, then Y [emin, emax] Algorithm

Partitioning Rules have the form: If X = a, then Y [emin, emax] Algorithm for all tables in the database for any qualifying variable pair (X and Y) in the table calculate partitions by using GROUP BY X statements find the max and min value of Y for each partition endfor 44

Experiments in TPC-H contains the following check constraint: L_RECEIPTDATE > L_SHIPDATE Our algorithm discovered

Experiments in TPC-H contains the following check constraint: L_RECEIPTDATE > L_SHIPDATE Our algorithm discovered the following rule: L_RECEIPTDATE = L_SHIPDATE + (1, 30), m = 0. 0114. Rules discovered through partitioning: If L_LINESTATUS=F, then L_SHIPDATE=(01/04/1992, 06/17/1995), m = 0. 50 If L_LINESTATUS=O, then L_SHIPDATE=(06/19/1995, 12/25/1998), m = 0. 50 45

Applications • DBA Wizard • Semantic Query Optimization • Improved Filter Factor Estimates 46

Applications • DBA Wizard • Semantic Query Optimization • Improved Filter Factor Estimates 46

Example Consider a query issued against a hotel database, that requests the number of

Example Consider a query issued against a hotel database, that requests the number of guests staying in the hotel on a given date. ARRIVAL DATE <= ‘ 1999 -06 -15’ AND DEPARTURE_DATE >= ‘ 1999 -06 -15’ The filter factor estimate for the query would be: ff = ff 1 * ff 2 If ‘ 1999 -06 -15’ was approximately midway in the date ranges, we would estimate a quarter of all the guests that came in over the number of years would be in the answer of the query! 47

Example (cont. ) Assume that the following check constraint was discovered: DEPARTURE_DATE >= ARRIVAL_DATE

Example (cont. ) Assume that the following check constraint was discovered: DEPARTURE_DATE >= ARRIVAL_DATE + (1 DAY, 5 DAYS) The original condition in the query predicate can then be changed to: ARRIVAL_DATE <= ‘ 1999 -06 -15’ AND ARRIVAL_DATE >= ‘ 1999 -06 -18’ or ARRIVAL_DATE BETWEEN ‘ 1999 -06 -15’ AND ‘ 1999 -06 -18’ The filter factor is now estimated to: ff = (ff 1 + ff 2 – 1) 48

Other Research on the Use of Soft Constraints in Query Optimization 49

Other Research on the Use of Soft Constraints in Query Optimization 49

Query-driven Approach • Built multidimensional histograms based on query results (Microsoft) • Improve cardinality

Query-driven Approach • Built multidimensional histograms based on query results (Microsoft) • Improve cardinality estimates by looking at the intermediate query results (IBM) FBoth techniques generate statistical soft constraints 50

Data-driven Approach • Lots of methods using Bayesian networks to infer statistical soft constraint

Data-driven Approach • Lots of methods using Bayesian networks to infer statistical soft constraint • Lots of methods to discover functional dependencies in data (absolute soft constraints) • Most recently, BHUNT and CORDS use sampling to discover soft constraints (IBM) 51

References • Q. Cheng, J. Gryz, F. Koo, T. Y. Cliff Leung, L. Liu,

References • Q. Cheng, J. Gryz, F. Koo, T. Y. Cliff Leung, L. Liu, X. Qian, B. Schiefer: Implementation of Two Semantic Query Optimization Techniques in DB 2 Universal Database. VLDB 1999. • J. Edmonds, J. Gryz, D. Liang, R. Miller: Mining for Empty Rectangles in Large Data Sets. ICDT 2001. • J. Gryz, B. Schiefer, J. Zheng, C. Zuzarte: Discovery and Application of Check Constraints in DB 2. ICDE 2001. • P. Godfrey, J. Gryz, C. Zuzarte: Exploiting Constraint-Like Data Characterizations in Query Optimization. SIGMOD 2001. • J. Gryz, D. Liang: Query Optimization via Empty Joins. DEXA 2002. • J. Gryz, D. Liang: Query Cardinality Estimation via Data Mining. IIS 2004. 52