Access Path Selection in a Relational Database Management

Access Path Selection in a Relational Database Management System Selinger et al.

Query Optimization • Declarative query language relieves programmer of the burden to choose an access plan. • Difference between good and bad access plans can be several orders of magnitude. • Problem: How do we find a good plan?

How to choose a good plan? Divide into three problems: • What is the set of plans we consider? • How do we compare (cost) plans? • How do we choose a “good” plan from this set?

Background: SQL SELECT {DISTINCT} <list of columns> FROM <list of tables> {WHERE <list of "Boolean Factors“ (predicates in CNF)>} {GROUP BY <list of columns> {HAVING <list of Boolean Factors>}} {ORDER BY <list of columns>};

Processing SQL Statements • Four steps: – Parsing – Optimization – Code generation – Execution

SQL Query Types • Single block queries only – optimize nested sub-queries separately • Correlated sub-queries are much harder and much more expensive than un-correlated sub-queries. – Rewrite to remove correlations where possible. (Tricky, more about this later. ) • SPJ queries only in this paper

Problem 1: Plan Space • Fixed set of individual access methods • Sequential & index (clustered/unclustered) scans • NL-join, (sort)-merge join, hash join • Sorting & hash-based grouping • Plan flows in a non-blocking fashion with get-next iterators

Plan Space (Contd. ) • Assumptions in System R: – Selections of “sargable” predicates are "pushed down" – Projections are "pushed down" – Single query blocks – Only left-deep plan trees (There are n! plans (not factoring in choice of join method)) – Avoid Cartesian products

Problem 2: How to Cost a Plan • Estimation of input and intermediate cardinalities based on simple statistical models (e. g. , uniform distribution assumption, attribute independence) • Estimation of costs for each operator based on statistics about input relations – Cost is weighted function between I/O and CPU (no distinction between random and sequential IO)

Cost Estimation (Selinger) • Maintenance of simple statistics: – # of tuples & pages – # of values per column (only for indexed columns) • Assumption of attribute independence • When no estimate available, use magic number • These estimations are done periodically

Cost Estimation (Today) • Sampling: so far only concrete results for base relations • Histograms: getting better. Common in industry, some interesting new research. • Controlling "error propagation"

Problem 3: Choosing a Plan • Exhaustive search • Dynamic Programming (prunes suboptimal parts of the search space): System R • Top-down, transformative version of DP: Volcano, Cascades (used in MS SQL Server? ) • Randomized search algorithms (e. g. Ioannidis & Kang) • Techniques from Operations Research • Etc.

System R Approach • Recall: Only left-deep plan trees (n! different plans) • Observation: Many of these plans share common prefixes, so do not recompute all of them: Dynamic Programming

Dynamic Programming Approach 1. Find all 1 -table plans for each base relation 2. Try all ways of joining i-table plans saved so far with 1 -table plans. Save cheapest unordered (i+1)-table plans, and cheapest (i+1)-table plans for each interesting order • Note: secondary join predicates are just like selections that can’t be pushed down 3. At the end, GROUP BY and ORDER BY – Use plan in interesting order, or add sort to cheapest unordered plan.

Evaluation • Complexity of dynamic programming: about n 2 n-1, intermediate storage: plans • “No-cartesian-products” rule can make a big difference for some queries. • DP only works up to 10 -15 joins • Adding parameters to the search space makes things worse (e. g. expensive predicates, distribution, parallelism, etc. )

Nested Queries • Subqueries optimized separately • Uncorrelated vs. correlated subqueries – Uncorrelated subqueries are basically constants to be computed once – Correlated subqueries are like function calls

Query Rewrite in IBM DB 2 Leung et al.

Why Query Rewrite? • Problem: – Very complex queries automatically generated through tools with many levels of subqueries – Correlated subqueries – Selinger approach only works one block at a time • Main idea: Transform the query into a simpler, equivalent query

Query Rewrite is Tricky SELECT P. pno FROM Parts P WHERE P. quantity = (SELECT COUNT(*) FROM Supply S WHERE S. pno = P. pno AND S. shipdate < 1 -1 -2000) INSERT INTO Temp (pnum, cnt) (SELECT P. pno, COUNT(*) FROM Supply S WHERE S. shipdate < 1 -12000 GROUP BY P. pno) SELECT P. pno FROM Parts P, Temp T WHERE P. quantity = T. cnt AND P. pno = T. pno