Query Optimization Chapter 15 Query Evaluation Query Parser

Query Evaluation Query Parser Parsed query Query Optmizer Plan Generator Plan Cost Estimator Evaluation

Query Optimization • • • It is responsible for identifying an efficient execution plan.

Query Evaluation Plan • A Query Evaluation Plan (or simply plan): Tree of extended

Query Evaluation Plans • Consider the following query: • The same query in R.

Pipelined vs Materialized Evaluation • When a query is composed of several operators, the

Motivating Example SELECT S. sname FROM Reserves R, Sailors S WHERE R. sid=S. sid

(On-the-fly) sname Alternative Plans 1 (No Indexes) • Main difference: push selects. • With

Alternative Plans 2 With Indexes • With clustered index on bid of Reserves, we

Alternative Plans Considered • Relational Algebra equivalences play a central role in identifying alternative

Left-Deep Plans • Consider a query of the form A B C D D

Equivalence Rules 1. Conjunctive selection operations can be deconstructed into a sequence of individual

Pictorial Depiction of Equivalence Rules

Equivalence Rules (Cont. ) 5. Theta-join operations (and natural joins) are commutative. E 1

Equivalence Rules (Cont. ) 7. The selection operation distributes over theta join operation under

Equivalence Rules (Cont. ) 8. The projections operation distributes over theta join operation as

Equivalence Rules (Cont. ) 9. The set operations union and intersection are commutative E

Transformation Example • Query: Find the names of sailors who reserved a red boat

Translating SQL Queries into R. A. • As an example consider the following SQL

Query Blocks: Units of Optimization • An SQL query is parsed into a collection

Estimating the cost of a plan • For each plan considered, must estimate cost:

Highlights of System R Optimizer • Impact: – Most widely used currently; works well

Estimating result sizes • Consider a query block: SELECT attribute list FROM relation list

Computing reduction factors • column = value: – if there is an index I

Reduction factors • column 1 = column 2: – if there are indexes I

Estimation of the Size of Joins • The Cartesian product r x s contains

Estimation of the Size of Joins (Cont. ) • If R S = {A}

Enumeration of Alternative Plans • There are two main cases: – – Single-relation plans

Cost Estimates for Single-Relation Plans • Index I on primary key matches selection: –

Example SELECT S. sid FROM Sailors S WHERE S. rating=8 • If we have

Queries Over Multiple Relations • Fundamental decision in System R: only left-deep join trees

Enumeration of Left-Deep Plans • Left-deep plans differ only in the order of relations,

Enumeration of Plans (Contd. ) • ORDER BY, GROUP BY, aggregates etc. handled as

Example 1 Sailors: B+ tree on rating Hash on sid Reserves: B+ tree on

Example 2 sid, count(*) as numres SELECT S. sid, COUNT(*) As numres FROM Boats

• Pass 1: Find best plans for each file. – Sailors: File scan

• Pass 3: For each plan retained in pass 2, taken as outer

Nested Queries SELECT S. sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves

Summary • Query optimization is an important task in a relational DBMS. • Must

Summary (Contd. ) • Single-relation queries: – – All access paths considered, cheapest is

Slides: 40

Download presentation

Query Optimization Chapter 15

Query Evaluation Query Parser Parsed query Query Optmizer Plan Generator Plan Cost Estimator Evaluation Plan Query Plan Evaluator Catalog Manager

Query Optimization • • • It is responsible for identifying an efficient execution plan. A query is essentially treated as a , , algebra expression, with the remaining operations. Optimizing such a relational algebra expression involves two basic steps: 1. Enumerating alternative plans (Typically a subset of all possible plans). • Needs equivalence rules 2. Estimating the cost of each enumerated plan and choosing the plan with the lowest estimated cost.

Query Evaluation Plan • A Query Evaluation Plan (or simply plan): Tree of extended R. A. ops, with choice of algorithm for each operation. • Two main issues: – For a given query, what plans are considered? • Algorithm to search plan space for cheapest (estimated) plan. – How is the cost of a plan estimated? • Ideally: Want to find best plan. Practically: Avoid worst plans!

Query Evaluation Plans • Consider the following query: • The same query in R. A. : SELECT S. sname FROM Reserves R, Sailors S WHERE R. sid=S. sid AND R. bid=100 AND S. rating>5 sname( bid =100 rating > 5 (Reserves R. A. tree: Plan: (On-the-fly) sname bid=100 Sailors)) bid=100 rating > 5 (On-the-fly) (Simple Nested Loops) sid=sid Reserves sid=sid Sailors file scan Reserves Sailors file scan

Pipelined vs Materialized Evaluation • When a query is composed of several operators, the result is sometimes pipelined to another operator without creating a temporary file. – It is a control strategy governing the rate at which different joins in the plan proceed. – Has lower ovehead cost. • If the output of an operator is saved in a temporary file for processing by the next operator we say that tuples are materialized. • Example: C Results of 1 st join pipelined into 2 nd A B

Motivating Example SELECT S. sname FROM Reserves R, Sailors S WHERE R. sid=S. sid AND R. bid=100 AND S. rating>5 Plan: (On-the-fly) sname • Cost: 500+500*1000 I/Os • By no means the worst plan! rating > 5 (On-the-fly) bid=100 • Misses several opportunities: selections could have been `pushed’ earlier, no use is made of any (Simple Nested Loops) available indexes, etc. sid=sid • Goal of optimization: To find more efficient plans that compute the Sailors Reserves same answer.

(On-the-fly) sname Alternative Plans 1 (No Indexes) • Main difference: push selects. • With 5 buffers, cost of plan: sid=sid (Scan; write to bid=100 temp T 1) Reserves (Sort-Merge Join) rating > 5 (Scan; write to temp T 2) Sailors – Scan Reserves (1000) + write temp T 1 (10 pages, if we have 100 boats, uniform distribution). – Scan Sailors (500) + write temp T 2 (250 pages, if we have 10 ratings). – Sort T 1 (2*2*10), sort T 2 (2*3*250), merge (10+250) – Total: 3560 page I/Os. • If we used BNL join, join cost = 10+4*250, total cost = 2770. • If we `push’ projections, T 1 has only sid, T 2 only sid and sname: – T 1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000.

Alternative Plans 2 With Indexes • With clustered index on bid of Reserves, we get 100, 000/100 = 1000 tuples on 1000/100 = 10 pages. • INL with pipelining (outer is not materialized). sname (On-the-fly) rating > 5 (On-the-fly) sid=sid (Use hash index; do not write result to temp) bid=100 (Index Nested Loops, with pipelining ) Sailors Reserves –Projecting out unnecessary fields from outer doesn’t help. v Join column sid is a key for Sailors. –At most one matching tuple, unclustered index on sid OK. v v Decision not to push rating>5 before the join is based on availability of sid index on Sailors. Cost: Selection of Reserves tuples (10 I/Os); for each, must get matching Sailors tuple (1000*1. 2); total 1210 I/Os.

Alternative Plans Considered • Relational Algebra equivalences play a central role in identifying alternative plans. E. g: – Selections and cross products combined into joins. – Joins can be reordered. – Selections and projections can be “pushed”. • In particular, join ordering is important. – As the number of joins increases, the number of alternative plans increases rapidly. – Thus, it’s necessary to prune the search space. – Usually search is restricted to left-deep plans only, since they allow fully pipelined plans.

Left-Deep Plans • Consider a query of the form A B C D D D C A B linear tree, also left-deep tree C A A linear tree B C B bushy tree D

Equivalence Rules 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. 2. Selection operations are commutative. 3. Only the last in a sequence of projection operations is needed, the others can be omitted. 4. Selections can be combined with Cartesian products and theta joins. a. (E 1 X E 2) = E 1 E 2 b. 1(E 1 2 E 2) = E 1 1 2 E 2

Pictorial Depiction of Equivalence Rules

Equivalence Rules (Cont. ) 5. Theta-join operations (and natural joins) are commutative. E 1 E 2 = E 2 E 1 6. Natural join operations are associative: (E 1 E 2) E 3 = E 1 (E 2 E 3)

Equivalence Rules (Cont. ) 7. The selection operation distributes over theta join operation under the following two conditions: a) When all the attributes in 0 involve only the attributes of one of the expressions (E 1) being joined. 0 E 1 E 2) = ( 0(E 1)) E 2 b) When 1 involves only the attributes of E 1 and 2 involves only the attributes of E 2. 1 E 1 E 2) = ( 1(E 1)) ( (E 2))

Equivalence Rules (Cont. ) 8. The projections operation distributes over theta join operation as follows: (a) if involves only attributes from L 1 L 2: (b) Consider a join E 1 E 2. – Let L 1 and L 2 be sets of attributes from E 1 and E 2, respectively. – Let L 3 be attributes of E 1 that are involved in join condition , but are not in L 1 L 2, and – let L 4 be attributes of E 2 that are involved in join condition , but are not in L 1 L 2.

Equivalence Rules (Cont. ) 9. The set operations union and intersection are commutative E 1 E 2 = E 2 E 1 n (set difference is not commutative). 10. Set union and intersection are associative. (E 1 E 2) E 3 = E 1 (E 2 E 3) 11. The selection operation distributes over , and –. (E 1 – E 2) = (E 1) – (E 2) and similarly for and in place of – Also: (E 1 – E 2) = (E 1) – E 2 and similarly for in place of –, but not for 12. The projection operation distributes over union L(E 1 E 2) = ( L(E 1)) ( L(E 2))

Transformation Example • Query: Find the names of sailors who reserved a red boat after 19/3/2004. sname( color = “red” day > 19/3/2004 (boats (reserves sailors))) • Transformation using join associativity and comm. : sname(( color = “red” day > 19/3/2004 (reserves boats) sailors) • Apply the “perform selections early” rule, resulting in the subexpression color = “red” (boats) day > 19/3/2004 (reserves ) • Thus a sequence of transformations can be useful

Translating SQL Queries into R. A. • As an example consider the following SQL query: SELECT sname, age FROM Sailors WHERE rating > SELECT max(rating) FROM Sailors WHERE age = 30 • Decompose into 2 blocks: SELECT max(rating) FROM Sailors WHERE age = 30 Max( rating ( age = 30 (Sailors)) SELECT sname, age FROM Sailors WHERE rating > constant sname, age ( rating>constant (Sailors))

Query Blocks: Units of Optimization • An SQL query is parsed into a collection SELECT S. sname of query blocks, and these are optimized FROM Sailors S one block at a time. WHERE S. age IN • Nested blocks are usually treated as (SELECT MAX (S 2. age) calls to a subroutine, made once per FROM Sailors S 2 outer tuple. (This is an over. GROUP BY S 2. rating) simplification, but serves for now. ) Outer block Nested block v For each block, the plans considered are: – All available access methods, for each reln in FROM clause. – All left-deep join trees (i. e. , all ways to join the relations one-at-a -time, with the inner reln in the FROM clause, considering all reln permutations and join methods. )

Estimating the cost of a plan • For each plan considered, must estimate cost: – Must estimate cost of each operation in plan tree. • Depends on input cardinalities. • We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc. ) – Must estimate size of result for each operation in tree! • Use information about the input relations. • For selections and joins, assume independence of predicates. • We’ll discuss the System R cost estimation approach. – Very inexact, but works ok in practice. – More sophisticated techniques known now.

Highlights of System R Optimizer • Impact: – Most widely used currently; works well for < 10 joins. • Cost estimation: Approximate art at best. – Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. – Considers combination of CPU and I/O costs. • Plan Space: Too large, must be pruned. – Only the space of left-deep plans is considered. • Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation. – Cartesian products avoided.

Estimating result sizes • Consider a query block: SELECT attribute list FROM relation list WHERE term 1 term 2 . . . termn • Every term in WHERE eliminates some of the result tuples. • A reduction factor is associated with each term. • The number of tuples in the result is estimated as the maximum size (i. e. product of cardinalities) times the product of reduction factors.

Computing reduction factors • column = value: – if there is an index I on column : 1/Nkeys(I); – if no index and no statistics about the distinct values: 1/10 (arbitrarily). • column > value: – If there is an index: High(I) - value High(I) – Low(I) – No index, not of arithmetic type: 1/2 (arbitrarily).

Reduction factors • column 1 = column 2: – if there are indexes I 1 and I 2 on column 1 and column 2 respectively: 1 max(Nkeys(I 1), NKeys(I 2)) – if only one of the columns have an index: 1 Nkeys(I) – if no index: 1/10 arbitrarily

Estimation of the Size of Joins • The Cartesian product r x s contains nr. ns tuples; each tuple occupies sr + ss bytes. • If R S = , then r s is the same as r x s. • If R S is a key for R, then a tuple of s will join with at most one tuple from r – therefore, the number of tuples in r tuples in s. s is no greater than the number of • If R S in S is a foreign key in S referencing R, then the number of tuples in r s is exactly the same as the number of tuples in s. • The case for R S being a foreign key referencing S is symmetric. • In the example query reserves sailors, sid in reserves is a foreign key of sailors – hence, the result has exactly nreserves tuples.

Estimation of the Size of Joins (Cont. ) • If R S = {A} is not a key for R or S. If we assume that every tuple t in R produces tuples in R S, the number of tuples in R S is estimated to be: If the reverse is true, the estimate obtained will be: The lower of these two estimates is probably the more accurate one.

Enumeration of Alternative Plans • There are two main cases: – – Single-relation plans Multiple-relation plans • For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops: – – Each available access path (file scan / index) is considered, and the one with the least estimated cost is chosen. The different operations are essentially carried out together (e. g. , if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate computation).

Cost Estimates for Single-Relation Plans • Index I on primary key matches selection: – Cost is Height(I)+1 for a B+ tree, about 1. 2 for hash index. • Clustered index I matching one or more selects: – (NPages(I)+NPages(R)) * product of RF’s of matching selects. • Non-clustered index I matching one or more selects: – (NPages(I)+NTuples(R)) * product of RF’s of matching selects. • Sequential scan of file: – NPages(R). +Note: Typically, no duplicate elimination on projections! (Exception: Done on answers if user says DISTINCT. )

Example SELECT S. sid FROM Sailors S WHERE S. rating=8 • If we have an index on rating: – – – (1/NKeys(I)) * NTuples(R) = (1/10) * 40000 tuples retrieved. Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(R)) = (1/10) * (50+500) pages are retrieved. (This is the cost. ) Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R)) = (1/10) * (50+40000) pages are retrieved. • If we have an index on sid: – Would have to retrieve all tuples/pages. With a clustered index, the cost is 50+500, with unclustered index, 50+40000. • Doing a file scan: – We retrieve all file pages (500).

Queries Over Multiple Relations • Fundamental decision in System R: only left-deep join trees are considered. – – As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space. Left-deep trees allow us to generate all fully pipelined plans. • Intermediate results not written to temporary files. • Not all left-deep trees are fully pipelined (e. g. , SM join). D D C A B C D A B C A B

Enumeration of Left-Deep Plans • Left-deep plans differ only in the order of relations, the access method for each relation, and the join method for each join. • Enumerated using N passes (if N relations joined): – – – Pass 1: Find best 1 -relation plan for each relation. Pass 2: Find best way to join result of each 1 -relation plan (as outer) to another relation. (All 2 -relation plans. ) Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans. ) • For each subset of relations, retain only: – – Cheapest plan overall, plus Cheapest plan for each interesting order of the tuples.

Enumeration of Plans (Contd. ) • ORDER BY, GROUP BY, aggregates etc. handled as a final step, using either an `interestingly ordered’ plan or an additonal sorting operator. • An N-1 way plan is not combined with an additional relation unless there is a join condition between them, unless all predicates in WHERE have been used up. – i. e. , avoid Cartesian products if possible. • In spite of pruning plan space, this approach is still exponential in the # of tables.

Example 1 Sailors: B+ tree on rating Hash on sid Reserves: B+ tree on bid • Pass 1: – Sailors: B+ tree matches rating>5, and is probably cheapest. However, if this selection is expected to retrieve a lot of tuples, and index is unclustered, file scan may be cheaper. • Still, B+ tree plan kept (because tuples are in rating order). sname sid=sid bid=100 rating > 5 Reserves Sailors Reserves: B+ tree on bid matches bid=100; cheapest. v Pass 2: – We consider each plan retained from Pass 1 as the outer, and consider how to join it with the (only) other relation. – e. g. , Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. u

Example 2 sid, count(*) as numres SELECT S. sid, COUNT(*) As numres FROM Boats B, Reserves R, Sailors S WHERE R. sid = S. sid and B. bid = R. bid and B. color = ‘red GROUP BY S. sid GROUP BYsid sid=sid • Available Indexes : Sailors: B+ tree on sid Hash on sid Reserves: B+ tree on sid B+ tree on bid (clustered) Boats: B+ tree on color Hash on color Sailors bid=bid color = red Boats Reserves

• Pass 1: Find best plans for each file. – Sailors: File scan – Reserves: File scan – Boats: hash index on color, B+ tree index on color • Pass 2: Possible joins are considered. – – – – Reserves Boats Reserves Sailors Boats Sailors Reserves Boats (access B+ tree) Sailors Boats (access B+ tree) Reserves Boats (access hash) Sailors Boats (access hash) Reserves • For each such pair, consider every join method and for each join method consider every access path for the inner relation. • For each pair keep the cheapest of the plans & the cheapest plan that generates tuples in sorted order

• Pass 3: For each plan retained in pass 2, taken as outer relation, consider how to join remaining relation as inner one. § Example plan for this pass: Boats (via hash) Reserves (via B+tree) (sort-merge) Result is joined with Sailors (via B+ tree) (sort merge) • GROUP BY is considered after all joins. It requires sorting on sid.

Nested Queries SELECT S. sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R. bid=103 AND R. sid=S. sid) • Nested block is optimized independently, with the outer tuple considered as providing a selection condition. Nested block to optimize: • Outer block is optimized with the cost SELECT * of `calling’ nested block computation FROM Reserves R taken into account. WHERE R. bid=103 • Implicit ordering of these blocks means AND S. sid= outer value that some good strategies are not considered. The non-nested version of Equivalent non-nested query: the query is typically optimized better. SELECT S. sname FROM Sailors S, Reserves R WHERE S. sid=R. sid AND R. bid=103

Summary • Query optimization is an important task in a relational DBMS. • Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). • Two parts to optimizing a query: – Consider a set of alternative plans. • Must prune search space; typically, left-deep plans only. – Must estimate cost of each plan that is considered. • Must estimate size of result and cost for each plan node. • Key issues: Statistics, indexes, operator implementations.

Summary (Contd. ) • Single-relation queries: – – All access paths considered, cheapest is chosen. Issues: Selections that match index, whether index key has all needed fields and/or provides tuples in a desired order. • Multiple-relation queries: – – All single-relation plans are first enumerated. • Selections/projections considered as early as possible. Next, for each 1 -relation plan, all ways of joining another relation (as inner) are considered. Next, for each 2 -relation plan that is `retained’, all ways of joining another relation (as inner) are considered, etc. At each level, for each subset of relations, only best plan for each interesting order of tuples is `retained’.