CS 4433 Database Systems Query Optimization Why Do

CS 4433 Database Systems Query Optimization

Why Do We Learn This? ■ Query Optimization – At the heart of the database engine ■ Step 1: convert the SQL query to some logical plan – Query compiler ■ Step 2: find a better logical plan, find an associated physical plan – We have multiple ways to address SQL queries, which one is better (the best)? 2

Converting from SQL to Logical Plans Select a 1, …, an From R 1, …, Rk Where C Group by b 1, …, bm Pa 1, …, an(s C(R 1 x R 2 x … x Rk)) Pa 1, …, an(g b 1, …, bm, aggs (s C(R 1 x R 2 x … x Rk))) 3

Optimization: Logical Query Plan ■ Now we have one logical plan – ■ Usually not optimal The query optimizer has the job of the following in the name of making query processing more efficient. – selecting the appropriate indexes for acquiring data, – classifying predicates used in a query, – performing simple data reductions, – selecting access paths, – determining the order of a join, – performing predicate transformations, – performing Boolean logic transformations, – performing subquery transformation 4

Query optimizer ■ Two approaches to optimizations: – Rule-based (heuristics): apply laws that seem to result in cheaper plans ■ – Oracle used this at one point. Presently, no system uses this. Cost-based: estimate size and cost of intermediate results, search systematically for best plan ■ Presently, most systems use this. ■ Three key components – Algebraic laws – An optimization algorithm – A cost estimator

Algebraic Laws ■ Commutative and Associative Laws – R S = S R, R (S T) = (R S) T – R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T – R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T ■ Distributive Laws – R ⋈ (S T) = (R ⋈ S) (R ⋈ T) Q: How to prove these laws? Make sense? 6

Algebraic Laws ■ Laws involving selection: splitting laws – s C AND C’(R) = s C(s C’(R)) = s C(R) ∩ s C’(R) – s C OR C’(R) = s C(R) U s C’(R) ■ When C involves only attributes of R ■ s C (R ⋈ S) = s C (R) ⋈ S ■ s C (R – S) = s C (R) – S ■ s C (R S) = s C (R) S ■ s C (R ∩ S) = s C (R) ∩ S Q: What do they mean? Make sense? 7

Example ■ R(A, B), S(B, C) 1. s (A=1 OR A=3) AND (B<C) (R ⋈ S) s (A=1 OR A=3) (s(B<C) (R ⋈ S)) s (A=1 OR A=3) (R ⋈ (B<C)S) s (A=1 OR A=3) (R) ⋈ (B<C)S 8

Algebraic Laws ■ Laws involving projections – – PM(R ⋈ S) = PN(PP(R) ⋈ PQ(S)) ■ Where N, P, Q are appropriate subsets of attributes of M ■ Does it make sense to reduce I/O? PM(PN(R)) = PM ∩ N(R) ■ Example R(A, B, C, D), S(E, F, G) – PA, B, G(R ⋈D=E S) = P ? (P? (R) ⋈ P? (S)) 9

Rule(Heuristic) Based Optimization ■ Query rewriting based on algebraic laws ■ Result in better queries most of the time ■ Heuristics number 1: – Push selections down ■ Heuristics number 2: – Sometimes push selections up, then down 10

Predicate Pushdown pname s price>100 AND city=“Tally” maker=name price>100 city=“Tally” maker=name Product Company The earlier we process selections, less tuples we need to manipulate higher up in the tree (but may cause us to loose an important ordering of the tuples, if we use indexes) 11

Behind the Scene: Oracle RBO and CBO 12

Cost Based Estimation 13

Cost-based Optimizations ■ Main idea: apply algebraic laws, until estimated cost is minimal ■ Practically: start from partial plans, introduce operators one by one ■ Problem: there are too many ways to apply the laws, hence too many (partial) plans ■ Approaches: – Top-down: the partial plan is a top fragment of the logical plan starting from root – Bottom up: the partial plan is a bottom fragment of the logical plan ■ First subexpression then root ■ How to estimate costs of plans accuratly 14

Size Estimation of Intermediate Relations ■ Need size in order to estimate cost ■ Example: – Cost of partitioned hash-join E 1 E 2 is 3 B(E 1) + 3 B(E 2) – B(E 1) = T(E 1)/ block size – B(E 2) = T(E 2)/ block size – So, we need to estimate T(E 1), T(E 2) ■ Estimating the size of a selection – S = s. A=c(R) ■ T(S) – – – Min = 0 Max = T(R) – V(R, A) + 1 Mean value: T(S) = T(R)/V(R, A) • S = s. A<c(R) • Min = 0 • Max = T(R) • Heuristics : T(S) = T(R)/3 15

Example 1. R(a, b, c) and S=sa=10 And b<20(R), T(R)=10000 and V(R, a)=50 – T(S)=? – T(S)=T(R)/(50 x 3)= 65 2. S=sa=10 OR b<20(R), – T(S)=? – T(S)=T(R)/V(R, A) + T(R)/3

Size Estimationof a join Assume V(R, A) <= V(S, A) ■ Then each tuple t in R joins some tuple(s) in S – How many ? – On average T(S)/V(S, A) – t will contribute T(S) /V(S, A) tuples in R S A ■ Hence T(R ■ In general: T(R ■ Example – T(R) = 10000, T(S) = 20000 – V(R, A) = 100, V(S, A) = 200 – How large is R S ? A – Answer: T(R S) = 10000 * 20000/200 = 1 M A S) = T(R) T(S) / V(S, A) A S) = T(R) T(S) / max(V(R, A), V(S, A)) A 17

Size Estimation Joins on more than one attribute: • T(R A, B S) = T(R) T(S)/max(V(R, A), V(S, A))max(V(R, B), V(S, B)) 18

Histograms Employee(ssn, name, salary, phone) – Maintain a histogram on salary: Salary: 0. . 20 k. . 40 k. . 60 k. . 80 k. . 100 k > 100 k Tuples 200 800 5000 12000 6500 – T(Employee) = 25000, but now we know the distribution ■ Ranks(rank. Name, salary) Ranks 0. . 20 k. . 40 k. . 60 k. . 80 k. . 100 k > 100 k Tuples 8 20 40 80 100 2 19

Histograms ■ Assume: – V(Employee, Salary) = 200 – V(Ranks, Salary) = 250 ■ Then T(Employee Salary. Ranks) = = Si=1, 6 Ti Ti’ / 250 = (200 x 8 + 800 x 20 + 5000 x 40 + 12000 x 80 + 6500 x 100 + 500 x 2)/250 = …. 20

Search Strategies 1. 2. Branch-and-bound: – Start with using heuristics to find a good plan for the entire query plan – Remember the cheapest complete plan P seen so far and its cost C – Consider other plans for subqueries ■ Stop generating partial plans whose cost is > C ■ If a cheaper complete plan is found, replace P, C Hill climbing: – Start with using heuristics to find a good plan for the entire query plan – Remember only the cheapest partial plan seen so far – Make small changes to the plan, ■ 3. e. g. , replacing one method for executing an operator by another, or reordering joins by using the associative and/or commutative laws, to find ‘nearby” plans that have lower cost Dynamic programming: variation of the general bottom-up strategy – Remember the all cheapest partial plans – For each subexpression, keep the plan of least cost 21

Choosing an order for Joins- Join Trees ■ R 1 ⋈ R 2 ⋈ …. ⋈ Rn ■ A plan = a join tree ■ A partial plan = a subtree of a join tree ■ Join tree: R 3 R 1 R 2 R 4 22

Join Trees ■ Left deep: R 4 R 2 R 5 R 3 R 1 23

Join Trees ■ Bushy: R 3 R 2 R 1 R 4 R 5 24

Dynamic programming to select a Join Order ■ Given a query R 1 ⋈ R 2 ⋈ … ⋈ Rn ■ Assume we have a function cost() that gives us the cost of every join tree ■ Objective: Find the best join tree for the query ■ Solution – Idea: for each subset (subquery) of {R 1, …, Rn}, compute the best plan for that subset – In increasing order of set cardinality: ■ ■ Step 1: for {R 1}, {R 2}, …, {Rn} Step 2: for {R 1, R 2}, {R 1, R 3}, …, {Rn-1, Rn} … Step n: for {R 1, …, Rn} 25

Dynamic Programming - Algorithm ■ For each subquery Q ⊆ {R 1, …, Rn} compute the following: – Size(Q) ■ We will get the same size regardless if which way we compute the join – A best plan for Q: Plan(Q) – The cost of that plan: Cost(Q) ■ Step 1: For each {Ri} do: – Size({Ri}) = B(Ri) – Plan({Ri}) = Ri – Cost({Ri}) = (cost of scanning Ri) 26

Dynamic Programming - Algorithm ■ Step i: For each Q ⊆ {R 1, …, Rn} of cardinality i do: – Compute Size(Q) (later…) – For every pair of subqueries Q’, Q’’ s. t. Q = Q’ Q’’ compute cost(Plan(Q’) ⋈ Plan(Q’’)) – Cost(Q) = the smallest such cost – Plan(Q) = the corresponding plan ■ Finally, return Plan({R 1, …, Rn}) 27

Dynamic Programming - Cost To illustrate, we will make the following simplifications: ■ Cost(P 1 ⋈ P 2) = Cost(P 1) + Cost(P 2) + size(intermediate result) – – ■ Intermediate results: ■ If P 1 is a join, then the size of the intermediate result is size(P 1), otherwise the size is 0 ■ Similarly for P 2 Cost of a scan = 0 Example: – Cost(R 1 ⋈ R 2) = 0 – Cost((R 1 ⋈ R 2) ⋈ R 3) = Cost(R 1 ⋈ R 2) + Cost(R 3) + size(R 1 ⋈ R 2) = size(R 1 ⋈ R 2) (no intermediate results) 28

Dynamic Programming - Example ■ Relations: R, S, T, U ■ Number of tuples: 2000, 5000, 3000, 1000 ■ Size estimation: T(A ⋈ B) = 0. 01*T(A)*T(B) 29

Dynamic Programming - Example Subquery Size Cost Plan RS RT RU ST SU TU RST RSU RTU STU RSTU 30

Dynamic Programming - Example Subquery Size Cost Plan RS 100 k 0 RS RT 60 k 0 RT RU 20 k 0 RU ST 150 k 0 ST SU 50 k 0 SU TU 30 k 0 TU RST 3 M 60 k (RT)S RSU 1 M 20 k (RU)S RTU 0. 6 M 20 k (RU)T STU 1. 5 M 30 k (TU)S RSTU 30 M 60 k+50 k=110 k (RT)(SU) 31

Dynamic Programming - Summary ■ Compute optimal plans for subqueries: – Step 1: {R 1}, {R 2}, …, {Rn} – Step 2: {R 1, R 2}, {R 1, R 3}, …, {Rn-1, Rn} – … – Step n: {R 1, …, Rn} ■ We used naïve size/cost estimations ■ In practice: – more realistic size/cost estimations – heuristics for reducing the Search Space ■ Restrict to left-depth trees ■ Restrict to trees “without Cartesian product”: R(A, B), S(B, C), T(C, D) (R join T) join S has a Cartesian product 32