Basic Steps in Query Processing 1 Parsing and

Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation Database System Concepts - 6 th Edition 12. 1 ©Silberschatz, Korth and Sudarshan

Basic Steps in Query Processing (Cont. ) n Parsing and translation l translate the query into its internal form. This is then translated into relational algebra. l Parser checks syntax, verifies relations n Evaluation l The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query. Database System Concepts - 6 th Edition 12. 2 ©Silberschatz, Korth and Sudarshan

Basic Steps in Query Processing : Optimization n A relational algebra expression may have many equivalent expressions l E. g. , salary 75000( salary(instructor)) is equivalent to salary( salary 75000(instructor)) n Each relational algebra operation can be evaluated using one of several different algorithms l Correspondingly, a relational-algebra expression can be evaluated in many ways. n Annotated expression specifying detailed evaluation strategy is called an evaluation-plan. l E. g. , can use an index on salary to find instructors with salary < 75000, l or can perform complete relation scan and discard instructors with salary 75000 Database System Concepts - 6 th Edition 12. 3 ©Silberschatz, Korth and Sudarshan

Basic Steps: Optimization (Cont. ) n Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. l Cost is estimated using statistical information from the database catalog 4 e. g. number of tuples in each relation, size of tuples, etc. n In this chapter we study l How to measure query costs l Algorithms for evaluating relational algebra operations l How to combine algorithms for individual operations in order to evaluate a complete expression l How to optimize queries, that is, how to find an evaluation plan with lowest estimated cost Database System Concepts - 6 th Edition 12. 4 ©Silberschatz, Korth and Sudarshan

Measures of Query Cost n Cost is generally measured as total elapsed time for answering query l Many factors contribute to time cost 4 disk accesses, CPU, or even network communication n Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by taking into account l Number of seeks * average-seek-cost l Number of blocks read * average-block-read-cost l Number of blocks written * average-block-write-cost 4 Cost to write a block is greater than cost to read a block – data is read back after being written to ensure that the write was successful Database System Concepts - 6 th Edition 12. 5 ©Silberschatz, Korth and Sudarshan

Measures of Query Cost (Cont. ) n For simplicity we just use the number of block transfers from disk and the number of seeks as the cost measures l t. T – time to transfer one block l t. S – time for one seek l Cost for b block transfers plus S seeks b * t. T + S * t S n We ignore CPU costs for simplicity l Real systems do take CPU cost into account n We do not include cost to writing output to disk in our cost formulae Database System Concepts - 6 th Edition 12. 6 ©Silberschatz, Korth and Sudarshan

Measures of Query Cost (Cont. ) n Several algorithms can reduce disk IO by using extra buffer space l Amount of real memory available to buffer depends on other concurrent queries and OS processes, known only during execution 4 We often use worst case estimates, assuming only the minimum amount of memory needed for the operation is available n Required data may be buffer resident already, avoiding disk I/O l But hard to take into account for cost estimation Database System Concepts - 6 th Edition 12. 7 ©Silberschatz, Korth and Sudarshan

Selection Operation n File scan n Algorithm A 1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition. l Cost estimate = br block transfers + 1 seek 4 br denotes number of blocks containing records from relation r If selection is on a key attribute, can stop on finding record 4 cost = (br /2) block transfers + 1 seek l Linear search can be applied regardless of l 4 selection condition or 4 ordering of records in the file, or 4 availability of indices n Note: binary search generally does not make sense since data is not stored consecutively except when there is an index available, l and binary search requires more seeks than index search l Database System Concepts - 6 th Edition 12. 8 ©Silberschatz, Korth and Sudarshan

Selections Using Indices n Index scan – search algorithms that use an index l selection condition must be on search-key of index. n A 2 (primary index, equality on key). Retrieve a single record that satisfies the corresponding equality condition l Cost = (hi + 1) * (t. T + t. S) n A 3 (primary index, equality on nonkey) Retrieve multiple records. l Records will be on consecutive blocks 4 Let l b = number of blocks containing matching records Cost = hi * (t. T + t. S) + t. S + t. T * b Database System Concepts - 6 th Edition 12. 9 ©Silberschatz, Korth and Sudarshan

Selections Using Indices n A 4 (secondary index, equality on nonkey). l Retrieve a single record if the search-key is a candidate key 4 Cost l = (hi + 1) * (t. T + t. S) Retrieve multiple records if search-key is not a candidate key 4 each of n matching records may be on a different block 4 Cost = (hi + n) * (t. T + t. S) – Can be very expensive! Database System Concepts - 6 th Edition 12. 10 ©Silberschatz, Korth and Sudarshan

Selections Involving Comparisons n Can implement selections of the form A V (r) or A V(r) by using a linear file scan, l or by using indices in the following ways: n A 5 (primary index, comparison). (Relation is sorted on A) 4 For A V(r) use index to find first tuple v and scan relation sequentially from there 4 For A V (r) just scan relation sequentially till first tuple > v; do not use index n A 6 (secondary index, comparison). 4 For A V(r) use index to find first index entry v and scan index sequentially from there, to find pointers to records. 4 For A V (r) just scan leaf pages of index finding pointers to records, till first entry > v 4 In either case, retrieve records that are pointed to – requires an I/O for each record l – Linear file scan may be cheaper Database System Concepts - 6 th Edition 12. 11 ©Silberschatz, Korth and Sudarshan

Join Operation n Several different algorithms to implement joins l Nested-loop join l Block nested-loop join l Indexed nested-loop join l Merge-join l Hash-join n Choice based on cost estimate n Examples use the following information l Number of records of student: 5, 000 takes: 10, 000 l Number of blocks of student: takes: Database System Concepts - 6 th Edition 12. 12 100 400 ©Silberschatz, Korth and Sudarshan

Nested-Loop Join n To compute theta join r s for each tuple tr in r do begin for each tuple ts in s do begin test pair (tr, ts) to see if they satisfy the join condition if they do, add tr • ts to the result. end n r is called the outer relation and s the inner relation of the join. n Requires no indices and can be used with any kind of join condition. n Expensive since it examines every pair of tuples in the two relations. Database System Concepts - 6 th Edition 12. 13 ©Silberschatz, Korth and Sudarshan

Nested-Loop Join (Cont. ) n In the worst case, if there is enough memory only to hold one block of each relation, the estimated cost is nr bs + br block transfers, plus nr + b r seeks n If the smaller relation fits entirely in memory, use that as the inner relation. l Reduces cost to br + bs block transfers and 2 seeks n Assuming worst case memory availability cost estimate is l l with student as outer relation: 4 5000 400 + 100 = 2, 000, 100 block transfers, 4 5000 + 100 = 5100 seeks with takes as the outer relation 4 10000 100 + 400 = 1, 000, 400 block transfers and 10, 400 seeks n If smaller relation (student) fits entirely in memory, the cost estimate will be 500 block transfers. n Block nested-loops algorithm (next slide) is preferable. Database System Concepts - 6 th Edition 12. 14 ©Silberschatz, Korth and Sudarshan

Block Nested-Loop Join n Variant of nested-loop join in which every block of inner relation is paired with every block of outer relation. for each block Br of r do begin for each block Bs of s do begin for each tuple tr in Br do begin for each tuple ts in Bs do begin Check if (tr, ts) satisfy the join condition if they do, add tr • ts to the result. end end Database System Concepts - 6 th Edition 12. 15 ©Silberschatz, Korth and Sudarshan

Block Nested-Loop Join (Cont. ) n Worst case estimate: br bs + br block transfers + 2 * br seeks Each block in the inner relation s is read once for each block in the outer relation n Best case: br + bs block transfers + 2 seeks. n Improvements to nested loop and block nested loop algorithms: l In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output l 4 Cost = br / (M-2) bs + br block transfers + 2 br / (M-2) seeks If equi-join attribute forms a key or inner relation, stop inner loop on first match l Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement) l Use index on inner relation if available (next slide) l Database System Concepts - 6 th Edition 12. 16 ©Silberschatz, Korth and Sudarshan

Query optimization n Alternative ways of evaluating a given query l Equivalent expressions l Different algorithms for each operation Database System Concepts - 6 th Edition 12. 17 ©Silberschatz, Korth and Sudarshan

n An evaluation plan defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated. n Find out how to view query execution plans on your favorite database Database System Concepts - 6 th Edition 12. 18 ©Silberschatz, Korth and Sudarshan

n Cost difference between evaluation plans for a query can be enormous l E. g. seconds vs. days in some cases n Steps in cost-based query optimization 1. Generate logically equivalent expressions using equivalence rules 2. Annotate resultant expressions to get alternative query plans 3. Choose the cheapest plan based on estimated cost n Estimation of plan cost based on: l Statistical information about relations. Examples: 4 number of tuples, number of distinct values for an attribute l Statistics estimation for intermediate results 4 to compute cost of complex expressions l Cost formulae for algorithms, computed using statistics Database System Concepts - 6 th Edition 12. 19 ©Silberschatz, Korth and Sudarshan

Transformation of Relational Expressions n Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance l Note: order of tuples is irrelevant l we don’t care if they generate different results on databases that violate integrity constraints n In SQL, inputs and outputs are multisets of tuples l Two expressions in the multiset version of the relational algebra are said to be equivalent if the two expressions generate the same multiset of tuples on every legal database instance. n An equivalence rule says that expressions of two forms are equivalent l Can replace expression of first form by second, or vice versa Database System Concepts - 6 th Edition 12. 20 ©Silberschatz, Korth and Sudarshan

Equivalence Rules 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. 2. Selection operations are commutative. 3. Only the last in a sequence of projection operations is needed, the others can be omitted. 4. Selections can be combined with Cartesian products and theta joins. a. (E 1 X E 2) = E 1 b. 1(E 1 Database System Concepts - 6 th Edition 2 E 2) = E 1 1 2 E 2 12. 21 ©Silberschatz, Korth and Sudarshan

Equivalence Rules (Cont. ) 5. Theta-join operations (and natural joins) are commutative. E 1 E 2 = E 2 E 1 6. (a) Natural join operations are associative: (E 1 E 2 ) E 3 = E 1 (E 2 E 3) (b) Theta joins are associative in the following manner: (E 1 1 E 2) 2 3 E 3 = E 1 1 3 (E 2 2 E 3) where 2 involves attributes from only E 2 and E 3. Database System Concepts - 6 th Edition 12. 22 ©Silberschatz, Korth and Sudarshan

Pictorial Depiction of Equivalence Rules Database System Concepts - 6 th Edition 12. 23 ©Silberschatz, Korth and Sudarshan

Equivalence Rules (Cont. ) 7. The selection operation distributes over theta join operation under the following two conditions: (a) When all the attributes in 0 involve only the attributes of one of the expressions (E 1) being joined. 0 E 1 E 2) = ( 0(E 1)) E 2 (b) When 1 involves only the attributes of E 1 and 2 involves only the attributes of E 2. 1 E 1 Database System Concepts - 6 th Edition E 2) = ( 1(E 1)) 12. 24 ( (E 2)) ©Silberschatz, Korth and Sudarshan

Equivalence Rules (Cont. ) 8. The projection operation distributes over theta join operation as follows: (a) if involves only attributes from L 1 L 2: L 1 L 2 ( E 1 q E 2 ) (b) Consider a join E 1 = ( L 1 ( E 1 )) q ( L 2 ( E 2 )) E 2. l Let L 1 and L 2 be sets of attributes from E 1 and E 2, respectively. l Let L 3 be attributes of E 1 that are involved in join condition , but are not in L 1 L 2, and l let L 4 be attributes of E 2 that are involved in join condition , but are not in L 1 L 2. L L ( E 1 1 Database System Concepts - 6 th Edition 2 q E 2 ) = L L (( L L ( E 1 )) 1 2 12. 25 1 3 q ( L L ( E 2 ))) 2 4 ©Silberschatz, Korth and Sudarshan

Equivalence Rules (Cont. ) 9. The set operations union and intersection are commutative E 1 E 2 = E 2 E 1 n (set difference is not commutative). 10. Set union and intersection are associative. (E 1 E 2) E 3 = E 1 (E 2 E 3) 11. The selection operation distributes over , and –. (E 1 – E 2) = (E 1) – (E 2) and similarly for and in place of – Also: (E 1 – E 2) = (E 1) – E 2 and similarly for in place of –, but not for 12. The projection operation distributes over union L(E 1 E 2) = ( L(E 1)) ( L(E 2)) Database System Concepts - 6 th Edition 12. 26 ©Silberschatz, Korth and Sudarshan

Transformation Example: Pushing Selections n Query: Find the names of all instructors in the Music department, along with the titles of the courses that they teach l name, title( dept_name= “Music” (instructor (teaches course_id, title (course)))) n Transformation using rule 7 a. l name, title(( dept_name= “Music”(instructor)) (teaches course_id, title (course))) n Performing the selection as early as possible reduces the size of the relation to be joined. Database System Concepts - 6 th Edition 12. 27 ©Silberschatz, Korth and Sudarshan

Example with Multiple Transformations n Query: Find the names of all instructors in the Music department who have taught a course in 2009, along with the titles of the courses that they taught l name, title( dept_name= “Music” year = 2009 (instructor (teaches course_id, title (course)))) n Transformation using join associatively (Rule 6 a): l name, title( dept_name= “Music” gear = 2009 ((instructor teaches) course_id, title (course))) n Second form provides an opportunity to apply the “perform selections early” rule, resulting in the subexpression dept_name = “Music” (instructor) Database System Concepts - 6 th Edition 12. 28 year = 2009 (teaches) ©Silberschatz, Korth and Sudarshan

Multiple Transformations (Cont. ) Database System Concepts - 6 th Edition 12. 29 ©Silberschatz,

Transformation Example: Pushing Projections n Consider: name, title( dept_name= “Music” (instructor) teaches) course_id, title (course)))) n When we compute ( dept_name = “Music” (instructor teaches) we obtain a relation whose schema is: (ID, name, dept_name, salary, course_id, sec_id, semester, year) n Push projections using equivalence rules 8 a and 8 b; eliminate unneeded attributes from intermediate results to get: name, title( name, course_id ( dept_name= “Music” (instructor) teaches)) course_id, title (course)))) n Performing the projection as early as possible reduces the size of the relation to be joined. Database System Concepts - 6 th Edition 12. 30 ©Silberschatz, Korth and Sudarshan

Join Ordering Example n For all relations r 1, r 2, and r 3, (r 1 r 2 ) r 3 = r 1 (r 2 r 3 ) (Join Associativity) n If r 2 r 3 is quite large and r 1 (r 1 r 2 ) r 2 is small, we choose r 3 so that we compute and store a smaller temporary relation. Database System Concepts - 6 th Edition 12. 31 ©Silberschatz, Korth and Sudarshan

Join Ordering Example (Cont. ) n Consider the expression name, title( dept_name= “Music” (instructor) teaches) course_id, title (course)))) n Could compute teaches course_id, title (course) first, and join result with dept_name= “Music” (instructor) but the result of the first join is likely to be a large relation. n Only a small fraction of the university’s instructors are likely to be from the Music department l it is better to compute dept_name= “Music” (instructor) teaches first. Database System Concepts - 6 th Edition 12. 32 ©Silberschatz, Korth and Sudarshan

Enumeration of Equivalent Expressions n Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression n Can generate all equivalent expressions as follows: l Repeat 4 apply all applicable equivalence rules on every subexpression of every equivalent expression found so far 4 add newly generated expressions to the set of equivalent expressions Until no new equivalent expressions are generated above n The above approach is very expensive in space and time l Two approaches 4 Optimized plan generation based on transformation rules 4 Special case approach for queries with only selections, projections and joins Database System Concepts - 6 th Edition 12. 33 ©Silberschatz, Korth and Sudarshan

Cost Estimation n Cost of each operator computed as described in Chapter 12 l Need statistics of input relations 4 E. g. number of tuples, sizes of tuples n Inputs can be results of sub-expressions l Need to estimate statistics of expression results l To do so, we require additional statistics 4 E. g. number of distinct values for an attribute n More on cost estimation later Database System Concepts - 6 th Edition 12. 34 ©Silberschatz, Korth and Sudarshan

Choice of Evaluation Plans n Must consider the interaction of evaluation techniques when choosing evaluation plans l choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E. g. 4 merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation. 4 nested-loop join may provide opportunity for pipelining n Practical query optimizers incorporate elements of the following two broad approaches: 1. Search all the plans and choose the best plan in a cost-based fashion. 2. Uses heuristics to choose a plan. Database System Concepts - 6 th Edition 12. 35 ©Silberschatz, Korth and Sudarshan