Optimizing Nested Queries with Parameter Sort Orders Appeared

  • Slides: 46
Download presentation
Optimizing Nested Queries with Parameter Sort Orders Appeared in the 31 st VLDB Conference

Optimizing Nested Queries with Parameter Sort Orders Appeared in the 31 st VLDB Conference 2005 Ravindra N. Guravannavar Ramanujam H. S. Sudarshan Indian Institute of Technology Bombay

Nested Queries are Important Commonly encountered in practice Queries having performance issues are often

Nested Queries are Important Commonly encountered in practice Queries having performance issues are often complex nested queries In WHERE clause, SELECT clause, SQL LATERAL clause Queries invoking User-Defined Functions (UDFs) 2

Nested Queries – Few Examples Example 1: SELECT order_id, order_date FROM ORDER O WHERE

Nested Queries – Few Examples Example 1: SELECT order_id, order_date FROM ORDER O WHERE default_ship_to NOT IN ( SELECT ship_to FROM ORDERITEM OI WHERE OI. order_id = O. order_id ); Example SELECT FROM WHERE 2: name, desgn EMP E 1. sal=(SELECT max(E 2. sal) FROM EMP E 2 WHERE E 2. dept=E 1. dept); 3

An Example: Query Invoking a UDF Find the turn-around time for high priority orders

An Example: Query Invoking a UDF Find the turn-around time for high priority orders SELECT orderid, Turnaround. Time(orderid, totalprice, orderdate) FROM ORDERS WHERE order_priority=’HIGH’; DEFINE Turnaround. Time(@orderid, @totalprice, @orderdate) // Compute the order category with some procedural logic. IF (@category = ‘A’) SELECT max(L. shipdate – @orderdate) FROM LINEITEM L WHERE L. orderid=@orderid; ELSE SELECT MAX(L. commitdate – @orderdate) FROM LINEITEM L WHERE L. orderid=@orderid; END; 4

Nested Iteration For each tuple t in the outer block Bind parameter values from

Nested Iteration For each tuple t in the outer block Bind parameter values from t Evaluate inner block – collect results in s Process t, s Advantages Simple to implement Easy to ensure correctness Applicable to all types of nested queries 5

Nested Iteration Drawbacks Performance can be very poor Repeated work Random I/O Cost =

Nested Iteration Drawbacks Performance can be very poor Repeated work Random I/O Cost = Cost(Outer. Block) + n*Cost(Inner. Block) Where n=# tuples in the result of outer block Improvements Proposed in System R Cache the inner subquery result for each distinct correlation binding Sort the outer tuples so as to be able to cache a single result at any given time 6

Decorrelation Techniques Rewrite nested query as an equivalent flat query Allows the choice of

Decorrelation Techniques Rewrite nested query as an equivalent flat query Allows the choice of set-oriented evaluation plans such as hash and merge-join A range of techniques proposed and refined over 2 decades 7

Decorrelation Example Original Query: SELECT O. order_id, O. order_date FROM ORDER O WHERE default_ship_to

Decorrelation Example Original Query: SELECT O. order_id, O. order_date FROM ORDER O WHERE default_ship_to IN ( SELECT ship_to FROM ORDERITEM OI WHERE OI. order_id = O. order_id); Decorrelated Query: SELECT O. order_id, O. order_date FROM ORDER O, ORDERITEM OI WHERE O. order_id=OI. order_id AND O. default_ship_to=OI. ship_to; * Queries are not equivalent when duplicates are present 8

Decorrelation Example Original Query: SELECT c_name FROM CUSTOMER C WHERE 10 = ( SELECT

Decorrelation Example Original Query: SELECT c_name FROM CUSTOMER C WHERE 10 = ( SELECT count(order_id) FROM ORDER O WHERE O. cust_id = C. cust_id); Decorrelation [Kim 82]: Temp = SELECT cust_id, count(order_id) as order_count FROM ORDER O GROUP BY cust_id; SELECT c_name FROM CUSTOMER C, Temp T WHERE C. cust_id=T. cust_id AND T. order_count=10; * Goes wrong if one tries to find customers with no orders! 9

Problems with Decorrelation Not always possible Many cases need duplicate elimination and an extra

Problems with Decorrelation Not always possible Many cases need duplicate elimination and an extra outer-join. E. g. , NOT IN predicate – requires anti-join Outer joins are not commutative and do not associate with joins Duplicate elimination expensive May not be applicable to UDFs unless their structure is very simple 10

Our Approach Optimize nested queries keeping their structure intact Exploit properties of parameters (such

Our Approach Optimize nested queries keeping their structure intact Exploit properties of parameters (such as sort order) to efficiently evaluate the inner sub-query More generic and can be applied to a wider class of queries (e. g. , Queries invoking complex UDFs) 11

Benefits of Sorting Outer Tuples Sorting allows caching of a single inner result (System

Benefits of Sorting Outer Tuples Sorting allows caching of a single inner result (System R) Advantageous buffer effects (Graefe) A clustered index scan in the inner block will access each block at most once irrespective of the buffer replacement policy Allows state-retaining operators Re-startable scan Incremental computation of aggregates 12

Restartable Table Scan Parameter bindings match sort order of inner relation Retain state across

Restartable Table Scan Parameter bindings match sort order of inner relation Retain state across function calls Similar to merge join – applicable for NI 13

Restartable Table Scan Parameters: match sort order of inner relation Retain state across function

Restartable Table Scan Parameters: match sort order of inner relation Retain state across function calls Parameter Bindings orderid, totalprice, orderdate SELECT Turnaround. Time(orderid, … ) FROM ORDERS WHERE … Turnaround. Time(@orderid, …) IF (…) SELECT … FROM LINEITEM WHERE L. orderid=@orderid; ELSE SELECT … FROM LINEITEM WHERE L. orderid=@orderid; Table LINEITEM orderid {100, 20. 5, 2005 -01 -02} 100 lineitemid 1 shipdate 2005 -01 -10 {140, 10. 2, 2005 -01 -04} 100 2 2005 -01 -12 {200, 30. 8, 2005 -02 -01} 140 1 2005 -01 -04 200 1 2005 -02 -02 2005 -02 -01 14

Incremental Computation of Aggregates SELECT day, sales FROM DAILYSALES DS 1 WHERE sales >

Incremental Computation of Aggregates SELECT day, sales FROM DAILYSALES DS 1 WHERE sales > (SELECT MAX(sales) FROM DAILYSALES DS 2 WHERE DS 2. day < DS 1. day); Param Sort Order No order DS 1. day Plan Cost O(n*B) block transfers + seeks 2 B block transfers + less seeks Applicable to: Aggregates SUM, COUNT, MAX, MIN, AVG and Predicates <, ≤, >, ≥ 15

Benefits of Sorting for a Clustered Index Case-1 400 Keys: 50, 500, 400, 80,

Benefits of Sorting for a Clustered Index Case-1 400 Keys: 50, 500, 400, 80, 600, 200 Potential data block fetches=6 50 80 Data Block-1 200 400 Data Block-2 500 600 * Assume a single data block can be held in memory Random I/O Data Block-3 Case-2 Keys: 50, 80, 200, 400, 500, 600 Data block fetches=3 Sequential I/O 16

Query Optimization with Nested Iteration A multi-level, multi-branch query Plan cost for a block:

Query Optimization with Nested Iteration A multi-level, multi-branch query Plan cost for a block: A function of the order guaranteed on the IN variables and order required on the OUT variables Not every possible sort order may be useful (only interesting orders) Not every interesting order may be feasible/valid Similar to interesting sort order of results but on parameters B 1 B 2 B 4 B 3 B 5 B 8 BIND variable set USE variables set B 6 B 7 B 9 17

Representing Nested Queries with Apply SELECT FROM WHERE NOT IN PO. order_id PURCHASEORDER PO

Representing Nested Queries with Apply SELECT FROM WHERE NOT IN PO. order_id PURCHASEORDER PO default_ship_to ( SELECT ship_to FROM ORDERITEM OI WHERE OI. order_id = PO. order_id ); A* Bind Expression B: $a, $b Use Expression U: $a, $b A – The Apply Operator [Galindo-Legaria et. al. SIGMOD 2001] * – Operation between the outer tuple and result of the inner block 18

A UDF Represented with Apply DEFINE fn(p 1, p 2, … pn) AS BEGIN

A UDF Represented with Apply DEFINE fn(p 1, p 2, … pn) AS BEGIN fn. Q 1 <p 1, p 2>; fn. Q 2 <p 1, p 2, p 3>; IF (condition) fn. Q 3<p 2>; ELSE fn. Q 4<p 3>; A Qi // Cursor loop binding v 1, v 2 OPEN CURSOR ON fn. Q 5<p 2, p 3>; LOOP fn. Q 6<p 1, p 2, v 1, v 2>; END LOOP fn. Q 1 fn. Q 2 fn. Q 3 fn. Q 4 fn. Q 5 A fn. Q 6 END 19

Optimizing with Parameter Sort Orders Top-Down Exhaustive Approach For each possible sort order of

Optimizing with Parameter Sort Orders Top-Down Exhaustive Approach For each possible sort order of the parameters, optimize the outer block and then the inner block. A query block b at level l using n parameters will get l optimized d(k) times where, d(k)=kp 0 +kp 1 + … kpk • • Assuming an average of k=n/l parameters are bound at each block above b. And kpi = k!/(k-i)! 20

Optimizing with Parameter Sort Orders Our proposal: Top-Down Multi-Pass Approach Traverse the inner block(s)

Optimizing with Parameter Sort Orders Our proposal: Top-Down Multi-Pass Approach Traverse the inner block(s) to find all valid, interesting orders. For each valid, interesting order ord Optimize the (outer) block with ord as the required output sort order (physical property). Then optimize the inner block(s) with ord as the guaranteed parameter sort order. Keep the combination, if it is cheaper than the cheapest plan found so far. 21

Feasible/Valid Parameter Sort Orders Parameter sort order (a 1, a 2, … an) is

Feasible/Valid Parameter Sort Orders Parameter sort order (a 1, a 2, … an) is valid iff level(ai) <= level(aj) for all i, j s. t. i < j B 1 Binds a : sorted B 1 Binds a, b : sorted B 2 Binds b : sorted B 2 Uses a, b. Binds c : sorted B 3 Is (a, b) valid/observable? B 3 Cannot get (a, c) by dup elimination 22

A Stricter Notion of Validity Parameter sort order o=(a 1, a 2, … an)

A Stricter Notion of Validity Parameter sort order o=(a 1, a 2, … an) is valid (observable) at block bx iff i. level(ai) <= level(aj) for all i, j s. t. i < j ii. AND For each block bk s. t. level(bx) - level(bk) > 1, corrattrs(bk, o) U bindattrs(bk, o) is a candidate key bk (key of schema of the expression in the FROM clause of bk) Notation: level(bi): Level of the block bi level(ai): Level of the block in which ai is bound bindattrs(bk, o): Attributes in o that are bound at block bk corrattrs(bk, o): Atttributes in bk that are correlated with attributes in o with an equality predicate. 23

A Stricter Notion of Validity (Example) B 1 Binds a Key: a B 2

A Stricter Notion of Validity (Example) B 1 Binds a Key: a B 2 Binds b, has pred c=a Key: b, c B 3 B 4 Now, (a, b) is valid at B 4 24

Weaker Notion of Sort Orders (b 11, b 12, …)(b 21, b 22…)… Sorted

Weaker Notion of Sort Orders (b 11, b 12, …)(b 21, b 22…)… Sorted on seg-0 For a given value of seg-i, seg-i+1 can have several sorted runs A parameter sort order p is said to weakly subsume a sort order o if o is a subsequence of p ignoring parantheses Operators need to have a method reset_state(segno) to reset the state for a specific segment Cost of a state-retaining plan must be multipled by the number of expected runs 25

Plan Generation A Interesting Parameter Sort Orders A Query Block-1 Binds $a, $b Query

Plan Generation A Interesting Parameter Sort Orders A Query Block-1 Binds $a, $b Query Block-2 Binds $c Uses $a, $b Query Block-3 Uses $a, $b, $c Required Result Sort Order • Traverse the use inputs and obtain valid interesting orders • Extract orders relevant to the bind input • Optimize the bind input making the order as a required output physical property • Optimize the use input making the order as a guaranteed parameter sort order 26

Plan Generation (Contd. ) At a non-Apply logical operation node Consider only those algorithms

Plan Generation (Contd. ) At a non-Apply logical operation node Consider only those algorithms that require parameter sort order weaker than or equal to the guaranteed parameter sort order E. g. , An algorithm requiring parameter sort order (a, b) is not applicable when no order is guaranteed on the parameters. 27

Sort Order Propagation for a Multi-Level Multi-Branch Expression σc 1=a ^ c 2=b (R

Sort Order Propagation for a Multi-Level Multi-Branch Expression σc 1=a ^ c 2=b (R 2) R 2 sorted on (c 1, c 2) 28

Experiments Evaluated the benefits of state retention plans with Postgre. SQL Scan and Aggregate

Experiments Evaluated the benefits of state retention plans with Postgre. SQL Scan and Aggregate operators were modified for state retention Plans were hard coded as the Optimizer extensions were not complete 29

Experiments (Contd. ) A Nested Aggregate Query with Non-Equality Corrl. Predicate SELECT day, sales

Experiments (Contd. ) A Nested Aggregate Query with Non-Equality Corrl. Predicate SELECT day, sales FROM DAILYSALES DS 1 WHERE sales > (SELECT MAX(sales) FROM DAILYSALES DS 2 WHERE DS 2. day < DS 1. day); NI – Nested Iteration MAG – Magic Decorrelation [SPL 96] NISR – NI with State Retention 30

Experiments (Contd. ) TPC-H MIN COST Supplier Query SELECT name, address … FROM PARTS,

Experiments (Contd. ) TPC-H MIN COST Supplier Query SELECT name, address … FROM PARTS, SUPPLIER, PARTSUPP WHERE nation=’FRANCE’ AND p_size=15 AND p_type=’BRASS’ AND <join_preds> AND ps_supplycost = ( SELECT min(PS 1. supplycost) FROM …); 31

Experiments (Contd. ) A query with UDF SELECT orderid, Turnaround. Time(orderid, totalprice, orderdate) FROM

Experiments (Contd. ) A query with UDF SELECT orderid, Turnaround. Time(orderid, totalprice, orderdate) FROM ORDERS WHERE order_priority=’H’; DEFINE Turnaround. Time(@orderid, @totalprice, @orderdate) … Compute the order category with some procedural logic … IF (@category = ‘A’) SELECT max(L. shipdate – @orderdate) FROM LINEITEM L WHERE L. orderid=@orderid; ELSE SELECT MAX(L. commitdate – @orderdate) FROM LINEITEM L WHERE L. orderid =@orderid; END; 32

Questions? 33

Questions? 33

Extra Slides 34

Extra Slides 34

Physical Plan Space Generation Phys. Eq. Node Phys. DAGGen(Log. EQNode e, Phy. Prop p,

Physical Plan Space Generation Phys. Eq. Node Phys. DAGGen(Log. EQNode e, Phy. Prop p, Param. Sort. Order s) If a physical equivalence node np exists for e, p, s return np Create an equivalence node np for e, p, s For each logical operation node o below e If(o is an instance of Apply. Op) Proc. Apply. Node(o, s, np) else Proc. Log. Op. Node(o, p, s, np) For each enforcer f that generates property p Create an enforcer node of under np Set the input of of = Phys. DAGGen(e, null, s) return np End 35

Processing a Non-Apply Node void Proc. Log. Op. Node(Log. Op. Node o, Phys. Prop

Processing a Non-Apply Node void Proc. Log. Op. Node(Log. Op. Node o, Phys. Prop p, Param. Sort. Order s, Phys. Eq. Node np) For each algorithm a for o that guarantees p and requires no stronger sort order than s Create an algorithm node oa under np For each input i of oa Let oi be the i th input of oa Let pi be the physical property required from input i by algorithm a Set input i of oa = Phys. DAGGen(oi, pi, s) End 36

Processing the Apply Node void Proc. Apply. Node(Log. Op. Node o, Param. Sort. Order

Processing the Apply Node void Proc. Apply. Node(Log. Op. Node o, Param. Sort. Order s, Phys. Eq. Node np) Initialize i_ords to be an empty set or sort orders For each use expression u under o u. Ords = Get. Interesting. Orders(u) i_ords = i_ords Union u. Ords l_ords = Get. Local. Orders(i ords, o. bind. Input) For each order ord in l_ords and empty order leq = Phys. DAGGen(lop. bind. Input, ord, s) Let new. Ord = concat(s, ord) apply. Op = create new apply. Phys. Op(o. TYPE) apply. Op. lchild = leq For each use expression u of o ueq = Phys. DAGGen(u, null, new. Ord) Add ueq as a child node of apply. Op np. add. Child(apply. Op) End 37

Generating Interesting Parameter Orders Set<Order> Get. Interesting. Orders(Log. Eq. Node e) if the set

Generating Interesting Parameter Orders Set<Order> Get. Interesting. Orders(Log. Eq. Node e) if the set of interesting orders i_ords for e is already found return i_ords Create an empty set result of sort orders for each logical operation node o under e for each algorithm a for o Let sa be the sort order of interest to a on the unbound parameters in e if sa is a valid order and sa is not in result Add sa to result for each input logical equivalence node ei of a child. Ord = Get. Interesting. Orders(ei) if (o is an Apply operator AND ei is a use input) child. Ord = Get. Ancestor. Orders(child. Ord, o. bind. Input) result = result Union child. Ord return result End 38

Extracting Ancestor Orders Set<Order> Get. Ancestor. Orders(Set<Order> i_ords, Log. Eq. Node e) Initialize a_ords

Extracting Ancestor Orders Set<Order> Get. Ancestor. Orders(Set<Order> i_ords, Log. Eq. Node e) Initialize a_ords to be an empty set of sort orders for each order ord in i_ords new. Ord = Empty vector; for (i = 1; i <=length(ord); i = i + 1) if ord[i] is NOT bound by e append(ord[i], new. Ord) else break; add new. Ord to a_ords return a_ords End 39

Extracting Local Orders Set<Order> Get. Local. Orders(Set<Order> i_ords, Log. Eq. Node e) Initialize l_ords

Extracting Local Orders Set<Order> Get. Local. Orders(Set<Order> i_ords, Log. Eq. Node e) Initialize l_ords to be an empty set or sort orders For each ord in i_ords new. Ord = Empty vector; For (i =length(ord); i > 0; i = i – 1 ) If ord[i] is bound by e prepend(ord[i], new. Ord) Else break; add new. Ord to l_ords return l_ords End 40

Extensions to the Volcano Optimizer Contract of the original algorithm for optimization: Plan Find.

Extensions to the Volcano Optimizer Contract of the original algorithm for optimization: Plan Find. Best. Plan(Expr e, Phys. Prop rpp, Cost cl) Contract of the modified algorithm for optimization: Plan Find. Best. Plan(Expr e, Phys. Prop rpp, Cost cl, Order pso, int call. Count) Plans generated and cached for <e, rpp, pso, call. Count> Not all possible orderings of the parameters are valid Parameter Sort Order (a 1, a 2, … an) is valid iff level(ai) <= level(aj) for all i, j s. t. i < j. Not all valid orders may be interesting (we consider only valid, interesting parameter sort orders) 41

A Typical Nested Iteration Plan For ti {t 1, t 2, t 3, …

A Typical Nested Iteration Plan For ti {t 1, t 2, t 3, … tn} do inner. Result = {Ø} For ui {u 1, u 2, u 3, … um} do if (pred(ti , ui)) Add ui to inner. Result; done; process(ti , inner. Result); done; 42

Benefits of Sorting for a Clustered Index Case-1 400 Keys: 50, 500, 400, 80,

Benefits of Sorting for a Clustered Index Case-1 400 Keys: 50, 500, 400, 80, 600, 200 Potential data block fetches=6 50 80 Data Block-1 200 400 Data Block-2 500 600 * Assume a single data block can be held in memory Random I/O Data Block-3 Case-2 Keys: 50, 80, 200, 400, 500, 600 Data block fetches=3 Sequential I/O * We provide cost estimation for clustered index scan taking the buffer effects into account (full length paper) 43

Difference from Join Optimization Block-1 B: {R 1. a, R 1. b} Sort on

Difference from Join Optimization Block-1 B: {R 1. a, R 1. b} Sort on R 1. a R 2 Block-2 B: {R 2. c} U: {R 1. a} Sort on R 3. b R 1 R 3 Not an option for Nested Iteration Block-3 U: {R 1. b, R 2. c} 44

Experiments (Contd. ) A simple IN query with no outer predicates SELECT o_orderkey FROM

Experiments (Contd. ) A simple IN query with no outer predicates SELECT o_orderkey FROM ORDERS WHERE o_orderdate IN (SELECT l_shipdate FROM LINEITEM WHERE l_orderkey = o_orderkey); NI – Nested Iteration MAG – Magic Decorrelation [SPL 96] NISR – NI with State Retention Note: MAG is just one form of decorrelation, and the comparison here is NOT with decorrelation techniques in general 45

Future Work Factoring execution probabilities of queries inside function body for appropriate costing Analyze

Future Work Factoring execution probabilities of queries inside function body for appropriate costing Analyze function body Exploit history of execution (when available) Parameter properties other than sort orders that would be interesting to nested queries and functions SQL/XML, XQuery 46