Rewriting Procedures for Batched Bindings Ravindra Guravannavar and
Rewriting Procedures for Batched Bindings Ravindra Guravannavar and S. Sudarshan Indian Institute of Technology, Bombay Appeared in VLDB 2008
Motivation Queries/updates are often invoked repeatedly. E. g. : n n Nested subqueries (with correlated evaluation) Queries embedded in user-defined functions (UDFs) that are invoked from other queries Queries/updates in stored procedures that are invoked repeatedly (e. g. batch jobs) Queries/updates invoked inside loops in application programs 2
Example: A Nested Query List the orders of high worth (> 10 K) SELECT O. orderid, O. custid FROM orders O WHERE 10000 < (SELECT sum(lineprice) FROM lineitem L WHERE L. orderid=O. orderid); Iterative Execution: For each record in the result of the outer query block - bind the parameters for the nested sub-query - evaluate the nested sub-query - process the results (check predicate, output) 3
Optimizing Repeated Invocations n Iterative execution of queries often performs poorly q q q n Redundant I/O Random I/O and poor buffer effects Network round-trip delays Decorrelation is widely used for optimizing nested queries 4
Query Decorrelation Original Query SELECT O. orderid, O. custid FROM orders O WHERE 10000 < (SELECT sum(lineprice) FROM lineitem L WHERE L. orderid=O. orderid); After Decorrelation (Most systems do this automatically) SELECT O. orderid, O. custid FROM orders O, lineitem L WHERE O. orderid=L. orderid GROUP BY O. orderid, O. custid HAVING sum(L. lineprice) > 10000; 5
Limitations of Decorrelation Tricky at times… q q COUNT aggregate Non-equality correlation predicates The solutions may not produce the best plan Decorrelation techniques are not applicable for: q q Nested invocation of procedures with complex logic and embedded query invocations Iterative invocation of queries/updates from application code 6
Example: UDF Invoked from a Query SELECT * FROM category WHERE count_items(category-id) > 50; // Count the items in a given category and its sub-categories int count_items(int category. Id) { … while(…) { … … SELECT count(item-id) INTO icount FROM item WHERE category-id = : curcat; Procedural logic with embedded queries … } } 7
Key Idea: Parameter Batching n Repeated invocation of an operation is replaced by a single invocation of its batched form q n Batched form: Works on a set of parameters Benefits q Choice of efficient set-oriented plans n n n q Repeated selection → Join Efficient integrity checks Efficient disk access (sort RIDs before fetch) Reduced network round-trip delay 8
Batched Forms of Basic Operations n Insert q q n Update q n insert into <table 1> select … from <table 2> … Bulk load (SQLServer: bcp, Oracle: sqlldr, DB 2: load) update <table 1> from <table 2> where … (equivalent to SQL: 2003 merge statement) Queries q Make use of join or outer-join (seen in decorrelation) 9
SQL Merge GRANTMASTER empid name grants S 101 Ramesh 8000 S 204 Gopal 4000 S 305 Veena 3500 S 602 Mayur 2000 GRANTLOAD empid grants S 204 5000 S 602 2600 merge into GRANTMASTER GM using GRANTLOAD GL on GM. empid=GL. empid when matched then update set GM. grants=GL. grants; Notation: Mc 1=c 1’, c 2=c 2’, … cn=cn’(r, s) 10
Effect of Batch Size on Inserts Bulk Load: 1. 3 min 11
Iterative and Set-Oriented Updates on the Server Side TPC-H PARTSUPP (800, 000 records), Clustering index on (partkey, suppkey) n Iterative update of all the records using T-SQL script (each update has an index lookup plan) n Single commit at the end of all updates Takes 1 minute n Same update processed as a merge (update … from …) Takes 15 seconds 12
The Challenge Given a procedure, how to obtain its batched form? Possible to manually rewrite, but time consuming and error-prone. 13
Our Work n Automatic generation of batched forms of UDFs/stored procedures using rewrite rules n Automatic rewrite of programs to replace looping with batched invocation 14
Batched Forms n Batched form qb of a pure function q q Returns results for a set of parameters q Result in the form {(parameter, result)} q For queries: standard techniques for creating batched forms (from work on decorrelation) Example: Original query: SELECT item-id FROM item WHERE category-id=? Batched form: SELECT pb. category-id, item-id FROM param-batch pb LEFT OUTER JOIN item ON pb. category-id = item. category-id; 15
Batch Safe Operations n n Batched forms – no guaranteed order of parameter processing Can be a problem for operations having side-effects Batch-Safe operations n All operations without side effects n Also a few operations with side effects q q E. g. : INSERT on a table with no constraints Operations inside unordered loops (e. g. , cursor loops with no order-by) 16
Generating Batched Forms of Procedures Step 1: Create trivial batched form. Transform: procedure p(r) { body of p} To procedure p_batched(pb) { for each record r in pb { < body of p> < collect the return value > } return the collected results paired with corrsp. params; } Step 2: Optimize query invocations in the trivial batched form 17
Rule 1 A: Rewriting a Simple Set Iteration Loop where q is any batch-safe operation with qb as its batched form Rule 1 B Handles return values 18
Rule 1 C: Batching Conditional Statements Condition for Invocation Return values Operation to Batch // Batched invocation Let s = // Merge the results where s 19
Rule 2: Splitting a Loop while (p) { ss 1; sq; ss 2; } Table(T) t; while(p) { Collect the ss 1 modified to save parameters local variables as a tuple in t } for each r in t { Can apply Rule 1 A-1 C sq modified to use and batch. attributes of r; } for each r in t { ss 2 modified to use attributes of r; } * Conditions Apply Process the results 20
Rule 2: Pre-conditions n n The conditions make use of the data dependence graph Data Dependence Graph q q n Types of Dependencies q q n Nodes: program statements Edges: dependencies between statements that read/write same location Flow (Write Read), Anti (Read Write) and Output (Write Write) Loop-carried flow/anti/output n Dependencies across iterations Pre-conditions for Rule-2 q q No loop-carried flow/output dependencies cross the points at which the loop is split No loop-carried dependencies through external data (e. g. , DB) 21
Need for Reordering Statements Data Dependencies while (category != null) { (s 2) item-count = q 1(category); (s 3) sum = sum + item-count; (s 4) category = get. Parent(category); } (s 1) Flow Dependence Anti Dependence Output Dependence Control Dependence Loop-Carried 22
Reordering Statements to Enable Rule 2 while (category != null) { int item-count = q 1(category); // Query to batch sum = sum + item-count; category = get. Parent(category); } Splitting made possible after reordering while (category != null) { int temp = category; category = get. Parent(category); int item-count = q 1(temp); sum = sum + item-count; } 23
Cycles of Flow Dependencies 24
Rule 4: Control Dependencies while (…) { item = …; qty = …; brcode = …; if (brcode == 58) { brcode = 1; q(item, qty, brcode); } } while (…) { item = …; qty = …; brcode = …; boolean cv = (brcode == 58); cv? brcode = 1; cv? q(item, qty, brcode); } Remember the branching decision in a boolean variable 25
Cascading of Rules After applying Rule 2 Table(…) t; while (…) { r. item = …; r. qty = …; r. brcode = …; r. cv = (r. brcode == 58); r. cv? r. brcode = 1; t. add. Record(r); } for each r in t { r. cv? q(r. item, r. qty, r. brcode); Rule 1 C } qb(Pitem, qty, brcode(scv=true(t)) 26
Batching Across Multiple Levels while(…) { …. while(…) {. . . q(v 1, v 2, … vn); … } Batch q w. r. t inner loop while(…) { …. Table t (…); while(…) {. . . } qb(t); … } 27
Parameter Batches as Nested Tables cust-id C 101 C 180 cust-class Gold Regular orders ord-id date 1011 10 -12 -08 1012 12 -01 -09 ord-id date 1801 10 -12 -08 1802 20 -12 -08 1803 08 -01 -09 28
Nest and Unnest Operations n μc(T) : Unnest T w. r. t. table-valued column c n v. S s(T) : Group T on columns other than S and nest the columns in S under the name s 29
Rule 6: Unnesting Nested Batches 30
Implementation and Evaluation n n Conceptually the techniques can be used with any language (PL/SQL, Java, C#-LINQ) We implemented for Java using the SOOT framework for program analysis Evaluation n No benchmarks for procedural SQL Scenarios from three real-world applications, which faced performance problems Data Sets: TPC-H and synthetic 31
Application 1: ESOP Management App Process records from a file in custom format. Repeatedly called a stored procedure with mix of queries, updates and business logic. - Validate inputs - Lookup existing record - Update or Insert Rewritten program used outer-join and merge. 32
Application 2: Category Traversal Find the maximum size of any part in a given category and its sub-categories. Clustered Index CATEGORY (category-id) Secondary Index PART (category-id) Original Program Repeatedly executed a query that performed selection followed by grouping. Rewritten Program Group-By followed by Join 33
Application 3: Value Range Expansion Expand records of the form: (start-num, end-num, issued-to, …) Performed repeated inserts. Rewritten program Pulled the insert stmt out of the loop and replaced it with batched insert. ~75% improvement ~10% overhead Log scale 34
Related Work n Query unnesting q q n n Graefe [BTW 03] highlights the importance of batching in nested iteration plans (a motivation for our work) Optimizing set iteration loops in database programming languages - Lieuwen and De. Witt [SIGMOD 92] q q q n E. g. Kim [TODS 82], Dayal [VLDB 87], Seshadri et al. [ICDE 96], Galindo Legaria et al. [SIGMOD 01] We extend the benefits of unnesting to procedural nested blocks Also perform program rewriting, but Do not address batching of queries/procedure calls within the loop Limited language constructs - No WHILE loops, IF-THEN-ELSE Parallelizing compilers Kennedy[90], Padua[95] q We borrow and extend the techniques 35
Conclusion n Automatic rewrite of programs for set-orientation is possible q n Combining query rewrite with program analysis is the key Our experiments on real-world scenarios show significant benefits due to batching Future Work n Cost-based selection of operations to batch n Handling exceptions n Automatically deciding whether an operation is batch-safe n Implementing rewriting for PL/SQL 36
Questions? 37
- Slides: 37