Analyzing Plan Diagrams of Database Query Optimizers Naveen
Analyzing Plan Diagrams of Database Query Optimizers Naveen Reddy Jayant Haritsa Database Systems Lab Indian Institute of Science Bangalore, INDIA August 2005 Picasso (VLDB)
Query Execution Plans l SQL, the standard database query language, is declarative in nature – does not specify how the query should be evaluated – Example: select Student. Name, Course. Name from STUDENT, COURSE, REGISTER where STUDENT. Roll. No = REGISTER. Roll. No and REGISTER. Course. No = COURSE. Course. No join order and techniques are left unspecified l DBMS query optimizer identifies efficient execution strategy: “query execution plan” August 2005 Picasso (VLDB) 2
Example Query and Plan select c_custkey, c_name, c_acctbal, c_address, c_phone, c_comment from customer, orders, nation where c_custkey = o_custkey and c_nationkey = n_nationkey and o_orderdate < date ('2001 -11 -09') and n_name = 'IRAQ' RETURN 201, 689 HSJOIN 201, 689 TBSCAN 175, 025 ORDERS HSJOIN 26, 571 TBSCAN 26, 512 CUSTOMER August 2005 Picasso (VLDB) TBSCAN 50 NATION 3
Query Plan Selection l Core technique Query (Q) Query Optimizer (dynamic programming) DB catalogs l Minimum Cost Plan P(Q) Cost Model Cost difference between best plan choice and a sub-optimal choice can be enormous (orders of magnitude) August 2005 Picasso (VLDB) 4
Plan and Cost Diagrams l Given a query, the optimizer’s plan choice is a function (among other factors) of the selectivities of the base relations participating in the query – selectivity is the estimated number of rows of a relation relevant to producing the final result l A plan diagram is a pictorial enumeration of the plan choices of a database query optimizer over the relational selectivity space l A cost diagram is a visualization of the associated (estimated) plan execution costs over the same relational selectivity space August 2005 Picasso (VLDB) 5
Example Query [Q 7 of TPC-H] select supp_nation, cust_nation, l_year, sum(volume) as revenue from (select n 1. n_name as supp_nation, n 2. n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n 1, nation n 2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n 1. n_nationkey and c_nationkey = n 2. n_nationkey and ((n 1. n_name = 'FRANCE' and n 2. n_name = 'GERMANY') or (n 1. n_name = 'GERMANY' and n 2. n_name = 'FRANCE')) and l_shipdate between date '1995 -01 -01' and date '1996 -12 -31' and o_totalprice < C 1 and c_acctbal < C 2 ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year August 2005 Picasso (VLDB) 6
Example Plan Diagram August 2005 Picasso (VLDB) 7
Specific Plan Choices August 2005 Picasso (VLDB) 8
Example Cost Diagram August 2005 Picasso (VLDB) 9
PICASSO August 2005 Picasso (VLDB)
Picasso l A Java tool that, given a query template, automatically generates plan and cost diagrams – Fires queries at user-specified granularity (default 100 x 100 grid) – Currently restricted to 2 -D plan diagrams and 3 -D cost diagrams l Using the tool, enumerated the plan/cost diagrams produced by industrial-strength query optimizers on TPC -H-based queries – IBM DB 2 v 8, Oracle 9 i and Microsoft SQL Server 2000 l Plan diagrams appear similar to cubist paintings [ Pablo Picasso founder of the cubist painting genre ] August 2005 Picasso (VLDB) 11
Picasso GUI August 2005 Picasso (VLDB) 12
Testbed Environment l Database – TPC-H database (1 GB scale) representing a manufacturing environment, featuring the following relations: l Relation Query Set – Queries based on TPC-H benchmark [Q 1 through Q 22] – Uniform 100 x 100 grid (10000 queries) [0. 5%, 0. 5%] to [99. 5%, 99. 5%] l Relational Engines – Default installations (with all optimization features on) – Stats on all columns, no extra indices l Cardinality REGION 5 NATION 25 SUPPLIER CUSTOMER 10000 150000 PART 200000 PARTSUPP ORDERS 800000 1500000 LINEITEM 6001215 Computational Platform – Pentium-IV 2. 4 GHz, 1 GB RAM, Windows XP Professional August 2005 Picasso (VLDB) 13
RESULTS Optimizers randomly identified as Opt A, Opt B, Opt C l NOT intended to make comparisons across optimizers l Black-box testing our conclusions are speculative l l Full result listing at http: //dsl. serc. iisc. ernet. in/projects/PICASSO/ August 2005 Picasso (VLDB)
Smooth Plan Diagram [Q 7, Opt. B] August 2005 Picasso (VLDB) 15
Complex Plan Diagram [Q 8, Opt. A*] Highly irregular plan boundaries Increases to Extremely fine 80 plans with grained coverage 300 x 300 grid ! (P 68 ~ 0. 02%) Intricate Complex Patterns August 2005 Picasso (VLDB) 16
Cost Diagram [Q 8, Opt. A*] August 2005 Picasso (VLDB) 17
Skew in Plan Space Coverage Opt A TPC-H Query Opt C Opt B Plan 80% Gini Cardinality Coverage Index 2 22 18% 0. 76 14 21% 0. 72 35 20% 0. 77 5 21 19% 0. 81 14 21% 0. 74 18 17% 0. 81 7 13 23% 0. 73 6 50% 0. 46 19 15% 0. 79 8 8 9 9 31 31 63 63 16% 9% 9% 0. 81 0. 88 25 25 44 44 25% 27% 0. 72 0. 70 38 38 41 41 18% 12% 0. 79 0. 83 10 24 16% 0. 78 9 22% 0. 69 8 25% 0. 75 18 18 21 5 5 27 60% 22% 0. 33 0. 74 13 13 6 38% 17% 0. 57 0. 80 5 5 22 20% 18% 0. 75 0. 81 Avg(dense) 28. 7 17% 0. 79 24. 5 23% 0. 72 28. 8 16% 0. 79 Dense Plan Cardinality 10 August 2005 80 -20 Rule Gini index > 0. 5 Picasso (VLDB) 18
Cost Domination Principle Cost of executing any “foreign” query point in the first quadrant of qs is an upper bound on the cost of executing the foreign plan at qs qs Cost of executing qs with Plans P 4 and P 1 is less than 91 and 90, respectively. Cost of Query point qs with plan P 2 is 88 August 2005 Picasso (VLDB) 19
Cost Domination Principle l Dominating Point Given a pair of distinct points q 1 (x 1, y 1) and q 2(x 2, y 2) in 2 -D selectivity space, we say that q 2 ≻ q 1, if and only if x 2 ≥ x 1, y 2 ≥ y 1 and result cardinality Rq 2 ≥ Rq 1 l Cost Domination Principle If points q 1 (x 1, y 1) and q 2(x 2, y 2) are associated with distinct plans P 1 and P 2 respectively, in the original space, the cost of executing query q 1 with plan P 2 is upper-bounded by the cost of executing q 2 with P 2, if and only if q 2 ≻ q 1 August 2005 Picasso (VLDB) 20
Plan Swallowing Algorithm 1. For each query point qs, look for replacements by “foreign” query points that are in the first quadrant relative to qs as the origin. 2. For the foreign points that are within λ (e. g. λ=10%) threshold, choose point with lowest cost as potential replacement. 3. An entire plan is “swallowed” only if all its query points can be swallowed by a single plan or group of plans. 4. Order the plans in ascending order of size; go up the list, checking for possibility of swallowing each plan. August 2005 Picasso (VLDB) 21
Complex Plan [Q 8, Opt. A*] Reduced Plan. Diagram (λ=10%) Reduced to 7 plans from 68 l Note: Plan Reduction ≠ Change in Optimization Levels August 2005 Picasso (VLDB) 22
Plan Cardinality Reduction by Swallowing Opt A TPC-H Query Opt C % Avg Max Card Cost Card Cost Decrease Increase Decrease Increase 2 59. 2 5 67. 3 7 46. 1 8 8 9 87. 6 84. 4 10 67. 6 18 40. 0 21 59. 8 Avg(dense) 67. 4 August 2005 Opt B 1. 0 2. 6 0. 1 0. 4 1. 6 0. 8 0. 1 0. 0 0. 9 4. 4 64. 2 0. 6 5. 9 77. 1 8. 1 42. 9 0. 1 0. 6 61. 1 9. 5 16. 6 0. 4 0. 7 54. 5 9. 4 8. 6 84. 0 36. 4 0. 9 1. 4 9. 1 8. 9 86. 8 80. 5 4. 4 44. 4 0. 5 6. 1 62. 5 0. 5 46. 2 3. 7 9. 6 00. 0 0. 2 66. 7 0. 9 2. 5 68. 2 6. 4 56. 9 0. 7 6. 1 71. 4 Picasso (VLDB) Average Cost Increase < 2% 3. 2 0. 2 1. 1 1. 2 2. 1 0. 4 0. 0 0. 7 0. 9 6. 4 8. 1 9. 5 8. 4 8. 3 2. 4 0. 0 6. 9 6. 4 23
Interesting Patterns Duplicates and Islands l Plan Switch Points l Footprint Pattern l Speckle Pattern l August 2005 Picasso (VLDB)
Duplicates and Islands [Q 10, Opt. A] Duplicate locations of P 3 P 18 is an island within P 6 Duplicate locations of P 10 August 2005 Picasso (VLDB) 25
Duplicates and Islands [Q 5, Opt. C] Three duplicates of P 7, which are islands within P 1 August 2005 Picasso (VLDB) 26
Duplicates and Islands Removal Databases l # Duplicates Original Reduced # Islands Original Reduced Opt A 130 13 38 3 Opt B 80 15 1 0 Opt C 55 7 8 3 With Plan Reduction by Swallowing, significant decrease in duplicates and islands August 2005 Picasso (VLDB) 27
Plan Switch Points [Q 9, Opt. A] Plan Switch Point: line parallel to axis with a plan shift for all plans bordering the line. Hash-Join sequence PARTSUPP►◄SUPPLIER►◄PAR T is altered to PARTSUPP►◄PART►◄SUPPLIE R August 2005 Picasso (VLDB) 28
Venetian Blinds [Q 9, Opt B] Six plans simultaneously change with rapid alternations to produce a “Venetian blinds” effect. Left-deep hash join across NATION, SUPPLIER and LINEITEM relations gets replaced by a right-deep hash join. August 2005 Picasso (VLDB) 29
Footprint Pattern [Q 7, Opt A] P 7 exhibits a thin broken curved pattern in the middle of P 2’s region. P 2 has sort-merge-join at the top of the plan tree, while P 7 hashjoin August 2005 Picasso (VLDB) 30
Speckle Pattern [Q 17, Opt A*] An additional sort operation is present on PART relation in P 2, whose cost is very low August 2005 Picasso (VLDB) 31
Non-Monotonic Cost Behavior Plan-Switch Non-Monotonic Costs l Intra-Plan Non-Monotonic Costs l August 2005 Picasso (VLDB)
Plan-Switch Non-Monotonic Costs. Presence [Q 2, of. Opt A] Rules? Parameterized changes in search space? 50% Selectivity 26% Selectivity Plan Diagram August 2005 26%: Cost decreases by a factor of 50 Cost Diagram 50%: Cost increases by a factor of 70 Picasso (VLDB) 33
Intra-Plan Non-Monotonic Costs [Q 21, Opt A] Plans P 1, P 3, P 4 and P 6 Nested loops join whose cost decreases with increasing input cardinalities Cost Diagram Plan Diagram August 2005 Picasso (VLDB) 34
Relationship to PQO August 2005 Picasso (VLDB)
PQO (Parametric Query Optimization) l l Active research area for last 15 years Identify the optimal set of plans for the entire relational selectivity space at compile time At run time, use actual selectivity values to identify the appropriate plan choice Assumptions – Plan Convexity: If a plan is optimal at two points, then it is optimal at all points on the straight line joining them – Plan Uniqueness: An optimal plan appears at only one contiguous region in the entire space – Plan Homogeneity: An optimal plan is optimal within the entire region enclosed by its plan boundaries August 2005 Picasso (VLDB) 36
Validity of PQO [Q 8, Opt A] Plan uniqueness is violated by P 4 Plan Homogeneity is Plan Convexity is violated by P 14 severely violated by regions covered by P 12 (dark green) and P 16 (light gray). August 2005 Picasso (VLDB) 37
Note: l PQO is a more viable proposition in the space of reduced plan diagrams due to the removal of most duplicates and islands August 2005 Picasso (VLDB) 38
Conclusions l Conceived and developed the Picasso tool for automatically generating plan and cost diagrams l Presented analyzed representative plan and cost diagrams on popular commercial query optimizers – Optimizers make fine grained choices – Complexity of plan diagrams can be drastically reduced without materially affecting the query processing quality – Plan optimality regions can have intricate patterns and complex boundaries – Non-Monotonic cost behavior exists where increasing result cardinalities decrease the estimated cost – Basic assumptions underlying research literature on PQO do not hold in practice; but hold approximately for reduced plan diagrams August 2005 Picasso (VLDB) 39
Future Work l l l Extend Picasso to generate higher-dimensional plan and cost diagrams Port Picasso to Sybase and Postgre. SQL Conduct a deeper investigation on the query features that result in high plan cardinalities Investigate whether it is possible to simplify the optimizer, so as to be able to directly produce reduced plan diagrams Public release of Picasso software (by end-2005) August 2005 Picasso (VLDB) 40
QUESTIONS ? August 2005 Picasso (VLDB)
END PICASSO August 2005 Picasso (VLDB)
- Slides: 42