An Efficient CostDriven Selection Tool for Microsoft SQL

Introduction ➢Indexes and materialized views are very important in physical design of a database.

Auto. Admin project ➢ Make database systems self-tuning and self-administering. Enabling databases to track

How do we choose the indexes ? Indexes can be chosen based on ➢

Terminologies Indexable column : Columns in a query that are potentially useful for indexing

Cost evaluation Basic method can is given M configurations and Q queries in the

Cost evaluation contd. . . For an insert/delete query, for a non-atomic configuration C

Cost evaluation contd. . . Cost evaluation module can identify atomic configuration heuristically. a)

Candidate Index Selection for a given workload The number of admissible indexes for a

1. For workload W that consists of n queries, generate 'n' workloads each consisting

Problem with this approach. . . Consider a query Q with indexable columns T

The candidate index selection does perform well. Why ? ➢ Indexes that are part

Enumeration Now we have 'n' candidate indexes and we want to pick k indexes.

Greedy(m, k) enumeration approach 1. Let S = the best m index configuration using

Why Greedy algorithm performs well ? ? Interaction between indexes. For example, merge join

Multi-column Index Generation ➢ Considering all the multi-column indexes is not possible. This paper

Variants of this strategy MC-LEAD : In (a, b), 'b' need not be an

BEST-CONF-1: The winning index BEST-CONF-2: The first two winning indexes.

Summary ➢Selecting ➢An candidate indexes algorithm for configuration enumeration ➢Multi-column index Can these ideas

Automated Selection of Materialized Views and Indexes for SQL Databases Sanjay Agrawal Surajit Chaudhuri

Indexes and Materialized views ➢Indexes ➢Views can be considered as special materialized views. have

Indexed Views in SQL server When a clustered index is created on the view,

Examples Example 1. Workload consisting of 1000 queries of the form: SELECT l_returnflag, l_linestatus,

Example 2: Workload of 100 queries whose total cost is 10, 000 units. Let

Features (tuning mode, performance benefit) SQL Server 7. 0 : Indexes Only, Thorough, 49%

Features ➢Index Usage Report ➢Query Cost Report ➢View-Table Relations Report ➢Workload Analysis Report ➢Maximum

Other References Auto. Admin: Self-Tuning and Self-Administering Databases http: //research. microsoft. com/dmx/autoadmin/default. asp ➢Index

Candidate materialized view selection 1) From the large space of all possible table-subsets for

Properties of Merged View 1. All queries answered using either of the parent views

Metrics used TS-Cost(T) = total cost of all queries in the workload (for the

Then, propose MV on these table sets. For a given query, take the view

Merging Views R= Materialized views returned by enumeration While( | R | > 1)

Slides: 39

Download presentation

An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit Chaudhuri Vivek Narasayya CS 632 Course seminar by Iyengar Suresh, CSE MTECH 1 Indian Institute of Technology Bombay

Introduction ➢Indexes and materialized views are very important in physical design of a database. ➢They can significantly improve performance if used in the right manner. ➢The tool has been implemented in Microsoft SQL Server 7. 0 and subsequent versions. ➢The technique developed was a part of the Auto. Admin project.

Auto. Admin project ➢ Make database systems self-tuning and self-administering. Enabling databases to track the usage of their systems and to gracefully adapt to application requirements. ➢ Databases actively auto-tunes itself to be responsive to application needs. ➢ Research work: Automating index and view selections, index merging, 'what-if' indexes, automating statistics management for Query Optimizers and many more. ➢

How do we choose the indexes ? Indexes can be chosen based on ➢ Table columns ➢ Workload ➢ In harmony with the optimizer

Terminologies Indexable column : Columns in a query that are potentially useful for indexing Admissible index : Index that is on one or more (in case of a multi-column index) indexable columns of the query. Admissible index : An index that is an admissible index for (for a workload) one or more queries in the workload.

Cost evaluation Basic method can is given M configurations and Q queries in the workload, optimize M*Q queries. Atomic configuration C: For some query in the workload there is a possible execution of the query by the query engine that uses all indexes in C. For select/update queries, for a non-atomic configuration C, there exists an atomic configuration Ci such that Cost (Q, C) = Min {(Cost(Q, Ci)}

Cost evaluation contd. . . For an insert/delete query, for a non-atomic configuration C The cost can be divided into: a) Cost of selection b) Cost of updating the table and the indexes used for selection c) Cost of updating indexes that do not affect the selection cost.

Cost evaluation contd. . . Cost evaluation module can identify atomic configuration heuristically. a) Restriction on the number of indexes per table b) Restriction on the number of indexes referenced per query. This can restrict the search space. The technique implemented by the authors includes two indexes per table and also two indexes referenced per query. This is called single join atomic configuration.

Candidate Index Selection for a given workload The number of admissible indexes for a given workload is very large. Determine the best configuration for each query independently and then use these configurations for the further step. Intuition: An index which is not a part of the best configuration for a single query is unlikely to be a part of best configuration for the entire workload. This method is called query-specific-best-configuration candidate index selection.

1. For workload W that consists of n queries, generate 'n' workloads each consisting of one query each, where Wi = { Qi} 2. For each workload Wi, we use the set of indexable columns Ii of the query in Wi as starting candidate indexes. 3. Let Ci be the configuration picked by index selection tool for Wi, i. e. , Ci = Enumerate (Ii, Wi). 4. The candidate index set for W is the union of all Ci's. Candidate index selection

Problem with this approach. . . Consider a query Q with indexable columns T 1. C 1 and T 2. C 2 such that the best configuration for Q is T 1. C 1 only. � Also, there is an insert query in the workload on T 1. C 1 with a high cost. � The enumeration phase would not consider T 1. C 2. � So, a general strategy can be consider even the second best configuration or the first 'k' best configurations. �

The candidate index selection does perform well. Why ? ➢ Indexes that are part of 'next' best configuration of a query may appear as the best configuration for another query in the workload. ➢ Also, most of the indexes in the second best configuration do find their way into the best configuration.

Enumeration Now we have 'n' candidate indexes and we want to pick k indexes. A naïve method is to enumerate all the subsets. However this is not practical as for realistic values of n and k. (e. g. n=40 and k=10. )

Greedy(m, k) enumeration approach 1. Let S = the best m index configuration using the naive enumeration algorithm. If m = k then exit. 2. Pick a new index I such that Cost (S U {I}, W) <= Cost(S U(I’}, W) for any choice of I’ != I 3. If Cost (S U {I}) >= Cost(S) then exit Else S = S U (I} 4. If size(S) = k then exit 5. Go to 2 A small value of 'm' leads to near optimal results.

Why Greedy algorithm performs well ? ? Interaction between indexes. For example, merge join with two clustered indexes. If one of the indexed was picked during the 'm' phase of the above algorithm, its likely that the other will be picked during the 'k' phase as it will reduce the overall cost.

Multi-column Index Generation ➢ Considering all the multi-column indexes is not possible. This paper introduces an iterative approach for taking into account multi-column indexes of increasing width. ➢ First step is to find admissible two column indexes. This set along with the winning one-column indexes becomes the input for the second iteration. ➢ The strategy followed is for a two-column index to be desirable, a single column index on its leading column is also desirable. ➢

Variants of this strategy MC-LEAD : In (a, b), 'b' need not be an winning single column index MC-ALL : Both a and b need to be winning single column indexes.

Experimental results

BEST-CONF-1: The winning index BEST-CONF-2: The first two winning indexes.

Summary ➢Selecting ➢An candidate indexes algorithm for configuration enumeration ➢Multi-column index Can these ideas be applicable to materialized views ? ?

Automated Selection of Materialized Views and Indexes for SQL Databases Sanjay Agrawal Surajit Chaudhuri Vivek Narasayya

Indexes and Materialized views ➢Indexes ➢Views can be considered as special materialized views. have rich structure. ➢Space of potentially interesting materialized views for a given workload is very large. ➢The technique proposed deals with automate selection of materialized views in presence of: a) Workload b) Indexes c) Indexes on materialized views d) In sync with the optimizer

Indexed Views in SQL server When a clustered index is created on the view, SQL Server immediately allocates storage space to store the results of the view. Indexed View definition cannot contain the following a) b) c) d) e) f) g) TOP DISTINCT MIN, MAX, COUNT, STDEV, VARIANCE, AVG Another view UNION Subqueries, outer joins, self joins Cannot include order by in view definition and some more. . .

joint enumeration

Examples Example 1. Workload consisting of 1000 queries of the form: SELECT l_returnflag, l_linestatus, SUM(l_quantity) FROM lineitem WHERE l_shipdate BETWEEN <Date 1> and <Date 2> GROUP BY l_returnflag, l_linestatus (Assume that each of the 1000 queries has different constants for <Date 1> and <Date 2> ). MV that can service all 1000 queries SELECT l_shipdate, l_returnflag, l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY l_shipdate, l_returnflag, l_linestatus

Example 2: Workload of 100 queries whose total cost is 10, 000 units. Let T be a table-subset that occurs in 25 queries whose combined cost is 50 units. Then even if we considered all syntactically relevant materialized views on T, the maximum possible benefit of those materialized views for the workload is 0. 5%. Also, the number of rows matters. For example, tables lineitem and orders may have 6 million and 1. 5 million rows respectively, but tables nation and region may be very small (25 and 5 rows respectively).

Features (tuning mode, performance benefit) SQL Server 7. 0 : Indexes Only, Thorough, 49% SQL Server 2000 : Indexes Only, Fast, 37% SQL Server 2000 : Indexes Only, Medium, 39% SQL Server 2000 : Indexes and Indexed Views, Thorough, 79% The Index Tuning Wizard in SQL Server 2000 supports the ability to randomly sample queries from a workload and restrict tuning to the sampled queries

Features ➢Index Usage Report ➢Query Cost Report ➢View-Table Relations Report ➢Workload Analysis Report ➢Maximum columns per index ➢Maximum space for the recommendation

Other References Auto. Admin: Self-Tuning and Self-Administering Databases http: //research. microsoft. com/dmx/autoadmin/default. asp ➢Index tuning wizard http: //www. microsoft. com/technet/prodtechnol/sql/2000/maintain/tun esql. mspx ➢Indexed Views in SQL server http: //www. sqlteam. com/item. asp? Item. ID=1015

Extra slides

Candidate materialized view selection 1) From the large space of all possible table-subsets for the workload, we arrive at a smaller set of interesting table-subsets. 2) Based on these, a) Propose a set of materialized views for each query in the workload b) From this set we select a configuration that is best for that query. 3) View merging

Properties of Merged View 1. All queries answered using either of the parent views should be answerable using the merged view. 2. Prevent the merged view from becoming too large as compared to its parents based upon some threshold.

Metrics used TS-Cost(T) = total cost of all queries in the workload (for the current database) where table-subset T occurs. If all queries in the workload referenced the tables A, B, C and D together, then using the TS-Cost(T) metric, the table-subsets T 1 = {A, B} would have the same importance as the table-subset T 2 = {C, D}. TS-Weight(T) = sum[ Cost(Qi)* (sum of sizes of tables in T)/ (sum of sizes of all tables referenced in Qi)) The summation is only over queries in the workload where T occurs.

Then, propose MV on these table sets. For a given query, take the view with minimum cost.

Merging Views R= Materialized views returned by enumeration While( | R | > 1) M' = Views returned by Merge. View on each pair of R If M'={} return (R-M) R= R U M' For each view in M' remove both its parents End While Return (R-M)