Recent Advances in Query Optimization Tutorial by S

Recent Advances in Query Optimization Tutorial by: S. Sudarshan IIT Bombay sudarsha@cse. iitb. ernet. in www. cse. iitb. ernet. in/~sudarsha 1

Talk Outline z System R, Volcano z Recent extensions (including OODBs, ORDBs) z OLAP z Materialized views: y maintenance, use and selection, continuous queries z Caching of Query Results z Data Warehouses and Virtual Warehouses S. Sudarshan: Recent Advances in Query Optimization 2

System R z Join order selection y. A 1 A 2 A 3. . y. Left deep join trees An Ak Ai y. Dynamic programming x. Best plan computed for each subset of relations • Best plan (A 1, . . , An) = min cost plan of( A 1 Best plan(A 2, . . , An) A 2 Best plan(A 1, A 3, . . , An) …. An Best plan(A 1, . . , An-1)) S. Sudarshan: Recent Advances in Query Optimization 3

System R (cont) z Selects and projects pushed down to lowest possible place z Sort order yjoin may be cheaper if inputs are sorted on join attr y=> Best plan(set-of-relations, sort-order) z Starburst (successor to System R) yretains single query block-at-a-time cost based optimization y+ heuristic Query Rewrite xincluding decorrelation of nested queries S. Sudarshan: Recent Advances in Query Optimization 4

Decorrelation z Idea: convert nested subqueries to joins z Consider select * from emp E where E. numchildren <> (select count(*) from person where person. parent = E. name z Can’t always express using basic rel. algebra z Long history: yspecial cases: Kim 88, Dayal 88, Muralikrishna 93 ygeneral case: P. Seshadri et al 95: use outerjoin S. Sudarshan: Recent Advances in Query Optimization 5

Decorrelation (cont) z Pushing semijoins into decorrelated query yuse selections on correlation variables xselect * from R, S where R. A = S. A and R. B = (select min(T. B) from T where T. A=R. A) xdon’t evaluate groupby/min on all of T: x GB T. A, min(T. B) (T SJ T. A=R. A (R R. A=S. A S) S. Sudarshan: Recent Advances in Query Optimization 6

Magic Rewriting z Recursive views are now part of SQL-3, supported by DB 2 and Oracle already z Magic rewriting pushes semijoins through recursive views ypath (X, Y) : - edge (X, Y) path (X, Y) : - edge (X, Z), path(Z, Y) Query: ? path(Pune, Y) z Long history, see survey by Ramakrishnan and Ullman S. Sudarshan: Recent Advances in Query Optimization 7

Predicate Movearound z Idea: pull R. A=5 up, infer S. A=5, and push S. A=5 down into subtree S z Generalizes to any constraints z History: s R. A=5 S R y Fold/unfold transformation in logic programs y. Aggregate constraints and relevance RS, VLDB 91 y. Fold/unfold and constraints RS, ILPS 92 yfor SQL LMSS, SIGMOD 93 z Aggregate constraints GB A, min(B) S R S. Sudarshan: Recent Advances in Query Optimization 8

Volcano Extensible Query Optimizer Generator z General purpose cost based query optimizer, based on equivalence rules on algebras yeg equivalences: join associativity, select push down, aggregate push down, etc yextensible: new operations and equivalences can be easily added ynotion of physical properties generalizes “interesting sort order” idea of System R y. Developed by Graefe and Mc. Kenna 1993 z Follow up to EXODUS, but much more efficient S. Sudarshan: Recent Advances in Query Optimization 9

Key Ideas in Volcano z DAG representation of query y. Equivalence node and operation nodes y. Compactly represents set of all evaluation plans xchoose one child of each equivalence node, and all children of operation nodes ABC AB AC BC A B C S. Sudarshan: Recent Advances in Query Optimization 10

Key Ideas of Volcano (Cont) z Hashing scheme used to efficiently detect duplicate expressions ygives ID to each equivalence node, hash function of operation nodes based on Ids of child equivalence nodes z Physical algebra also represented by DAG z Best plan found for each equivalence node yuse cheapest of child operation nodes ydynamic programming: cache best plans ybranch and bound pruning used when searching S. Sudarshan: Recent Advances in Query Optimization 11

Main Benefits of Volcano z Highly Extensible ycan handle arbitrary algebraic expressions ynew operators and equivalence rules easy to add xmust be careful of search space though z Yet (reasonably) efficient ygeneralizes the dynamic programming idea of System -R optimizer y. Optimizations of Pellenkroft et al. [VLDB 97] eliminate redundant derivations for joins z Ideas are used in MS SQL Server and Tandem S. Sudarshan: Recent Advances in Query Optimization 12

Parametrized Query Optimization z Some parameters to the query may not be available at optimization time yselection constants (e. g. in stored procedures) ymemory size z Idea: y come up with a set of plans optimal at different points in parameter space, yselect best when parameters are known at run time z Work in this area y. Ganguly [VLDB 1998], Ganguly and Krishnamurthy [COMAD 95], Ng et al [SIGMOD 92] S. Sudarshan: Recent Advances in Query Optimization 13

Parametric Query Opt (Cont) z Results of Ganguly [1998] y. Number of parametrically optimal queries is quite small, so idea is practical ynice algorithms for single parameter case yextended above to two parameter case, but general case is harder z Optimization for best expected case (P. Seshadri, PODS 99) S. Sudarshan: Recent Advances in Query Optimization 14

Sampling and Approximate Query Answering z In databases, sampling originally proposed for query size estimation (estimate need not be perfect) Li and Naughton [94], Olken [93] z Used today for generating quick and dirty (fast but approximate) results yespecially for aggregates on large tables z Online aggregates (Hellerstein. . ) z Generating histograms (Ioannidis. . ) S. Sudarshan: Recent Advances in Query Optimization 15

Optimization in OODB/ORDBs z Major issues y. Path expressions: xe. g. forall ( p in person) print (p->spouse->name) xcan convert pointer dereferences to joins xcan “assemble objects” in a clever sequence to minimize I/O (Graefe 93, Blakeley et al, Open OODB optimizer 95) y. Path indices xe. g. forall (p in person suchthat p->spouse->name = “Rabri”) … S. Sudarshan: Recent Advances in Query Optimization 16

Optimization in ORDBs y. Expensive predicates/functions in selects/projects xe. g. selects based on image manipulation xusual heuristic of “push select predicates to lowest possible level’’ does not work x. Hack to System R: treat predicates like joins • not an issue with Volcano • also heuristics to limit search space (Hellerstein and Naughton (93, 94), Chaudhuri et al (93) S. Sudarshan: Recent Advances in Query Optimization 17

Extended ADTs z ADTs are a simple way to add new types to a database. Used extensively in data blades/cartridges/… z Extended ADTs -- understand some semantics of ADT functions, and optimize ye. g. if Image. smooth(). clip(10, 10) is equivalent to Image. clip(10, 10). smooth choose the one that is cheaper to compute y. Predator ORDB supports such optimizations (P. Seshadri [1998]) S. Sudarshan: Recent Advances in Query Optimization 18

Multi Query Optimization z Idea: Given a set of queries to evaluate, exploit common subexpressions by materializing and sharing them z Problems: Many equivalent forms of a query y. Some have CSE, others dont. E. g. : x. R S T and R P S versus x. R S T and R S P z Exhaustive algos: Sellis [1988], and others ytry every combination of forms of every query. yproblem: cost is doubly exponential S. Sudarshan: Recent Advances in Query Optimization 19

Multi Query Optimization (Cont) z Heuristics y. Find best plans for each query, look for CSEs in best plans x. Subramaniam and Venkataraman [SIGMOD 98] x. Volcano SH [RSSB 99] y. When optimizing query i, treat subparts of plans for earlier queries as available cheaply x. Volcano RU [RSSB 99] S. Sudarshan: Recent Advances in Query Optimization 20

Greedy Heuristics for MQO z Greedy heuristic: y. Repeat xfind subexpression which if materialized and shared will give most benefit (cheapest plan) • subproblem: given some subexpressions are materialized, find best plans for given queries • also: update the best plans incrementally as new subexpressions are checked for materialization xmaterialize above subexpression y. Until no further benefits can be got S. Sudarshan: Recent Advances in Query Optimization 21

Greedy Heuristic (Cont) z Monotonicity addition to greedy heuristic: y. Benefit of materializing a subexpression cannot increase as other subexpressions are materialized y. Assume above, and keep heap of overestimates of benefits -- reduces number of benefit recomputations z Performance study shows greedy heuristic gives very significant benefits on TPCD queries at reasonable cost z Volcano-SH and Volcano-RU are very fast but give much less benefits than Greedy S. Sudarshan: Recent Advances in Query Optimization 22

OLAP - Data Cube z Idea: analysts need to group data in many different ways yeg. Sales(region, product, prodtype, prodstyle, date, saleamount) ysaleamount is a measure attribute, rest are dimension attributes ygroupby every subset of the other attributes xprecompute above to give online response y. Also: hierarchies on attributes: date -> weekday, date -> month -> quarter -> year S. Sudarshan: Recent Advances in Query Optimization 23

OLAP Issues z MOLAP: cube in memory, multi-dimensional array z ROLAP: cube in DB, represented as a relation z S. Sudarshan: Recent Advances in Query Optimization 24

Data Cube Lattice z Cube lattice y ABC AB A AC BC B C none z Can materialize some groupbys, compute others on demand z Question: which groupbys to materialze? z Question: what indices to create z Question: how to organize data (chunks, etc) S. Sudarshan: Recent Advances in Query Optimization 25

Cube: Selecting what to materialize z Basic cube: materializes everyting z Greedy Algo: max benefit per unit space ybenefit computation takes into account what is already materialized y. Harinarayanan et al [SIGMOD 96], Gupta [ICDE 97], Labio et al … z Smallest Algo y. Deshpande et al [SIGMOD 98] S. Sudarshan: Recent Advances in Query Optimization 26

Materialized Views z Can materialize (precompute and store) views to speed up queries y. Incremental maintenance xwhen database is updated, propagate updates to materialized view y. Deciding when to use materialized views xeven if query does not refer to materialized view, optimizer can figure out it can be used y. Deciding what to materialize xbased on workload, choose best set of views to materialize, subject to space constraints S. Sudarshan: Recent Advances in Query Optimization 27

Incremental View Maintenance z E. g. R S (R U ir) S=R S U ir S (R - dr) S=R S - dr S z similar techniques for selection, projection (must maintain multiplicity counters though) and aggregation z Blakeley et al. [SIGMOD 87], Gupta and Mumick survey [DE Bulletin 95]. S. Sudarshan: Recent Advances in Query Optimization 28

Continuous Querying z Idea: define a query, results get updated and shown to you dynamically, as base data changes z E. g. applications: ynetwork monitoring, stock monitoring yalerting systems (e. g. , new book arrived in library) xbetter than triggers for this application z Implementation techniques similar to materialized view maintenance z Maier et al, SIGMOD 98 demo session S. Sudarshan: Recent Advances in Query Optimization 29

When to Use Materialized Views z Let V = R S be materialized z Query may V, but may still be better to replace by view definition. Eg selection on V z Query may use R S, but may be better to replace by V z Job of query optimizer y. Chaudhuri et al [ICDE 95] y. Falls out as special case of multiquery optimization algos of RSSB 99 S. Sudarshan: Recent Advances in Query Optimization 30

Deciding What to Materialize z maintenance cost and query cost yworkload: xqueries and update transactions xweights for each component of workload z workload cost depends on what is materialized z Goal: find set of views that gives minimum cost if materialized, subject to space constraints z Note: materializing views can reduce even update costs yindices, and SQL assertions S. Sudarshan: Recent Advances in Query Optimization 31

Deciding What to Materialize z History y. Roussopolous [1982]: exhaustive A* algorithm y. Ross, Srivastava and Sudarshan [SIGMOD 96] suggest materializing views can reduce update costs, give heuristics y. Labio et al. [1997], Gupta [1997], Sellis et al [1997], Yang, Karlapalem and Li [1997] give various exhaustive/heuristic/greedy algorithms y. Chaudhuri and Narsayya [1998] considers only indices, being introduced in SQL server y. Exhaustive algos are all doubly exponential! S. Sudarshan: Recent Advances in Query Optimization 32

Caching of Query Results z Store results of earlier queries z Motivation yspeed up access to remote data xalso reduce monetary costs if charge for access yinteractive querying often results in related queries xresults of one query can speed up processing of another ycaching can be at client side, in middleware, and even in a database server itself S. Sudarshan: Recent Advances in Query Optimization 33

Query Caching (Cont) z Differences from page/object caching yresults that are cached are defined by a (possibly complex) query ycost of computing different results is different --- cost of fetching a page is same for all pages ysizes of different results is different --- page size is fixed z One heuristic: benefit = (recomp-cost * freq-access) / size y. Update frequence must also be taken into account S. Sudarshan: Recent Advances in Query Optimization 34

Query Caching (Cont) z Differences from selection of views to materialize ywhat to cache decided based on recent queries x=> set of cached results changes dynamically xadapts as users change their behaviour ycached data may not be maintained up-to-date x=> if base data has been updated, query optimizer must choose between recomputing cached results and incrementally computing changes S. Sudarshan: Recent Advances in Query Optimization 35

Query Caching (Cont) z Predicate caching (Wiederhold et al 1996) and Semantic caching (Dar et al, 1996) ynot tied to query optimizer z ADMS (Roussopolous, 1994) yhandles SPJ queries, with specific graph structure z WATCHMAN (Scheurmann et al, VLDB 96) ymakes caching decisions based on cost, frequency of usage and size yreuses cached results only if exactly same query repeats S. Sudarshan: Recent Advances in Query Optimization 36

Query Caching (Cont) z Dynamat (Roussopolous et al, SIGMOD 99) yconsiders caching of data cube queries ynot general purpose unlike ADMS, but handles update costs better z Web caching is somewhat similar ycached pages differ in size, and in access cost (e. g. , local pages can be accessed faster) S. Sudarshan: Recent Advances in Query Optimization 37

Data Warehouses z Characteristics: y. Very large ytypical schema: very large fact table, small dimension tables ytypical query: aggregate on join of fact table and dimension tables z Can exploit above characteristics for optimizing queries ye. g. , join dimension tables (even if cross product), build in memory index, scan fact table, probe index. Summarize if required and output S. Sudarshan: Recent Advances in Query Optimization 38

Data Warehouses (Cont) z Synchronized scans ymultiple queries can share a scan of fact table xslow some queries down so others catch up z Bit map indices yfor selections on low cardinality attributes ye. g. : M 10011100011001 F 01100011100110 yidea: and-ing of bit maps is very efficient, use on bitmaps to filter to relevant tuples, retrieve them y. Quass and O’Neill [Sigmod 1997], various DB products (DB 2, Informix, …) S. Sudarshan: Recent Advances in Query Optimization 39

Virtual Warehouses/Databases z Data sources are numerous and distributed ymay be accessible only via html x=> wrappers needed x. Stanform TSIMMIS project, Junglee, and others have built wrappers. ymay support only limited number of access types through forms interfaces ysite descriptions: describe what data is contained at a site Levy et al [1995]. x. Query sent only to relevant sites. S. Sudarshan: Recent Advances in Query Optimization 40

Virtual Warehouses and Databases (Cont) z Provide user with view of a single database, which can be queried z Underlying system must find best/good way of evaluating query S. Sudarshan: Recent Advances in Query Optimization 41

Parallel Databases z Search space is extremely large in general y. How to partition data y. How to partition operations z Two basic approaches y. Each operation is parallelized across all nodes y. Get best sequential plan, then parallelize xscheduling issues xpipelining issues S. Sudarshan: Recent Advances in Query Optimization 42

New Applications z Querying semistructured data y. XML y. Querying on the web x. Web. SQL, Web. OQL, . . (Mendelzon. . , Shmueli. . , Laks. . ) y. Formal query languages for semi-structureed data x. Buneman et al S. Sudarshan: Recent Advances in Query Optimization 43

Conclusions z Query optimization has come a long way in the last 5/6 years z Still an area of active research ylots of work on selection of materialized views, and caching late y. Driving forces: Object relational DBS, Web, increasingly complex DSS queries, Data mining yquery optimizers are still very expensive in space and time. Better approximation algorithms could help a lot. S. Sudarshan: Recent Advances in Query Optimization 44