Optimizing Query Execution Zachary G Ives University of

Query Execution: The Basic Techniques Three general data organization techniques: § Indexing Associative lookup

Speeding Operations over Data Three general data organization techniques: § Indexing § Sorting §

General External Merge Sort § To sort a file with N pages using B

Applicability of Sort Techniques § Aggregation § Duplicate removal as an instance of aggregation

Merge Join § Requires data sorted by join attributes Merge and join sorted files,

Hashing Several types of hashing: § Static hashing § Extendible hashing § Consistent hashing

Static Hashing § Fixed number of buckets (and pages); overflow when necessary § h(k)

Extendible Hashing If a bucket becomes full split in half § § Use directory

Insert h(r)=20 (Causes Doubling) LOCAL DEPTH 2 32*16* GLOBAL DEPTH 2 00 Bucket A

Relevance of Hashing Techniques § Hash indices use extensible hashing § Uses of static

Hash Join Read entire inner relation into hash table (join attributes as key) For

Running out of Memory § Prevention: First partition the data by value into memory

The Duality of Hash and Sort Different means of partitioning and merging data when

What If I Want to Distribute Query Processing? § Where do I put the

Pipelined Hash Join Useful for Joining Web Sources § Two hash tables § As

The Dependent Join § Take attributes from left and feed to the right source

Wrap-Up of Execution Query execution is all about engineering for efficiency § O(1) and

Query Optimization § Challenge: pick the query execution plan that has minimum cost §

The General Model of Optimization § Given an AST of a query: § Build

Which Operators Need Significant Optimization Decisions? § We typically make the following assumptions: §

The Basic Model: System-R § Breaks a query into its blocks, separately optimizes them

Next Time: Beyond System-R § Cross-query-block optimizations § e. g. , push a selection

Upcoming Readings For Tuesday: § Read Volcano and Starburst papers § Write one review

Slides: 24

Download presentation

Optimizing Query Execution Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 18, 2008 Content on hashing and sorting courtesy Ramakrishnan & Gehrke

Query Execution: The Basic Techniques Three general data organization techniques: § Indexing Associative lookup & synopses Both for selection and projection “Inner” loop of nested loops join … And anywhere sorted data is useful… § Sorting § Hashing 2

Speeding Operations over Data Three general data organization techniques: § Indexing § Sorting § Hashing 3

General External Merge Sort § To sort a file with N pages using B buffer pages: § Pass 0: use B buffer pages. Produce d. N / Be sorted runs of B pages each § Pass 2, …, etc. : merge B-1 runs § Number of passes: 1+dlog. B-1 d. N / Bee § Cost = 2 N * (# of passes) INPUT 1 . . . INPUT 2 . . . OUTPUT . . . INPUT B-1 Disk B Main memory buffers Disk

Applicability of Sort Techniques § Aggregation § Duplicate removal as an instance of aggregation § XML nesting as an instance of aggregation § Join, semi-join, and intersection 5

Merge Join § Requires data sorted by join attributes Merge and join sorted files, reading sequentially a block at a time § Maintain two file pointers While tuple at R < tuple at S, advance R (and vice versa) While tuples match, output all possible pairings § Maintain a “last in sequence” pointer § Preserves sorted order of “outer” relation § Cost: b(R) + b(S) plus sort costs, if necessary In practice, approximately linear, 3 (b(R) + b(S)) 6

Hashing Several types of hashing: § Static hashing § Extendible hashing § Consistent hashing (used in P 2 P; we’ll see later) 7

Static Hashing § Fixed number of buckets (and pages); overflow when necessary § h(k) mod N = bucket to which data entry with key k belongs § Downside: long overflow chains h(key) mod N key 0 2 h N-1 Primary bucket pages Overflow pages

Extendible Hashing If a bucket becomes full split in half § § Use directory of pointers to buckets, double the directory, splitting just the bucket that overflowed Directory much smaller than file, so doubling it is much cheaper § Only one page of data entries is split Trick lies in how hash function is adjusted!

Insert h(r)=20 (Causes Doubling) LOCAL DEPTH 2 32*16* GLOBAL DEPTH 2 00 Bucket A 2 1* 5* 21*13* Bucket B 10 2 11 10* Bucket C 3 000 2 1* 5* 21* 13* Bucket B 010 2 011 10* Bucket C 100 Bucket D 101 2 110 15* 7* 19* Bucket D 111 2 4* 12* 20* 32* 16* Bucket A 001 2 15* 7* 19* 3 GLOBAL DEPTH 01 DIRECTORY LOCAL DEPTH Bucket A 2 (`split image' of Bucket A) 3 DIRECTORY 4* 12* 20* Bucket A 2 (‘split image' of Bucket A)

Relevance of Hashing Techniques § Hash indices use extensible hashing § Uses of static hashing: § Aggregation § Intersection § Joins Why isn’t extendible hashing used in hash joins – only as a disk indexing technique? 11

Hash Join Read entire inner relation into hash table (join attributes as key) For each tuple from outer, look up in hash table & join O Not fully pipelined 12

Running out of Memory § Prevention: First partition the data by value into memory -sized groups Partition both relations in the same way, write to files Recursively join the partitions § Resolution: Similar, but do when hash tables full Split hash table into files along bucket boundaries Partition remaining data in same way Recursively join partitions with diff. hash fn! § Hybrid hash join: flush “lazily” a few buckets at a time § Cost: <= 3 * (b(R) + b(S)) 13

The Duality of Hash and Sort Different means of partitioning and merging data when comparisons are necessary: § Break on physical rule (mem size) in sorting Merge on logical step, the merge § Break on logical rule (hash val) in hashing Combine using physical step (concat) § When larger-than-memory sorting is necessary, multiple operators use the same key, we can make all operators work on the same in-memory portion of data at the same time § Can we do this with hashing? Hash teams (Graefe) 14

What If I Want to Distribute Query Processing? § Where do I put the data in the first place (or do I have a choice)? § How do we get data from point A -> point B? § What about delays? § What about “binding patterns”? § Looks kind of like an index join with a sargable predicate 15

Pipelined Hash Join Useful for Joining Web Sources § Two hash tables § As a tuple comes in, add to the appropriate side & join with opposite table PFully pipelined, adaptive to source data rates PCan handle overflow as with hash join O Needs more memory 16

The Dependent Join § Take attributes from left and feed to the right source as input/filter § Important in data integration § Simple method: for each tuple from left send to right source get data back, join § More complex: § Hash “cache” of attributes & mappings § Don’t send attribute already seen § Bloom joins (use bit-vectors to reduce traffic) Join. A. x = B. y A x B 17

Wrap-Up of Execution Query execution is all about engineering for efficiency § O(1) and O(lg n) algorithms wherever possible § Avoid looking at or copying data wherever possible § Note that larger-than-memory is of paramount importance Should that be so in today’s world? As we’ve seen it so far, it’s all about pipelining things through as fast as possible But may also need to consider other axes: § Adaptivity/flexibility – may sometimes need this § Information flow – to the optimizer, the runtime system 18

Query Optimization § Challenge: pick the query execution plan that has minimum cost § Sources of cost: § § § Interactions with other work Size of intermediate results Choices of algorithms, access methods Mismatch between I/O, CPU rates Data properties – skew, order, placement § Strategy: Estimate the cost of every query plan, find cheapest § Given: § Some notion of CPU, disk speeds § Cost model for every operator § Some information about tables and data 19

The General Model of Optimization § Given an AST of a query: § Build a logical query plan (Tree of query algebraic operations) § Transform into “better” logical plan § Convert into a physical query plan (Includes strategies for executing operations) 20

Which Operators Need Significant Optimization Decisions? § We typically make the following assumptions: § All predicates are evaluated as early as possible § All data is projected away as early as possible § As a general rule, those that produce intermediate state or are blocking: § Joins § Aggregation § Sorting § By choosing a join ordering, we’re automatically choosing where selections and projections are pushed – why is this so? 21

The Basic Model: System-R § Breaks a query into its blocks, separately optimizes them § Focuses strictly on joins (and only a few kinds) in dynamic programming enumeration § Principle of optimality: best k-way join includes best (k-1)-way join § Use simple table statistics when available, based on indices; “magic numbers” where unavailable § Heuristics § § Push “sargable” selects, projects as low as possible Cartesian products after joins Left-linear trees only: n 2 n-1 cost-est. operations Grouping last § Extra “interesting orders” dimension Grouping, ordering, join attributes 22

Next Time: Beyond System-R § Cross-query-block optimizations § e. g. , push a selection predicate from one block to another § Better statistics § More general kinds of optimizations § § Optimization of aggregation operations Different cost and data models, e. g. , OO, XML Additional joins, e. g. , “containment joins” Can we build an extensible architecture for this? Logical, physical, and logical-to-physical transformations Enforcers § Alternative search strategies § Left-deep plans aren’t always optimal § Perhaps we can prune more efficiently 23

Upcoming Readings For Tuesday: § Read Volcano and Starburst papers § Write one review contrasting the two on the major issues 24