Temple University CIS Dept CIS 331 Principles of

General Overview - rel. model n n n Relational model - SQL Functional Dependencies

Overview of a DBMS Naïve user casual user DML precomp. DML parser DBA DDL

Overview - detailed n n n Motivation - Why q-opt? Equivalence of expressions Cost

Why Q-opt? n n SQL: ~declarative good q-opt -> big difference n e. g.

Q-opt steps n n bring query in internal form (e. g. , parse tree)

Q-opt - example p select name from STUDENT, TAKES Canonical form s where c-id=‘CIS

Q-opt - example p Hash join; merge join; nested loops; STUDENT s Index; seq.

Overview - detailed n n n Why q-opt? Equivalence of expressions Cost estimation Cost

Equivalence of expressions n n Q: How to prove a transformation rule? A: use

Equivalence of expressions n Selections n perform them early break a complex predicate, and

Equivalence of expressions n Projections n perform them early (but carefully…) n n n

Equivalence of expressions n Joins n n Commutative , associative Q: n-way join -

Cost estimation n n E. g. , find ssn’s of students with an ‘A’

Cost estimation n Statistics: for each relation ‘r’ we keep n n Sr #1

Cost estimation n Statistics: for each relation ‘r’ we keep n n n Sr

Derivable statistics n n n fr: blocking factor = max# records/block (=? ? )

Derivable statistics n n fr: blocking factor = max# records/block (= B/Sr ; B:

Derivable statistics n SC(A, r) = selection cardinality = avg# of records with A=given

Additional quantities we need: n For index ‘i’: n n fi: average fanout -

Statistics n n Where do we store them? How often do we update them?

Q-opt steps n n n bring query in internal form (e. g. , parse

Cost estimation + plan generation n Selections – e. g. , Sr select *

Cost estimation + plan generation n Sr Plans? n n seq. scan binary search

Cost estimation + plan generation Sr seq. scan – cost? n br (worst case)

Cost estimation + plan generation Sr binary search – cost? if sorted and consecutive:

Cost estimation + plan generation Sr estimation of selection cardinalities SC(A, r): non-trivial #1

Cost estimation + plan generation Sr method#3: index – cost? n levels of index

Cost estimation + plan generation Sr method#3: index – cost? n n levels of

Cost estimation – arithmetic examples n n find accounts with branch-name = ‘Perryridge’ account(branch-name,

Arithm. examples – cont’d n n n-account = 10, 000 tuples f-account = 20

Arithm. examples n n n Q 1: cost of seq. scan? A 1: 500

Arithm. examples n A 2: HTi + SC(branch-name, account)/f-account n n HTi: 50 values,

Arithm. examples n n n A 2 final answer: 2+10= 12 block accesses (vs.

2 -way joins n n algorithm(s) for r JOIN s? nr, ns tuples each

2 -way joins n Algorithm #0: (naive) nested loop (SLOW!) for each tuple tr

2 -way joins n n Algorithm #0: why is it bad? how many disk

2 -way joins n Algorithm #1: Blocked nested-loop join n read in a block

2 -way joins n Arithmetic example: n n nr = 10, 000 tuples, br

2 -way joins n Observation 1: Algo#1: asymmetric: n n n cost: br +

2 -way joins n n Other algorithm(s) for r JOIN s? nr, ns tuples

2 -way joins - other algo’s n sort-merge n n sort ‘r’; sort ‘s’;

2 -way joins - other algo’s n hash join: n n n hash ‘r’

Structure of query optimizers: More heuristics by Oracle, Sybase and Starburst (-> DB 2)

Conclusions -- Q-opt steps n n n bring query in internal form (eg. ,

Slides: 53

Download presentation

Temple University – CIS Dept. CIS 331– Principles of Database Systems V. Megalooikonomou Query Processing (based on notes by C. Faloutsos at CMU)

General Overview - rel. model n n n Relational model - SQL Functional Dependencies & Normalization Physical Design; Indexing Query processing/optimization Transaction processing Advanced topics n n Distributed Databases OO- and OR-DBMSs

Overview of a DBMS Naïve user casual user DML precomp. DML parser DBA DDL parser trans. mgr buffer mgr Data-files catalog

Overview - detailed n n n Motivation - Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies

Why Q-opt? n n SQL: ~declarative good q-opt -> big difference n e. g. , seq. Scan vs B-tree index, on P=1, 000 pages

Q-opt steps n n bring query in internal form (e. g. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alternative plans estimate cost; pick best

Q-opt - example p select name from STUDENT, TAKES Canonical form s where c-id=‘CIS 331’ and s STUDENT. ssn=TAKES. ssn STUDENT p TAKES STUDENT TAKES

Q-opt - example p Hash join; merge join; nested loops; STUDENT s Index; seq. scan TAKES

Overview - detailed n n n Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies

Equivalence of expressions n n n … or syntactic q-opt In short: perform selections and projections early More details: see transformation rules in text

Equivalence of expressions n n Q: How to prove a transformation rule? A: use TRC, to show that LHS = RHS, e. g. :

Equivalence of expressions

Equivalence of expressions n Selections n perform them early break a complex predicate, and push n simplify a complex predicate n n (‘X=Y and Y=3’) -> ‘X=3 and Y=3’

Equivalence of expressions n Projections n perform them early (but carefully…) n n n Smaller tuples Fewer tuples (if duplicates are eliminated) project out all attributes except the ones requested or required (e. g. , joining attr. )

Equivalence of expressions n Joins n n Commutative , associative Q: n-way join - how many diff. orderings? … Exhaustive enumeration too slow…

Q-opt steps n n bring query in internal form (e. g. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best

Cost estimation n n E. g. , find ssn’s of students with an ‘A’ in CIS 331 (using seq. scanning) How long will a query take? n n CPU (but: small cost; decreasing; tough to estimate) Disk (mainly, # block transfers) How many tuples will qualify? (what statistics do we need to keep? ) 18

Cost estimation n Statistics: for each relation ‘r’ we keep n n Sr #1 #2 #3 nr : # tuples; Sr : size of tuple in bytes … #nr

Cost estimation n Statistics: for each relation ‘r’ we keep n n n Sr #1 #2 #3 … V(A, r): number of distinct values of attr. ‘A’ (recently, histograms, too) … #nr

Derivable statistics n n n fr: blocking factor = max# records/block (=? ? ) br: # blocks (=? ? ) SC(A, r) = selection cardinality = avg# of records with A=given (=? ? ) Sr #1 fr #2 … #br

Derivable statistics n n fr: blocking factor = max# records/block (= B/Sr ; B: block size in bytes) br: # blocks (= nr / fr )

Derivable statistics n SC(A, r) = selection cardinality = avg# of records with A=given (= nr / V(A, r) ) (assumes uniformity. . . ) – eg: 30, 000 students, 10 colleges – how many students in CST?

Additional quantities we need: n For index ‘i’: n n fi: average fanout - degree (~50 -100) HTi: # levels of index ‘i’ (~2 -3) n n ~ log(#entries)/log(fi) LBi: # blocks at leaf level HTi

Statistics n n Where do we store them? How often do we update them?

Q-opt steps n n n bring query in internal form (e. g. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans n n n selections; sorting; projections joins estimate cost; pick best

Cost estimation + plan generation n Selections – e. g. , Sr select * fr from TAKES where grade = ‘A’ n #1 #2 Plans? … #br

Cost estimation + plan generation n Sr Plans? n n seq. scan binary search n n #2 (if sorted & consecutive) index search n #1 fr if an index exists … #br

Cost estimation + plan generation Sr seq. scan – cost? n br (worst case) n br/2 (average, if we search for primary key) #1 fr #2 … #br

Cost estimation + plan generation Sr binary search – cost? if sorted and consecutive: n ~log(br) + n SC(A, r)/fr (=#blocks spanned by qualified tuples) n -1 #1 fr #2 … #br

Cost estimation + plan generation Sr estimation of selection cardinalities SC(A, r): non-trivial #1 fr #2 … #br

Cost estimation + plan generation Sr method#3: index – cost? n levels of index + n blocks w/ qual. tuples #1 fr. . . #2 case#1: primary key case#2: sec. key – clustering index case#3: sec. key – nonclust. index … #br

Cost estimation + plan generation Sr method#3: index – cost? n levels of index + n blocks w/ qual. tuples #1 fr. . #2 case#1: primary key – cost: HTi + 1 … HTi #br

Cost estimation + plan generation Sr method#3: index - cost? n levels of index + n blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index … OR prim. index on non-key …retrieve multiple records HTi + SC(A, r)/fr HTi #br

Cost estimation + plan generation Sr method#3: index – cost? n n levels of index + blocks w/ qual. tuples #1 fr. . . #2 case#3: sec. key – nonclust. index HTi + SC(A, r) (actually, pessimistic. . . ) … #br

Cost estimation – arithmetic examples n n find accounts with branch-name = ‘Perryridge’ account(branch-name, balance, . . . )

Arithm. examples – cont’d n n n-account = 10, 000 tuples f-account = 20 tuples/block V(balance, account) = 500 distinct values V(branch-name, account) = 50 distinct values for branch-index: fanout fi = 20

Arithm. examples n n n Q 1: cost of seq. scan? A 1: 500 disk accesses Q 2: assume a clustering index on branch-name – cost?

Cost estimation + plan generation Sr method#3: index – cost? n n levels of index + blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index … HTi + SC(A, r)/fr HTi #br

Arithm. examples n A 2: HTi + SC(branch-name, account)/f-account n n HTi: 50 values, with index fanout 20 -> HT=2 levels (log(50)/log(20) = 1+) SC(. . )= # qualified records = n n nr/V(A, r) = 10, 000/50 = 200 tuples SC/f: spanning 200/20 blocks = 10 blocks

Arithm. examples n n n A 2 final answer: 2+10= 12 block accesses (vs. 500 block accesses of seq. scan) footnote: in all fairness n n seq. disk accesses: ~2 msec or less random disk accesses: ~10 msec

Overview - detailed n n n Motivation - Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies

2 -way joins n n algorithm(s) for r JOIN s? nr, ns tuples each r(A, . . . ) s(A, . . . ) nr ns

2 -way joins n Algorithm #0: (naive) nested loop (SLOW!) for each tuple tr of r for each tuple ts of s print, if they match r(A, . . . ) s(A, . . . ) nr ns

2 -way joins n n Algorithm #0: why is it bad? how many disk accesses (‘br’ and ‘bs’ are the number of blocks for ‘r’ and ‘s’)? nr*bs + br r(A, . . . ) s(A, . . . ) nr ns

2 -way joins n Algorithm #1: Blocked nested-loop join n read in a block of r n read in a block of s n print matching tuples cost: br + br * bs r(A, . . . ) s(A, . . . ) nr, br ns records, bs blocks

2 -way joins n Arithmetic example: n n nr = 10, 000 tuples, br = 1, 000 blocks ns = 1, 000 tuples, bs = 200 blocks alg#0: 2, 001, 000 d. a. r(A, . . . ) 10, 000 1, 000 s(A, . . . ) alg#1: 201, 000 d. a. 1, 000 records, 200 blocks

2 -way joins n Observation 1: Algo#1: asymmetric: n n n cost: br + br * bs - reverse roles: cost= bs + bs*br Best choice? smallest relation in outer loop r(A, . . . ) s(A, . . . ) nr, br ns records, bs blocks

2 -way joins n n Other algorithm(s) for r JOIN s? nr, ns tuples each r(A, . . . ) s(A, . . . ) nr ns

2 -way joins - other algo’s n sort-merge n n sort ‘r’; sort ‘s’; merge sorted versions (good, if one or both are already sorted) r(A, . . . ) s(A, . . . ) nr ns

2 -way joins - other algo’s n hash join: n n n hash ‘r’ into (0, 1, . . . , ‘max’) buckets hash ‘s’ into buckets (same hash function) join each pair of matching buckets r(A, . . . ) 0 1 max s(A, . . . )

Structure of query optimizers: More heuristics by Oracle, Sybase and Starburst (-> DB 2) : in book In general: q-opt is very important for large databases. (‘explain select <sql-statement>’ gives plan)

Conclusions -- Q-opt steps n n n bring query in internal form (eg. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans n n selections (simple; complex predicates) sorting; projections joins estimate cost; pick best