Temple University CIS Dept CIS 331 Principles of
- Slides: 53
Temple University – CIS Dept. CIS 331– Principles of Database Systems V. Megalooikonomou Query Processing (based on notes by C. Faloutsos at CMU)
General Overview - rel. model n n n Relational model - SQL Functional Dependencies & Normalization Physical Design; Indexing Query processing/optimization Transaction processing Advanced topics n n Distributed Databases OO- and OR-DBMSs
Overview of a DBMS Naïve user casual user DML precomp. DML parser DBA DDL parser trans. mgr buffer mgr Data-files catalog
Overview - detailed n n n Motivation - Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies
Why Q-opt? n n SQL: ~declarative good q-opt -> big difference n e. g. , seq. Scan vs B-tree index, on P=1, 000 pages
Q-opt steps n n bring query in internal form (e. g. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alternative plans estimate cost; pick best
Q-opt - example p select name from STUDENT, TAKES Canonical form s where c-id=‘CIS 331’ and s STUDENT. ssn=TAKES. ssn STUDENT p TAKES STUDENT TAKES
Q-opt - example p Hash join; merge join; nested loops; STUDENT s Index; seq. scan TAKES
Overview - detailed n n n Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies
Equivalence of expressions n n n … or syntactic q-opt In short: perform selections and projections early More details: see transformation rules in text
Equivalence of expressions n n Q: How to prove a transformation rule? A: use TRC, to show that LHS = RHS, e. g. :
Equivalence of expressions
Equivalence of expressions
Equivalence of expressions n Selections n perform them early break a complex predicate, and push n simplify a complex predicate n n (‘X=Y and Y=3’) -> ‘X=3 and Y=3’
Equivalence of expressions n Projections n perform them early (but carefully…) n n n Smaller tuples Fewer tuples (if duplicates are eliminated) project out all attributes except the ones requested or required (e. g. , joining attr. )
Equivalence of expressions n Joins n n Commutative , associative Q: n-way join - how many diff. orderings? … Exhaustive enumeration too slow…
Q-opt steps n n bring query in internal form (e. g. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best
Cost estimation n n E. g. , find ssn’s of students with an ‘A’ in CIS 331 (using seq. scanning) How long will a query take? n n CPU (but: small cost; decreasing; tough to estimate) Disk (mainly, # block transfers) How many tuples will qualify? (what statistics do we need to keep? ) 18
Cost estimation n Statistics: for each relation ‘r’ we keep n n Sr #1 #2 #3 nr : # tuples; Sr : size of tuple in bytes … #nr
Cost estimation n Statistics: for each relation ‘r’ we keep n n n Sr #1 #2 #3 … V(A, r): number of distinct values of attr. ‘A’ (recently, histograms, too) … #nr
Derivable statistics n n n fr: blocking factor = max# records/block (=? ? ) br: # blocks (=? ? ) SC(A, r) = selection cardinality = avg# of records with A=given (=? ? ) Sr #1 fr #2 … #br
Derivable statistics n n fr: blocking factor = max# records/block (= B/Sr ; B: block size in bytes) br: # blocks (= nr / fr )
Derivable statistics n SC(A, r) = selection cardinality = avg# of records with A=given (= nr / V(A, r) ) (assumes uniformity. . . ) – eg: 30, 000 students, 10 colleges – how many students in CST?
Additional quantities we need: n For index ‘i’: n n fi: average fanout - degree (~50 -100) HTi: # levels of index ‘i’ (~2 -3) n n ~ log(#entries)/log(fi) LBi: # blocks at leaf level HTi
Statistics n n Where do we store them? How often do we update them?
Q-opt steps n n n bring query in internal form (e. g. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans n n n selections; sorting; projections joins estimate cost; pick best
Cost estimation + plan generation n Selections – e. g. , Sr select * fr from TAKES where grade = ‘A’ n #1 #2 Plans? … #br
Cost estimation + plan generation n Sr Plans? n n seq. scan binary search n n #2 (if sorted & consecutive) index search n #1 fr if an index exists … #br
Cost estimation + plan generation Sr seq. scan – cost? n br (worst case) n br/2 (average, if we search for primary key) #1 fr #2 … #br
Cost estimation + plan generation Sr binary search – cost? if sorted and consecutive: n ~log(br) + n SC(A, r)/fr (=#blocks spanned by qualified tuples) n -1 #1 fr #2 … #br
Cost estimation + plan generation Sr estimation of selection cardinalities SC(A, r): non-trivial #1 fr #2 … #br
Cost estimation + plan generation Sr method#3: index – cost? n levels of index + n blocks w/ qual. tuples #1 fr. . . #2 case#1: primary key case#2: sec. key – clustering index case#3: sec. key – nonclust. index … #br
Cost estimation + plan generation Sr method#3: index – cost? n levels of index + n blocks w/ qual. tuples #1 fr. . #2 case#1: primary key – cost: HTi + 1 … HTi #br
Cost estimation + plan generation Sr method#3: index - cost? n levels of index + n blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index … OR prim. index on non-key …retrieve multiple records HTi + SC(A, r)/fr HTi #br
Cost estimation + plan generation Sr method#3: index – cost? n n levels of index + blocks w/ qual. tuples #1 fr. . . #2 case#3: sec. key – nonclust. index HTi + SC(A, r) (actually, pessimistic. . . ) … #br
Cost estimation – arithmetic examples n n find accounts with branch-name = ‘Perryridge’ account(branch-name, balance, . . . )
Arithm. examples – cont’d n n n-account = 10, 000 tuples f-account = 20 tuples/block V(balance, account) = 500 distinct values V(branch-name, account) = 50 distinct values for branch-index: fanout fi = 20
Arithm. examples n n n Q 1: cost of seq. scan? A 1: 500 disk accesses Q 2: assume a clustering index on branch-name – cost?
Cost estimation + plan generation Sr method#3: index – cost? n n levels of index + blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index … HTi + SC(A, r)/fr HTi #br
Arithm. examples n A 2: HTi + SC(branch-name, account)/f-account n n HTi: 50 values, with index fanout 20 -> HT=2 levels (log(50)/log(20) = 1+) SC(. . )= # qualified records = n n nr/V(A, r) = 10, 000/50 = 200 tuples SC/f: spanning 200/20 blocks = 10 blocks
Arithm. examples n n n A 2 final answer: 2+10= 12 block accesses (vs. 500 block accesses of seq. scan) footnote: in all fairness n n seq. disk accesses: ~2 msec or less random disk accesses: ~10 msec
Overview - detailed n n n Motivation - Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies
2 -way joins n n algorithm(s) for r JOIN s? nr, ns tuples each r(A, . . . ) s(A, . . . ) nr ns
2 -way joins n Algorithm #0: (naive) nested loop (SLOW!) for each tuple tr of r for each tuple ts of s print, if they match r(A, . . . ) s(A, . . . ) nr ns
2 -way joins n n Algorithm #0: why is it bad? how many disk accesses (‘br’ and ‘bs’ are the number of blocks for ‘r’ and ‘s’)? nr*bs + br r(A, . . . ) s(A, . . . ) nr ns
2 -way joins n Algorithm #1: Blocked nested-loop join n read in a block of r n read in a block of s n print matching tuples cost: br + br * bs r(A, . . . ) s(A, . . . ) nr, br ns records, bs blocks
2 -way joins n Arithmetic example: n n nr = 10, 000 tuples, br = 1, 000 blocks ns = 1, 000 tuples, bs = 200 blocks alg#0: 2, 001, 000 d. a. r(A, . . . ) 10, 000 1, 000 s(A, . . . ) alg#1: 201, 000 d. a. 1, 000 records, 200 blocks
2 -way joins n Observation 1: Algo#1: asymmetric: n n n cost: br + br * bs - reverse roles: cost= bs + bs*br Best choice? smallest relation in outer loop r(A, . . . ) s(A, . . . ) nr, br ns records, bs blocks
2 -way joins n n Other algorithm(s) for r JOIN s? nr, ns tuples each r(A, . . . ) s(A, . . . ) nr ns
2 -way joins - other algo’s n sort-merge n n sort ‘r’; sort ‘s’; merge sorted versions (good, if one or both are already sorted) r(A, . . . ) s(A, . . . ) nr ns
2 -way joins - other algo’s n hash join: n n n hash ‘r’ into (0, 1, . . . , ‘max’) buckets hash ‘s’ into buckets (same hash function) join each pair of matching buckets r(A, . . . ) 0 1 max s(A, . . . )
Structure of query optimizers: More heuristics by Oracle, Sybase and Starburst (-> DB 2) : in book In general: q-opt is very important for large databases. (‘explain select <sql-statement>’ gives plan)
Conclusions -- Q-opt steps n n n bring query in internal form (eg. , parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans n n selections (simple; complex predicates) sorting; projections joins estimate cost; pick best
- Temple university cis
- Cis 331
- Cis temple
- Temple university entrepreneurship
- Engineering temple run
- Irb temple
- Temple university disability services
- Study abroad application temple
- Temple university admission requirements
- Temple university writing center
- Isss temple
- Temple university dba
- Map of consciousness
- Temple university change password
- Temple university undergraduate bulletin
- Ce 331
- Ssis 331
- Uw cse 331
- 14:332:331
- 14:332:331
- Affirmative easement
- Cse 331
- Cse 332 p3
- Ist 331
- Cmsc 331 umbc
- Cmsc 331
- Ce 331
- Ist 331
- Cmsc 331
- Cmsc 331
- Cow metu
- Cmsc 331
- Cmsc 331
- In 331 bce, alexander the great successfully invaded egypt.
- Ley 26 331
- Ley 26 331
- Ley 26 331
- Komax gamma 333 manual
- Pred-331
- 14:332:331
- Ist spring design
- 2011 plc (cs) 331
- Anth 331
- Mist-331
- Cmsc 331
- Umbc cmsc 341
- 14:332:331
- Dept nmr spectroscopy
- Florida department of agriculture and consumer services
- Finance department organizational chart
- Worcester public health department
- Dept. name of organization
- Mn dept of education
- Department of finance and administration