Bancos de Dados Avanados Processamento de Consultas DCC

Bancos de Dados Avançados Processamento de Consultas DCC 030 - TCC: Bancos de Dados Avançados (Ciência Computação) DCC 049 - TSI: Bancos de Dados Avançados (Sistemas Informação) DCC 842 - Bancos de Dados (Pós-Graduação) MIRELLA M. MORO mirella@dcc. ufmg. br http: //www. dcc. ufmg. br/~mirella

Introdução Árvore de consulta Consulta costuma ter N estratégias de execução possíveis: Otimização de consulta = escolher estratégia adequada Estratégia de execução Plano de consulta Código pode ser: (1) Executado diretamente (modo interpretado) (2) Armazenado e executado + tarde (modo compilado) Figura 15. 1 - Passos típicos durante a execução de uma consulta de alto nível. Elmasri/Navathe 4ª Ed 2

Sumário Processamento de Consultas 1. Tradução SQL Álgebra Relacional 3. Algoritmos para operações – Ordenação externa – SELEÇÃO, JUNÇÃO, PROJEÇÃO, AGREGAÇÃO, JUNÇÃO EXTERNA 2. Otimização de consultas – Com heurísticas – Com estimativas de custo (seletividade) MATERIAL BASEADO EM : Elmasri/Navathe 4 a ed cap 15 / 6 a ed cap 19 3

Seletividade e Estimativas de Custo Componentes de Custo Funções de Custo para SELECT Funções de Custo para JOIN

Using Selectivity and Cost Estimates in Query Optimization • Cost-based query optimization: Estimate and compare the costs of executing a query using different execution strategies and choose the strategy with the lowest cost estimate. (Compare to heuristic query optimization) • Issues – Cost function – Number of execution strategies to be considered Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 5

Cost Components for Query Execution 1. Access cost to secondary storage [próximo slide] 2. Storage cost – intermediate files 3. Computation cost – in-memory ops on buffers (sorting records, merging records for a join, etc) 4. Memory usage cost – # buffers needed 5. Communication cost – shipping query + results Focus on different components – Large DBs: main emphasis on (1) – Small DBs: main emphasis on (3) Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 6

Cost Components for Query Execution 1. Access cost to secondary storage Search for, reading, writing data blocks – Type of access structures: ordering, hashing, primary/secondary index – Block allocation: contiguously, scattered • For most DBMSs, this is the cost to be minimized Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 7

Catalog Information Used in Cost Functions • Information about the size of a file – – number of records (tuples) (r), record size (R), number of blocks (b) blocking factor (bfr) (number of records that fit in one block) • Information about indexes and indexing attributes of a file – – Number of levels (x) of each multilevel index Number of first-level index blocks (b. I 1) Number of distinct values (d) of an attribute Selectivity (sl) of an attribute (fraction of records satisfying an equality condition on the attribute) – Selection cardinality (s) of an attribute. (s = sl * r) (average number of records that satisfy an equality selection condition on that attribute) E. g. for a key attribute: d = r, sl = 1/r and s=1 Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 8

Examples of Cost Functions for SELECT • S 1. Linear search (brute force) approach CS 1 a = b; For an equality condition on a key, CS 1 a = (b/2)* if the record is found; otherwise CS 1 b = b. • S 2. Binary search: CS 2 = log 2 b + ┌(s/bfr) ┐– 1 For an equality condition on a unique (key) attribute CS 2 =log 2 b* • S 3. Using a primary index (S 3 a) or hash key (S 3 b) to retrieve a single record CS 3 a = x + 1; CS 3 b = 1 for static or linear hashing CS 3 b = 1 for extendible hashing Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed *Na média, um registro é encontrado depois de pesquisar metade das chaves da tabela *Se for busca pelo atributo chave, s=1 Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 9

Examples of Cost Functions for SELECT (cont. ) • S 4. Using an ordering index to retrieve multiple records with inequality condition (<, >, ≠, ≤, ≥): For the comparison condition on a key field with *Na média, metade dos registros satisfazem à condição an ordering index, CS 4 = x + (b/2)* • S 5. Using a clustering index to retrieve multiple *s registros em s/bfr blocos satisfazem à condição records on equality condition: CS 5=x+┌ (s/bfr)* ┐ Number of records (tuples) (r), • S 6. Using a secondary (B+-tree) index: Record size (R), Number of blocks (b) For equality in key attribute, CS 6 a = x + 1 Blocking factor (bfr) #levels (x) of each multilevel index For equality comparison, CS 6 b = x + 1+ s* #first-level index blocks (b. I 1) For inequality condition (<, >, ≠, ≤, ≥): #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) CS 6 c = x + (b. I 1/2) + (r/2)** *s registros em s blocos: pois é índice secundário no qual o arquivo não está ordenado pelo campo não-chave ** na média, metade dos registros satisfazem à condição: metade dos blocos do primeiro nível do índice são acessados, mais metade do número de registros são acessados pelo índice Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 10

Examples of Cost Functions for SELECT (cont. ) • • S 7. Conjunctive selection: Use either S 1 or one of the methods S 2 to S 6 to solve. For the latter case, use one condition to retrieve the records and then check in the memory buffer whether each retrieved record satisfies the remaining conditions in the conjunction. S 8. Conjunctive selection using a composite index: Same as S 3 a, S 5 or S 6 a, depending on the type of index. Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 11

Example of Using Cost Functions • Schema EMPLOYEE (SSN, fname, minit, lname, bdate, address, sex, salary, super_ssn, dno) • Statistics r. E = 10000 records • Access paths – – b. E = 2000 disk blocks bfr. E = 5 records/block Salary: clustering index x. Salary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi 1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 • Queries a. b. c. d. σ ssn=‘ 123456789’(EMPLOYEE) σ dno>5(EMPLOYEE) σ dno=5 AND salary>30000 AND sex=‘F’(EMPLOYEE) • COST? ! Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 12

σ ssn=‘ 123456789’(EMPLOYEE) • Statistics r. E = 10000 records • Access paths – – b. E = 2000 disk blocks bfr. E = 5 records/block Salary: clustering index x. Salary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi 1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S 1. Linear search (brute force) approach CS 1 b = b / 2 (for a key attribute) Custo = 1000 S 6. Using a secondary (B+-tree) index: For an equality comparison, CS 6 a = x + s; Custo = 4 + 1 = 5 Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 13

σ dno > 5(EMPLOYEE) • Statistics r. E = 10000 records • Access paths – – b. E = 2000 disk blocks bfr. E = 5 records/block Salary: clustering index x. Salary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi 1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S 1. Linear search (brute force) approach CS 1 a = b; Custo = 2000 S 6. Using a secondary (B+-tree) index: For an comparison condition such as >, <, >=, or <= : CS 6 a = x + (b. I 1/2) + (r/2) Custo = 2 + 4/2 + 10000/2 = 5004 Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 14

σ dno = 5(EMPLOYEE) • Statistics r. E = 10000 records • Access paths – – b. E = 2000 disk blocks bfr. E = 5 records/block Salary: clustering index x. Salary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi 1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S 1. Linear search (brute force) approach CS 1 a = b; Custo = 2000 S 6. Using a secondary (B+-tree) index: For an equality comparison, CS 6 a = x + s; Custo = 2 + 80 = 82 Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 15

σ dno=5 AND salary>30000 AND sex=‘F’(EMPLOYEE) • Statistics r. E = 10000 records • Access paths – – b. E = 2000 disk blocks bfr. E = 5 records/block Salary: clustering index x. Salary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi 1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S 1. Linear search (brute force) approach CS 1 a = b; Custo = 2000 Dno = 5 : : S 6. Using a secondary (B+-tree) index: For an equality comparison, CS 6 a = x + s; Custo = 2 + 80 = 82 salary> 30000 : : S 4. Using an ordering index to retrieve multiple records: CS 4 = x + (b/2); Custo = 3+ 2000/2 = 1003 Sex=‘F’: : S 6. Using a secondary (B+-tree) index: For an equality comparison, CS 6 a = x + s; Custo = 1 + 5000 = 5001 Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 16

σ dno=5 AND salary>30000 AND sex=‘F’(EMPLOYEE) S 1. Linear search (brute force) approach CS 1 a = b; Custo = 2000 Dno = 5 : : S 6. Using a secondary (B+-tree) index: For an equality comparison, CS 6 a = x + s; Custo = 2 + 80 = 82 salary> 30000 : : S 4. Using an ordering index to retrieve multiple records: CS 4 = x + (b/2); Custo = 3+ 2000/2 = 1003 Sex=‘F’: : S 6. Using a secondary (B+-tree) index: For an equality comparison, CS 6 a = x + s; Custo = 1 + 5000 = 5001 • Optimizer chooses S 6 a on the secondary index on Dno • (dno=5) retrieves the records • Others (salary>30000 and sex=‘F’) checked for each selected record after it is retrieved into main memory Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed 17

Examples of Cost Functions for JOIN • • Join selectivity (js) js = | (R C S) | / | R x S | = | (R C S) | / (|R| * |S |) If condition C does not exist, js = 1; If no tuples from the relations satisfy condition C, js = 0; Usually, 0 <= js <= 1; Size of the result file after join operation | (R C S) | = js * |R| * |S | Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 18

Examples of Cost Functions for JOIN (cont. ) • J 1. Nested-loop join: CJ 1 = b. R + (b. R*b. S) + ((js* |R|* |S|)/bfr. RS) (Use R for outer loop) cost of writing resulting file to disk Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 19

Examples of Cost Functions for JOIN (cont. ) • J 2. Single-loop join (using an access structure to retrieve the matching record(s)) If an index exists for the join attribute B of S with index levels x. B, we can retrieve each record s in R and then use the index to retrieve all the matching records t from S that satisfy t[B] = s[A]. The cost depends on the type of index. [NEXT SLIDE] Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 20

Examples of Cost Functions for JOIN (cont. ) • J 2. Single-loop join: cost depending on type of index For a secondary index, CJ 2 a = b. R + (|R| * (x. B + s. B)) + ((js* |R|* |S|)/bfr. RS); For a clustering index, CJ 2 b = b. R + (|R| * (x. B + (s. B/bfr. B))) + ((js* |R|* |S|)/bfr. RS); For a primary index, CJ 2 c = b. R + (|R| * (x. B + 1)) + ((js* |R|* |S|)/bfr. RS); If a hash key exists for one of the two join attributes — B of S CJ 2 d = b. R + (|R| * h) + ((js* |R|* |S|)/bfr. RS); - s. B is the selection cardinality for join attribute B in S Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed cost of writing Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 21

Examples of Cost Functions for JOIN (cont. ) • J 3. Sort-merge join: CJ 3 a = CS + b. R + b. S + ((js* |R|* |S|)/bfr. RS); (CS: Cost for sorting files) Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 22

Join and Buffers J 1: Nested-loop (for each R retrieve S and test join condition) • • Read as many blocks as possible at a time into memory from outer loop file number of buffers available = n. B-2 (needs 1 to read the other file + 1 to result) Read one block from INNER then probe OUTER in memory – EMPLOYEE � � DEPARTMENT dno=dnumber – COST: EMPLOYEE as outer #blocks accessed for outer = b. E #blocks accessed for inner = b. D #times (n. B-2) blocks of outer are loaded = b. E/ (n. B-2) COST = #block accessed = b. E + b. D * b. E/ (n. B-2) – COST: DEPARTMENT as outer #block accesses = b. D + b. E * b. D/ (n. B-2) Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 23

Join and Buffers • J 1: Nested-loop (for each R retrieve S and test join condition) EMPLOYEE � � DEPARTMENT dno=dnumber n. B = 7 blocks (buffers) EMPLOYEE: r. E=6000 records in b. E=2000 disk blocks DEPARTMENT: r. D=50 records in b. D=10 disk blocks COST: EMPLOYEE as outer #block accesses = b. E + b. D * b. E/ (n. B-2) = 2000 + 10 * 2000/5 = 6000 – COST: DEPARTMENT as outer #block accesses = b. D + b. E * b. D/ (n. B-2) = 10 + 2000 * 10/5 = 4010 – – – MUITO MELHOR COLOCAR A MENOR RELAÇÃO COMO OUTER 24

Join and Selection Factor J 2: Single loop join (for each R probe S for matching values) • Join selection factor (equi-join): percentage of records in a file that will be joined with records in the other file • DEPARTMENT �mgr_ssn=ssn � EMPLOYEE • If we have indexes on both sides, the cost depends: 1. From Employee to Department b. E + r. E * (xmgr_ssn + 1) 2. From Department to Employee Number of records (tuples) (r), Record size (R), Number of blocks (b) b. D + r. D * (xssn + 1) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (b. I 1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) 25

Join and Selection Factor J 2: Single loop join (for each R probe S for matching values) • DEPARTMENT � � EMPLOYEE mgr_ssn=ssn – DEPARTMENT: r. D=50 recordsin b. D=10 disk blocks JSDE = 50 every department has a manager on Employee – EMPLOYEE: r. E=6000 recordsin b. E=2000 disk blocks JSED = 50/6000 only 50 employees are managers on Department – INDEX on mgr_ssn on DEPARTMENT: xmrg_ssn=2 – INDEX on ssn on EMPLOYEE: xssn = 4 – Employee probe DEPARTMENT using index mgr_ssn b. E + r. E * (xmgr_ssn + 1) = 2000 + 6000 * 3 = 20. 000 – Department probe EMPLOYEE using index ssn b. D + r. D * (xssn + 1) = 10 + 50 * 5 = 260 26

Multiple Relation Queries and Join Ordering • A query joining n relations will have n-1 join operations, and hence can have a large number of different join orders when we apply the algebraic transformation rules. • Current query optimizers typically limit the structure of a (join) query tree to that of left-deep (or right-deep) trees. • Left-deep tree: a binary tree where the right child of each non -leaf node is always a base relation. – Amenable to pipelining – Could utilize any access paths on the base relation (the right child) when executing the join. R 4 R 3 Elmasri&Navathe – Fundamentos de Banco de Dados, 4 a ed R 1 R 2 27

Example of Using Cost Functions • Schema EMPLOYEE (SSN, fname, minit, lname, bdate, address, sex, salary, super_ssn, dno) DEPARTMENT (dnumber, dname, mgr_ssn, mgr_start_date) • Statistics r. E = 10000 records r. D = 125 records • Access paths – – b. E = 2000 disk blocks b. D = 13 disk blocks bfr. E = 5 recors/block SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi 1 dno = 4. ddno = 125 distinct values, sdno = 80 Dnumber: primary index, xdnumber = 1 level Mgr_ssn: secondary index, average selection cardinality smgr_ssn = 1, levels xmgr_ssn = 2 • Queries 1. EMPLOYEE � � DEPARTMENT js = 1/125; bfr. ED = 4 records per block dno=dnumber PARE E PENSE NAS POSSÍVEIS RESPOSTAS 28

EMPLOYEE � � DEPARTMENT dno=dnumber • J 1: CJ 1 = b. R + (b. R*b. S) + ((js* |R|* |S|)/bfr. RS) • J 2: For a secondary index, CJ 2 a = b. R + (|R| * (x. B + s. B)) + ((js* |R|* |S|)/bfr. RS); For a clustering index, CJ 2 b = b. R + (|R| * (x. B + (s. B/bfr. B))) + ((js* |R|* |S|)/bfr. RS); For a primary index, CJ 2 c = b. R + (|R| * (x. B + 1)) + ((js* |R|* |S|)/bfr. RS); If a hash key exists for one of the two join attributes — B of S CJ 2 d = b. R + (|R| * h) + ((js* |R|* |S|)/bfr. RS); 29

EMPLOYEE � � DEPARTMENT dno=dnumber • J 1 (nested loop join) with EMPLOYEE as outer loop CJ 1 = b. E + (b. E*b. D) + ((js* |E|* |D|)/bfr. ED) = 2000 + 2000*13 + (1/125 * 10000 * 125)/4 = 30500 • J 1 with DEPARTMENT as outer loop CJ 1 = b. D + (b. D*b. E) + ((js* |D|* |E|)/bfr. DE) = 13 + 13*2000 + (1/125 * 10000 *125/4) = 28513 • J 2 (single loop join) with EMPLOYEE as outer loop (dnumber as primary idx) CJ 2 c = b. E + (|E| * (xdnumber + 1)) + ((js* |E|* |D|)/bfr. ED) = 2000 + (10000*2) + (1/125 * 10000 *125/4) = 24500 • J 2 with DEPARTMENT as outer loop (dno as secondary idx) CJ 2 a = b. D + (|D| * (xdno + sdno)) + ((js* |D|* |E|)/bfr. ED) = 13 + (125*(2+80)) + (1/125 * 10000 * 125/4) = 12763 • • – – – • Statistics: r. E = 10000 records r. D = 125 records b. E = 2000 disk blocks b. D = 13 disk blocks bfr. E = 5 recors/block Access paths SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi 1 dno = 4. ddno = 125 distinct values, sdno = 80 Dnumber: primary index, xdnumber = 1 level Mgr_ssn: secondary index, average selection cardinality smgr_ssn = 1, levels xmgr_ssn = 2 Queries EMPLOYEE dno=dnumber �� DEPARTMENT js = 1/125; bfr ED = 4 records per block 30

EMPLOYEE � � DEPARTMENT dno=dnumber • J 1 (nested loop join) with EMPLOYEE as outer loop CJ 1 = b. E + (b. E*b. D) + ((js* |E|* |D|)/bfr. ED) = 30500 • J 1 with DEPARTMENT as outer loop CJ 1 = b. D + (b. D*b. E) + ((js* |D|* |E|)/bfr. DE) = 28513 • J 2 (single loop join) with EMPLOYEE as outer loop (dnumber as primary idx) CJ 2 c = b. E + (|E| * (xdnumber + 1)) + ((js* |E|* |D|)/bfr. ED) = 24500 • J 2 with DEPARTMENT as outer loop (dno as secondary idx) CJ 2 a = b. D + (|D| * (xdno + sdno)) + ((js* |D|* |E|)/bfr. ED) = 12763 NOTE: with 15 buffers or more • DEPARTMENT fits memory (13 blocks) • 1 buffer for result • 1 for EMPLOYEE • CJ 2 = b. E + b. D + ((js* r. E* r. D)/bfr. ED) = 4513 31

Sugestões de Exercícios DCCbda/pdfs/exercicios-otimizacao. pdf RAMAKRISHNAM 3 rd Ed • 14. 4 e 14. 6 : given two relations and their statistics, calculate cost of joining them • 15. 4 1) e 2): given a relation, a sql statement, and options for indexes, define the cost of the best plan 32