Optimization Recap and examples 1 Optimization introduction For

Optimization introduction • For every SQL expression, there are many possible ways of implementation.

Disk-Memory-CPU Delete from Sailors where sid=90 sailors Reserves Main Memory DISK CPU 3

Hardware Recap • The DB is kept on the Disk. • The Disk is

Hardware Recap • We assume a constant time for each Disk access, and that

SID Block 1 Block 2 Block 3 SNAME rating age RID 1923 Joe 8

SID SNAME 1923 Joe rating age 8 32 B blocks t tuples 3321 Phil

Indexes on files • An Index on a table is an additional file which

Tree index on sname of sailors ‘A’->’M’ B 1 Root block ‘N’->’Z’ B 2

Tree index • The tree is kept balanced • The tree entries are always

Tree index on sname of sailors ‘A’->’M’ B 1 ‘N’->’Z’ B 2 B 1

Hash index • Works in a similar way, but using a hash function instead

Natural Join • We want to compute Running example data SELECT * FROM Reserves

Natural Join • We want to compute SELECT * FROM Reserves R, Sailors S

Block Nested Loop Join • Suppose there are B available blocks in the memory,

Index Nested Loop • Suppose there is an index on sid of Sailors •

• Q: So when would we typically choose to use an index-nested loop

Sort-Merge Join • Sort both relations on the join column • Join them according

Run time of Sort-Merge • M, N: number of blocks of the relations •

Question Suppose: tuple size= 100 bytes number of tuples (employees)=3, 000 Page size=1000 bytes

Question 2 SELECT E. ename FROM Employees E, Departments D WHERE E. eid=D. eid

Slides: 22

Download presentation

Optimization Recap and examples 1

Optimization introduction • For every SQL expression, there are many possible ways of implementation. • The different alternatives could result in huge run-time differences. • Our aim is to introduce the basic hardware used, and optimization principles 2

Disk-Memory-CPU Delete from Sailors where sid=90 sailors Reserves Main Memory DISK CPU 3

Hardware Recap • The DB is kept on the Disk. • The Disk is divided into BLOCKS • Any processing of the information occurs in the Main Memory. • Therefore, a block which we want to access has to be brought from the Disk to the memory, and perhaps written back. • Blocks are read/written from/to the Disk as single units. • The time of reading/writing a block to/from the disk is an I/O operation, and takes a lot of time. 4

Hardware Recap • We assume a constant time for each Disk access, and that only disk access define the run time. • We do not consider writing to the disk • Every table in the DB is stored as a File (on the Disk), which is a ‘bunch of Blocks’. • We will deal with files that are ‘heap-sorted’, i. e. , there is no order in the file tuples • Every block contains many tuples, each of them has a Record ID (RID), which states its location: (number of block, number of tuple within the block) 5

SID Block 1 Block 2 Block 3 SNAME rating age RID 1923 Joe 8 32 (1, 1) 3321 Phil 9 41 (1, 2) 1332 Boe 7 33 (1, 3) 1226 Bill 6 23 (2, 1) 1444 Paul 1 21 (2, 2) 1112 Jim 3 33 (2, 3) 1445 Vicky 9 54 (3, 1) 6

SID SNAME 1923 Joe rating age 8 32 B blocks t tuples 3321 Phil 9 41 1332 Boe 7 33 1226 Bill 6 23 Select * from sailors where sname=‘Jim’ 1444 Paul 1 21 Select * from sailors 1112 Jim 3 33 1445 Vicky 9 54 Q: What would be the cost of the following queries? Select * from sailors where rating>4 Answer: B 7

Indexes on files • An Index on a table is an additional file which helps access the data fast. • An index holds ‘data entries’ to the table file • The index can have the structure of a B+ Tree, or a hash function. 8

Tree index on sname of sailors ‘A’->’M’ B 1 Root block ‘N’->’Z’ B 2 B 1 ‘A’->’G’ L 1 B 2 ‘N’->’T’ L 3 ‘H’->’M’ L 2 ‘U’->’Z’ L 4 Branch blocks L 1 ‘Bill’(2, 1) L 2 ‘Jim’(2, 3) L 3 ‘Paul’ (2, 2) L 4 ‘Vicky’ (3, 1) ‘Boe’(1, 3) ‘Joe’(1, 1) ‘Phil’ (1, 2) … SID SNAME rating age 1923 Joe 8 32 3321 Phil 9 41 1332 Boe 7 33 1226 Bill 6 23 1444 Paul 1 21 1112 Jim 3 33 1445 Vicky 9 54 Leaf blocks 9

Tree index • The tree is kept balanced • The tree entries are always ordered • The leaves point to the exact location of tuples • Getting to the leaf is typically 2 -3 I/O • Each leaf points to the next/previous leaf • A Clustered index means that the index and the table are ordered by the same attribute 10

Tree index on sname of sailors ‘A’->’M’ B 1 ‘N’->’Z’ B 2 B 1 ‘A’->’G’ L 1 B 2 ‘N’->’T’ L 3 ‘H’->’M’ L 2 ‘U’->’Z’ L 4 Notice: index is not clustered L 1 ‘Bill’(2, 1) L 2 ‘Jim’(2, 3) L 3 ‘Paul’ (2, 2) L 4 ‘Vicky’ (3, 1) ‘Boe’(1, 3) ‘Joe’(1, 1) ‘Phil’ (1, 2) … How would the following queries be processed? Select * from sailors where sname=‘Jim’ Select * from sailors where sname>’J’ SID SNAME rating age 1923 Joe 8 32 3321 Phil 9 41 1332 Boe 7 33 1226 Bill 6 23 1444 Paul 1 21 1112 Jim 3 33 1445 Vicky 9 54 11

Tree index on sname of sailors ‘A’->’M’ B 1 ‘N’->’Z’ B 2 B 1 ‘A’->’G’ L 1 B 2 ‘N’->’T’ L 3 ‘H’->’M’ L 2 ‘U’->’Z’ L 4 Notice: index is clustered L 1 ‘Bill’(1, 1) L 2 ‘Jim’(1, 3) L 3 ‘Paul’ (2, 2) L 4 ‘Vicky’ (3, 1) ‘Boe’(1, 2) ‘Joe’(2, 1) ‘Phil’ (2, 3) … How would the following queries be processed? Select * from sailors where sname=‘Jim’ Select * from sailors where sname>’J’ SID SNAME rating age 1226 Bill 6 23 1332 Boe 7 33 1112 Jim 3 33 1923 Joe 8 32 1444 Paul 1 21 3321 Phil 9 41 1445 Vicky 9 54 12

Hash index • Works in a similar way, but using a hash function instead of a tree • Works only for equality conditions • Average of 1. 2 I/O to get to the tuple location 13

Natural Join • We want to compute Running example data SELECT * FROM Reserves R, Sailors S WHERE R. sid = S. sid • t. R=5000 t. S=10, 000 50 tuples per block 12 buffer pages Naïve algorithm: Foreach tuple r in R Foreach tuple s in S if r. sid=s. sid add r, s to result Cost: BR+t. R*BS = 100+5000*200=1, 000, 100 14

Natural Join • We want to compute SELECT * FROM Reserves R, Sailors S WHERE R. sid = S. sid • We have 4 optional algorithms: 1. Block Nested Loops Join 2. Index Nested Loops Join 3. Sort Merge Join 4. Hash Join This is assuming there is not enough space in the memory for the smaller of the 2 relations+2 15

Block Nested Loop Join • Suppose there are B available blocks in the memory, BR blocks of relation R, and BS blocks of relations S, and BR<BS • Until all blocks of R have been read: – Read B-2 blocks of R – Read all blocks of S (one by one), and write the result • Run time: BR + BS * ceil(BR /(B-2)) = 100+200*100/10=2, 100 16

Index Nested Loop • Suppose there is an index on sid of Sailors • Until all blocks of R have been read: – Read a block of R – For each tuple in the block, use the index of S to locate the matching tuples in S. • We mark the time it takes to read the tuples in S that match a single tuple in R as X. • Run time: BR + t. R*X • If the index is clustered, X=2 -4 • If it is not clustered, we evaluate X. = 100+5000*3=15, 100 17

• Q: So when would we typically choose to use an index-nested loop over blocknested? • A: Look at the inequality… 18

Sort-Merge Join • Sort both relations on the join column • Join them according to the join algorithm: sid sname rating age sid bid day agent 22 dustin 7 45 28 103 12/4/96 Joe 28 yuppy 9 35 28 103 11/3/96 Frank 31 lubber 8 55 36 lubber 6 36 31 10/2/96 Joe 44 guppy 5 35 31 102 12/7/96 Sam 58 rusty 10 35 31 101 13/7/96 Sam 58 103 2/6/96 Frank 19

Run time of Sort-Merge • M, N: number of blocks of the relations • Sorting: Mlog. M+Nlog. N • Merging: N+M if no partition is scanned twice. • Total: Mlog. M+Nlog. N+N+M • Especially good if one or both of the relations are already sorted. = 100*7+200*8+100+200=2, 600 20

Question Suppose: tuple size= 100 bytes number of tuples (employees)=3, 000 Page size=1000 bytes You have an unclustered index on Hobby. You know that 50 employees collect stamps. Would you use the index? And for 1, 000 stamp-lovers? SELECT E. dno FROM Employees E WHERE E. hobby=‘stamps’ 21

Question 2 SELECT E. ename FROM Employees E, Departments D WHERE E. eid=D. eid • Length of tuples, Number of tuples – Emp: 20 bytes, 20, 000 tuples – Dept: 40 bytes, 5000 tuples • Pages contain 4000 bytes; 12 buffer pages • Which algorithm would you use if there is an unclustered tree index on E. eid? And clustered? 22