Introduction to Indexes Rui Zhang http www csse

  • Slides: 30
Download presentation
Introduction to Indexes Rui Zhang http: //www. csse. unimelb. edu. au/~rui The University of

Introduction to Indexes Rui Zhang http: //www. csse. unimelb. edu. au/~rui The University of Melbourne Aug 2006

Copyright l Many contents are from Database Management Systems, 3 rd edition. Raghu Ramakrishnan

Copyright l Many contents are from Database Management Systems, 3 rd edition. Raghu Ramakrishnan & Johannes Gehrke Mc. Graw-Hill, 2000. l Some slides are from Prof. Beng Chin Ooi’s lecture notes for CS 5226 in NUS l Some are from the internet 2

Outline l Cost Model and File Organization l l l Indexes : General Concepts

Outline l Cost Model and File Organization l l l Indexes : General Concepts and Properties l l l Queries, keys and search keys Simple index files Properties of indexes Indexed sequential access method (ISAM) B+-tree l l l The memory hierarchy Magnetic disk Cost model Heap files The concept of index Structure Search Insertion Deletion Other basic indexes l l Hash tables Bitmap 3

The Memory Hierarchy Intel PIII CPU: 450 MHz Memory: 512 MB CPU Die CPU

The Memory Hierarchy Intel PIII CPU: 450 MHz Memory: 512 MB CPU Die CPU Registers L 1 Cache 40 ns L 2 Cache 90 ns Main Memory 14 ms = 14000000 ns Cost is affected significantly by disk accesses ! Hard Disk 4

The Hard (Magnetic) disk l The time for a disk block access, or disk

The Hard (Magnetic) disk l The time for a disk block access, or disk page access or disk I/O access time = seek time + rotational delay + transfer time l IBM Deskstar 14 GPX: 14. 4 GB l l l Seek time: 9. 1 msec Rotational delay: 4. 17 msec Transfer rate: 13 MB/sec, that is, 0. 3 msec/4 KB 5

The Cost Model l Cost measure: number of page accesses l Objective l l

The Cost Model l Cost measure: number of page accesses l Objective l l Reason l l l A simple way to estimate the cost (in terms of execution time) of database operations Page access cost is usually the dominant cost of database operations An accurate model is too complex for analyzing algorithms Note l l This cost model is for disk based databases; NOT applicable to main memory databases Blocked access: sequential scan of the database 6

Basic File Organization : Heap Files l File : a logical collection of data,

Basic File Organization : Heap Files l File : a logical collection of data, physically stored as a set of pages. l Heap File (Unordered File) l l l Linked list of pages The DBMS maintains the header page: <heap_file_name, header_page_address> Operations l l l Insertion Deletion Search Advantage: Simple Disadvantage: Inefficient Data Pages with free space Data Page Full pages Header Page 7

The Concept of Index l Searching a Book… Where are books by Rui ?

The Concept of Index l Searching a Book… Where are books by Rui ? Index by Author Database by Rao Where is “Database” ? l Programming by Alistair Index by Title Catalogs Books Index l l Database by Rui A data structure that helps us find data quickly Note l l Can be a separate structure (we may call it the index file) or in the records themselves. Usually sorted on some attribute 8

Query, Key and Search Key l Queries l Exact match (point query) l l

Query, Key and Search Key l Queries l Exact match (point query) l l Range query l l Q 2: Find me the books published between year 2003 -2005 Searching methods l l Sequential scan - too expensive Through index – if records are sorted on some attribute, we may do a binary search l l l Q 1: Find me the book with the name “Database” If sorted on “book name”, then we can do binary search for Q 1 If sorted on “year published”, then we can do binary search for Q 2 Key vs. Search key l l Key: the indexed attribute Search key: the attribute queried on 9

Simple Index File (Clustered, Dense) Dense index 10 20 30 40 50 60 70

Simple Index File (Clustered, Dense) Dense index 10 20 30 40 50 60 70 80 90 100 110 120 Sorted records 10 20 30 40 50 60 70 80 90 10

Simple Index File (Clustered, Sparse) Sparse index 10 30 50 70 90 110 130

Simple Index File (Clustered, Sparse) Sparse index 10 30 50 70 90 110 130 150 170 190 210 230 Sorted records 10 20 30 40 50 60 70 80 90 100 11

Simple Index File (Clustered, Multi-level) Sparse 2 nd level 10 90 170 250 330

Simple Index File (Clustered, Multi-level) Sparse 2 nd level 10 90 170 250 330 410 490 570 10 30 50 70 90 110 130 150 170 190 210 230 Sorted records 10 20 30 40 50 60 70 80 90 100 12

Simple Index File (Unclustered, Dense) Dense index 10 50 90. . . sparse high

Simple Index File (Unclustered, Dense) Dense index 10 50 90. . . sparse high level Unsorted records 10 20 30 40 30 50 50 60 70. . . 80 40 20 70 10 90 60 13

Simple Index File (Unclustered, Sparse ? ) Unsorted records Sparse index 30 20 80

Simple Index File (Unclustered, Sparse ? ) Unsorted records Sparse index 30 20 80 100 30 50 20 70 90. . . 80 40 does not make sense! 100 10 90 60 14

Properties of Indexes l Alternatives for entries in an index l l l An

Properties of Indexes l Alternatives for entries in an index l l l An actual data record A pair (key, ptr) A pair (key, ptr-list) l Clustered vs. Unclustered indexes l Dense vs. Sparse indexes l l l Dense index on clustered or unclustered attributes Sparse index only on clustered attribute Primary vs. Secondary indexes l l Primary index on clustered attribute Secondary index on unclustered attribute 15

Indexes on Composite Keys l l l Q 3: age=20 & sal=10 Index on

Indexes on Composite Keys l l l Q 3: age=20 & sal=10 Index on two or more attributes: entries are sorted first on the first attribute, then on the second attribute, the third … Q 4: age=20 & sal>10 Examples of composite key indexes using lexicographic order. 11, 80 11 12, 10 12 12, 20 13, 75 <age, sal> 10, 12 l Q 5: sal=10 & age>20 20, 12 75, 13 l Note l Different indexes are useful for different queries 80, 11 <sal, age> name age sal bob 12 10 cal 11 80 joe 12 20 sue 13 75 Data records sorted by name 12 13 <age> 10 20 75 80 <sal> 16

Indexed sequential access method (ISAM) l l Tree structured index Support queries l l

Indexed sequential access method (ISAM) l l Tree structured index Support queries l l l Point queries Range queries Problems l Static: inefficient for insertions and deletions Data pages Overflow page 17

The B+-Tree: A Dynamic Index Structure l l Grows and shrinks dynamically Minimum 50%

The B+-Tree: A Dynamic Index Structure l l Grows and shrinks dynamically Minimum 50% occupancy (except for root). l l Height-balanced l l Each node contains d <= m <= 2 d entries. The parameter d is called the order of the tree. Insert/delete at logf N cost (f = fanout, N = No. leaf pages) Pointers to sibling pages l Non-leaf pages (internal pages) P 0 K 1 P 1 K 2 P 2 K 3 l … Km P m Leaf pages l l If directory page, same as non-leaf pages; pointers point to data page addresses If data page K 0 D 0 K 1 D 1 K 2 D 2 … Km D m 18

Searching in a B+-Tree l Search begins at root, and key comparisons direct it

Searching in a B+-Tree l Search begins at root, and key comparisons direct it to a leaf (as in ISAM) l Search for 5, 15, all data entries >= 24. . . l What about all entries <= 24 ? Root 13 2 3 5 14 16 17 19 20 24 22 30 24 27 29 33 34 38 39 19

Insertion in a B+-Tree l l Find correct leaf L Put data entry onto

Insertion in a B+-Tree l l Find correct leaf L Put data entry onto L l l l This can happen recursively l l If L has enough space, done! Else, must split L (into L and a new node L 2) l Redistribute entries evenly, copy up middle key. l Insert index entry pointing to L 2 into parent of L. To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits. ) Splits “grow” tree; root split increases height. l Tree growth: gets wider or one level taller at top. 20

Inserting 7 & 8 into the B+-Tree Root 13 2 l 3 5 7

Inserting 7 & 8 into the B+-Tree Root 13 2 l 3 5 7 14 16 17 19 20 24 22 30 24 27 29 33 34 38 39 (Note that 5 is copied up and continues to appear in the leaf. ) Observe how minimum occupancy is guaranteed in both leaf and index pg splits 13 5 2 3 5 7 17 24 30 8 21

Inserting 8 into the B+-Tree (continued) l (17 is pushed up and only appears

Inserting 8 into the B+-Tree (continued) l (17 is pushed up and only appears once in the index. Contrast this with a leaf split. ) Note the difference between copy up and push up 17 5 13 24 13 5 2 3 30 5 7 17 24 30 8 22

The B+-Tree After Inserting 8 Root 17 5 2 3 24 13 5 7

The B+-Tree After Inserting 8 Root 17 5 2 3 24 13 5 7 8 14 16 19 20 22 30 24 27 29 33 34 38 39 l Note that root was split, leading to increase in height l We can avoid splitting by re-distributing entries. However, this is usually not done in practice. Why? 23

Deletion in a B+-Tree l l Start at root, find leaf L where the

Deletion in a B+-Tree l l Start at root, find leaf L where the entry belongs. Remove the entry. l l If L is at least half-full, done! If L has only d-1 entries, Try to redistribute, borrowing from sibling (adjacent node with same parent as L). l If redistribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height. l l l 24

Deleting 19 & 20 from the B+-Tree Root 17 5 2 3 24 13

Deleting 19 & 20 from the B+-Tree Root 17 5 2 3 24 13 5 7 8 19 20 22 14 16 30 24 27 29 33 34 38 39 Root 17 5 2 3 l 27 13 5 7 8 14 16 22 24 30 27 29 33 34 38 39 Deleting 20 is done with re-distribution. Note how the middle key is copied up 25

Deleting 24 from the B+-Tree l l l Merge happens Observe ‘toss’ of index

Deleting 24 from the B+-Tree l l l Merge happens Observe ‘toss’ of index entry (on right), and ‘pull down’ of index entry (below). Note the decrease in height 30 22 27 29 33 34 27 29 38 39 Root 5 2 3 5 7 8 13 14 16 17 30 22 33 34 38 39 26

Summary of the B+-Tree l A dynamic structure – height balanced – robustness l

Summary of the B+-Tree l A dynamic structure – height balanced – robustness l Scalability l Typical order: 100. Typical fill-factor: 67%. l l Typical capacities (root at Level 1, and has 133 entries) l l l average fanout = 133 Level 5: 1334 = 312, 900, 700 records Level 4: 1333 = 2, 352, 637 records Can often hold top levels in buffer pool: l l l Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17, 689 pages = 133 Mbytes l Efficient point and range queries – performance l Concurrency Essential properties of a DBMS 27

Other Basic Indexes: Hash Tables l Static hashing: static table size, with overflow chains

Other Basic Indexes: Hash Tables l Static hashing: static table size, with overflow chains 0 . . . (1 + ) M-1 l l Extendible hashing: dynamic table size, with no overflows Efficient point query, but inefficient range query 28

Other Basic Indexes: Bitmap l Bitmap with bit position to indicate the presence of

Other Basic Indexes: Bitmap l Bitmap with bit position to indicate the presence of a value gender l l rating M F cusid name gender rating 12345 1 0 112 Joe M 3 00100 1 0 115 Ram M 5 00001 0 1 119 Sue F 5 00001 1 0 120 Woo M 4 00010 Advantages l Efficient bit-wise operations l Efficient for aggregation queries: counting bits with 1’s l More compact than trees-- amenable to the use of compression techniques Limitations l Only good for domain of small cardinality 29

B+-Tree For All and Forever? l Can the B+-tree being a single-dimensional index be

B+-Tree For All and Forever? l Can the B+-tree being a single-dimensional index be used for emerging applications such as: l l l l Spatial databases High-dimensional databases Temporal databases Main memory databases String databases Genomic/sequence databases. . . Coming next … Indexing Multidimensional Data 30