Carnegie Mellon Univ Dept of Computer Science 15

  • Slides: 57
Download presentation
Carnegie Mellon Univ. Dept. of Computer Science 15 -415 - Database Applications C. Faloutsos

Carnegie Mellon Univ. Dept. of Computer Science 15 -415 - Database Applications C. Faloutsos Indexing and Hashing – part II Carnegie Mellon 15 -415 - C. Faloutsos

General Overview - rel. model • Relational model - SQL – Formal & commercial

General Overview - rel. model • Relational model - SQL – Formal & commercial query languages • • Functional Dependencies Normalization Physical Design Indexing Carnegie Mellon 15 -415 - C. Faloutsos 2

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: – dynamic hashing – multi-attribute indexing Carnegie Mellon 15 -415 - C. Faloutsos 3

(Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free,

(Static) Hashing Problem: “find EMP record with ssn=123” What if disk space was free, and time was at premium? Carnegie Mellon 15 -415 - C. Faloutsos 4

Hashing A: Brilliant idea: key-to-address transformation: #0 page 123; Smith; Main str #123 page

Hashing A: Brilliant idea: key-to-address transformation: #0 page 123; Smith; Main str #123 page #999, 999 Carnegie Mellon 15 -415 - C. Faloutsos 5

Hashing Since space is NOT free: • use M, instead of 999, 999 slots

Hashing Since space is NOT free: • use M, instead of 999, 999 slots • hash function: h(key) = slot-id #0 page 123; Smith; Main str #123 page #999, 999 Carnegie Mellon 15 -415 - C. Faloutsos 6

Hashing Typically: each hash bucket is a page, holding many records: #0 page 123;

Hashing Typically: each hash bucket is a page, holding many records: #0 page 123; Smith; Main str #h(123) M Carnegie Mellon 15 -415 - C. Faloutsos 7

Hashing Notice: could have clustering, or non-clustering versions: #0 page 123; Smith; Main str.

Hashing Notice: could have clustering, or non-clustering versions: #0 page 123; Smith; Main str. #h(123) M Carnegie Mellon 15 -415 - C. Faloutsos 8

Hashing Notice: could have clustering, or non-clustering versions: EMP file. . . #0 page

Hashing Notice: could have clustering, or non-clustering versions: EMP file. . . #0 page . . . #h(123) 234; Johnson; Forbes ave 123; Smith; Main str. . M 345; Tompson; Fifth ave Carnegie Mellon 15 -415 - C. Faloutsos . . . 9

Indexing- overview • ISAM and B-trees • hashing – hashing functions – size of

Indexing- overview • ISAM and B-trees • hashing – hashing functions – size of hash table – collision resolution • Hashing vs B-trees • Indices in SQL • Advanced topics: Carnegie Mellon 15 -415 - C. Faloutsos 10

Design decisions 1) formula h() for hashing function 2) size of hash table M

Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method Carnegie Mellon 15 -415 - C. Faloutsos 11

Design decisions - functions • Goal: uniform spread of keys over hash buckets •

Design decisions - functions • Goal: uniform spread of keys over hash buckets • Popular choices: – Division hashing – Multiplication hashing Carnegie Mellon 15 -415 - C. Faloutsos 12

Division hashing h(x) = (a*x+b) mod M • eg. , h(ssn) = (ssn) mod

Division hashing h(x) = (a*x+b) mod M • eg. , h(ssn) = (ssn) mod 1, 000 – gives the last three digits of ssn • M: size of hash table - choose a prime number, defensively (why? ) Carnegie Mellon 15 -415 - C. Faloutsos 13

Division hashing • eg. , M=2; hash on driver-license number (dln), where last digit

Division hashing • eg. , M=2; hash on driver-license number (dln), where last digit is ‘gender’ (0/1 = M/F) • in an army unit with predominantly male soldiers • Thus: avoid cases where M and keys have common divisors - prime M guards against that! Carnegie Mellon 15 -415 - C. Faloutsos 14

Multiplication hashing h(x) = [ fractional-part-of ( x * φ ) ] * M

Multiplication hashing h(x) = [ fractional-part-of ( x * φ ) ] * M • φ: golden ratio ( 0. 618. . . = ( sqrt(5)-1)/2 ) • in general, we need an irrational number • advantage: M need not be a prime number • but φ must be irrational Carnegie Mellon 15 -415 - C. Faloutsos 15

Other hashing functions • quadratic hashing (bad) • . . . • conclusion: use

Other hashing functions • quadratic hashing (bad) • . . . • conclusion: use division hashing Carnegie Mellon 15 -415 - C. Faloutsos 16

Design decisions 1) formula h() for hashing function 2) size of hash table M

Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method Carnegie Mellon 15 -415 - C. Faloutsos 17

Size of hash table • eg. , 50, 000 employees, 10 employee-records / page

Size of hash table • eg. , 50, 000 employees, 10 employee-records / page • Q: M=? ? pages/buckets/slots Carnegie Mellon 15 -415 - C. Faloutsos 18

Size of hash table • eg. , 50, 000 employees, 10 employees/page • Q:

Size of hash table • eg. , 50, 000 employees, 10 employees/page • Q: M=? ? pages/buckets/slots • A: utilization ~ 90% and – M: prime number Eg. , in our case: M= closest prime to 50, 000/10 / 0. 9 = 5, 555 Carnegie Mellon 15 -415 - C. Faloutsos 19

Design decisions 1) formula h() for hashing function 2) size of hash table M

Design decisions 1) formula h() for hashing function 2) size of hash table M 3) collision resolution method Carnegie Mellon 15 -415 - C. Faloutsos 20

Collision resolution • Q: what is a ‘collision’? • A: ? ? Carnegie Mellon

Collision resolution • Q: what is a ‘collision’? • A: ? ? Carnegie Mellon 15 -415 - C. Faloutsos 21

Collision resolution #0 page #h(123) 123; Smith; Main str. M Carnegie Mellon 15 -415

Collision resolution #0 page #h(123) 123; Smith; Main str. M Carnegie Mellon 15 -415 - C. Faloutsos 22

Collision resolution • Q: what is a ‘collision’? • A: ? ? • Q:

Collision resolution • Q: what is a ‘collision’? • A: ? ? • Q: why worry about collisions/overflows? (recall that buckets are ~90% full) • A: ‘birthday paradox’ Carnegie Mellon 15 -415 - C. Faloutsos 23

Collision resolution • open addressing – linear probing (ie. , put to next slot/bucket)

Collision resolution • open addressing – linear probing (ie. , put to next slot/bucket) – re-hashing • separate chaining (ie. , put links to overflow pages) Carnegie Mellon 15 -415 - C. Faloutsos 24

Collision resolution linear probing: #0 page #h(123) 123; Smith; Main str. M Carnegie Mellon

Collision resolution linear probing: #0 page #h(123) 123; Smith; Main str. M Carnegie Mellon 15 -415 - C. Faloutsos 25

Collision resolution re-hashing #0 page h 1() #h(123) 123; Smith; Main str. h 2()

Collision resolution re-hashing #0 page h 1() #h(123) 123; Smith; Main str. h 2() M Carnegie Mellon 15 -415 - C. Faloutsos 26

Collision resolution separate chaining 123; Smith; Main str. Carnegie Mellon 15 -415 - C.

Collision resolution separate chaining 123; Smith; Main str. Carnegie Mellon 15 -415 - C. Faloutsos 27

Design decisions - conclusions • function: division hashing – h(x) = ( a*x+b )

Design decisions - conclusions • function: division hashing – h(x) = ( a*x+b ) mod M • size M: ~90% util. ; prime number. • collision resolution: separate chaining – easier to implement (deletions!); – no danger of becoming full Carnegie Mellon 15 -415 - C. Faloutsos 28

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: – dynamic hashing – multi-attribute indexing Carnegie Mellon 15 -415 - C. Faloutsos 29

Hashing vs B-trees: Hashing offers • speed ! ( O(1) avg. search time). .

Hashing vs B-trees: Hashing offers • speed ! ( O(1) avg. search time). . but: Carnegie Mellon 15 -415 - C. Faloutsos 30

Hashing vs B-trees: . . but B-trees give: • key ordering: – range queries

Hashing vs B-trees: . . but B-trees give: • key ordering: – range queries – proximity queries – sequential scan • O(log(N)) guarantees for search, ins. /del. • graceful growing/shrinking Carnegie Mellon 15 -415 - C. Faloutsos 31

Hashing vs B-trees: thus: • B-trees are implemented in most systems footnotes: • hashing

Hashing vs B-trees: thus: • B-trees are implemented in most systems footnotes: • hashing is not (why not? ) • ‘dbm’ and ‘ndbm’ of UNIX: offer one or both Carnegie Mellon 15 -415 - C. Faloutsos 32

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: – dynamic hashing – multi-attribute indexing Carnegie Mellon 15 -415 - C. Faloutsos 33

Indexing in SQL • create index <index-name> on <relationname> (<attribute-list>) • create unique index

Indexing in SQL • create index <index-name> on <relationname> (<attribute-list>) • create unique index <index-name> on <relation-name> (<attribute-list>) • drop index <index-name> Carnegie Mellon 15 -415 - C. Faloutsos 34

Indexing in SQL • eg. , create index ssn-index on STUDENT (ssn) • or

Indexing in SQL • eg. , create index ssn-index on STUDENT (ssn) • or (eg. , on TAKES(ssn, cid, grade) ): create index sc-index on TAKES (ssn, c-id) Carnegie Mellon 15 -415 - C. Faloutsos 35

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: (theoretical interest) – dynamic hashing – multi-attribute indexing Carnegie Mellon 15 -415 - C. Faloutsos 36

Problem with static hashing • problem: overflow? • problem: underflow? (underutilization) Carnegie Mellon 15

Problem with static hashing • problem: overflow? • problem: underflow? (underutilization) Carnegie Mellon 15 -415 - C. Faloutsos 37

Solution: Dynamic/extendible hashing • idea: shrink / expand hash table on demand. . •

Solution: Dynamic/extendible hashing • idea: shrink / expand hash table on demand. . • . . dynamic hashing Details: how to grow gracefully, on overflow? Many solutions - One of them: ‘extendible hashing’ Carnegie Mellon 15 -415 - C. Faloutsos 38

Extendible hashing #0 page #h(123) 123; Smith; Main str. M Carnegie Mellon 15 -415

Extendible hashing #0 page #h(123) 123; Smith; Main str. M Carnegie Mellon 15 -415 - C. Faloutsos 39

Extendible hashing #0 page solution: split the bucket in two #h(123) 123; Smith; Main

Extendible hashing #0 page solution: split the bucket in two #h(123) 123; Smith; Main str. M Carnegie Mellon 15 -415 - C. Faloutsos 40

Extendible hashing in detail: • keep a directory, with ptrs to hash-buckets • Q:

Extendible hashing in detail: • keep a directory, with ptrs to hash-buckets • Q: how to divide contents of bucket in two? • A: hash each key into a very long bit string; keep only as many bits as needed Eventually: Carnegie Mellon 15 -415 - C. Faloutsos 41

Extendible hashing directory 00. . . 01. . . 10. . . 11. .

Extendible hashing directory 00. . . 01. . . 10. . . 11. . . 0001. . . 0111. . . 10101. . . 10011. . . 10110. . . 1101. . . 101001. . . Carnegie Mellon 15 -415 - C. Faloutsos 42

Extendible hashing directory 00. . . 01. . . 10. . . 11. .

Extendible hashing directory 00. . . 01. . . 10. . . 11. . . 0001. . . 0111. . . 10101. . . 10011. . . 10110. . . 1101. . . 101001. . . Carnegie Mellon 15 -415 - C. Faloutsos 43

Extendible hashing directory 00. . . 01. . . 10. . . 11. .

Extendible hashing directory 00. . . 01. . . 10. . . 11. . . 0001. . . 0111. . . 10101. . . 10011. . . 10110. . . 101001. . . split on 3 -rd bit 1101. . . Carnegie Mellon 15 -415 - C. Faloutsos 44

Extendible hashing directory 00. . . 01. . . 10. . . 0001. .

Extendible hashing directory 00. . . 01. . . 10. . . 0001. . . 0111. . . new page / bucket 10011. . . 10101. . . 101001. . . 10110. . . 1101. . . Carnegie Mellon 15 -415 - C. Faloutsos 45

Extendible hashing directory (doubled) 000. . . 001. . . 010. . . 0001.

Extendible hashing directory (doubled) 000. . . 001. . . 010. . . 0001. . . 0111. . . new page / bucket 10011. . . 100. . . 101. . . 10101. . . 101001. . . 10110. . . 1101. . . 110. . . 111. . . Carnegie Mellon 15 -415 - C. Faloutsos 46

Extendible hashing BEFORE AFTER 0001. . . 0111. . . 00. . . 01.

Extendible hashing BEFORE AFTER 0001. . . 0111. . . 00. . . 01. . . 0001. . . 0111. . . 000. . . 001. . . 010. . . 10101. . . 10. . . 11. . . 10011. . . 10110. . . 101001. . . 10011. . . 1101. . . 10101. . . 101001. . . 10110. . . 011. . . 100. . . 101. . . 110. . . 111. . . Carnegie Mellon 15 -415 - C. Faloutsos 47

Extendible hashing • Summary: directory doubles on demand • or halves, on shrinking files

Extendible hashing • Summary: directory doubles on demand • or halves, on shrinking files • needs ‘local’ and ‘global’ depth (see book) • Mainly, of theoretical interest - same for – ‘linear hashing’ of Litwin – ‘order preserving’ – ‘perfect hashing’ (no collisions!) Carnegie Mellon 15 -415 - C. Faloutsos 48

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: – dynamic hashing – multi-attribute indexing Carnegie Mellon 15 -415 - C. Faloutsos 49

multiple-key access • how to support queries on multiple attributes, like – grade>=3 and

multiple-key access • how to support queries on multiple attributes, like – grade>=3 and course=‘ 415’ • major motivation: Geographic Information systems (GIS) Carnegie Mellon 15 -415 - C. Faloutsos 50

multiple-key access y Carnegie Mellon 15 -415 - C. Faloutsos x 51

multiple-key access y Carnegie Mellon 15 -415 - C. Faloutsos x 51

multiple-key access Typical query: • Find cities within x miles from Pittsburgh thus, we

multiple-key access Typical query: • Find cities within x miles from Pittsburgh thus, we want to store nearby cities on the same disk page: Carnegie Mellon 15 -415 - C. Faloutsos 52

multiple-key access y Carnegie Mellon 15 -415 - C. Faloutsos x 53

multiple-key access y Carnegie Mellon 15 -415 - C. Faloutsos x 53

multiple-key access y Carnegie Mellon 15 -415 - C. Faloutsos x 54

multiple-key access y Carnegie Mellon 15 -415 - C. Faloutsos x 54

multiple-key access - R-trees y Carnegie Mellon 15 -415 - C. Faloutsos x 55

multiple-key access - R-trees y Carnegie Mellon 15 -415 - C. Faloutsos x 55

multiple-key access - R-trees • R-trees: very successful for GIS • (along with ‘z-ordering’)

multiple-key access - R-trees • R-trees: very successful for GIS • (along with ‘z-ordering’) • more details: at ‘advanced topics’, later • even more details: in 15 -826 Carnegie Mellon 15 -415 - C. Faloutsos 56

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in

Indexing- overview • • • ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: industry workhorse – dynamic hashing – multi-attribute indexing Carnegie Mellon 15 -415 - C. Faloutsos 57