External Memory Hashing Hash Tables Hash function h

External Memory Hashing

Hash Tables • Hash function h: search key [0…B 1]. • Buckets are blocks, numbered [0…B 1]. • Big idea: If a record with search key K exists, then it must be in bucket h(K). - One disk I/O if there is only one block per bucket. Hash Table Lookup: For record(s) with search key K, compute h(K); search that bucket.

Hash Table Insertion • Put in bucket h(K) if it fits; otherwise create an overflow block. - Overflow block(s) are part of bucket. Example: Insert record with search key g.

What if the File Grows too Large? • Efficiency is highest if #records < #buckets #(records/block) • If file grows, we need a dynamic hashing method to maintain the above relationship. - Extensible Hashing: double the number of buckets when needed. - Linear hashing: add one more bucket as appropriate.

Dynamic Hashing Framework • Hash function h produces a sequence of k bits. • Only some of the bits are used at any time to determine placement of keys in buckets. Extensible Hashing (Buckets may share blocks!) • Keep parameter i = number of bits from the beginning of h(K) that determine the bucket. • Bucket array now = pointers to buckets. - A block can serve several buckets. - For each block, a parameter j i tells how many bits of h(K) determine membership in the block. - i. e. , a block represents 2 i-j buckets that share the first j bits of their number.

Example • An extensible hash table when i=1:

Extensible Hash table Insert • • If record with key K fits in the block B pointed to by h(K), put it there. If not, let this block B represent j bits. 1. j=i: a. Set i: =i+1; b. Double the bucket array, so it has now 2 i+1 entries; c. Let w be an old array entry. Both the new entries, w 0 and w 1, point to the same block that w used to point to. d. Split B into two and distribute the records (of B) according to (j+1)st bit; i. set j: =j+1; ii. fix pointers in bucket array, so that entries that formerly pointed to B now point either to B or the new block How? depending on…(j+1)st bit 2. j<i: a) Do as in 1. d

Example • Insert record with h(K) = 1010. Now, after the insertion Before

Example: Next • Next: records with h(K)=0000; h(K)=0111. - Bucket for 0. . . gets split, - but i stays at 2. • Then: record with h(K) = 1000. - Overflows bucket for 10. . . - Raise i to 3. After the insertions Currently

Extensible Hash Tables: Advantages: • Lookup; never search more than one data block. - Hope that the bucket array fits in main memory Defects: • Doubling the bucket array could make the array to not fit in main memory. • Problem with skewed key distributions. - E. g. Let 1 block=2 records. Suppose that three records have hash values, which happen to be the same in the first 20 bits. - In that case we would have i=20 and one million bucket array entries, even though we have only 3 records!!

Linear Hashing • • • Use i bits from right (low order) end ofh(K). Buckets numbered [0…n 1], where 2 i-1<n 2 i. Let last i bits of h(K) be m = a 1 a 2…ai 1. If m < n, then record belongs to bucket m. 2. If n m<2 i, then record belongs to bucket m 2 i 1, that is the bucket we would get if we changed a 1 (which must be 1) to 0. i=1 #of buckets n=2 #of records r=3 This is also part of the structure

Linear Hash Table Insert • Pick an upper limit on capacity, - e. g. , 85% (1. 7 records/bucket in our example). • If an insertion exceeds capacity limit, set n : = n + 1. - If new n is 2 i + 1, set i : = i + 1. • No change in bucket numbers needed just imagine a leading 0. - Need to split bucket n 2 i 1 because there is now a bucket numbered (old) n.

Example • Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. i=1 n=1 r=0 Before 0 i=1 n=1 r=1 After 0 0000

Example • Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. i=1 0 0000 i=1 n=2 r=1 r=2 0 0000 1010 1 Before After Capacity limit exceeded; increment n

Example • Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. i=1 0 n=2 0000 i=1 1010 n=2 0 r=3 r=2 1 1 Before After 0000 1010 1111

Example • Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. i=1 0 n=2 r=3 1 0000 i=2 1010 n=3 1111 00 r=4 01 0000 1111 0101 Before After Capacity limit exceeded; increment n, which causes incrementing i as well. 10 1010

Example • Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. i=2 00 0000 i=2 n=3 r=4 r=5 01 1111 00 01 0101 10 Before 0000 1111 0001 0101 1010 10 1010 After As long as capacity is not exceeded can add overflow blocks.

Example • Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. i=2 00 0000 i=2 n=3 00 1100 n=4 r=5 01 1111 r=6 0001 01 0101 10 Before 0000 0001 0101 1010 10 After Capacity limit exceeded; increment n. 11 1010 1111

Lookup in Linear Hash Table • For record(s) with search key K, compute h(K); search the corresponding bucket according to the procedure described for insertion. • If the record we wish to look up isn’t there, it can’t be anywhere else. • E. g. lookup for a key which hashes to 1010, and then for a key which hashes to 1011. i=2 n=3 r=4

Exercise • Suppose we want to insert keys with hash values: 0000… 1111 in a linear hash table with 100% capacity threshold. • Assume that a block can hold three records.