Subject Name File Structures Subject Code 10 IS

Subject Name: File Structures Subject Code: 10 IS 63 Engineered for Tomorrow

Chapter 8 Extendable Hashing Prepared By: Swetha G Department: Information Science & Engg Date : 13 / 05 / 2015

Extendable Hashing • Hashing, if no overflow, provides best performance as we can access one record with one seek only. • But, when working with dynamic files, to avoid overflow, the file has to self adjust itself, increasing address space. • One solution is Extendable hashing. • Extendable hashing combines tries with hashing. • Trie is a binary tree built using radix of the key. • Tries help retrieve records easily.

How Extendable Hashing works ? Primary key Hashing function H(key) Extract first d digit Directory Index Table look-up File pointer

Trie. . • Trie for keys “apple”, “ate”, “ant”, “ bat”, “ball” , “box”, “cat”. .

Extendible Hashing Example Directory with d=3 and 4 buckets d’=1 d=3 000 001 010 011 100 101 110 111 B 0 H(key)=0 d’=3 B 100 H(key)=100 d’=3 B 101 H(key)=101 d’=2 B 11 H(key)=11

Directory • A Directory is created from a trie. • Consider radix 2 approach, search decisions are made on bit by bit basis and data is retrieved in terms of buckets, not keys. • Consider Bucket A, having keys that when hashed have addresses that begins with 01, bucket B, keys beginning with 10, bucket C with 11. Figure left shows Trie, right shows corresponding directory.

Turning a Trie to Directory • Using Trie for extendible hashing (1) Use Radix 2 Trie : Keys in A : beginning with 0 Keys in B : beginning with 10 Keys in C : beginning with 11 A 0 1 B C (2) Retrieving from secondary storage the buckets containing keys, instead of individual keys

Splitting to handle overflow • To avoid overflow, split and increase ( extend ) the address-space. • In previous example, if bucket A, overflows, split A into D and • redistribute the keys on 00 and 01. • Thus the ability to grow the address-space is the advantage of extendable hashing.

Implementation 1. Create the address 2. Design a class to handle operations & create directory / bucket to hold data. 3. Manage operations on bucket and directory.

Create address space To create address space, drop a key to hash func, it returns with a address. For the generated address to be in the range of address space, we divide it with the largest prime no of the integer range, as in extendable hashing, it provides large no of address space that can be extended as required and shrink. Hence the only concern is to ensure that the resultant address need to be stored in the integer variable, hence need to be the range of integer. int hash ( char * key) { int sum =0, len – strlen(key); if ( len % 2 == 1) len++; for(int j=0; j<len; j+=2) sum = ( sum +100 * key[j] + key [ j+1 ] % 19937; return sum; }

The function Make. Address() is used to generate addresses. Consider keys and its generated addresses as follows: Bill 0000 0010 1100 Lee 0000 0101 1011 Alan 0000 0110 0101 Here if we consider the starting bits of the address, all keys goes to same bucket, and fills soon. Hence when the address is reversed, it provides a good distribution. ( consider higher order bits ) Bill 1100 Lee 1011 Alan 0101

Make. Address() Function reverses the address bits. int Make. Address ( char * Key, int Depth ) { Int retval = 0, hashval = Hash ( Key ); For ( int j=0; j < Depth ; j++) { retval =retval << 1; int lowbit = hashval & 1; retval = retval | lowbit ; hashval = hashval >> 1; } } return retval;

Class to handle operations Class Bucket: protected Text. Index { protected: Bucket (Directory & dir, int max. Keys = default. Max. Keys); int Insert (char * key, int rec. Addr); int Remove(char * key); Bucket * Split (); int New. Range (int & new. Start, int & new. End); int Redistribute (Bucket & new. Bucket); int Find. Buddy (); int Try. Combine (); int Combine (Bucket * buddy, int buddy. Index); int Depth; Directory & Dir; int Bucket. Addr; friend class Directory; friend class Bucket. Buffer; };

Retrieving a Record • Steps in retrieving a record with a given key • find H(given key) • extract first d bits of H(given key) • use this value as an index into the directory to find a pointer • use this pointer to read a bucket into primary memory • locate the desired record within the bucket (scan)

Split and Collapse • A pair of adjacent buckets ( buddy ) can be combined if the average load < 50%, so all records would be able to fit into one bucket • The directory can be compacted and d decremented whenever all pairs of pointers have the same values

Bucket B 0 overflows, then splits into B 0 and B 1 d=3 000 001 010 011 100 101 110 111 d’=2 B 00 H(key)=00. . B 01 H(key)=01. . d’=2 d’=3 B 100 H(key)=100. . d’=3 B 00 H(key)=101. . B 00 H(key)=11. . d’=2

Splitting to handle Overflow • E. g. Overflowing of bucket A Split A into A and D, use additional unused bits • No need to expand the directory. 00 A 01 D 10 B 11 C

1. Result of overflow of bucket B A 0 1 B 0 1 D C 3. Directory 2. Complete Binary Tree 0 0 000 0 1 001 1 0 A 1 1 0 1 0 1 A 010 011 B 100 D 101 C 110 111 B D C

Buddy Buckets • Given a bucket with an address uvwxy, where u, v, w, x, and y have values of either 0 or 1, the buddy bucket, if it exists, has the value uvwxz, such that z = y XOR 1 • If enough keys are deleted, the contents of buddy buckets can be combined into a single bucket – combination of different functions : Find Buddy() Try. Combine() Collapse()

Making Deletions • When to combine buckets – Buddy buckets: the buckets are siblings and at the leaf level of the tree – Examine the directory to see if we can make changes there – Shrink the directory if none of the buckets requires the depth of address information that is currently available in the directory

Extendable Hashing Performance • Time : O(1) – If the directory can kept in RAM: a single access – Otherwise: two accesses are necessary • Space utilization of the bucket – r (# of records), b (block size), N (# of Blocks) – Utilization = r / b. N – Average utilization ==> 0. 69 • Space utilization for the directory – How large a directory should we expect to have, given an expected number of keys?

Other Approaches • Similar to dynamic extendible hashing – Use a directory to track bucket addresses – Extend the directory through the use of tries • Start with a hash function that covers an address space of a fixed size • When overflow occurs – splits forming the leaves of a trie that grows down from the original address node makes a trie

Dynamic Hashing Two kinds of nodes External node: reference a data bucket Internal node: point to two children index nodes When a node has split children, it changed from an external node to an internal node Two hash functions Apply the first hash function original address space if external node is found : search is completed if internal node is found : apply second hash function

(a) (b) 1 2 3 Original address space 4 3 40 (c) 1 20 21 41 4 3 2 Original address space 4 1 Original address space 41 410 411

Linear Hashing Unlike extendible hashing and dynamic hashing, linear hashing does not use a directory. The actual address space is extended one bucket at a time as buckets overflow Because the extension of the address space does not necessarily correspond to the bucket that is overflowing, linear hashing necessarily involves the use of overflow buckets, even as the address space expands No directories: Avoid additional seek resulting from additional layer Use more bits of hashed value hd(k) : depth d hashing function (using function make_address)

The growth of address space in linear hashing(1) w a b c 00 01 10 d 11 a b c d A 000 01 10 11 100 (b) (a) y x a 00 b 01 c 10 x d A B 11 100 101 (c) a 00 b 01 c 10 d 11 (d) A 100 B C 101 110

Thank you.