Programming Data Structures and Algorithms Hashing Anton Biasizzo

Programming, Data Structures and Algorithms (Hashing) Anton Biasizzo Programming, Data Structures and Algorithms (Hashing) Slide 1/25

Hash table ADT q Search tree ADT § Various operations on a set of elements. § Find operates in fast O(log n) time. § Insert and Delete require find procedure – both require O(log n) time q Hash table ADT § Supports only subset of the operations of search tree ADT (insert, delete, and find) § Very fast operations (close to constant time O(1)) § Does not provide ordering information § Implementations are referred as hashing Programming, Data Structures and Algorithms (Hashing) Slide 2/25

General idea q Hash table is an array of fixed size. q The array contains keys (i. e. string with associated value). q The table size (Table. Size) is a part of hash data structure. q Each key is mapped into some number in the range [0, Table. Size-1] and stored in appropriate cell. q Mapping is called Hash function. q Hash function should be simple to implement. Programming, Data Structures and Algorithms (Hashing) Slide 3/25

General idea q Returned values called hash values, hash codes, hash sums, or hashes. q Ideally distinct keys should have distinct hash values. q Finite number of cells (i. e. hash values). q Inexhaustible supply of keys. q Hash function should distribute keys evenly among the cells. q More keys map to same hash values – collision q Hash table implementation: § Choose hash function, § Manage collisions, § Determine table size. Programming, Data Structures and Algorithms (Hashing) Slide 4/25

Hash function q If input keys are integers, hash function is typically Key mod Table. Size: § Unless Key have some undesirable properties (i. e. Table size is 10 and keys end in zero). § Collisions can be reduced when the table size is a prime. § When keys are random integers they are evenly distributed. q Keys are usually strings: § Hash functions have to be chosen carefully. § One option is to sum ASCII values of characters in the string § Second option is to use only first few characters of key Programming, Data Structures and Algorithms (Hashing) Slide 5/25

Hash function q Sum of ASCII codes: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { unsigned int hash_val = 0; while ( *key != ‘’ ) hash_val += *key++; return ( hash_val % H_SIZE); } q It is simple and fast hash function q If the table size is large, it does not distribute the keys well: § For keys with eight or fewer characters hash is between 0 and 1016 Programming, Data Structures and Algorithms (Hashing) Slide 6/25

Hash function q First three characters: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { return ( ( key[0] + 27*key[1] + 729*key[2] ) % H_SIZE); } q Assumes that key has at least three characters. q 27 is the number of letters in English alphabet. q This is good hash function if characters are random, not the case for any language. Programming, Data Structures and Algorithms (Hashing) Slide 7/25

Hash function q Use all characters in key: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { unsigned int hash_val = 0; while ( *key != ‘’ ) hash_val = ( hash_val << 5 ) + *key++; return ( hash_val % H_SIZE); } q Multiplication with 32 instead of 27. q Simple and fast (if overflows are allowed) hash function. q If keys are very long: § § § it might be too time consuming. first characters are shifted out Use only some characters (odd, from different field, …) Programming, Data Structures and Algorithms (Hashing) Slide 8/25

Collision resolution q Collision: When inserting new element, it hashes to the same value as an already inserted element. q Strategies to resolve collisions: § Open hashing, § Closed hashing. Programming, Data Structures and Algorithms (Hashing) Slide 9/25

Open hashing q Open hashing or separate chaining. q Keep a list of all elements that hash to the same value. q ADT operations (find, insert, …) must be adopted. q In the example lists have headers. q Hash function is: mod 10 q Assume that keys are first 10 squares. Programming, Data Structures and Algorithms (Hashing) Slide 10/25

$Open hashing type declaration q Type declaration: typedef struct list_node *node_ptr; struct list_node {$

Open hashing type declaration q Type declaration: typedef struct list_node *node_ptr; struct list_node { element_type element; node_ptr next; }; typedef tree_ptr LIST; typedef tree_ptr position; struct hash_tbl { unsigned int table_size; LIST *the_lists; } typedef struct hash_tbl *HASH_TABLE Programming, Data Structures and Algorithms (Hashing) Slide 11/25

$Open hashing operations q Initialization HASH_TABLE initialize_table( unsigned int table_size ) { HASH_TABLE H;$

Open hashing operations q Initialization HASH_TABLE initialize_table( unsigned int table_size ) { HASH_TABLE H; int i; /* Allocate table */ H = (HASH_TABLE) malloc ( sizeof (struct hash_tbl) ); /* Allocate list pointers */ H->the_lists = (position *) malloc( sizeof (LIST) * H->table_size ); /* Allocate list headers */ for(i=0; i<H->table_size; i++ ) { H->the_lists[i] = (LIST) malloc sizeof (struct list_node) ); H->the_lists[i]->next = NULL; } return H; } Programming, Data Structures and Algorithms (Hashing) Slide 12/25

Open hashing operations q Find operation § If keys are strings appropriate functions must be used for key comparison. position find( element_type key, HASH_TABLE H ) { position p; LIST L; L = H->the_lists[ hash( key, H->table_size) ]; p = L->next; while ( (p != NULL) && (p->element != key) ) p = p->next; return p; } Programming, Data Structures and Algorithms (Hashing) Slide 13/25

Open hashing operations q Insert operation (no duplicates) void insert( element_type key, HASH_TABLE H ) { position pos, new_cell; LIST L; pos = find( key, H); if ( pos == NULL ) new_cell = (position) malloc(sizeof(struct list_node)); L = H->the_lists[ hash( key, H->table size ) ]; new_cell->next = L->next; new_cell->element = key; L->next = new_cell; } q This implementation compute hash value twice. Programming, Data Structures and Algorithms (Hashing) Slide 14/25

Open hashing q Any scheme could be used instead of linked lists to resolve the collisions (trees, other hash table, …) q We expect that if the table is large, the lists are short. q Load factor λ is a ratio of the number of elements in the hash table to the table size. q The average length of a list is λ. q Effort to perform a search is a constant time to calculate the hash value plus the time to traverse the list. q In an unsuccessful search, the number of links to traverse is λ on average. q The general rule for open hashing is to make table size about as large as the number of elements expected (λ ≈ 1) Programming, Data Structures and Algorithms (Hashing) Slide 15/25

Closed hashing q Open hashing has disadvantage of requiring lists or other data structure. q Closed hashing or Open addressing is an alternative to resolve collisions with linked lists. q If collision occurs an alternate cells are tried until an empty cell is found. q Formally: Cells h 0(X), h 1(X), h 2(X), … are tried in succession where q Function F is the collision resolution strategy (F(0) = 0). q For closed hashing bigger tables are needed. q In general the load factor should be below λ=0. 5. Programming, Data Structures and Algorithms (Hashing) Slide 16/25

Linear probing q Collision resolution function F is linear function (typically F(i)=i). q Cells are tried sequentially with wraparound in search of an empty cell. q As long as table is big enough a free cell can be found q Time to find empty cell can get quite large q Even when table are relatively empty blocks of occupied cells start forming – primary clustering Programming, Data Structures and Algorithms (Hashing) Slide 17/25

Example of linear probing q Example of inserting keys {89, 18, 49, 58, 69} Programming, Data Structures and Algorithms (Hashing) Slide 18/25

Quadratic probing q Collision resolution function F is quadratic function (typically F(i)=i 2). q It eliminates primary clustering. q For linear probing it is bad if table gets almost full. q In quadratic probing only at most half of table can be used as alternate locations. q For quadratic probing there is no guarantee of finding an empty cell once the table gets more then half full. q If table size is not prime the empty cell might not be found even when the table is less than half full. Programming, Data Structures and Algorithms (Hashing) Slide 19/25

Example of quadratic probing q Example of inserting keys {89, 18, 49, 58, 69} Programming, Data Structures and Algorithms (Hashing) Slide 20/25

Double hashing q Collision resolution function F includes second hash function F(i) = i hash 2(X). q We probe at distance hash 2(X) , 2 hash 2(X) , 3 hash 2(X) , … q Good second hash function is essential. q Hash function must never evaluate to zero! q Hash function must be chosen such that all cells can be probed (prime table size). Programming, Data Structures and Algorithms (Hashing) Slide 21/25

Example of double hashing q Hash 2(X) = R – (X mod R), where R=7 q Example of inserting keys {89, 18, 49, 58, 69} Programming, Data Structures and Algorithms (Hashing) Slide 22/25

Problems with closed hash table q Standard deletion cannot be performed, because the cell might have caused a collision to go past it. q Closed hash table require lazy deletion. Additional field is introduced to an element which tags it as deleted. q If the table gets too full, the operations gets slower and insertion might even fail. q This happens when many deletions are intermixed with insertions. Programming, Data Structures and Algorithms (Hashing) Slide 23/25

Rehashing q Solution is rehashing: § Build another table that is twice as big with new hash function. § Scan original hash table § Insert all non-deleted elements into new hash table q It is expensive operation. q It happens infrequent. q Several strategies: § Rehash when the table is half full, § Rehash only when insertion fails, § Rehash on certain load factor. Programming, Data Structures and Algorithms (Hashing) Slide 24/25

Hash tables q Hash tables are used to implement Insert and Find operation in constant average time. q Hash table usage: § § § Compilers to keep track of declared variables – symbol table Graph theory where nodes have names instead of numbers In playing games for recording positions – transposition table For dictionary implementation (spell checker, search engines, …) For database implementation Programming, Data Structures and Algorithms (Hashing) Slide 25/25