Hashing Its not just for breakfast anymore hashing

Hashing It’s not just for breakfast anymore! hashing 1

Hashing: the facts • Approach that involves both storing and searching for values • Behavior is linear in the worst case, but strong competitor with binary searching in the average case • Hashing makes it easy to add and delete elements, an advantage over binary search (since the latter requires sorted array) hashing 2

Dictionary ADT • Previously, we have seen a dictionary ADT implemented as a binary search tree • A hash table can be used to provide an array -based dictionary implementation • Abstract properties of dictionary: – every item has a key – to retrieve an item, specify key and retrieval process fetches associated data hashing 3

$Possible structure for single dictionary item template <class item> struct Record. Type { size_t$

Possible structure for single dictionary item template <class item> struct Record. Type { size_t key; item datarecord; } hashing 4

Setting up the array • One approach to an array-based dictionary would be to create consecutive keys, storing the records so that each key corresponds to its index -- this is the method used in MS Access, for example • An alternative would be to use an existing attribute of the data to be stored as the key value; this approach is more typical of hashing 5

Setting up the array • Use of existing key field presents challenges: – Value may be too large for indexing: e. g. social security number – No guarantee that individual values will be close enough together for effective indexing: e. g. last 4 digits of social security numbers of students in a class hashing 6

Solution: hashing • Instead of direct use of data field, a function is applied to the original value to produce a valid index: this is called the hash function • The hash function maps the key to an index that can be used to insert data into the array or to retrieve data based on a given key • An array that uses hashing for indexing is called a hash table hashing 7

Operations on a hash table • Inserting an item – calculate hash value (index) from item key – check index to determine if space is open • if open, insert item • if not open, collision occurs; search through array for next open slot – requires some mechanism for recognizing an empty space; can’t just start with uninitialized array hashing 8

Open-address hashing • The insertion scheme just described uses open-address hashing • In open addressing, collisions are resolved by placing a new item in the next open spot in the array • Scheme requires that the key field of each array element be initialized to some known value; -1, for example hashing 9

Inserting a New Record • In order to insert a new record, the key must somehow be converted to an array index. • The index is called the hash value of the key. [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number hashing 506643548 Number 580625685 [5] [ 700]. . . Number 155778322 10

Inserting a New Record Number 580625685 • Typical hash function – 701 is the number of items in the array – Number is the original key value (Number mod 701) 3 What is (580625685 mod 701) ? [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number hashing 506643548 [5] [ 700]. . . Number 155778322 11

Inserting a New Record • The hash value is used for the location of the new record. [3] [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number hashing 506643548 [5] [ 700]. . . Number 155778322 12

Collisions Number 701466868 • Here is another new record to insert, with a hash value of 2. My hash value is [2]. [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number 580625685 Number hashing 506643548 [5] [ 700]. . . Number 155778322 13

Collisions Number 701466868 • This is called a collision, because there is already another valid record at [2]. When a collision occurs, move forward until you find an empty spot. [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number 580625685 Number hashing 506643548 [5] [ 700]. . . Number 155778322 14

Collisions The new record goes in the empty spot. [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number 580625685 Number hashing 506643548 [5] [ 700] Number 701466868 . . . Number 155778322 15

Operations on a hash table • Retrieving an item – calculate hash value based on desired key – search array, beginning at calculated index, for desired data – search is finished when: • item is found; successful search • an empty index is encountered; unsuccessful search hashing 16

Searching for a Key • The data that's attached to a key can be found fairly quickly. [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number 580625685 Number hashing 506643548 Number 701466868 [5] [ 700] Number 701466868 . . . Number 155778322 17

Searching for a Key • Calculate the hash value. • Check that location of the array for the key. Number 701466868 My hash value is [2]. Not me. [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number 580625685 Number hashing 506643548 [5] [ 700] Number 701466868 . . . Number 155778322 18

Searching for a Key • Keep moving forward until you find the key, or you reach an empty spot. Number 701466868 My hash value is [2]. Not me. Yes! [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number 580625685 Number hashing 506643548 [5] [ 700] Number 701466868 . . . Number 155778322 19

Searching for a Key • When the item is found, the information can be copied to the necessary location. Number 701466868 My hash value is [2]. [0] [1] Number 281942902 [2] Number 233667136 [3] [4] Number 580625685 Number hashing 506643548 [5] [ 700] Number 701466868 . . . Number 155778322 20

Operations on a hash table • Deleting an item: – find index based on hashed key, as with insertion and retrieval – mark record at index to indicate the spot is open • can’t use ordinary “empty” designation -- this could interfere with record retrieval • use alternative “open” designation: indicate the slot is open for insertion, but won’t stop a search hashing 21

Deleting a Record • Records may also be deleted from a hash table. • But the location must not be left as an ordinary "empty spot" since that could interfere with searches. • The location must be marked in some special way so that a search can tell that the spot used to have something in it. Please [0] [1] Number 281942902 [2] Number 233667136 [3] [ 4 ] delete [ 5 me. ] Number 580625685 Number 506643548 [ 700] Number 701466868 Number 155778322 . . . hashing 22

A class specification for a hashing dictionary • Public functions: – constructor: creates and initializes empty dictionary – insert: inserts a new item – is_present: returns true if specified item is found in dictionary, false if not – find: returns a copy of the desired item, if found – remove: removes specified record if it exists – size: returns total number of records in dictionary hashing 23

Invariant for dictionary class • Member variable used stores the number of records currently in dictionary • Member variable data is an array of CAPACITY entries; actual records are stored here • Each valid record has a non-negative key value; an unused record has its key field set to the constant NEVER_USED or the constant PREVIOUSLY_USED hashing 24

Code for dictionary class template <class Rec. Type> class Dictionary { public: enum {CAPACITY = 811}; Dictionary( ); void insert (const Rec. Type& entry); void remove (int key); bool is_present(int key) const; void find (int key, bool& found, Rec. Type& result) const; size_t size( ) const {return used; } hashing 25

Code for dictionary class … private: const int NEVER_USED = -1; const int PREVIOUSLY_USED = -2; Rec. Type data[CAPACITY]; size_t used; … hashing 26

Helper functions in dictionary class • hash: calculates hash value for given key • next_index: steps through array, providing wraparound function at end of array • find_index: finds array index of record with given key • never_used: returns true if index has never been used • is_vacant: returns true if index is not currently in hashing 27 use

Code for dictionary class. . . // helper functions: size_t hash (int key) const {return key%CAPACITY; } size_t next_index (size_t index) const {return (index+1)%CAPACITY; } void find_index (int key, bool& found, size_t& index) const; bool never_used (size_t index) const {return data[index]. key == NEVER_USED; } bool is_vacant(size_t index) const {return data[index]. key < 0; } }; hashing 28

Function implementations // constructor template <class Rec. Type> Dictionary<Rec. Type>: : Dictionary( ) { used = 0; for (int x=0; x<CAPACITY; x++) data[x]. key = NEVER_USED; } hashing 29

Function implementations // helper function find_index template <class Rec. Type> void Dictionary<Rec. Type>: : find_index(int key, bool& found, size_t& index) { size_t count=0; index = hash(key); while ((count < CAPACITY) && (!never_used(index)) && (data[index]. key != key)) { count++; index = next_index(index); } found = (data[index]. key == key); } hashing 30

Function implementations template <class Rec. Type> void Dictionary<Rec. Type>: : insert (const Rec. Type& entry) { bool already_present; // true if entry already in table size_t index; // location of new entry find_index(entry. key, already_present, index); if (!already_present) { assert (size( ) < CAPACITY); used++; data[index] = entry; } } hashing 31

Function implementations template <class Rec. Type> void Dictionary<Rec. Type>: : remove (int key) { bool found; // true if key occurs somewhere in table size_t index; // index of key value assert (key >= 0); // must be valid key find_index(key, found, index); if (found) { data[index]. key = PREVIOUSLY_USED; used--; } hashing 32 }

$Function implementations template <class Rec. Type> bool Dictionary<Rec. Type>: : is_present(int key) { bool$

Function implementations template <class Rec. Type> bool Dictionary<Rec. Type>: : is_present(int key) { bool found; size_t index; assert (key >= 0); find_index (key, found, index); return found; } hashing 33

Function implementations template <class Rec. Type> void Dictionary<Rec. Type>: : find(int key, bool& found, Rec. Type& result) const { size_t index; assert (key >= 0); find_index(key, found, index); if (found) result = data[index]; } hashing 34