Hash Tables and Associative Containers CS212 Dick Steflik
Hash Tables and Associative Containers CS-212 Dick Steflik
Hash Tables • a hash table is an array of size Tsize – has index positions 0. . Tsize-1 • two types of hash tables – open hash table • array element type is a <key, value> pair • all items stored in the array – chained hash table • element type is a pointer to a linked list of nodes containing <key, value> pairs • items are stored in the linked list nodes • keys are used to generate an array index – home address (0. . Tsize-1)
Faster Searching • "balanced" search trees guarantee O(log 2 n) search path by controlling height of the search tree – AVL tree – 2 -3 -4 tree – red-black tree (used by STL associative container classes) • hash table allows for O(1) search performance – search time does not increase as n increases
Hash Table • a hash table is an array/vector (fixed size) – has index positions 0. . Tsize-1 • if we could use the keys as an index we would have O(1) retrieval – hash. Table[key] • keys are used to generate an array index – home address (0. . Tsize-1) – function to do this is called a hash function • hash(key) returns an int value • hash(key) % Tsize => 0. . Tsize - 1
Collisions • Collisions occur whenever two keys produce the same index (hash to the same location • Design goal: pick a hash function that produces no collisions • Away of life with hash tables • What do you do? – linear probing: check the next location, if its empty use it – quadratic probing: check next, then 2 away, then 4 away. . .
a Hash Table of size 7 some insertions: hash(K 1) % 7 => 3 hash(K 2) % 7 => 5 hash(K 3) % 7 => 2 hash(K 4) % 7 => 3 hash(K 5) % 7 => 2 hash(K 6) % 7 => 4 linear probe open addressing collision resolution strategy 0 1 2 3 4 5 6 T T T T key value empty
Search Performance 0 1 2 3 4 5 6 F K 6 info T F K 3 info F K 1 info F K 4 info F K 2 info F K 5 info average number of probes needed to retrieve the value with key K? K home address K 1 3 K 2 5 K 3 2 K 4 3 K 5 2 K 6 4 #probes 1 1 1 2 5 4 14/6 = 2. 33 (successful) unsuccessful search?
Chaining with Separate Lists hash(K 1) % 7 => 3 hash(K 2) % 7 => 5 hash(K 3) % 7 => 2 hash(K 4) % 7 => 3 hash(K 5) % 7 => 2 hash(K 6) % 7 => 4 linked lists of synonyms 0 1 2 3 4 5 6 K 3 info K 1 info K 6 info K 2 info K 5 info K 4 info
Search Performance 0 1 2 3 4 5 6 average number of probes needed to retrieve the value with key K? K 3 info K 1 info K 6 info K 2 info K 5 info K 4 info K home address K 1 3 K 2 5 K 3 2 K 4 3 K 5 2 K 6 4 #probes 1 1 1 2 2 1 8/6 = 1. 33 (successful) unsuccessful search?
Where are Hash Tables used? Databases Spelling checkers Java uses them all over the place (built into the language) most scripting languages (ASP, PERL, PHP) have associative arrays • Caching Schemes • • – software – browsers, http proxy servers, DNS servers – hardware – memory caching, instruction caching
Deletions? • search for item to be deleted • chained hash table – delete a node from a linked list • open hash table – just mark spot as "empty"? – must mark vacated spot as “deleted” – is different than “empty”
Hash Functions • a hash function is used to map a key to an array index (home address) – search starts from here • insert, retrieve, delete all start by applying the hash function to the key • goals for a hash function – fast to compute – even distribution over the entire collection of keys • all hash functions produce collisions – multiple keys hash to same home address
Some Hash Functions. . . • Division – works good in most cases as long as keys are relatively random – H(key) = key mod m – if key is an integer • identity function ( return key) – good if keys are random – not good if keys have similar characteristics • ex m = 25 • all keys divisible by 5 would map into positions 0, 5, 10, 15… causing clustering around those values
more Hash functions. . . • Mid-Squared – produces a nearly random distribution of indices – mid-square technique takes longer to compute but gives better distribution when keys may have some digits in common – convert key to an octal string • A-Z = 018 - 328 and 0 -9 = 338 - 448 – ex key = A 1 = 1348 • 1348 * 1348 = 204208 • using a table of 1024 elements use middle 10 bits as the index – index = 1000102 = 54610 – 001000100002 – note - most collisions will occur for short identifiers
more Hash functions. . . • Digit Folding – assume a 5 digit decimal string (digits 0 -9 only) – H(key) = d 1 + d 2 + d 3 + d 4 + d 5 (sum of digits) • this would yield 0 <= h <= 45 for all possible keys • if we were to fold the digits in pairs – H(key) = d 1 d 2 + d 3 d 4 + d 5 – 0 <= h <= 207 (99 + 9) • Double hashing – use two (or more) hash functions serially – helps overcome effects of a function that produces a poor distribution of keys
Clustering • Undesireable function of the hash function selected and the collision resolution strategy – too many keys has to the same location causing long string of keys that need to be searched • especially bad using a divide based function and using linear probing • insertion/deletion/search can approach O(n) • Pick a different hash function • Pick a different collision resolution strategy
Factors Affecting Search Performance • quality of hash function – how uniform? – depends on actual data • collision resolution strategy used • load factor of the Hash. Table – N/Tsize – the lower the load factor the better the search performance
Successful Search Performance open addressing chaining (linear probing) (double hashing) load factor 0. 5 0. 7 0. 9 1. 0 2. 0 1. 50 2. 17 5. 50 ------- 1. 39 1. 72 2. 56 ------- 1. 25 1. 35 1. 45 1. 50 2. 00
Summary of Hash tables • search speed depends on load factor and quality of hash function – should be less than. 75 for open addressing – can be more than 1 for chaining • items not kept sorted by key • very good for fast access to unordered data with known upper bound – to pick a good TSize
- Slides: 19