Hashtables Picture of a hashtable You can think

Hashtables

Picture of a hashtable • You can think of this as a dictionary – with words and definitions. KEY e. g. student id VALUE e. g. student name 089 JOHN 045 DAVE 939 STEVE

A basic problem • We have to store some records and perform the following: – add new record – delete record – search a record by key • Find a way to do these efficiently! 3

Unsorted array • Use an array to store the records, in unsorted order – add - add the records as the last entry fast O(1) – delete a target - slow at finding the target, fast at filling the hole (just take the last entry) O(n) – search - sequential search slow O(n) 4

Sorted array • Use an array to store the records, keeping them in sorted order – add - insert the record in proper position. much record movement slow O(n) – delete a target - how to handle the hole after deletion? Much record movement slow O(n) – search - binary search fast O(log n) 5

Linked list • Store the records in a linked list (unsorted) – add - fast if one can insert node anywhere O(1) – delete a target - fast at disposing the node, but slow at finding the target O(n) – search - sequential search slow O(n) (if we only use linked list, we cannot use binary search even if the list is sorted. ) 6

More approaches • have better performance but are more complex – Hash table – Tree (BST, Heap, …) 7

What is a Hash Table ? • The simplest kind of hash table is an array of records. • This example has 701 records. [0] [1] [2] [3] [4] [5] [ 700]. . . An array of records

What is a Hash Table ? [4] Number 506643548 • Each record has a special field, called its key. • In this example, the key is a long integer field called Number. [0] [1] [2] [3] [4] [5] [ 700]. . .

What is a Hash Table ? [4] Number 506643548 • The number might be a person's identification number, and the rest of the record has information about the person. [0] [1] [2] [3] [4] [5] [ 700]. . .

What is a Hash Table ? • When a hash table is in use, some spots contain valid records, and other spots are "empty". [0] [1] [2] Number 281942902 Number 233667136 [3] [4] Number 506643548 [5] [ 700]. . . Number 155778322

Inserting a New Record Number 580625685 • In order to insert a new record, the key must somehow be converted to an array index. • The index is called the hash value of the key. [0] [1] [2] Number 281942902 Number 233667136 [3] [4] Number 506643548 [5] [ 700]. . . Number 155778322

Inserting a New Record Number 580625685 • Typical way create a hash value: (Number mod 701) What is (580625685 mod 701) ? [0] [1] [2] Number 281942902 Number 233667136 [3] [4] Number 506643548 [5] [ 700]. . . Number 155778322

Inserting a New Record Number 580625685 • Typical way to create a hash value: (Number mod 701) 3 What is (580625685 mod 701) ? [0] [1] [2] Number 281942902 Number 233667136 [3] [4] Number 506643548 [5] [ 700]. . . Number 155778322

Inserting a New Record Number 580625685 • The hash value is used for the location of the new record. [3] [0] [1] [2] Number 281942902 Number 233667136 [3] [4] Number 506643548 [5] [ 700]. . . Number 155778322

Inserting a New Record • The hash value is used for the location of the new record. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700]. . . Number 155778322

Collisions Number 701466868 • Here is another new record to insert, with a hash value of 2. My hash value is [2]. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700]. . . Number 155778322

Collisions Number 701466868 • This is called a collision, because there is already another valid record at [2]. When a collision occurs, move forward until you find an empty spot. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700]. . . Number 155778322

Collisions • This is called a collision, because there is already another valid record at [2]. The new record goes in the empty spot. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700] Number 701466868 . . . Number 155778322

Searching for a Key Number 701466868 • The data that's attached to a key can be found fairly quickly. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700] Number 701466868 . . . Number 155778322

Searching for a Key Number 701466868 • Calculate the hash value. • Check that location of the array for the key. My hash value is [2]. Not me. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700] Number 701466868 . . . Number 155778322

Searching for a Key Number 701466868 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Not me. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700] Number 701466868 . . . Number 155778322

Searching for a Key Number 701466868 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Yes! [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700] Number 701466868 . . . Number 155778322

Searching for a Key Number 701466868 • When the item is found, the information can be copied to the necessary location. My hash value is [2]. Yes! [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700] Number 701466868 . . . Number 155778322

Deleting a Record • Records may also be deleted from a hash table. Please delete me. [0] [1] [2] [3] [4] Number 281942902 Number 580625685 Number 233667136 506643548 [5] [ 700] Number 701466868 . . . Number 155778322

Deleting a Record • Records may also be deleted from a hash table. • But the location must not be left as an ordinary "empty spot" since that could interfere with searches. [0] [1] [2] [3] Number 281942902 Number 580625685 233667136 [4] [5] [ 700] Number 701466868 . . . Number 155778322

Deleting a Record • Records may also be deleted from a hash table. • But the location must not be left as an ordinary "empty spot" since that could interfere with searches. • The location must be marked in some special way so that a search can tell that the spot used to have something in it. [0] [1] [2] [3] Number 281942902 Number 580625685 233667136 [4] [5] [ 700] Number 701466868 . . . Number 155778322

Array as table studid 0012345 0033333 0056789. . . 9801010 9802020. . . 9903030 9908080 name score andy betty david 81. 5 90 56. 8 peter mary 20 100 tom bill 73 49 Consider this problem. We want to store 1, 000 student records and search them by student id. 32

Array as table name 0 : : 12345 andy : : 33333 betty : : 56789 david : : 9908080 bill : : 9999999 score : 81. 5 : 90 : 56. 8 : : 49 : One way is to store the records in a huge array (index 0. . 9999999). The index is used as the student id, i. e. the record of the student with studid 0012345 is stored at A[12345] -- Is this a good idea? If I have 70 friends, and I want to store their mobile phone numbers, I do not want an array 1000000 in size. I could use a table about 140 slots in it. 33

• Array as table It is also called Direct-address Hash Table. • Each slot, or position, corresponds to a key in U. • If there’s an element x with key k, then T [k] contains a pointer to x. • Otherwise, T [k] is empty, represented by NIL. 34

Array as table • Store the records in a huge array where the index corresponds to the key – add - very fast O(1) – delete - very fast O(1) – search - very fast O(1) • But it wastes a lot of memory! Not feasible. 35

Hash function Hash(key: Key. Type): integer; Imagine that we have such a magic function Hash. It maps the key (studid) of the 1000 records into the integers 0. . 999, one to one. No two different keys maps to the same number. H(‘ 0012345’) = 134 H(‘ 0033333’) = 67 H(‘ 0056789’) = 764 … H(‘ 9908080’) = 3 36

Hash Table 0 To store a record, we compute Hash(stud_id) for the record and store it at the location Hash(stud_id) of the array. To search for a student, we only need to peek at the location Hash(target stud_id). 3 67 134 764 999 : 9908080 : 0033333 : 0012345 : 0056789 : : name : bill : betty : andy : david : : score : 49 : 90 : 81. 5 : 56. 8 : : 37

Hash Table with Perfect Hash • Such magic function is called perfect hash – add - very fast O(1) – delete - very fast O(1) – search - very fast O(1) • But it is generally difficult to design perfect hash. (e. g. when the potential key space is large) 38

Hash function • A hash function maps a key to an index within in a range • Desirable properties: – simple and quick to calculate – even distribution, avoid collision as much as possible function Hash(key: Key. Type); 39

Division Method h(k) = k mod m • Certain values of m may not be good: p – When m = 2 then h(k) is the p lower-order bits of the key – Good values for m are prime numbers which are not close to exact powers of 2. For example, if you want to store 2000 elements then m=701 (m = hash table length) yields a hash function: h(key) = k mod 701 40

Collision • For most cases, we cannot avoid collision • Collision resolution - how to handle when two different keys map to the same index H(‘ 0012345’) = 134 H(‘ 0033333’) = 67 H(‘ 0056789’) = 764 … H(‘ 9903030’) = 3 H(‘ 9908080’) = 3 41

Solutions to Collision • The problem arises because we have two keys that hash in the same array entry, a collision. There are two ways to resolve collision: – Hashing with Chaining: every hash table entry contains a pointer to a linked list of keys that hash in the same entry – Hashing with Open Addressing: every hash table entry contains only one key. If a new key hashes to a table entry which is filled, systematically examine other table entries until you find one empty entry to place the new key 42

Chained Hash Table 0 1 2 3 4 5 HASHMAX nil One way to handle collision is to store the collided records in a linked list. The array now stores pointers to such lists. If no key maps to a certain hash value, that array entry points to nil : nil Key: 9903030 name: tom score: 73 Which index has the collisions? 43

Chained Hash Table Put all elements that hash to the same slot into a linked list. • Slot j contains a pointer to the head of the list of all stored elements that hash to j • If there are no such elements, slot j contains NIL. 44

Chained Hash table • Hash table, where collided records are stored in linked list – good hash function, appropriate hash size • Few collisions. Add, delete, search very fast O(1) – otherwise… • some hash value has a long list of collided records. . • add - just insert at the head fast O(1) • delete a target - delete from unsorted linked list slow • search - sequential search slow O(n) • Consider the two extremes. 45

Open Addressing An alternative to chaining for handling collisions. • • Store all keys in the hash table itself. • • Each slot contains either a key or NIL. • • To search for key k: – Compute h(k) and examine slot h(k). Examining a slot is known as a probe. – If slot h(k) contains key k, the search is successful. If this slot contains NIL, the search is unsuccessful. – There’s a third possibility: slot h(k) contains a key that is not k. We compute the index of some other slot, based on k and on which probe (count from 0: 0 th, 1 st, 2 nd, etc. ) we’re on. Keep probing until we either find key k (successful search) or we find a slot holding NIL (unsuccessful search). 46

How to compute probe sequences • Linear probing: Given auxiliary hash function h, the probe sequence starts at slot h(k) and continues sequentially through the table, wrapping after slot m − 1 to slot 0. Given key k and probe number i (0 ≤ i < m), h(k, i ) = (h(k) + i ) mod m. • Quadratic probing: As in linear probing, the probe sequence starts at h(k). Unlike linear probing, it examines cells 1, 4, 9, and so on, away from the original probe point: h(k, i ) = (h(k) + c 1 i + c 2 i 2) mod m (if c 1=0, c 2=1) • Double hashing: Use two auxiliary hash functions, h 1 and h 2. h 1 gives the initial probe, and h 2 gives the remaining probes: h(k, i ) = (h 1(k) + ih 2(k)) mod m. 47

Open Addressing Example • • • Hash( 89, 10) = 9 Hash( 18, 10) = 8 Hash( 49, 10) = 9 Hash( 58, 10) = 8 Hash( 9, 10) = 9 48

Linear Probing: h(k, i ) = (h(k) + i ) mod m. • • In linear probing, collisions are resolved by sequentially scanning an array (with wraparound) until an empty cell is found. In following example, table size m = 8, and k: A, P, Q B, O, R C, N, S D, M, T E, L, U F, K, N G, J, W, Z H, I, X, Y h(k): 0 1 2 3 4 5 6 7 Action 0 Store A Store C Store D Store G Store P Store Q Delete P Delete Q Store B Store R Store Q A A A 1 P P ± ± B B B 2 C C C C C 3 D D D D D 4 Q Q ± ± R R 5 Q 6 G G G G 7 # probes 1 1 2 5 1 4 6 49

Choosing a Hash Function • Notice that the insertion of Q required several probes (5). This was caused by A and P mapping to slot 0 which is beside the C and D keys. • The performance of the hash table depends on having a hash function which evenly distributes the keys. • Choosing a good hash function is a black art. 50

Clustering • Even with a good hash function, linear probing has its problems: – The position of the initial mapping i 0 of key k is called the home position of k. – When several insertions map to the same home position, they end up placed contiguously in the table. This collection of keys with the same home position is called a cluster. – As clusters grow, the probability that a key will map to the middle of a cluster increases, increasing the rate of the cluster’s growth. This tendency of linear probing to place items together is known as primary clustering. – As these clusters grow, they merge with other clusters forming even bigger clusters which grow even faster. 51

Quadratic Probing Example • • • Hash( 89, 10) = 9 Hash( 18, 10) = 8 Hash( 49, 10) = 9 Hash( 58, 10) = 8 Hash( 9, 10) = 9 52

Quadratic Probing: h(k, i ) = (h(k) + c 1 i + c 2 i 2) mod m • Quadratic probing eliminates the primary clustering problem of linear probing by examining certain cells away from the original probe point. In the following example, table size m = 8, and c 1 = 0 , c 2 = 1 k: A, P, Q B, O, R C, N, S D, M, T E, L, U F, K, N G, J, W, Z H, I, X, Y h(k): 0 1 2 3 4 5 6 7 Action 0 Store A Store C Store D Store G Store P Store Q Delete P Delete Q Store B Store R Store Q A A A 1 P P ± ± B B B 2 C C C C C 3 D D D D D 4 5 Q Q ± ± R Q 6 G G G G 7 # probes 1 1 2 3(5) 1 3(4) 3(6) 53

Double Hashing Double hashing: Use two auxiliary hash functions, h 1 and h 2. h 1 gives the initial probe, and h 2 gives the remaining probes: h(k, i ) = (h 1(k) + ih 2(k)) mod m. n Quadratic probing solves the primary clustering problem, but it has the secondary clustering problem, in which, elements that hash to the same position probe the same alternative cells. Secondary clustering is a minor theoretical blemish. n Double hashing is a hashing technique that does not suffer from secondary clustering. A second hash function is used to drive the collision resolution. n Limits are left to ponder n 54