Hashing Using balanced search trees 2 3 2

Hashing • Using balanced search trees (2 -3, 2 -3 -4, red-black, and AVL trees), we can implement table operations (retrieval, insertion, and deletion) efficiently, in O(log. N) time. • Can we find a data structure so that we can perform these table operations even faster (e. g. , in O(1) time)? HASH TABLES • In hash tables, we have – An array (index ranges 0 … n – 1) and • Each array location is called a bucket – An address calculator (hash function), which maps a search key into an array index between 0 … n – 1 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 1

Hash Function -- Address Calculator Hash Function Hash Table 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 2

Hashing • A hash function tells us where to place an item in array called a hash table. – This method is known as hashing. • A hash function maps a search key into an integer between 0 and n – 1. – We can have different hash functions. – The hash function is designed for the search keys depending on their (int, string, . . . ) • E. g. , h(x) = x mod n, where x is an integer 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 3

Collisions • A perfect hash function maps each search key into a unique location of the hash table. – A perfect hash function is possible if we know all search keys in advance. – In practice (we do not know all search keys), and thus, a hash function can map more than one key into the same location. • Collisions occur when a hash function maps more than one item into the same array location. – We have to resolve the collisions using a certain mechanism. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 4

Hash Functions • We can design different hash functions. • But a good hash function should – be easy and fast to compute – place items uniformly (evenly) throughout the hash table. • We will consider only hash functions that operate on integers. – Since on a computer, everything is represented with bits and they can be converted into integers. • 100101001010000110…. remember? 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 5

Everything is an Integer • Even if our search keys are strings, we can think of them as integers, and apply a hash function which is designed to operate on integers to this integer value. • For example, strings can be encoded using ASCII codes of characters. Consider the string “NOTE” – The ASCII code of N is 4 Eh (01001110), O is 4 Fh (01001111), T is 54 h(01010100), E is 45 h (01000101) – Concatenate four binary numbers to get a new binary number 010011100100111101000101 = 4 E 4 F 5445 h = 1313821765 – Apply 1313821765 mod table. Size 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 6

Hash Functions -- Selecting Digits • If the search keys are big integers, we can select certain digits and combine to create the address. • For example, suppose that we have nine-digit numbers – Define a hash function that selects the 2 nd and 5 th most significant digits h(033475678) = 37 h(023455678) = 25 – Define the table size as 100 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 7

Hash Functions -- Folding • We can select all digits and add them. • For example, suppose the previous nine-digit numbers – Define a hash function that selects all digits and adds them h(033475678) = 0 + 3 + 4 + 7 + 5 + 6 + 7 + 8 = 43 h(023455678) = 0 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 40 – Define the table size as 82 • We can select a group of digits and add the digits in this group as well. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 8

Hash Functions -- Modular Arithmetic • Modular arithmetic provides a simple and effective hash function. h(x) = x mod table. Size • The table size should be a prime number. – Why? Think about it. • We will use modular arithmetic as our hash function in the rest of our discussions. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 9

Why Primes? • Assume you hash the following with x mod 8: – 64, 100, 128, 200, 300, 400, 500 0 64 128 200 400 1 2 3 4 100 300 5 6 7 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 10

Why Primes? • Now try it with x mod 7 – 64, 100, 128, 200, 300, 400, 500 0 1 64 128 2 100 3 500 4 200 400 5 6 9/7/2021 300 CS 202 - Fundamental Structures of Computer Science II 11

Rationale • If we are adding numbers a 1, a 2, a 3 … a 4 to a table of size m – All values will be hashed into multiples of gcd(a 1, a 2, a 3 … a 4 , m) – For example, if we are adding 64, 100, 128, 200, 300, 400, 500 to a table of size 8, all values will be hashed to 0 or 4 gcd(64, 100, 128, 200, 300, 400, 500, 8) = 4 – When m is a prime gcd(a 1, a 2, a 3 … a 4 , m) = 1, all values will be hashed to anywhere gcd(64, 100, 128, 200, 300, 400, 500, 7) = 1 unless gcd(a 1, a 2, a 3 … a 4 ) = m, which is rare. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 12

Collision Resolution • There are two general approaches to collision resolution in hash tables: 1. Open Addressing Each entry holds one item 2. Chaining Each entry can hold more than item (Buckets – hold certain number of items) 9/7/2021 CS 202 - Fundamental Structures of Computer Science II Table size is 101 13

Open Addressing • A hash table is said to use open addressing if it probes for some other empty location when a collision occurs. – The sequence of locations that it examines is called the probe sequence. • There are different open-addressing schemes: – Linear Probing – Quadratic Probing – Double Hashing 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 14

Open Addressing -- Linear Probing • In linear probing, we search the hash table sequentially starting from the original hash location. – We check the next location if a location is occupied. – We wrap around from the last table location to the first table location if necessary. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 15

Linear Probing -- Example • Example: – Table Size is 11 (0. . 10) – Hash Function: h(x) = x mod 11 – Insert keys: 20, 30, 2, 13, 25, 24, 10, 9 • • 9/7/2021 20 mod 11 = 9 30 mod 11 = 8 2 mod 11 = 2 13 mod 11 = 2 2+1=3 25 mod 11 = 3 3+1=4 24 mod 11 = 2 2+1, 2+2, 2+3=5 10 mod 11 = 10 9 mod 11 = 9 9+1, 9+2 mod 11 =0 CS 202 - Fundamental Structures of Computer Science II 0 9 1 2 2 3 13 4 25 5 24 6 7 8 30 9 20 10 10 16

Linear Probing -- Clustering Problem • One of the problems with linear probing is that table items tend to cluster together in the hash table. – This means that the table contains groups of consecutively occupied locations. • This phenomenon is called primary clustering. – Clusters can get close to one another, and merge into a larger cluster. – Thus, the one part of the table might be quite dense, even though another part has relatively few items. – Primary clustering causes long probe searches, and therefore, decreases the overall efficiency. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 17

Open Addressing -- Quadratic Probing • Primary clustering problem can almost be eliminated if we use a quadratic probing scheme. • In quadratic probing, – We start from the original hash location i – If a location is occupied, we check the locations i+12 , i+22 , i+32 , i+42. . . – We wrap around from the last table location to the first table location, if necessary. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 18

Quadratic Probing -- Example • Example: – Table Size is 11 (0. . 10) – Hash Function: h(x) = x mod 11 – Insert keys: 20, 30, 2, 13, 25, 24, 10, 9 • • 9/7/2021 20 mod 11 = 9 30 mod 11 = 8 2 mod 11 = 2 13 mod 11 = 2 2+12=3 25 mod 11 = 3 3+12=4 24 mod 11 = 2 2+12, 2+22=6 10 mod 11 = 10 9 mod 11 = 9 9+12, 9+22 mod 11, 9+32 mod 11 =7 CS 202 - Fundamental Structures of Computer Science II 0 1 2 2 3 13 4 25 5 6 24 7 9 8 30 9 20 10 10 19

Open Addressing -- Double Hashing • Double hashing also reduces clustering. • In linear and quadratic probing, increments used during probing are independent from the key. • Double hashing selects increments using a second hash function h 2. This second function should satisfy h 2(key) 0 h 2 h 1 • It probes the following locations until it finds an unoccupied place h 1(key), h 1(key) + h 2(key), h 1(key) + 2*h 2(key), . . . 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 20

Double Hashing -- Example • Example: – Table Size is 11 (0. . 10) – Hash Function: h 1(x) = x mod 11 h 2(x) = 7 – (x mod 7) – Insert keys: 58, 14, 91 • 58 mod 11 = 3 • 14 mod 11 = 3 3+7=10 • 91 mod 11 = 3 3+7, 3+2*7 mod 11=6 0 1 2 3 58 4 5 6 91 7 8 9 10 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 14 21

Open Addressing -- Retrieval & Deletion • To retrieve an item with a given key, we probe the locations (same as insertion) until we find the desired item or we reach to an empty location. • Deletions in open addressing cause complications – We CANNOT simply delete an item from the hash table because this new empty (a deleted) location causes to stop prematurely (incorrectly) indicating a failure during a retrieval. – Solution: We have to have three kinds of locations in a hash table: Occupied, Empty, Deleted. – A deleted location will be treated as an occupied location during retrieval and insertion. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 22

Separate Chaining • Another way to resolve collisions is to change the structure of the hash table. – In open-addressing, each location holds only one item. – We can define a hash table so that each location is itself an array called bucket, so that we can store the items that are hashed into this location in this array. • Problem: What will be the size of the bucket? – A better approach is to design the hash table as an array of linked lists, this method is known as separate-chaining. – In separate-chaining, each entry (of the hash table) is a pointer to a linked list (the chain) of the items that the hash function has mapped into that location. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 23

Separate Chaining 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 24

Hashing -- Analysis • An analysis of the average-case efficiency of hashing involves the load factor – The load factor is the ratio of the current number of items in the table to the table size. = (current number of items) / table. Size – The load factor measures how full a hash table is. – The hash table should not be too loaded if we want to get better performance from hashing. • In average case analyses, we assume that the hash function uniformly distributes keys in the hash table. • Unsuccessful searches generally require more time than successful searches. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 25

Separate Chaining -- Analysis • The approximate average number of comparisons (probes) that a search requires is given as follows: for a successful search for an unsuccessful search • It is the most efficient collision resolution scheme. • But it requires more storage (needs storage for pointers). • It easily performs the deletion operation. Deletion is more difficult in open-addressing. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 26

Linear Probing -- Analysis • The approximate average number of comparisons (probes) that a search requires is given as follows: for a successful search for an unsuccessful search • As the load factor increases, the number of collisions increases, causing increased search times. • To maintain efficiency, it is important to prevent the hash table from filling up. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 27

Linear Probing -- Analysis Example: Find the average number of probes for a successful search and an unsuccessful search for this hash table? Use the following hash function: h(x) = x mod 11 0 9 1 Successful Search: Try 20, 30, 2, 13, 25, 24, 10, 9 20: 9 30: 8 2: 2 13: 2, 3 25: 3, 4 24: 2, 3, 4, 5 10: 10 9: 9, 10, 0 Avg. no of probes = (1+1+1+2+2+4+1+3)/8 = 1. 9 2 2 3 13 4 25 5 24 Unsuccessful Search: Try 0, 1, 35, 3, 4, 5, 6, 7, 8, 9, 32 0: 0, 1 1: 1 35: 2, 3, 4, 5, 6 3: 3, 4, 5, 6 4: 4, 5, 6 5: 5, 6 6: 6 7: 7 8: 8, 9, 10, 0, 1 9: 9, 10, 0, 1 32: 10, 0, 1 Avg. no of probes = (2+1+5+4+3+2+1+1+5+4+3)/11 = 2. 8 6 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 7 8 30 9 20 10 10 28

Quadratic Probing & Double Hashing -- Analysis • The approximate average number of comparisons (probes) that a search requires is given as follows: for a successful search for an unsuccessful search • On average, both methods require fewer comparisons than linear probing. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 29

The relative efficiency of four collision-resolution methods

What Constitutes a Good Hash Function • A hash function should be easy and fast to compute. • A hash function should scatter the data evenly throughout the hash table. – How well does the hash function scatter random data? – How well does the hash function scatter non-random data? • Two general principles : 1. The hash function should use entire key in the calculation. 2. If a hash function uses modulo arithmetic, the table size should be prime. 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 31

Hash Table versus Search Trees • In most of the operations, the hash table performs better than search trees. • However, traversing the data in the hash table in a sorted order is very difficult. – For similar operations, the hash table will not be good choice (e. g. , finding all the items in a certain range). 9/7/2021 CS 202 - Fundamental Structures of Computer Science II 32