Hashing Amihood Amir Bar Ilan University 2014 Direct

Hashing Amihood Amir Bar Ilan University 2014

Direct Addressing In old days: LD 1, 1 LD 2, 2 AD 1, 2 ST 1, 3 Today: C <- A+B

Direct Addressing Compiler keeps track of: C <- A+B A in location 1 B in location 2 C in location 3 How? Tables? Linked lists? Skip lists? AVL trees? B-trees?

Encode? Consider ASCII code of alphabet: (American Standard Code for Information Interchange)

Encode In ASCII code: A to Z are 1000001 to 1011010

Encode In ASCII code: A to Z are 1000001 to 1011010 In decimal: 65 to 90. So if we subtract 64 from the ASCII We will get locations 1 to 26.

Map In general: Consider a function h: U {0, …, m}. Where U is the universe of keys, m is number of keys. When we want to access the record of key kєU just go to location h(k) and it will point to the location of the record.

Problems: 1. What happens if keys are not consecutive? E. g. parameters: pointer, variable, idnumber 2. What happens if not all keys are used? E. g. ID numbers of students: 211101793, 31264992, 001684421 3. What happens if h(k) is not bijective?

Hash Tables: If U is larger than m, h(k) can not be bijective, thus we may encounter collisions: k 1≠k 2 where h(k 1)= h(k 2) What do we do then? Solution 1: chaining.

Chaining puts elements that hash to the same slot in a linked list: U (universe of keys) k 6 k 2 k 5 k 8 k 1 k 4 —— k 5 k 2 —— —— —— k 1 k 4 K (actual k 7 keys) T —— k 3 —— k 8 —— k 6 —— k 7 ——

Hash operations: CHAINED-HASH-INSERT(T, x) Insert x at end of list T[h(key[x])] CHAINED-HASH-SEARCH(T, k) Search for an element with key k in list T[h(k)] CHAINED-HASH-DELETE(T, x) Delete x from list T[h(key[x])]

Time: Depends on chains. We assume: simple uniform hashing. h hashes into any slot with equal uniform likelihood.

Load Factor: We will analyze average search time. This depends on how large is the hash table relative to the number of keys. Let n= number of keys, m= hash table size n/m is the load factor. Write n/m = α.

Average Complexity of Search Notation: length of list T[j] is nj. Σ nj = n. Average value of nj = E(nj) = n/m = α Assumptions: h(k) computed in constant time. h(k) simple uniform hash function.

Searching for new key Compute h(k) in time O(1). Get to list T[h(k)] (uniformly distributed) whose length is nj. Need to search list T[h(k)]. Time for search: nj. E(nj) = α. Conclude: Average time to search for new key is Θ(1+α). What is worst case time?

Searching for Existing Key Do we gain anything by searching for a key that is already in list? Note: Expected number of elements in successful search for k = Expected length of list T[h(k)] when k was inserted + 1. (since we insert elements at end)

Searching for Existing Key Let k 1, … , kn be the keys in order of their insertion. When ki is inserted, the expected list length is (i-1)/m. So the expected length of a successful search is the average of all such lists:

Calculate: = = =

Conclude: Average time for searching in a hash table with chaining is Θ(1+α) Note: Generally m is chosen as O(n) Thus the average operation on a hash table with chaining takes constant time.

Choosing a Hash Function Assume keys are natural numbers: {0, 1, 2, … } A natural mapping of natural numbers to {0, 1, … , m} is by dividing by m+1 and taking the remainder, i. e. h(k) = k mod (m+1)

The Division Method What are good m’s? In ASCII example: A to Z are 1000001 to 1011010 In decimal: 65 to 90. We subtracted 64 from the ASCII and got locations 1 to 26. If we choose m=32 we achieve the same result.

What does it Mean? 32 = 25 Dividing by 32 and taking the remainder means taking the 5 LSBs in the binary representation. A is 1000001 = 1 Z is 1011010 = 26 Is this good? – Here, yes. In general?

Want Even Distribution We want even distribution of the bits. If m=2 b then only the b least significant bits participate in the hashing. We would like all bits to participate. Solution: Use for m a prime number not too close to a power of 2. Always good idea: Check distribution of real application data.

The Multiplication Method Choose: 0<σ<1 h(k) = Meaning: 1. multiply key by σ and take fraction. 2. then multiply by m and take floor.

Multiplication Method Example Choose: σ = 0. 618 m=32 (Knuth recommends (√ 5 -1)/2 ) k=2391 h(k) = 2391 x 0. 618 = 1477. 638 -1477 = 0. 638 x 32 = 20. 416 So h(2391)=20.

Why does this involve all bits? Assume: k uses w bits, which is a computer word. Word operations take constant time. Assume: m = 2 p. Easy Implementation of h(k):

Implementing h(k) Easy Implementation of h(k): w bits k X Getting rid of this Means dividing by 2 w. σ2 w p bits h(k)

Worst Case Problem: Malicious adversary can make sure all keys hash to same entry. Usual solution: randomize. But then how do we get to correct hash table entry upon search?

Universal Hashing We want to make sure that: Not every time we hash keys k 1 , k 2 , … , k n They hash to same table entry. How can we make sure of that? Use a number of different hash functions and employ them at random.

Universal Hashing A collection H of hash functions from universe U to {0, …, m} is universal if for every pair of keys x, yєU, where x≠y the number of hash functions hєH for which h(x)=h(y) is |H|/m. This means: If we choose a function hєH at random the chance of collision between x and y where x≠y is 1/m.

Constructing a Universal Collection of Hash Functions Choose m to be prime. Decompose a key k as follows: log m bits value < m r+1 pieces k = [k 0, k 1, … , kr ]

Universal Hashing Construction Choose randomly mr+1 sequences of length r+1 of elements from {0, …, m-1}. Each sequence is a=[a 0, … , ar], aiє {0, …, m-1}. Each sequence defines a hash function The universal class is: H = U{ha} a

This is a Universal Class Need to show that the probability that x≠y collide is 1/m. Because x≠y then there is some i for which xi≠yi. Wlog assume i=0. Since m is prime then for any fixed [a 1, …, ar] there is exactly one value that a 0 can get that satisfies h(x)=h(y). Why?

Proof Continuation… h(x)-h(y)=0 means: But x 0 -y 0 is non-zero. For prime m there is a unique multiplicative inverse modulo m. Thus there is only one value between 0 and m-1 that a 0 can get.

End of Proof Conclude: Each pair x≠y may collide once for each [a 1, …, ar]. But there are mr possible values for [a 1, …, ar]. However: Since there are mr+1 different hash functions, the keys x and y collide with probability:

Open Addressing Solution 2 to the collision problem. We don’t want to use chaining. (save space and complexity of pointers. ) Where should we put the collision? Inside hash table, in an empty slot. Problem: Which empty slot? -- We assume a probing hashing function.

Idea behind Probing Hashing function of the form: h : U x {0, …, m} The initial hashing function: h(k, 0). h(k, i) for i=1, …, m, gives subsequent values. We try locations h(k, 0), h(k, 1), etc. until an open slot is found in the hash table. If none found we report an overflow error.

Average Complexity of Probing Assume uniform hashing – i. e. A probe sequence for key k is equally likely to be any permutation of {0, …, m}. What is the expected number of probes with load factor α? Note: since this is open address hashing, α = n/m <1

Probing Complexity – Case I : key not in hash table. Key not in table every probe, except for last, is to an occupied slot. Let pi=Pr{exactly i probes access occupied slot} Then The expected number of probes is

Probing Complexity – Case I Note: For i>n, pi=0. Claim: , Where qi=Pr{at least i probes access occupied slots} Why?

Probing Complexity – Case I q 2 q 3 q 4

Probing Complexity – Case I What is: qi? q 1 = n/m there are n elements in table, the probability of having something in the first slot accessed is therefore n/m q 2 = n/m ((n-1)/(m-1))

Probing Complexity – Case I Conclude: The expected complexity of insering an element into the hash table is: Example: If the hash table is half full, the expected number of probes is 2. If it is 90% full, the expected number is 10.

Probing Complexity – Case II : key in hash table. As in the chaining case. Expected search time for key in table = Expected insertion time. We have shown: Expected time to insert key i+1 is:

Probing Complexity – Case II Conclude: Expected key insertion time: Need to compute

Probing Complexity – Case II In General: What is ?

Probing Complexity – Case II Consider any monotonic function f(x)>0. Approximate By the area under the function.

Probing Complexity – Case II Consider any monotonic function f(x)>0. f(x) a-1 a b b+1

Probing Complexity – Case II Conclude: Approximate By

Probing Complexity – Case II Conclude: Expected key insertion time: Example: If the hash table is half full, the expected number of probes is 2. If it is 90% full, the expected number is 2. 6.

Types of Probing What Functions can be used for probing? 1. Linear 2. Quadratic 3. Double Hashing

Linear Probing j <- h(k) If T[j] occupied, then repeat until finding an empty slot: j <- j+1 End T[j] <- k Attention: 1. Wrap around when end of table reached. 2. If table full then overflow error.

Linear Probing - Discussion Pro: Easy to implement. Con: Clustering. an empty slot following a cluster of length i has probability (i+1)/m to be filled.

Idea behind Probing Hashing function of the form: h : U x {0, …, m} The initial hashing function, h’(k), gives First value, but if this one is taken, we try h(k, i) for i=0, …, m. In linear probing we have: h(k, i) = (h’(k)+i) mod (m+1)

For Uniformity: There are m! different paths of length m, we should be able to generate them all. Linear probing: Generates m paths, so clearly not uniform. The more different paths generated, the better.

Quadratic Probing h(k, i) = (h’(k) + c 1 i + c 2 i 2) mod m Works better but: 1. Still only m different paths. 2. Note that if h’(x)=h’(y) then h(x, i)=h(y, i) for all i. This causes what is called secondary clustering.

Double Hashing h(k, i)=(h 1(k)+ih 2(k))mod m To get permutation: h 2(k) must be relative prime to m. Possibility: h 1(k)=k mod m h 2(k)=1+(k mod m’) Either m power of 2 and m’ odd or m and m’ both prime with distance 2 between them.

Double Hashing Example h(k, i)=(h 1(k)+ih 2(k))mod m h 1(k)=k mod 13 h 2(k)=1+(k mod 11) Probing Seq. of 14: Probing Seq. of 27 27 mod 13 = 1 14 mod 13 = 1 (1+6) mod 13 = 7 (1+4) mod 13 = 5 (1+12) mod 13 = 0 (1+8) mod 13 = 9. .

Double Hashing Example Advantage: m 2 different sequences generated.

Deletions Problem: If an element is deleted, we may think a key is not in the table! Solution: When an element is deleted, mark it with flag. Meaning: It can cause long searches if many deletions. Therefore: In very dynamic setting use chaining.