Hash Tables 2 Exercise 2 Exercise 1 void

Hash Tables

2 Exercise 2 /* Exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i <= n - 1; i++) { for (j = i + 1; j <= n; j++) { for (k = 1; k <= j; k++) { /* Some statement taking O(1) time */ } }

3 Exercise 3 /* Exercise 2 */ void veryodd(int n) { int i, j, x, y; x = 0; y = 0; for (i = 1; i <= n; i++) { if (i % 2 == 1) { for (j = i; j <= n; j++) { x = x + 1; } for (j = 1; j <= i; j++) { y = y + 1; } }

4 Consider www. google. com § Efficient searches: lookup “laptop” in all web pages § How many web pages ? How fast is response ?

5 Consider www. google. com § 4 billion pages § Consider data structures: linked list, sorted linked list, array, sorted array, BST

6 Unsorted Linked List of n elem int search. List(List *a, int key) { if (a == NULL) return NULL; //not found if (a->data == key) return a; return search. List(a->next, key); } Best, Average, Worst T(n) ?

7 Sorted Linked List of n elem int search. List(List *a, int key) { if (a == NULL) return NULL; //not found if (a->data == key) return a; return search. List(a->next, key); } Best, Average, Worst T(n) ?

$8 Unsorted Array of n elem int seq_search(int n, int *a, int key) {$

8 Unsorted Array of n elem int seq_search(int n, int *a, int key) { int i = 0; while (i < n && a[i] != key) { i++; } return i; } Best, Average, Worst T(n) ?

$9 Sorted Array of n elem int binary_search(int n, int *a, int key) {$

9 Sorted Array of n elem int binary_search(int n, int *a, int key) { int lo = -1; int hi = n; while (hi - lo != 1) { int mid = (hi + lo) / 2; if (a[mid] <= key) { lo = mid; } else { hi = mid; } } return lo; } Best, Average, Worst T(n) ?

10 How about BST ? § Best O(1) § Average O(logn) § Worst O(n) – very imbalanced (tree degenerates to list)

11 Answer: Hash Tables § Search complexity is O(1) with “good” hash function § Hash Table: A generalization of an array that under some assumptions allows O(1) for Insert/Delete/Search

12 Intuition § How can you store all Student Numbers in an array? • Use an array with range 0 - 999, 999 • This will give you O(1) access time but … considering there approx. 5000 students you waste lots of array entries! § Problem: The range of key values is too large (0 -999, 999) when compared to the # of keys (students)

13 Formal Definition § Hash Tables solve this problem by using a smaller array and mapping keys with a hash function. § Set of keys K and an array of size m. A hash function h is a function from K to 0…m-1, that is: h: K 0…m-1

14 Example Hash Function k 888999222 k 123456789 0 1 2 3 4 5 6 7

15 Example Hash Function For example, if we hash the student number keys into a hash table with 8 entries we could use h (key) = key mod 8 k 888999222 k 123456789 0 1 2 3 4 5 6 7

16 Problem ? Collisions: Two keys hash into the same array entry h (88888) = h (00000) = key % 8 = 0 k 888999222 k 123456789 0 1 2 3 4 5 6 7

17 Solution • Hashing with Chaining (Open Hashing): every hash table entry contains a pointer to a linked list of keys that hash in the same entry • Closed Hashing: every hash table entry contains only one key. If a new key hashes to a table entry which is filled, systematically examine other table entries until you find one empty entry to place the new key

18 Hashing with Chaining (Open Hashing) § h (54) = 54 % 5 = 4 = h (34) – solved by CHAIN-ing key 0 1 2 next 21 2 3 4 54 34 CHAIN

19 Hashing with Chaining Insert 101 – where does it hash to ? key 0 1 2 next 21 2 3 4 54 34 CHAIN

20 Hashing with Chaining § h (101) = 101 % 5 = 1 Insert 101 key 0 1 2 next 0 1 21 2 2 3 4 21 2 101 54 34 3 54 34 CHAIN 4

21 Complexity Analysis § What is the running time to insert/search/delete? • Insert: It takes O(1) time to compute the hash function • • and insert at head of linked list Search: It is proportional to max linked list length Delete: Same as search

22 What is a “good” hash ? § uniform hashing: each key is equally likely to hash in any of the m slots • Creating a “good” hash function is black magic ! § How about when keys are student names ? § Interpret characters as numbers: • (int)‘a’, (int)‘b’, (int)‘c’ means 97 98 99 • Ex. Hash for names: ¨ Name “abc” hashes to (‘a’+‘b’+‘c’)% m

23 Example Hash Function For example, if we hash the student number keys into a hash table with 8 entries we could use h (key) = key mod 8 k 888999222 k 123456789 0 1 2 3 4 5 6 7

24 Hashing with Chaining Insert 101 – where does it hash to ? key 0 1 2 next 21 2 3 4 54 34 CHAIN

25 Closed Hashing § The key is first mapped to a slot: index = h(k) § If there is a collision, subsequent probes are performed § collision resolution is done as a linear search. This is known as linear probing. index = (index + 1) % m

26 Closed Hashing with Linear Probing H(k) = k % 11 Insert(1100) ? 1001 9537 3016 0 1 2 3 4 5 6 9874 2009 9875 7 8 9 10

27 Closed Hashing with Linear Probing H(k) = k % 11 Insert(1100) ? 1001 9537 3016 0 1 2 3 4 5 6 9874 2009 9875 7 8 9 10

28 Closed Hashing with Linear Probing H(k) = k % 11 Insert(1100) ? 1001 9537 3016 0 1 2 3 4 5 6 9874 2009 9875 7 8 9 10

29 Closed Hashing with Linear Probing H(k) = k % 11 Insert(1100) 3 1001 9537 3016 0 1 2 3 Same for keys that hash into 0 or 1 Prob(insert_into_3) = ? 4 5 6 9874 2009 9875 7 8 9 10

30 Closed Hashing with Linear Probing H(k) = k % 11 Insert(1100) 3 1001 9537 3016 0 1 2 3 Same for keys that hash into 0 or 1 Prob(insert_into_3) = 4/11 4 5 6 9874 2009 9875 7 8 9 10

31 Closed Hashing with Linear Probing H(k) = k % 11 Insert(1100) 3 1001 9537 3016 0 1 2 3 Same for keys that hash into 0 or 1 Prob(insert_into_3) = 4/11 Prob(insert_into_4) = 1/11 4 5 6 9874 2009 9875 7 8 9 10

32 Closed Hashing with Linear Probing H(k) = k % 11 Assume: Insert(1052) 10 1001 9537 3016 0 1 2 3 4 5 6 Prob(insert_into_3) = ? Prob(insert_into_4) = ? 9874 2009 9875 7 1052 10 8 9

33 Closed Hashing with Linear Probing H(k) = k % 11 Assume: Insert(1052) 10 1001 9537 3016 0 1 2 3 4 5 6 Prob(insert_into_3) = 8/11 Prob(insert_into_4) = 1/11 9874 2009 9875 7 1052 10 8 9

34 Problem: Clustering § Even with a good hash function, linear probing has its problems: • The position of the initial mapping i 0 of key k is called the home • • of k. When several insertions map to the same home position, they end up placed contiguously in the table. This collection of keys with the same home position is called a cluster. As clusters grow, the probability that a key will map to the middle of a cluster increases, increasing the rate of the cluster’s growth. As these clusters grow, they merge with other clusters forming even bigger clusters which grow even faster. • This tendency of linear probing to place items together is known as primary clustering.

35 Complexity Analysis – Worst Case § What is the running time to insert/search/delete? • Insert: Same as search • Search: It is proportional to max no of probes • Delete: Same as search • Worst O(n)

36 Complexity Analysis § When hash table is empty – insert is in 1 step (in home position) § As the table fills up, the probab that a record can be inserted in 1 step decreases More and more records are likely to be inserted far from their home position

37 Complexity Analysis - Intuition § The expected (avg. ) cost of hash (insert/search/delete) • is a function of how full the table is

38 The Load Factor n a= m n is the number of entries in a hash table that are occupied m is the size of the hash table =1 means the table is full, and =0 means the table is empty.

39 Complexity Analysis - Average Case n where n current no of records m n a = § On avg. probability to find the position occupied: m § The probability to find both position and next position § The load factor a = occupied is n/m * (n-1)/(m-1) § The probability of i collisions is: • n/m * (n-1)/(m-1) * …(n- i +1)/(m – i +1) ~ (n/m)i • probes = 1 + Si =1 to N (n/m)i

40 Complexity Analysis Average Case § It can be shown that the number of probes in a successful search, C, and the number of probes in an unsuccessful search, C’ is given by: Separate chaining C @1 + C ¢ @1 + a 1 ö æ ç 1 + ÷ 1 -a ø è 1 æ 1 C ¢ @ çç 1 + 2è (1 - a )2 C @ 2 a Linear probing 2 2 1 2 @ ö ÷÷ ø C ¢@ a l 1 1 -a

41 Successful search 20 Linear probing Double hashing Separate chaining 18 Average # of probes 16 14 12 10 8 6 4 2 0 0. 2 0. 4 0. 6 Load factor 0. 8 1

42 Unsuccessful search 20 Linear probing Double hashing Separate chaining 18 Average # of probes 16 14 12 10 8 6 4 2 0 0. 2 0. 4 0. 6 Load factor 0. 8 1

43 Insert Implementation bool Hash. Table: : hash. Insert(const Elem &e){ int home; int index = home = h(getkey(e)); for (int i = 1; !is_empty(HT[index]); i++) { index = (home + i) % m; // follow probes if (is_equal (e, HT[index]) return false; // duplicate } HT[index] = e; return true; }

44 Search Implementation bool Hash. Table: : hash. Search(const Key &k, Elem &e){ int home; int index = home = h(k); for (int i = 1; !is_empty(HT[index]) && !is_equal(k, HT[index]); i++) index = (home + i) % m; // follow probes if (is_equal (k, HT[index]){ //found it e = HT[index]; return true; } else return false; // k is not in the table }