Hash Functions Andy Wang Data Structures Algorithms and

  • Slides: 22
Download presentation
Hash Functions Andy Wang Data Structures, Algorithms, and Generic Programming

Hash Functions Andy Wang Data Structures, Algorithms, and Generic Programming

Introduction l Hash function – Maps keys to integers (buckets) l Hash(Key) = Integer

Introduction l Hash function – Maps keys to integers (buckets) l Hash(Key) = Integer – Ideally in a random-like manner Evenly distributed bucket values l Even if the input data is not evenly distributed l

An Example l ID Number Generation – Key = your name – Hash(Key) =

An Example l ID Number Generation – Key = your name – Hash(Key) = a number l Not a great hash function… – Two people with the same name will have the same number…

Simple Hash Functions l Assumptions: – K: an unsigned 32 -bit integer – M:

Simple Hash Functions l Assumptions: – K: an unsigned 32 -bit integer – M: the number of buckets (the number of entries in a hash table) l Goal: – If a bit is changed in K, all bits are equally likely to change for Hash(K)

A Simple Hash Function… l What if K = M? l Hash(K) = K

A Simple Hash Function… l What if K = M? l Hash(K) = K l What is wrong? l Your student ID = SSN – I can’t use your SSN to post your grades…

Another Simple Function l If K>M l Hash(K) = K % M l What

Another Simple Function l If K>M l Hash(K) = K % M l What is wrong? l Suppose M = 4, K = 2, 4, 6, 8 l K % M = 2, 0, 2, 0

Yet Another Simple Function l If K > P, P = prime number l

Yet Another Simple Function l If K > P, P = prime number l Hash(K) = K % P l Suppose P = 3, K = 2, 4, 6, 8 l K % P = 2, 1, 0, 3 l More uniform distribution…but still problematic for other cases

More on Prime Numbers l. K > P 1 > P 2, P 1

More on Prime Numbers l. K > P 1 > P 2, P 1 and P 2 are prime numbers l Hash(K) = (K % P 1) % P 2 l Suppose P 1 = 5, P 2 = 3, K = 2, 4, 6, 8, 10 l (K % 5) = 2, 4, 1, 3, 0 l (K % 5) % 3 = 2, 1, 1, 0, 0 l Still uniform distribution

Polynomial Functions l If K > P, P = prime number l Hash(K) =

Polynomial Functions l If K > P, P = prime number l Hash(K) = K(K + 3) % P l Slightly better than pure modulo functions

How About… l Hash(K) = rand() l What is wrong? l Not repeatable

How About… l Hash(K) = rand() l What is wrong? l Not repeatable

How About… l. K > P, P = prime number l Hash(K) = rand(K)

How About… l. K > P, P = prime number l Hash(K) = rand(K) % P l Better randomness l Can be expensive to compute random numbers

Pre-generated Randomness l Two prime numbers: P 1 and P 2 l K >

Pre-generated Randomness l Two prime numbers: P 1 and P 2 l K > P 1 and K > P 2 l A table R[P 1], with R[i] pre-initialized to rand(i) % P 2 l Hash(K) = R[K % P 1] l Slight Problem: Possible duplicate mapping

To Avoid Duplicate Mapping… l Two prime numbers: P 1 and P 2 l

To Avoid Duplicate Mapping… l Two prime numbers: P 1 and P 2 l K > P 1 and K > P 2 l A table R[P 1], with R[i] pre-initialized to unique random numbers l Hash(K) = R[K % P 1]

An Example l. K = 0… 232, P 1 = 3, P 2 =

An Example l. K = 0… 232, P 1 = 3, P 2 = 5 l R[3] = {0, 4, 1} l Hash(K) = R[K % 3]

Hashing a Sequence of Keys l. K = {K 1, K 2, …, Kn)

Hashing a Sequence of Keys l. K = {K 1, K 2, …, Kn) l E. g. , Hash(“test”) = 98157 l Design Principles – Use the entire key – Use the ordering information – Use pre-generated randomness

Use the Entire Key unsigned int Hash(const char *Key) { unsigned int hash =

Use the Entire Key unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ Key[j] } return hash; } l Problem: Hash(“ab”) == Hash(“ba”)

Use the Ordering Information unsigned int Hash(const char *Key) { unsigned int hash =

Use the Ordering Information unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ Key[j] hash = /* hash with some shiftings */ } return hash; } l Problem: H(short keys) will not perturb all 32 -bits (clustering)

Use Pre-generated Randomness unsigned int Hash(const char *Key) { unsigned int hash = 0;

Use Pre-generated Randomness unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ R[Key[j]] hash = /* hash with some shiftings */ } return hash; }

CRC Variant l Do 5 -bit circular shift of hash l XOR hash and

CRC Variant l Do 5 -bit circular shift of hash l XOR hash and K[j] … for (…) { highorder = hash = hash } … hash & 0 xf 8000000; << 5; ^ (highorder >> 27) ^ K[j];

CRC Variant + For long keys, all 32 -bits are exercised + More randomness

CRC Variant + For long keys, all 32 -bits are exercised + More randomness toward lower bits - Not all bits are changed for short keys

BUZ Hash l Set up an array R to store precomputed random numbers …

BUZ Hash l Set up an array R to store precomputed random numbers … for (…) { highorder = hash = hash } … hash & 0 x 80000000; << 1; ^ (highorder >> 31) ^ R[K[j]];

References l Aho, Sethi, and Ullman. Compilers: Principles, Techniques, and Tools, 1986. l Cormen,

References l Aho, Sethi, and Ullman. Compilers: Principles, Techniques, and Tools, 1986. l Cormen, Leiserson, River. Introduction to Algorithms, 1990 l Knuth. The Art of Computer Programming, 1973 l Kuenning. Hash Functions, 2003.