Bloom Filters BloomFilters S Sioutas CEIDUPATRAS 1 Bloom
Bloom Filters Bloom-Filters S. Sioutas CEID@UPATRAS 1
Bloom Filters Lookup questions: Does item “x” exist in a set or multiset? n Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. n Allow false positive errors, as they only cost us an extra data access. n Don’t allow false negative errors, because they result in wrong answers. n 2
Bloom Filters Bloom Filter [B 70] n n n Encoding an attribute a U Maintain a Bit Vector V of size m Use k hash functions (h 1. . hk) , hi: U [1. . m] Encoding: For item x, “turn on” bits V[h 1(x)]. . V[hk(x)]. Lookup: Check bits V[h 1(i)]. . V[hk(i)]. If all equal 1, return “Probably Yes”. Else “Definitely No”. 3
Bloom Filters 4 Bloom Filter x V 0 Vm-1 0 0 0 1 0 1 h 1(x) h 2(x) h 3(x) 0 1 0 0 0 hk(x)
Bloom Filters 5 Bloom Errors a b c d V 0 Vm-1 0 0 0 1 0 1 h 1(x) h 2(x) h 3(x) 0 1 0 0 0 hk(x) x didn’t appear, yet its bits are already set
Bloom Filters Error Estimation Assumption: Hash functions are perfectly random n Probability of a bit being 0 after hashing all n elements: n n Let p=e-kn/m. The probability of a false positive is: n Assuming we are given m and n, the optimal k is: 6
Bloom Filters Bloom Filter Tradeoffs Three factors: m, k and n. n Normally, n and m are given, and we select k. n Small k n – Less computations. – Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too. – However, less bits need to be stepped over to generate an error. For big k, the exact opposite holds. n Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0. 5 n 7
Bloom Filters ΤΕΛΟΣ 8
- Slides: 8