The dynamic Bloom filters Brice Pesci M 2

The dynamic Bloom filters Brice Pesci – M 2 M輪 Brice PESCI - M輪 2009/11/27 1

About the paper. . The Dynamic Bloom Filters IEEE TKDE Vol 22, No 1, January 2010 Maths. . . ORZ Brice PESCI - M輪 2009/11/27 2

Standard Bloom filter What is it? ◦ Space suffisent data structure for representing and answering approximative membership quieries within a constant delay Set of n items : X = {x 1, x 2, . . . , xn} k independent hash functions : h 1, . . . , hk Bloom filter : vector of m bits ◦ Initially, all bits are set to 0 Maps each item of X to a random number over {1, . . . m} uniformely Brice PESCI - M輪 2009/11/27 3

Standard Bloom filter Initialization ◦ To insert x, all the bits corresponding to all the hi(x) are set to 1 Example ◦ Set of 3 elements : kw 1, kw 2, kw 3 ◦ 2 hash functions : h 1 and h 2 ◦ Bloom filter of 10 bits h 1(kw 1) =9 h 2(kw 1) =7 h 1(kw 2) =4 h 2(kw 2) =3 h 1(kw 3) =7 h 2(kw 3) =2 0 1 0 1 1 0 0 0 9 8 7 6 5 4 3 2 1 0 Brice PESCI - M輪 kw 1 kw 2 kw 3 2009/11/27 4

Standard Bloom filter Assumption ◦ If all the hi(x) = 1, then x belongs to X Example ◦ 2 hash functions : h 1 and h 2 ◦ Bloom filter of 10 bits ? h 1(kw 1) =9 h 2(kw 1) =3 h 1(kw 2) =3 h 2(kw 2) =0 h 1(kw 3) =6 h 2(kw 3) =0 kw 1 kw 2 kw 3 ? ? kw 1 kw 2 kw 3 kw 1 1 0 0 0 1 9 8 7 6 5 4 3 2 1 0 kw 2 kw 3 Brice PESCI - M輪 2009/11/27 5

Standard bloom filter Risk of false positive ◦ « x is in X » even though it’s wrong ◦ Hash collisions In case of perfectly random hash functions ◦ ◦ For each function n elements have been added For n insertions k hash functions Probability of a random bit being 0 : P(set to 1) False positive probability caused by (n+1)th P(not set to 1) insertion : Should not be used for non static set Brice PESCI - M輪 2009/11/27 6

Algebra on Bloom Filters Union of SBF ◦ ◦ BF(A) and BF(B) use the same m bits BF(A) and BF(B) use the same hash functions Logical “or” between bit vectors = {x, y} U {z} {x, y, z} Exemple : 0 1 1 1 + = z x ◦ ◦ We have : 0 0 1 1 1 x FP( ) Brice PESCI - M輪 2009/11/27 z 7

Algebra on Bloom Filters Intersection of SBF ◦ ◦ BF(A) and BF(B) use the same m bits BF(A) and BF(B) use the same hash functions Logical “and” between bit vectors = {x, y} {y, z} {y} Exemple : U 0 1 1 1 x ◦ . 0 0 1 1 = y 0 0 1 1 x y with probability Brice PESCI - M輪 2009/11/27 8

Related works Split Bloom filters ◦ Increase their capacity by allocating a fixed s x m bit matrix instead of m bit vector ◦ A certain number of s filter are selected when inserting ◦ The false match probability increases with the cardinality New to use a new bit matrix when the false match probability exceeds an upperbound ◦ Do not support deletion operation Brice PESCI - M輪 2009/11/27 9

Related works Scalable Bloom filter ◦ Uses a series of BF in an incremental manner ◦ Allocates m x ai-1 bits for its ith BF ◦ Large overhead when calculating the BF address for items in each BF when testing for membership Brice PESCI - M輪 2009/11/27 10

Vocabulary SFB / DFB ◦ Standard Bloom filter / Dynamic Bloom filter Nr / c ◦ Number of items accomodated / capacity SBF is active ◦ its false match prob < e ◦ => nr < c SBF is full ◦ its false match prob > e ◦ => nr = c Brice PESCI - M輪 2009/11/27 11

Initialization Parameters ◦ ◦ ◦ Upper bound on false match prob for the DBF Largest value of s, number of SBF Upper bound on false match prob for the SBF Filer size m of the SBF Capacity c of the SBF k hash functions of the SBF These will be initialized differently depending on the scenario Brice PESCI - M輪 2009/11/27 12

DBF : Insertion Test for active SBF If no active SBF, - create a new one - update s We represent a set X by invoking the Insert function for each of the elements x Set to 1 the bits O(k) Keep the count of accomodated elements If one if the existing BF is not full, use it Brice PESCI - M輪 2009/11/27 13

DBF : Query For each SBF the hash(x) x is not a member of X if it is not found. Ifinany all of SBFs is zero, x is a member of X if it is found in any SBF move on the next SBF O(kxs) If all the hash(x) are 1 then x is a member X is not a member Brice PESCI - M輪 2009/11/27 14

DBF : deletion If two SBFs could fit in one, we merge Locate the SBFs Since we use counters, we xcan delete containing With bits, it would be impossible (setting to 0 a bit might generate errors) If only in one SBF O(kxs) - the counters hash(x) are decremented - nr is decremented - if in no SBF, x is not in X - if in several, we keep membership (we do not want false negative) Brice PESCI - M輪 2009/11/27 15

DBF & False match prob A set X with n elements is represented by ◦ DBF with False match probability for a DBF If n<c, same result 133 If n>c, huge improvements Brice PESCI - M輪 2009/11/27 16

Algebra on DBF Union of DBFs ◦ ◦ DBF(A) and DBF(B) use s 1 xm and s 2 xm bits matrixes DBF(A) U DBF(B) (s 1+s 2)xm bit matrix 0 1 0 0 “ DBF(A) followed by DBF(B) ” 1 1 0 0 DBF(B) Exemple : DBF(A) 0 1 0 0 U 1 1 0 0 0 = 1 1 x ◦ FP( ) or FP( 1 1 0 0 0 1 1 x y ) FP( Brice PESCI - M輪 ) 2009/11/27 y ) 17

More on item deletions Probability x appears to be in multiple SBFs ◦ If x was represented by one of the firsts s-1 SBFs ◦ If x was represented by the sth SBF Experiments Independant random interger € {1, . . . , m} i € {0, . . . , k-1} ◦ Hash functions : ◦ SDBM Mersenne Twister method Increases with the number of the remaining items Increases with the size Brice PESCI - M輪 2009/11/27 18

Optimizations. . . Improvement of Item Insertion Operation ◦ Avoid allocating / extending when unnecessary Duplicate items Items that seems to have been represented already ◦ Querying for membership before insertion Estimate number of items ◦ Only worth it if Compressed DBFs With multiple BF addresses Incures one unnecessary SBF ◦ Save bandwidth as cost of additionnal computation ◦ Compress a DBF by using compressed BFs Storing DBFs ◦ Bit string / bit slice methods (will not be presented) Brice PESCI - M輪 2009/11/27 19

In before experimenting. . . Definitions ◦ a : upper bound for the false match prob of a SBF representation a static set with fixed cardinality ◦ n : cardinality of a static set Given these ◦ We can optimize the parameters ◦ ◦ ◦ Brice PESCI - M輪 2009/11/27 20

Experiments (1/4) Static set with fixed cardinality ◦ Dynamic set = series of static set over a sequence of discrete time ◦ The SBF is reconstructed every time To achieve same false match prob -If n<c, same -If n>c, SBF consumes fewer To achieve same false match prob SBF never uses more bit than DBF to represent static version of dynamic set if it is reconstructed Brice PESCI - M輪 2009/11/27 21

Experiments (2/4) Dynamic set with upper bound on set cardinality ◦ X has an upperbound N on set cardinality ◦ We do not want to FP to exceed the threshold ◦ Cardinality distributions : variants of Zipf distrib Ratio DBF / SDF DBF outperforms SBFs Widen with the set cardinality Ratio increases with the false match probability Brice PESCI - M輪 2009/11/27 22

Experiments (3/4) Dynamic sets without upper bound on set cardinality ◦ We do not know N in advance ◦ b < g left and right upper bound of false match prob ◦ The SBF and the DBF may be reconstructed Results ◦ It is unavoidable to reconscrust if N not known. . . ◦ The frequency of reconstruction for DBF is lower than for SDF, especially if g - b is large ◦ DBF : less overheard and more stable No fig nor data…. Brice PESCI - M輪 2009/11/27 23

Experiments (4/4) Distributed application scenarios ◦ Must now be transferered between nodes ◦ SBF Frequent reconstruction -> huge overheard Overestimating the cardinal set -> hurts space efficiency ◦ DBF Reajustement rate is lower More stable with the increases of cardinal No fig nor data…. Brice PESCI - M輪 2009/11/27 24

Conclusion Enhancement to bloom filters ◦ Can handle dynamic sets Used for approximative membership queries Compromise between ◦ Space efficiency ◦ False positives Thank you for your listening Brice PESCI - M輪 2009/11/27 25