Bloom Filters ICP Web Caching BloomFilters n Internet
Bloom Filters ICP – Web Caching -- Bloom-Filters n Internet Cache Protocol with Bloom-Filters S. Sioutas ETY@CEID. UPATRAS 1
Bloom Filters Lookup questions: Does item “x” exist in a set or multiset? n Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. n Allow false positive errors, as they only cost us an extra data access. n Don’t allow false negative errors, because they result in wrong answers. n 2
Bloom Filters Bloom Filter [B 70] n n n Encoding an attribute a U Maintain a Bit Vector V of size m Use k hash functions (h 1. . hk) , hi: U [1. . m] Encoding: For item x, “turn on” bits V[h 1(x)]. . V[hk(x)]. Lookup: Check bits V[h 1(i)]. . V[hk(i)]. If all equal 1, return “Probably Yes”. Else “Definitely No”. 3
Bloom Filters 4 Bloom Filter x V 0 Vm-1 0 0 0 1 0 1 h 1(x) h 2(x) h 3(x) 0 1 0 0 0 hk(x)
Bloom Filters 5 Bloom Errors a b c d V 0 Vm-1 0 0 0 1 0 1 h 1(x) h 2(x) h 3(x) 0 1 0 0 0 hk(x) x didn’t appear, yet its bits are already set
Bloom Filters Error Estimation Assumption: Hash functions are perfectly random n Probability of a bit being 0 after hashing all elements: n n Let p=e-kn/m, probability of a false positive is: n Assuming we are given m and n, the optimal k is: 6
Bloom Filters Bloom Filter Tradeoffs Three factors: m, k and n. n Normally, n and m are given, and we select k. n Small k n – Less computations. – Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too. – However, less bits need to be stepped over to generate an error. For big k, the exact opposite holds. n Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0. 5 n 7
Bloom Filters Summary Cache [FCAB 00] n n n Proxy servers maintain local cache to minimize expensive internet requests. Proxy must maintain an efficient lookup method into the cache. The lookup structure must be stored in DRAM for performance. Structure must be compact, as DRAM is expensive and is used for “Hot Items” storage and more. Pages are usually replaced in the cache using an LRU algorithm. 8
Bloom Filters 9 ICP – Request Handling Proxy Cache Client Proxy Cache Internet Proxy Cache
Bloom Filters 10 Internet Cache Protocol (ICP) n n n Allows for scaling-out when using proxies. Protocol that supports discovery and retrieval of documents from neighboring caches. Establish an hierarchy of proxy caches If page not found in local proxy cache, it searches for the page in neighboring proxies. If page not found anywhere, fetch it from the internet.
Bloom Filters 11 ICP – Request Handling Proxy Cache Client Proxy Cache Internet Proxy Cache
Bloom Filters 12 Summary Cache Each proxy maintains a Bloom Filter representing its local cache. n Also, it holds Bloom Filters representing caches of other proxies. n Updates to Bloom Filters are exchanged periodically or after a certain percentage of the documents in the cache was replaced. n ICP request is sent only to proxy who supposedly holds the requested document. n
Bloom Filters 13 ICP – With Summary Cache Proxy Cache Client Proxy Cache Internet
Bloom Filters Summary Cache – Bloom Filters n n n n To support deletions and updates, the proxy maintains the Bloom Filter and also an array of counters C, initially set to 0. The Bloom Filter is filled with the contents of the cache. Each bit in the BF is allowed 4 bits for its counter. On insert of item i, all C[hj(i)] are increased (to a maximum of 15). On deletion of item i, counters are decreased. When C[i] increases from 0 to 1, V[i] is turned on. When C[i] decreases from 1 to 0, V[i] is turned off. 14
Bloom Filters Summary Cache – Bloom Filters n Hashing scheme – Generate 128 bits using MD 5 on the URL. – Divide to segments of M bits (usually 32) – Calculate modulus of segments by m, providing 128/M hash values (4, for 32 bit segments) – If 128 bits are not enough, calculate MD 5 of URL concatenated with itself. n Bloom Filter Exchange – – Header contains MD 5 properties, size of array. If refresh rate is high, send only deltas. Bit counts are internal and not exchanged. Otherwise, send entire Bloom Filter. 15
Bloom Filters Summary Cache - Errors n False Misses – Document requested is cached at some remote proxy, but summary does not reflect that fact. – Hit ratio is reduce, a redundant internet access is performed. n False Hits – Document is not at a remote proxy, but summary suggests that it is. – An Inter-Proxy query message is wasted. n Remote Stale Hits – Document is cached at a remote proxy, but is stale. – Occurs in both ICP and Summary Cache. – Might not be a totally wasted effort, as delta compression can be used. 16
Bloom Filters Implementation - Squid n Squid – A publicly available web proxy cache software. http: //www. squid-cache. org n Summary Cache is implemented in Squid v 1. 1. 14 n A variation called cache digest is implemented in Squid 1. 2 b 20 17
- Slides: 17