How to Approximate a Set Without Knowing Its

ASM in a picture S ⊆ [100]×[100] |S|=188 |A(S)|=5213 |A(S)|=2699 |A(S)|=1580 |A(S)|=918

Applications • Many…. Very common in practice • Data Bases, Networking and more… •

Lower Bounds for Static Case: • [CFGMW 78]

Upper Bounds – Bloom Filters • 1 1 X 1 1

But in practice…. • The size of the set is not known in advance!

Main Results (approximate) • Super linear bound!

Extensions and standard tricks • Extra space required when rebuilding the new dictionary. Both

Supporting Deletions • Necessary assumption: Only items that are in the set are ever

Open Problems • Bridge a theory – practice gap • Practitioners seem content with

Slides: 21

Download presentation

How to Approximate a Set Without Knowing It’s Size In Advance? Rasmus Pagh IT University of Copenhagen Gil Segev Stanford Udi Wieder Microsoft Research

Set Membership •

Approximate Set Membership • |S|= n

ASM in a picture S ⊆ [100]×[100] |S|=188 |A(S)|=5213 |A(S)|=2699 |A(S)|=1580 |A(S)|=918

Applications • Many…. Very common in practice • Data Bases, Networking and more… • Serves as a filter for accessing slow/bandwidth bounded data Request Filter: Approximation of the Cache External Web Proxy Cache • Requests arrive first at the filter which determines which requests reside in the proxy’s cache and which should be fetched from the network. • The cost of a false positive is a cache miss.

Lower Bounds for Static Case: • [CFGMW 78]

Upper Bounds – Bloom Filters • 1 1 X 1 1

Dictionary Based Upper Bounds •

Separation of Static and Dynamic •

But in practice…. • The size of the set is not known in advance! • Leads to over-provisioning of space up front • Waste of space as long as the set is small • Typically the data structure lies in prime real estate, the whole idea is saving space. • Problem raised and handled in ‘practical’ papers • Typically in a naïve way from a ‘theoretical’ point of view

Main Results (approximate) • Super linear bound!

Lower Bound •

Lower Bound – proof sketch • . . .

Lower Bound: the encoding •

Upper Bound – Construction 1 •

Getting Constant Query Time •

Analysis •

Extensions and standard tricks • Extra space required when rebuilding the new dictionary. Both dictionaries need to be stored until the rebuild is complete. • This can be mitigated by bucketing items into many smaller dictionaries, rebuilding the smaller dictionaries one at a time. • De-amortization of Insert, • Each time an item is inserting, perform O(1) operations on the next dictionary. • Not compatible with bucketing technique, requires a small increase in space.

Supporting Deletions • Necessary assumption: Only items that are in the set are ever deleted. • The removal of a ‘false positive’ item may introduce false negatives • The assumption makes sense in many applications when data structure filters a cache • Standard approach of storing multi-sets is problematic. • An item generates many signatures, can’t tell which one to remove. • Upon insertions, if fingerprint already appears put it in a secondary structure. Upon removal check secondary structure first. • Requires assumption that each item is inserted only once • Requires some extra book keeping.

Open Problems • Bridge a theory – practice gap • Practitioners seem content with the solution of multiple bloom filters • But then, practitioners seem content with Bloom Filters… • Get the leading constant in front of log n THANK YOU