How to Approximate a Set Without Knowing Its
- Slides: 21
How to Approximate a Set Without Knowing It’s Size In Advance? Rasmus Pagh IT University of Copenhagen Gil Segev Stanford Udi Wieder Microsoft Research
Set Membership •
Approximate Set Membership • |S|= n
ASM in a picture S ⊆ [100]×[100] |S|=188 |A(S)|=5213 |A(S)|=2699 |A(S)|=1580 |A(S)|=918
Applications • Many…. Very common in practice • Data Bases, Networking and more… • Serves as a filter for accessing slow/bandwidth bounded data Request Filter: Approximation of the Cache External Web Proxy Cache • Requests arrive first at the filter which determines which requests reside in the proxy’s cache and which should be fetched from the network. • The cost of a false positive is a cache miss.
Lower Bounds for Static Case: • [CFGMW 78]
Upper Bounds – Bloom Filters • 1 1 X 1 1
Dictionary Based Upper Bounds •
Separation of Static and Dynamic •
But in practice…. • The size of the set is not known in advance! • Leads to over-provisioning of space up front • Waste of space as long as the set is small • Typically the data structure lies in prime real estate, the whole idea is saving space. • Problem raised and handled in ‘practical’ papers • Typically in a naïve way from a ‘theoretical’ point of view
Main Results (approximate) • Super linear bound!
Lower Bound •
Lower Bound – proof sketch • . . .
Lower Bound: the encoding •
Upper Bound – Construction 1 •
Getting Constant Query Time •
Getting Constant Query Time •
Analysis •
Extensions and standard tricks • Extra space required when rebuilding the new dictionary. Both dictionaries need to be stored until the rebuild is complete. • This can be mitigated by bucketing items into many smaller dictionaries, rebuilding the smaller dictionaries one at a time. • De-amortization of Insert, • Each time an item is inserting, perform O(1) operations on the next dictionary. • Not compatible with bucketing technique, requires a small increase in space.
Supporting Deletions • Necessary assumption: Only items that are in the set are ever deleted. • The removal of a ‘false positive’ item may introduce false negatives • The assumption makes sense in many applications when data structure filters a cache • Standard approach of storing multi-sets is problematic. • An item generates many signatures, can’t tell which one to remove. • Upon insertions, if fingerprint already appears put it in a secondary structure. Upon removal check secondary structure first. • Requires assumption that each item is inserted only once • Requires some extra book keeping.
Open Problems • Bridge a theory – practice gap • Practitioners seem content with the solution of multiple bloom filters • But then, practitioners seem content with Bloom Filters… • Get the leading constant in front of log n THANK YOU
- Knowing is good knowing everything is better
- Not knowing is worse than knowing
- Total set awareness set consideration set
- Training set validation set test set
- A program that designed to send you advertisements.
- Justify the title of keeping quiet
- Without title poem theme
- The father in the poem without title
- Crisp set vs fuzzy set
- Bounded set vs centered set
- What is the overlap of data set 1 and data set 2?
- Fucntions
- Correspondence function examples
- Crisp set vs fuzzy set
- Units of storage in computer
- Times are approximate
- Are tiny elevation or hill like structures
- What is consonance
- What does this map represent
- Approximate computing
- Approximate 645 to the nearest hundred
- A guided tour to approximate string matching