Augmented Sketch Faster and More Accurate Stream Processing
- Slides: 50
Augmented Sketch: Faster and More Accurate Stream Processing Roy, Pratanu, Arijit Khan, and Gustavo Alonso. SIGMOD’ 16, June 26 -July 01, 2016, San Francisco, CA, USA 2016/07/21 M 1 kosaka S
What is the stream data? S Example S real-time IP traffic, phone calls, sensor measurements, web clicks and crawls. . . S Processing of such streams often requires approximation through succinct synopses created in a single-pass S These synopses summarize the streams to give an idea of the frequency of items, using a small amount of space, while allowing to process the stream fast enough
Introduction S In this paper, the authors study the problem of frequency estimation over data streams that require summarization : S given a data item as a query, they estimate the frequency count of that item in the input stream. S Sketch data structures are the solution of choice to address this problem. S By using multiple hash functions, sketches summarize massive data streams within a limited space.
Introduction S However, the approach can give inaccurate results. 1. it can give an inaccurate count for the most frequent items 2. it can misclassify low- frequency items and report them as high-frequency ones
Introduction S In this paper, the authors propose a way to solve these problems by complementing existing techniques and without requiring a radically different solution. S ASketch (Augmented Sketch) is a stream processing framework, applicable over many sketch-based synopses, and coupled with an adaptive pre-filtering and early-aggregation strategy.
Related Work S The problem of synopsis construction has been studied extensively in the context of a variety of techniques S sampling , wavelets , histograms , sketches , and counter- based methods S Sketch-Based data structures are typically used for frequency estimation S Counter-based data structures are designed for finding the top-k frequent items.
Related Work S Several systems have been proposed either to improve the stream processing throughput of sketches S e. g. Holistic UDAFs , Frequency- Aware Counting , g. Sketch S The authors empirically compare in Table 1 the efficiency and accuracy of ASketch with several existing techniques
Related Work S The idea of separating high-frequency items from skewed data distribution is not new S Modern processors employ caches on top of main memory to capture the locality of access and to improve the overall throughput
Preliminaries S Sketches are a family of data structures for summarizing data streams. S Sketches bear similarities to Bloom filters in that both employ hashing to summarize data; however, they differ in how they update the hash buckets and use these hashed data to derive estimates.
Preliminaries S Among the various sketches available, it has been shown that Count- Min achieves the best update throughput in general, as well as high accuracy on skewed distributions. S The authors shall discuss ASketch on top of Count-Min.
Preliminaries S Count-Min S In Count-Min, hashing is used to approximately maintain the frequency counts of a large number of distinct items in a data stream. S The authors use w pairwise independent hash functions, each of which maps onto uniformly random integers in the range [0. . . h − 1]. The data structure itself consists of a 2 -dimensional array with w · h cells
Preliminaries S The estimated count is at least equal to ct, since they are dealing with non-negative counts only, and there may be an over-estimation because of collisions among hash cells. S For a data stream with N as the sum of the counts of the items received so far, the estimated count is at most ct + ( e/h )*N with probability at least 1 − e−w.
Motivation For Filtering and Aggregating Frequent Items S The authors analytically compare the frequency estimation error, frequency estimation time, and stream processing throughput of both Count-Min and ASketch.
Motivation For Filtering and Aggregating Frequent Items S Count-Min Properties S The time required to insert an item in Count-Min is ts = O(w). S For a data stream with N as the sum of the counts of the items received so far, the estimated count is at most ct + ( e/h )*N with probability at least 1 − e−w.
Motivation For Filtering and Aggregating Frequent Items S How is a Filter Accommodated in an ASketch? S ASketch augments the traditional Count-Min with a filter S Assume that the filter consumes sf space, while the time required to insert an item (as well as query an item) in the filter is tf S sf is very small and tf << ts
Motivation For Filtering and Aggregating Frequent Items S They have : sf + w’h’ = wh S In their implementation, they fix w’ = w , and reduce h’ to ( h − sf/w) , due to two reasons ′ 1. Usually, h > w. → updating h is more flexible to accommodate various sizes of filter in ASketch. 2. Having the same w number of hash functions for both Count. Min and ASketch → keep the error bound probability e−w identical.
Motivation For Filtering and Aggregating Frequent Items S Impact of the Filter on Throughput of Asketch S ASketch update time (as well as query time) can be expressed as: tf + filterselectivity ∗ ts. S In the case of ASketch updates, there is an additional factor which involves exchanging of data items between the filter and sketch. S Here, they ignored it. But they analyze it in more detail in the Appendix and evaluate the impact of this factor empirically in Section 7.
Motivation For Filtering and Aggregating Frequent Items S Clearly, for ASketch, they increase the processing time by tf ; however, in return, they reduce the sketch processing time by a factor of filterselectivity. S A small size of the filter and vectorized execution keeps tf small, in the range of a few assembly instructions. S Therefore, filterselectivity essentially determines the overall effectiveness of ASketch.
Motivation For Filtering and Aggregating Frequent Items S Assume that out of N aggregated stream-counts, N 1 counts are processed by the filter and (N − N 1) = N 2 counts are processed by the sketch. Therefore, filterselectivity = N 2/N. S However, most real world streams exhibit skew
Motivation For Filtering and Aggregating Frequent Items S Increasing the filter size beyond a certain threshold does not reduce the filterselectivity significantly S However, increasing the filter size, depending on the filter implementation , may increase the filter processing time (tf).
Motivation For Filtering and Aggregating Frequent Items S Impact of the Filter on Accuracy of Asketch S ASketch improves the frequency estimation accuracy for the frequent items that are stored in the filter. S For the items that are stored in the sketch, the authors reduce their collisions with the high-frequency items. S This reduces the possibility that a low-frequency item would appear as a high-frequency item, therefore reducing the misclassification error.
Motivation For Filtering and Aggregating Frequent Items
Algorithm for ASketch S ASketch exchanges data items between the filter and the sketch, so that the filter stores the most frequent items of the observed input stream. S Such movement of data items is a challenging problem for several reasons. 1. one needs to ensure that the exchange procedure does not incur too much overhead on the normal stream processing rate. 2. it is difficult to remove the item, along with its frequency count, from the underlying sketch.
Algorithm for ASketch S In ASketch, they want to maintain the one-sided accuracy guarantee of Count-Min, i. e. , the estimated frequency of an item reported by ASketch must always be an overestimation of its true frequency. S They achieve this accuracy guarantee by introducing two different counts in the filter as described below.
Algorithm for ASketch S Algorithm Description S ASketch consists of two data structures: filter (F) and sketch (CMS). S The filter stores a few high-frequency items and two associated counts, namely new_count and old_count for each of the items that it monitors.
Algorithm for ASketch ・Stream processing algorithm for Asketch
Algorithm for ASketch S The key insight here is that they can exactly capture the hits for an item that is currently stored in the filter by using its (new_count- old_count) value. This has three benefits 1. By accumulating this count in the filter, they save the cost of applying multiple hash functions in the sketch, and thus, improve throughput. 2. As high-frequency items in a skewed data stream remain in the filter most of the time, they get more accurate frequency counts for these items. 3. The misclassification rate for the low-frequency items as high-frequency items also decreases, since the aggregate of the (new_count-old_count) values of the high-frequency items are not hashed into the underlying sketch.
Algorithm for ASketch ・Query processing algorithm for Asketch
Algorithm for ASketch S Exchange Policy. S In this section, they discuss the efficiency of the exchange policy. 1. They need to keep track of the smallest frequency count in the filter. The efficiency of this step depends on the underlying data structure and the hardware implementation of the filter, which will be discussed shortly in Section 6. 1.
Algorithm for ASketch 2. They empirically found that they require a relatively small number of exchanges compared to the overall stream size in order to ensure that the high-frequency items are stored and early aggregated in the filter. 3. Can their exchange policy trigger multiple exchanges between the filter and the sketch? Their exchange mechanism may indeed initiate multiple exchanges
Algorithm for ASketch S The following lemma guarantees an upper bound on the number of times an item is inserted into the underlying sketch in the presence of their exchange policy, thereby ensuring the overall efficiency of ASketch.
Algorithm for ASketch S Below, they demonstrate the key steps of their stream processing algorithm (Algorithm 1) with an example.
Asketch on Multi-Core S They now discuss the implementation of ASketch in modern multi-core hardware with an emphasis on (1) SIMD parallelism in the filter implementation (2) pipeline parallelism in the ASketch framework (3) SPMD parallelism by implementing ASketch as a counting kernel in a multi-core machine.
Asketch on Multi-Core
Asketch on Multi-Core
Asketch on Multi-Core
Asketch on Multi-Core
Experimental Results S Frequency Estimation queries S Stream Processing Throughput and Query Processing Throughput
Experimental Results S Avoiding Large Estimation Error
Experimental Results S Error Comparison with Count-Min and Holistic UDAFs
Experimental Results S Error Comparison with Frequency-Aware Counting (FCM)
Experimental Results S Top-k Frequent Items Queries
Experimental Results S Number of Exchanges for Asketch
Experimental Results S Experiments on Real-World Datasets S Stream Processing Throughput Comparison, Frequency Estimation Error Comparison
Experimental Results S Space Saving for Frequency Estimation
Experimental Results S Experiments on Parallel ASketch S Throughput Benefit of Pipeline Parallelism
Experimental Results S ASketch as a Counting Kerne
Experimental Results S ASketch Sensitivity Analysis S Impact of the Filter Implementation
Experimental Results S Impact of the Filter Size
Conclusion S The authors present ASketch — a filtering technique complementing sketches. S It improves the accuracy of sketches by increasing the frequency estimation accuracy for the most frequent items and by reducing the possible misclassification of the low- frequency items. S It also improves the overall throughput.
- More more more i want more more more more we praise you
- More more more i want more more more more we praise you
- Accurate stream
- Differentiate byte stream and character stream
- Finish sketch crime scene
- Because you speak english unnaturally
- Apa itu augmented processing table
- How to sketch a crime scene
- Corpus delicti
- More accurate data
- Flink stateful stream processing
- Alur data memory
- Stream processing engines
- Sketch techniques for approximate query processing
- Neighborhood averaging in image processing
- Secondary processing
- Interactive processing
- Human history becomes more and more a race
- Core augmented and actual product
- Augmented reality: principles and practice
- Building and sustaining relationships in retailing pdf
- Core product augmented product
- Top-down vs bottom-up processing
- Gloria suarez
- Bottom up processing example
- Point processing operations in image processing
- Histogram processing in digital image processing
- Parallel processing vs concurrent processing
- Laplacian filter
- What is point processing in digital image processing
- Morphological
- Top down vs bottom up psychology
- The more you take the more you leave behind
- The more you study the more you learn
- Aspire not to have more but to be more
- Newton's first law
- Knowing more remembering more
- The more i give to thee the more i have
- More choices more chances
- The reason a 20-kg rock falls no faster than
- What is mass and weight
- Superlative form hot
- Faster than fairies
- Faster scale genesis process
- Anchor boxes
- Do heavier things fall faster
- All by himself and gathering brambles
- Do heavier things fall faster
- Gravitational acceleration
- What is tachyon
- Factors affecting the rate of chemical reaction temperature