Augmented Sketch Faster and More Accurate Stream Processing

  • Slides: 50
Download presentation
Augmented Sketch: Faster and More Accurate Stream Processing Roy, Pratanu, Arijit Khan, and Gustavo

Augmented Sketch: Faster and More Accurate Stream Processing Roy, Pratanu, Arijit Khan, and Gustavo Alonso. SIGMOD’ 16, June 26 -July 01, 2016, San Francisco, CA, USA 2016/07/21 M 1 kosaka S

What is the stream data? S Example S real-time IP traffic, phone calls, sensor

What is the stream data? S Example S real-time IP traffic, phone calls, sensor measurements, web clicks and crawls. . . S Processing of such streams often requires approximation through succinct synopses created in a single-pass S These synopses summarize the streams to give an idea of the frequency of items, using a small amount of space, while allowing to process the stream fast enough

Introduction S In this paper, the authors study the problem of frequency estimation over

Introduction S In this paper, the authors study the problem of frequency estimation over data streams that require summarization : S given a data item as a query, they estimate the frequency count of that item in the input stream. S Sketch data structures are the solution of choice to address this problem. S By using multiple hash functions, sketches summarize massive data streams within a limited space.

Introduction S However, the approach can give inaccurate results. 1. it can give an

Introduction S However, the approach can give inaccurate results. 1. it can give an inaccurate count for the most frequent items 2. it can misclassify low- frequency items and report them as high-frequency ones

Introduction S In this paper, the authors propose a way to solve these problems

Introduction S In this paper, the authors propose a way to solve these problems by complementing existing techniques and without requiring a radically different solution. S ASketch (Augmented Sketch) is a stream processing framework, applicable over many sketch-based synopses, and coupled with an adaptive pre-filtering and early-aggregation strategy.

Related Work S The problem of synopsis construction has been studied extensively in the

Related Work S The problem of synopsis construction has been studied extensively in the context of a variety of techniques S sampling , wavelets , histograms , sketches , and counter- based methods S Sketch-Based data structures are typically used for frequency estimation S Counter-based data structures are designed for finding the top-k frequent items.

Related Work S Several systems have been proposed either to improve the stream processing

Related Work S Several systems have been proposed either to improve the stream processing throughput of sketches S e. g. Holistic UDAFs , Frequency- Aware Counting , g. Sketch S The authors empirically compare in Table 1 the efficiency and accuracy of ASketch with several existing techniques

Related Work S The idea of separating high-frequency items from skewed data distribution is

Related Work S The idea of separating high-frequency items from skewed data distribution is not new S Modern processors employ caches on top of main memory to capture the locality of access and to improve the overall throughput

Preliminaries S Sketches are a family of data structures for summarizing data streams. S

Preliminaries S Sketches are a family of data structures for summarizing data streams. S Sketches bear similarities to Bloom filters in that both employ hashing to summarize data; however, they differ in how they update the hash buckets and use these hashed data to derive estimates.

Preliminaries S Among the various sketches available, it has been shown that Count- Min

Preliminaries S Among the various sketches available, it has been shown that Count- Min achieves the best update throughput in general, as well as high accuracy on skewed distributions. S The authors shall discuss ASketch on top of Count-Min.

Preliminaries S Count-Min S In Count-Min, hashing is used to approximately maintain the frequency

Preliminaries S Count-Min S In Count-Min, hashing is used to approximately maintain the frequency counts of a large number of distinct items in a data stream. S The authors use w pairwise independent hash functions, each of which maps onto uniformly random integers in the range [0. . . h − 1]. The data structure itself consists of a 2 -dimensional array with w · h cells

Preliminaries S The estimated count is at least equal to ct, since they are

Preliminaries S The estimated count is at least equal to ct, since they are dealing with non-negative counts only, and there may be an over-estimation because of collisions among hash cells. S For a data stream with N as the sum of the counts of the items received so far, the estimated count is at most ct + ( e/h )*N with probability at least 1 − e−w.

Motivation For Filtering and Aggregating Frequent Items S The authors analytically compare the frequency

Motivation For Filtering and Aggregating Frequent Items S The authors analytically compare the frequency estimation error, frequency estimation time, and stream processing throughput of both Count-Min and ASketch.

Motivation For Filtering and Aggregating Frequent Items S Count-Min Properties S The time required

Motivation For Filtering and Aggregating Frequent Items S Count-Min Properties S The time required to insert an item in Count-Min is ts = O(w). S For a data stream with N as the sum of the counts of the items received so far, the estimated count is at most ct + ( e/h )*N with probability at least 1 − e−w.

Motivation For Filtering and Aggregating Frequent Items S How is a Filter Accommodated in

Motivation For Filtering and Aggregating Frequent Items S How is a Filter Accommodated in an ASketch? S ASketch augments the traditional Count-Min with a filter S Assume that the filter consumes sf space, while the time required to insert an item (as well as query an item) in the filter is tf S sf is very small and tf << ts

Motivation For Filtering and Aggregating Frequent Items S They have : sf + w’h’

Motivation For Filtering and Aggregating Frequent Items S They have : sf + w’h’ = wh S In their implementation, they fix w’ = w , and reduce h’ to ( h − sf/w) , due to two reasons ′ 1. Usually, h > w. → updating h is more flexible to accommodate various sizes of filter in ASketch. 2. Having the same w number of hash functions for both Count. Min and ASketch → keep the error bound probability e−w identical.

Motivation For Filtering and Aggregating Frequent Items S Impact of the Filter on Throughput

Motivation For Filtering and Aggregating Frequent Items S Impact of the Filter on Throughput of Asketch S ASketch update time (as well as query time) can be expressed as: tf + filterselectivity ∗ ts. S In the case of ASketch updates, there is an additional factor which involves exchanging of data items between the filter and sketch. S Here, they ignored it. But they analyze it in more detail in the Appendix and evaluate the impact of this factor empirically in Section 7.

Motivation For Filtering and Aggregating Frequent Items S Clearly, for ASketch, they increase the

Motivation For Filtering and Aggregating Frequent Items S Clearly, for ASketch, they increase the processing time by tf ; however, in return, they reduce the sketch processing time by a factor of filterselectivity. S A small size of the filter and vectorized execution keeps tf small, in the range of a few assembly instructions. S Therefore, filterselectivity essentially determines the overall effectiveness of ASketch.

Motivation For Filtering and Aggregating Frequent Items S Assume that out of N aggregated

Motivation For Filtering and Aggregating Frequent Items S Assume that out of N aggregated stream-counts, N 1 counts are processed by the filter and (N − N 1) = N 2 counts are processed by the sketch. Therefore, filterselectivity = N 2/N. S However, most real world streams exhibit skew

Motivation For Filtering and Aggregating Frequent Items S Increasing the filter size beyond a

Motivation For Filtering and Aggregating Frequent Items S Increasing the filter size beyond a certain threshold does not reduce the filterselectivity significantly S However, increasing the filter size, depending on the filter implementation , may increase the filter processing time (tf).

Motivation For Filtering and Aggregating Frequent Items S Impact of the Filter on Accuracy

Motivation For Filtering and Aggregating Frequent Items S Impact of the Filter on Accuracy of Asketch S ASketch improves the frequency estimation accuracy for the frequent items that are stored in the filter. S For the items that are stored in the sketch, the authors reduce their collisions with the high-frequency items. S This reduces the possibility that a low-frequency item would appear as a high-frequency item, therefore reducing the misclassification error.

Motivation For Filtering and Aggregating Frequent Items

Motivation For Filtering and Aggregating Frequent Items

Algorithm for ASketch S ASketch exchanges data items between the filter and the sketch,

Algorithm for ASketch S ASketch exchanges data items between the filter and the sketch, so that the filter stores the most frequent items of the observed input stream. S Such movement of data items is a challenging problem for several reasons. 1. one needs to ensure that the exchange procedure does not incur too much overhead on the normal stream processing rate. 2. it is difficult to remove the item, along with its frequency count, from the underlying sketch.

Algorithm for ASketch S In ASketch, they want to maintain the one-sided accuracy guarantee

Algorithm for ASketch S In ASketch, they want to maintain the one-sided accuracy guarantee of Count-Min, i. e. , the estimated frequency of an item reported by ASketch must always be an overestimation of its true frequency. S They achieve this accuracy guarantee by introducing two different counts in the filter as described below.

Algorithm for ASketch S Algorithm Description S ASketch consists of two data structures: filter

Algorithm for ASketch S Algorithm Description S ASketch consists of two data structures: filter (F) and sketch (CMS). S The filter stores a few high-frequency items and two associated counts, namely new_count and old_count for each of the items that it monitors.

Algorithm for ASketch ・Stream processing algorithm for Asketch

Algorithm for ASketch ・Stream processing algorithm for Asketch

Algorithm for ASketch S The key insight here is that they can exactly capture

Algorithm for ASketch S The key insight here is that they can exactly capture the hits for an item that is currently stored in the filter by using its (new_count- old_count) value. This has three benefits 1. By accumulating this count in the filter, they save the cost of applying multiple hash functions in the sketch, and thus, improve throughput. 2. As high-frequency items in a skewed data stream remain in the filter most of the time, they get more accurate frequency counts for these items. 3. The misclassification rate for the low-frequency items as high-frequency items also decreases, since the aggregate of the (new_count-old_count) values of the high-frequency items are not hashed into the underlying sketch.

Algorithm for ASketch ・Query processing algorithm for Asketch

Algorithm for ASketch ・Query processing algorithm for Asketch

Algorithm for ASketch S Exchange Policy. S In this section, they discuss the efficiency

Algorithm for ASketch S Exchange Policy. S In this section, they discuss the efficiency of the exchange policy. 1. They need to keep track of the smallest frequency count in the filter. The efficiency of this step depends on the underlying data structure and the hardware implementation of the filter, which will be discussed shortly in Section 6. 1.

Algorithm for ASketch 2. They empirically found that they require a relatively small number

Algorithm for ASketch 2. They empirically found that they require a relatively small number of exchanges compared to the overall stream size in order to ensure that the high-frequency items are stored and early aggregated in the filter. 3. Can their exchange policy trigger multiple exchanges between the filter and the sketch? Their exchange mechanism may indeed initiate multiple exchanges

Algorithm for ASketch S The following lemma guarantees an upper bound on the number

Algorithm for ASketch S The following lemma guarantees an upper bound on the number of times an item is inserted into the underlying sketch in the presence of their exchange policy, thereby ensuring the overall efficiency of ASketch.

Algorithm for ASketch S Below, they demonstrate the key steps of their stream processing

Algorithm for ASketch S Below, they demonstrate the key steps of their stream processing algorithm (Algorithm 1) with an example.

Asketch on Multi-Core S They now discuss the implementation of ASketch in modern multi-core

Asketch on Multi-Core S They now discuss the implementation of ASketch in modern multi-core hardware with an emphasis on (1) SIMD parallelism in the filter implementation (2) pipeline parallelism in the ASketch framework (3) SPMD parallelism by implementing ASketch as a counting kernel in a multi-core machine.

Asketch on Multi-Core

Asketch on Multi-Core

Asketch on Multi-Core

Asketch on Multi-Core

Asketch on Multi-Core

Asketch on Multi-Core

Asketch on Multi-Core

Asketch on Multi-Core

Experimental Results S Frequency Estimation queries S Stream Processing Throughput and Query Processing Throughput

Experimental Results S Frequency Estimation queries S Stream Processing Throughput and Query Processing Throughput

Experimental Results S Avoiding Large Estimation Error

Experimental Results S Avoiding Large Estimation Error

Experimental Results S Error Comparison with Count-Min and Holistic UDAFs

Experimental Results S Error Comparison with Count-Min and Holistic UDAFs

Experimental Results S Error Comparison with Frequency-Aware Counting (FCM)

Experimental Results S Error Comparison with Frequency-Aware Counting (FCM)

Experimental Results S Top-k Frequent Items Queries

Experimental Results S Top-k Frequent Items Queries

Experimental Results S Number of Exchanges for Asketch

Experimental Results S Number of Exchanges for Asketch

Experimental Results S Experiments on Real-World Datasets S Stream Processing Throughput Comparison, Frequency Estimation

Experimental Results S Experiments on Real-World Datasets S Stream Processing Throughput Comparison, Frequency Estimation Error Comparison

Experimental Results S Space Saving for Frequency Estimation

Experimental Results S Space Saving for Frequency Estimation

Experimental Results S Experiments on Parallel ASketch S Throughput Benefit of Pipeline Parallelism

Experimental Results S Experiments on Parallel ASketch S Throughput Benefit of Pipeline Parallelism

Experimental Results S ASketch as a Counting Kerne

Experimental Results S ASketch as a Counting Kerne

Experimental Results S ASketch Sensitivity Analysis S Impact of the Filter Implementation

Experimental Results S ASketch Sensitivity Analysis S Impact of the Filter Implementation

Experimental Results S Impact of the Filter Size

Experimental Results S Impact of the Filter Size

Conclusion S The authors present ASketch — a filtering technique complementing sketches. S It

Conclusion S The authors present ASketch — a filtering technique complementing sketches. S It improves the accuracy of sketches by increasing the frequency estimation accuracy for the most frequent items and by reducing the possible misclassification of the low- frequency items. S It also improves the overall throughput.