Augmented Sketch Faster and More Accurate Stream Processing





















- Slides: 21

Augmented Sketch: Faster and More Accurate Stream Processing Pratanu Roy Arijit Khan Gustavo Alonso Systems Group ETH Zurich Nanyang Technical University Singapore Systems Group ETH Zurich

Data Stream Processing f( e ) = 3 …. f( e ) = 2 e a a c e e Data Stream § IP traffic, phone calls, sensor measurements, web clicks and crawls P. Roy, A. Khan, G. Alonso 1/10

Data Stream Processing f( e ) = 3 …. f( e ) = 2 e a a c e e Data Stream § IP traffic, phone calls, sensor measurements, web clicks and crawls § Applications: • Load balancing • Ranking • Frequent itemsets mining • Classification P. Roy, A. Khan, G. Alonso 1/10

Challenges in Stream Processing § Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput § Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary P. Roy, A. Khan, G. Alonso 2/10

Related Work § § Sketch Space-saving Wavelets Sampling h (e, f) +f H 1(e) Hw(e) w +f +f Count-Min Sketch P. Roy, A. Khan, G. Alonso 3/10

Our Motivation § Improve accuracy for frequent items -- Critical for threshold checking (service-level agreement ) , ranking, load-balancing § Reduce misclassification -- Frequent items mining the first step of frequent itemsets mining -- Even a small number of misclassified items lead to a large number of false-positive itemsets § Improve throughput P. Roy, A. Khan, G. Alonso 3/10

Main Takeaway Frequency Hot data Cold data Items Let the common case go faster P. Roy, A. Khan, G. Alonso 4/10

Main Takeaway: Solution Framework Hot data Input Optimized Codepath (Filter) Cold data State-of-the-art Sketch Algorithms § Optimized codepath acts like a filter for hot data § Improvements: accuracy, throughput P. Roy, A. Khan, G. Alonso 4/10

Main Takeaway: Desired Outcome Throughput Solution framework State of the art Skew P. Roy, A. Khan, G. Alonso 4/10

Augmented Sketch (ASketch) Items with lower count Input Filter Very small (~0. 3%) and make the processing very fast (with SIMD) Count Min Items with higher count estimate(k) > minimum in (filter) § Challenges -- Removing items from sketch -- Cascading exchanges P. Roy, A. Khan, G. Alonso 5/10

Exchange Mechanism Count-Min filter A A 78 B Items 10 new 2 1 old 4 3 8 7 H 1 6 4 6 9 H 2 Found in filter Update in filter Count-Min filter C A B Items 8 10 new 2 1 old 4 3 89 7 H 1 6 4 6 9 10 H 2 Not found in filter Update in count-min estimate (C) > minimum count (filter) initiate an exchange P. Roy, A. Khan, G. Alonso 6/10

Exchange Mechanism Count-Min filter A C A B Items 8 89 10 new 2 29 1 old 4 3 99 7 H 1 6 4 6 10 10 H 2 Step 1: Move C to filter Count-Min filter C B Items 9 10 new 9 1 old 4 3 9 137 H 1 6 104 6 10 H 2 Step 2: Move A to Count-Min with (8 -2) = 6 We do not perform multiple exchanges P. Roy, A. Khan, G. Alonso 6/10

Other Technical Contributions § Theoretical Error Bounds § Four different filter implementation -- Array, Strict Heap, Relaxed Heap, Stream Summary § Hardware-conscious filter (SIMD) § Pipeline parallelism § SPMD parallelism P. Roy, A. Khan, G. Alonso 7/10

Throughput ( million items/sec) Experimental Results: Stream Processing Throughput 128 Asketch 64 FCM Count-Min 32 § Count-Min [Cormode et al. , Algo 05] § Frequency Aware Counting (FCM) [Thomas et al, ICDE 09] § Holistic UDAFs [Cormode et al, SIGMOD 04] holistic UDAFS 16 8 4 0. 0 1. 0 2. 0 Skew (Zipf parameter z) 3. 0 Synthetic data, 8 M data items, stream size: 32 M, sketch size = 128 KB, filter size = 0. 4 KB P. Roy, A. Khan, G. Alonso 8/10

Experimental Results: Query Processing Throughput (million items/sec) 128 Asketch FCM 64 Count-Min 32 holistic UDAFs 16 8 4 0. 0 0. 5 1. 0 1. 5 2. 0 Skew (Zipf parameter z) 2. 5 3. 0 § Synthetic data, 8 M data items, stream size: 32 M, sketch size = 128 KB, filter size = 0. 4 KB § Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 9/10

Experimental Results: Accuracy Improvement Observed error (%) 4 3 2 1 0 Count-Min ASketch Holistic UDAFs FCM Asketch-FCM • IP-Trace, 13 M data items, stream size = 461 M, sketch size = 128 KB, zipf 0. 9 • Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 10/10

Conclusions § ASketch dynamically identifies and aggregates most frequent items -- Improves throughput and accuracy of existing sketches -- Allows efficient utilization of modern hardware features such as SIMD, multi-cores, etc. § Future work: investigate the use of Asketch in the context of machine learning and data mining applications P. Roy, A. Khan, G. Alonso


Impact of Misclassifications Average relative error Count-Min ASketch 1000000 100 16 KB P. Roy, A. Khan, G. Alonso 24 KB Sketch Size 32 KB 8 M data items; stream size = 32 M, filter size = 0. 4 KB

Dataflow with Pipeline Parallelism Input Core 1 Core 2 Optimized codepath Count-Min Use message passing to communicate across the cores P. Roy, A. Khan, G. Alonso

Parallel-ASketch vs ASketch Throughput (million items/sec) 256 Parallel-ASketch 128 ASketch 64 32 16 8 4 0. 0 0. 5 1. 0 1. 5 2. 0 Skew (Zipf parameter z) 2. 5 3. 0 Synthetic data, 8 M data items, stream size: 32 M, sketch size = 128 KB, filter size = 0. 4 KB P. Roy, A. Khan, G. Alonso