Augmented Sketch Faster and More Accurate Stream Processing

  • Slides: 21
Download presentation
Augmented Sketch: Faster and More Accurate Stream Processing Pratanu Roy Arijit Khan Gustavo Alonso

Augmented Sketch: Faster and More Accurate Stream Processing Pratanu Roy Arijit Khan Gustavo Alonso Systems Group ETH Zurich Nanyang Technical University Singapore Systems Group ETH Zurich

Data Stream Processing f( e ) = 3 …. f( e ) = 2

Data Stream Processing f( e ) = 3 …. f( e ) = 2 e a a c e e Data Stream § IP traffic, phone calls, sensor measurements, web clicks and crawls P. Roy, A. Khan, G. Alonso 1/10

Data Stream Processing f( e ) = 3 …. f( e ) = 2

Data Stream Processing f( e ) = 3 …. f( e ) = 2 e a a c e e Data Stream § IP traffic, phone calls, sensor measurements, web clicks and crawls § Applications: • Load balancing • Ranking • Frequent itemsets mining • Classification P. Roy, A. Khan, G. Alonso 1/10

Challenges in Stream Processing § Trade-off among Space, Accuracy, and Efficiency: -- Increasing space

Challenges in Stream Processing § Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput § Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary P. Roy, A. Khan, G. Alonso 2/10

Related Work § § Sketch Space-saving Wavelets Sampling h (e, f) +f H 1(e)

Related Work § § Sketch Space-saving Wavelets Sampling h (e, f) +f H 1(e) Hw(e) w +f +f Count-Min Sketch P. Roy, A. Khan, G. Alonso 3/10

Our Motivation § Improve accuracy for frequent items -- Critical for threshold checking (service-level

Our Motivation § Improve accuracy for frequent items -- Critical for threshold checking (service-level agreement ) , ranking, load-balancing § Reduce misclassification -- Frequent items mining the first step of frequent itemsets mining -- Even a small number of misclassified items lead to a large number of false-positive itemsets § Improve throughput P. Roy, A. Khan, G. Alonso 3/10

Main Takeaway Frequency Hot data Cold data Items Let the common case go faster

Main Takeaway Frequency Hot data Cold data Items Let the common case go faster P. Roy, A. Khan, G. Alonso 4/10

Main Takeaway: Solution Framework Hot data Input Optimized Codepath (Filter) Cold data State-of-the-art Sketch

Main Takeaway: Solution Framework Hot data Input Optimized Codepath (Filter) Cold data State-of-the-art Sketch Algorithms § Optimized codepath acts like a filter for hot data § Improvements: accuracy, throughput P. Roy, A. Khan, G. Alonso 4/10

Main Takeaway: Desired Outcome Throughput Solution framework State of the art Skew P. Roy,

Main Takeaway: Desired Outcome Throughput Solution framework State of the art Skew P. Roy, A. Khan, G. Alonso 4/10

Augmented Sketch (ASketch) Items with lower count Input Filter Very small (~0. 3%) and

Augmented Sketch (ASketch) Items with lower count Input Filter Very small (~0. 3%) and make the processing very fast (with SIMD) Count Min Items with higher count estimate(k) > minimum in (filter) § Challenges -- Removing items from sketch -- Cascading exchanges P. Roy, A. Khan, G. Alonso 5/10

Exchange Mechanism Count-Min filter A A 78 B Items 10 new 2 1 old

Exchange Mechanism Count-Min filter A A 78 B Items 10 new 2 1 old 4 3 8 7 H 1 6 4 6 9 H 2 Found in filter Update in filter Count-Min filter C A B Items 8 10 new 2 1 old 4 3 89 7 H 1 6 4 6 9 10 H 2 Not found in filter Update in count-min estimate (C) > minimum count (filter) initiate an exchange P. Roy, A. Khan, G. Alonso 6/10

Exchange Mechanism Count-Min filter A C A B Items 8 89 10 new 2

Exchange Mechanism Count-Min filter A C A B Items 8 89 10 new 2 29 1 old 4 3 99 7 H 1 6 4 6 10 10 H 2 Step 1: Move C to filter Count-Min filter C B Items 9 10 new 9 1 old 4 3 9 137 H 1 6 104 6 10 H 2 Step 2: Move A to Count-Min with (8 -2) = 6 We do not perform multiple exchanges P. Roy, A. Khan, G. Alonso 6/10

Other Technical Contributions § Theoretical Error Bounds § Four different filter implementation -- Array,

Other Technical Contributions § Theoretical Error Bounds § Four different filter implementation -- Array, Strict Heap, Relaxed Heap, Stream Summary § Hardware-conscious filter (SIMD) § Pipeline parallelism § SPMD parallelism P. Roy, A. Khan, G. Alonso 7/10

Throughput ( million items/sec) Experimental Results: Stream Processing Throughput 128 Asketch 64 FCM Count-Min

Throughput ( million items/sec) Experimental Results: Stream Processing Throughput 128 Asketch 64 FCM Count-Min 32 § Count-Min [Cormode et al. , Algo 05] § Frequency Aware Counting (FCM) [Thomas et al, ICDE 09] § Holistic UDAFs [Cormode et al, SIGMOD 04] holistic UDAFS 16 8 4 0. 0 1. 0 2. 0 Skew (Zipf parameter z) 3. 0 Synthetic data, 8 M data items, stream size: 32 M, sketch size = 128 KB, filter size = 0. 4 KB P. Roy, A. Khan, G. Alonso 8/10

Experimental Results: Query Processing Throughput (million items/sec) 128 Asketch FCM 64 Count-Min 32 holistic

Experimental Results: Query Processing Throughput (million items/sec) 128 Asketch FCM 64 Count-Min 32 holistic UDAFs 16 8 4 0. 0 0. 5 1. 0 1. 5 2. 0 Skew (Zipf parameter z) 2. 5 3. 0 § Synthetic data, 8 M data items, stream size: 32 M, sketch size = 128 KB, filter size = 0. 4 KB § Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 9/10

Experimental Results: Accuracy Improvement Observed error (%) 4 3 2 1 0 Count-Min ASketch

Experimental Results: Accuracy Improvement Observed error (%) 4 3 2 1 0 Count-Min ASketch Holistic UDAFs FCM Asketch-FCM • IP-Trace, 13 M data items, stream size = 461 M, sketch size = 128 KB, zipf 0. 9 • Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 10/10

Conclusions § ASketch dynamically identifies and aggregates most frequent items -- Improves throughput and

Conclusions § ASketch dynamically identifies and aggregates most frequent items -- Improves throughput and accuracy of existing sketches -- Allows efficient utilization of modern hardware features such as SIMD, multi-cores, etc. § Future work: investigate the use of Asketch in the context of machine learning and data mining applications P. Roy, A. Khan, G. Alonso

Impact of Misclassifications Average relative error Count-Min ASketch 1000000 100 16 KB P. Roy,

Impact of Misclassifications Average relative error Count-Min ASketch 1000000 100 16 KB P. Roy, A. Khan, G. Alonso 24 KB Sketch Size 32 KB 8 M data items; stream size = 32 M, filter size = 0. 4 KB

Dataflow with Pipeline Parallelism Input Core 1 Core 2 Optimized codepath Count-Min Use message

Dataflow with Pipeline Parallelism Input Core 1 Core 2 Optimized codepath Count-Min Use message passing to communicate across the cores P. Roy, A. Khan, G. Alonso

Parallel-ASketch vs ASketch Throughput (million items/sec) 256 Parallel-ASketch 128 ASketch 64 32 16 8

Parallel-ASketch vs ASketch Throughput (million items/sec) 256 Parallel-ASketch 128 ASketch 64 32 16 8 4 0. 0 0. 5 1. 0 1. 5 2. 0 Skew (Zipf parameter z) 2. 5 3. 0 Synthetic data, 8 M data items, stream size: 32 M, sketch size = 128 KB, filter size = 0. 4 KB P. Roy, A. Khan, G. Alonso