Stream Approximate Stream Analytics in Apache Spark https
- Slides: 23
Stream. Approximate Stream Analytics in Apache Spark https: //streamapprox. github. io Do Le Quoc, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, Volker Hilt, Thorsten Strufe 10/2017
Modern online services Stream Aggregator Stream Analytics System Useful Information 1
Modern online services Approximate computing Tension Low latency Efficient resource utilization 2
Approximate Computing Many applications: Approximate output is good enough! The trend of data is more important than the precise numbers E. g. : Google Trends --- “Spark SQL” vs “Spark Streaming” (Sep/2017 – Oct/2017) 100 50 Average 0 Sep 17 Sep 26 Oct 5 3
Approximate Computing Idea: To achieve low latency, compute over a sub-set of data items instead of the entire data-set Approximate computing Take a sample Compute Approximate output ± Error bound 4
State-of-the-art systems Blink. DB [Euro. Sy. S’ 13] Using pre-existing samples Not designed for stream analytics Using multi-stage sampling Approx. Hadoop [ASPLOS’ 15] Quickr [SIGMOD’ 16] Injecting samplers into query plan 5
Stream. Approx: Design goals Transparent Targets existing applications w/ minor code changes Practical Supports adaptive execution based on query budget Efficient Employs online sampling techniques 6
Outline • Motivation • Design • Evaluation 7
Stream. Approx: Overview Input data stream Streaming query Query budget S 1 S 2 … Stream aggregator (E. g Kafka) Data stream Stream. Approximate output error bound Sn Query budget: • Latency/throughput guarantees • Desired computing resources for query processing • Desired accuracy 8
Key idea: Sampling Simple random sampling (SRS): Stratified sampling (STS): SRS SRS 9
Key idea: Sampling Reservoir sampling (RS): Drop item i i Replace by item i Size of reservoir = k 10
Spark-based Sampling Spark-based Simple Random Sampling (Spark-based SRS) Step #1 0. 08 0. 02 Step #2 0. 01 0. 02 0. 06 0. 12 0. 68 0. 08 0. 12 0. 15 0. 26 0. 88 0. 26 0. 68 0. 88 Assign each item with a random number in [0, 1] 0. 15 Sort items based on assigned value Step #3 0. 01 0. 02 0. 06 0. 08 0. 12 Take out k smallest items Sorting big data is very expensive 11
Spark-based Sampling Spark-based Stratified Sampling (Spark-based STS) Step #1 Step #2 Create strata using group. By. Key() Step #3 Apply SRS to each stratum Si These steps are very expensive Synchronize between worker nodes to select a sample of size k 12
Stream. Approx: Core idea Online Adaptive Stratified Reservoir Sampling (OASRS) S 1 RS Weight = #items/k = 8/4 Size of reservoir = k S 2 S 3 RS : Reservoir Sampling k = 4 RS Weight = #items/k = 6/4 RS Weight = 1 Easy to parallelize, doesn't need any synchronization between workers 13
Stream. Approx: Core idea Worker 1 OASRS Weight = 2 Weight = 1. 5 Weight = 1 Worker 2 OASRS Weight = 1 Weight = 2 Weight = 1. 5 Size of reservoir = 4 14
Implementation S 1 S 2 Stream aggregator Data stream Stream. Approx … Sn 15
Implementation S 1 S 2 … Stream aggregator Sampling module Batched RDDs generator Sn Refined sampling parameters Error estimation module Spark computation engine 16
Outline • Motivation • Design • Evaluation 17
Experimental setup • Evaluation questions • Throughput vs sample size • Throughput vs accuracy • End-to-end latency See the paper for more results! • Testbed • Cluster: 17 nodes • Datasets: • Synthesis: Gaussian distribution, Poisson distribution datasets • CAIDA Network traffic traces; NYC Taxi ride records 18
Throughput (M) #items/s Throughput 7 6 5 4 3 2 1 0 Higher the better Flink-based Stream. Approx Spark-based STS 10 20 40 60 Sampling fraction (%) 80 Spark-based Stream. Approx: ~2 X higher throughput over Spark-based STS Flink-based Stream. Approx: 1. 3 X higher throughput over Spark-based Stream. Approx With sampling fraction < 60% 19
Throughput vs Accuracy Throughput (M) #items/s 5 Flink-based Stream. Approx Spark-based STS Higher the better 4 3 2 1 0 0. 5 Accuracy loss (%) 1 Spark-based Stream. Approx: ~1. 32 X higher throughput over Spark-based STS Flink-based Stream. Approx: 1. 62 X higher throughput over Spark-based Stream. Approx With the same accuracy loss 20
Latency (Seconds) Total processing time Network Traffic Dataset 180 160 140 120 100 80 60 40 20 0 NYC Taxi Dataset Spark-based STS Spark-based SRS Lower the better Stream. Approx Spark-based Stream. Approx: ~1. 68 X faster than Spark-based STS Spark-based Stream. Approx: ~1. 45 X faster than Spark-based SRS 21
Conclusion Stream. Approx: Approximate computing for stream analytics Transparent Supports applications w/ minor code changes Practical Adaptive execution based on query budget Efficient Online stratified sampling technique Thank you! Details: Stream. Approx [Middleware’ 17] https: //streamapprox. github. io 22
- Zeppelin 설치
- Apache spark vs spring
- Apache spark challenges
- Zeppelin ldap
- Introduction to apache spark
- Sentiment analysis pyspark
- Spark tutorial
- Apache spark presentation
- Pyspark mappartitions
- Apache spark concepts
- Spark sql: relational data processing in spark
- "amplitude" analytics or "product analytics"
- Azure stream analytics sliding window
- Stream analytics to sql database
- Differentiate byte stream and character stream
- Fackts meaning
- Sketch techniques for approximate query processing
- Approximate method of structural analysis
- Draw a table for the different units of storage of data
- Voltage divider bias approximate analysis
- Approximate the best fitting line for the data
- Fourteen billion years represents the approximate age of
- The approximate dates of the classical era in music are
- Approximate cell decomposition