Efficient Monitoring of Distributed Streams ARNON LAZERSON ADVISORS
Efficient Monitoring of Distributed Streams ARNON LAZERSON ADVISORS: DANIEL KEREN, ASSAF SCHUSTER EFFICIENT MONITORING OF DISTRIBUTED STREAMS 1
Agenda § Background § Geometric Monitoring § Convex Decomposition (VLDB 2015) § Lightweight Monitoring (KDD 2016) § Approximating Multiple Functions (DEBS 2017) § Conclusions EFFICIENT MONITORING OF DISTRIBUTED STREAMS 2
Distributed Stream Processing § Stream mining § But, streams are often distributed § Distributed algorithms (graph algs. /model fitting/learning…) § But, data is not static § Centralizing and re-running is very expensive § We must be smarter! EFFICIENT MONITORING OF DISTRIBUTED STREAMS 3
Distributed Stream Processing § Space efficient § Time efficient § Communication efficient Smartphones Io. T § Battery lifetime § Network congestion § Computation bottleneck at central server WSN Datacenters EFFICIENT MONITORING OF DISTRIBUTED STREAMS 4
Distributed Stream Processing EFFICIENT MONITORING OF DISTRIBUTED STREAMS 5
Computing over Distributed Data § Multiple distributed nodes § Task: compute f over union of data. § Solution: distributed computation algorithms. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 6
Continuous Distributed Data § New data continuously arrives § Change of local data change of global union § Recomputing f is expensive § Wasteful if the change in f’s value is small. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 7
Computing f over distributed streams § The easy way: centralize and compute. – Expensive! § The efficient way: avoid communication and/or reduce its size. § Main approaches: sketching, sampling, local constraints. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 8
Sketching § Replace data with a compact summary § Probabilistic user controlled bounds § Higher accuracy larger size § Reduce the size of update messages § But not their number § Each sketch is only useful for a specific family of functions § AMS sketches can be used for self-join and full-join queries N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In PODS, 1999 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 9
Periodic Sampling § Collect data every N rounds/M minutes, etc. §Widely used in practice § easy to understand implement § Reduce the number of messages § No guarantee on error bounds § Wasteful if data is stable § Can miss spikes EFFICIENT MONITORING OF DISTRIBUTED STREAMS 10
Local constraints § Define local constraints that are indicative of the global state § As long they hold the global state did not change by much § Example: distributed counts § Each node is assigned a threshold (local constraint) § Nodes count local items and communicate when threshold is crossed Ram Keralapura, Graham Cormode, and Jeyashankher Ramamirtham. Communication. Efficient Distributed Monitoring of Thresholded Counts. SIGMOD 2006. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 11
Local Constraints properties § Reduce the number of messages § Many algorithms are tailored for specific problems § Deriving local constraints for arbitrary functions is hard EFFICIENT MONITORING OF DISTRIBUTED STREAMS 12
Data 2 1 Local Variance Node 1: 1 2 0. 25 Node 2: 51 52 0. 25 Global Variance 625. 25 high low Data 2 1 Local Variance Node 1: 1 2 0. 25 Node 2: 91 92 0. 25 Global Variance 2025. 25 higher unchanged 13
Recap § Communication is expensive § To reduce communication we want local constraints § These local constraints must guarantee the global constraint § Breaking up a global constraint is not trivial § NP-hard in the general case D. Keren, G. Sagy, A. Abboud, D. Ben-David, A. Schuster, I. Sharfman, and A. Deligiannakis. Geometric monitoring of heterogeneous streams. TKDE 2014. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 14
Our Distributed Model Coordinator Nodes Streams EFFICIENT MONITORING OF DISTRIBUTED STREAMS 15
Is The Average Interesting? – Yes! Distributed Histograms Distributed Graphs § Scatter Matrices § Contingency Tables § … EFFICIENT MONITORING OF DISTRIBUTED STREAMS 16
Geometric Monitoring EFFICIENT MONITORING OF DISTRIBUTED STREAMS I. Sharfman, A. Schuster, and D. Keren. A geometric approach to monitoring threshold functions over distributed data streams. In SIGMOD, 2006. 17
Some Notations EFFICIENT MONITORING OF DISTRIBUTED STREAMS 18
Geometric Monitoring in Action Node 1 Node 2 poll local data Node 3 violation! Coordinator EFFICIENT MONITORING OF DISTRIBUTED STREAMS 19
Threshold Crossing to Value Tracking § GM solves the distributed threshold crossing problem § Can be naturally extended to tracking: Two threshold crossing problems EFFICIENT MONITORING OF DISTRIBUTED STREAMS 20
Safe Zones § Safe zones are convex subsets of the admissible region § Local vectors are allowed to be in the safe zones § There is more than one way to construct safe-zones. §Covering Spheres (CS) was the first method to construct SZs Sharfman et al. A geometric approach to monitoring threshold functions over Distributed data streams. SIGMOD, 2006 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 21
CS Safe Zones § Create a sphere, such that the reference-point and the local vector define its diameter. § Check if the entire sphere is within the admissible region Admissible region Safe zone EFFICIENT MONITORING OF DISTRIBUTED STREAMS 22
CS Safe Zones Bounding Lemma EFFICIENT MONITORING OF DISTRIBUTED STREAMS 23
CS Safe Zones § Each node locally checks if the covering sphere is within A § All spheres are in A § The convex hull is in A § The average vector V is in A! EFFICIENT MONITORING OF DISTRIBUTED STREAMS 24
Convex Decomposition VLDB 2015 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 25
CS Safe Zone – Alternative Definition EFFICIENT MONITORING OF DISTRIBUTED STREAMS 26
Redundant Half-Spaces EFFICIENT MONITORING OF DISTRIBUTED STREAMS 27
Convex Decomposition EFFICIENT MONITORING OF DISTRIBUTED STREAMS 28
Example EFFICIENT MONITORING OF DISTRIBUTED STREAMS 29
Another Example EFFICIENT MONITORING OF DISTRIBUTED STREAMS 30
Monitoring the Median EFFICIENT MONITORING OF DISTRIBUTED STREAMS 31
Application: AGMS Sketches EFFICIENT MONITORING OF DISTRIBUTED STREAMS 32
Evaluation § CD was evaluated using two real data sets: § “Wcup”, access logs to the soccer World Cup 1998 website. § “Cdad”, SNMP requests of network users § Both self-join and full-join queries using sketches. § Substantial improvement over the state-of-the-art (CS). EFFICIENT MONITORING OF DISTRIBUTED STREAMS 33
Experimental Results 0, 6 0, 5 0, 4 0, 3 0, 2 0, 1 0 wcup cdad FULL-JOIN COMM. COST RATIO to state-of-the-art SELF-JOIN COMM. COST RATIO 0, 5 wcup cdad 0, 4 0, 3 0, 2 0, 1 0 4 8 12 16 number of nodes 20 Communication cost (bytes sent) relative to state-of-the-art for varying number of nodes over two datasets. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 34
So are we done? EFFICIENT MONITORING OF DISTRIBUTED STREAMS 35
Lightweight Monitoring KDD 2016 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 36
Distributed Stream Processing § Space efficient § Communication efficient § Time efficient § Power efficient § especially in a wireless, battery operated, environment EFFICIENT MONITORING OF DISTRIBUTED STREAMS 37
The Convex Bound Method - CB § CB is based on directly bounding the monitored function § using suitably chosen convex and concave functions. § Simpler local conditions than those applied by GM. § Runtime and power consumption are far lower than GMs § orders of magnitude in some cases. § Reduced comm. overhead in all application scenarios tested EFFICIENT MONITORING OF DISTRIBUTED STREAMS 38
Convexity – a Brief Reminder f is convex. The line segment between two points is above the graph EFFICIENT MONITORING OF DISTRIBUTED STREAMS 39
Monitoring Convex Functions T T EFFICIENT MONITORING OF DISTRIBUTED STREAMS 40
Monitoring Non-Convex Functions T 41 EFFICIENT MONITORING OF DISTRIBUTED STREAMS
Mind the Gap False violation! For x = 4, we get c(4) > T, so we must issue an alert, but f(4) ≤ T EFFICIENT MONITORING OF DISTRIBUTED STREAMS T = 20 42
Choosing the right bound § To minimize false alarms, the gap should be minimized. § It is impossible to choose a single optimal bound c. § The selection of the bound depends on the ref-point. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 43
Choosing the right bound To gurantee correctness the function, c must § be convex § bound f from above – c(x) ≥ f(x) for all x To achieve Efficiency, c must § “Stick” to the monitored function as much as possible § Be easy to compute EFFICIENT MONITORING OF DISTRIBUTED STREAMS 44
Choosing the Right Bound If the monitored function f is: § Convex – just choose c = f § Concave - choose c = tangent plane at the ref-point § Other cases? – stay tuned… The tangent plane at the ref-point is the best convex-bound for a concave function EFFICIENT MONITORING OF DISTRIBUTED STREAMS 45
“Convexizing” threshold conditions Lemma. If f possesses bounded second derivatives in a domain D, it can be expressed as the difference of two convex functions. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 46
“Convexizing” Threshold Conditions gap convexization p 0 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 47
Application § We applied CB to monitor four important functions: § Pearson correlation coefficient (PCC) § Inner product § Cosine similarity § PCA Score § These functions have no simple, efficient monitoring solutions § they are not linear, convex, concave, or monotonic EFFICIENT MONITORING OF DISTRIBUTED STREAMS 48
Example: Convexizing PCC convex A concave lower bound (blue) for PCC (green). The reference point (in red) is x 0 = 0. 3, y 0 = 0. 6, and T = 0. 4. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 49
Evaluation § Four functions § 10 nodes § Three datasets § Reuters Corpus - 804, 414 categorized documents , 47, 236 features § Twitter Crawl - 5 M hash-tagged tweets (out of 50 M in the data set) § KDD cup 1999 – TCP connection data, 41 features EFFICIENT MONITORING OF DISTRIBUTED STREAMS 50
Evaluation § Compared runtime and comm. cost of CB and CS § The naïve method is a common baseline § every message is sent to the coord. § CB outperformed CS in both parameters EFFICIENT MONITORING OF DISTRIBUTED STREAMS 51
Runtime results Function PCC Inner-Prod Csim PCA Dataset REU TWIT KC Dim 3 4100 2500 40 x 40 Runtime CB CS 0. 58 6. 70 E-05 1. 35 E-04 6. 89 E-04 8. 90 E-05 5. 20 E-04 1. 73 E-04 175 1. 41 E-04 170 8. 60 E-03 9. 57 E+00 speedup 8, 657. 7 5. 1 5. 84 1, 000 1, 200, 000 1112. 7 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 52
Communication Reduction Communication ratio to the naive for the CB and GM, as well as the super-optimal lower bound RLV. The Csim function lacks results for GM as the experiments did not complete in over 24 hours communication reduction results for the Inner-prod function on TWIT, using 200 to 1000 nodes. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 53
Power Consumption Mini-PC Edison So. C EFFICIENT MONITORING OF DISTRIBUTED STREAMS 54
Summary § CB Works directly on the monitored function § Using convex/concave bounds § Better runtime and power than CS § Reduced communication overhead EFFICIENT MONITORING OF DISTRIBUTED STREAMS 55
Approximating Multiple Functions DEBS 2017 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 56
Single Function Approximation § Approximate a function over distributed streams § Tradeoff accuracy for efficiency § Early work contains special-purpose algorithms § Sampling, sketching, local constraints § Geometric monitoring § General approach - can track any function of the average of streams EFFICIENT MONITORING OF DISTRIBUTED STREAMS 57
Tracking a Single Function Two threshold crossing problems EFFICIENT MONITORING OF DISTRIBUTED STREAMS 58
Tracking Multiple Functions § Centralize and compute § Inefficient even for a single function § No per-function overhead § Independently track each function § Each function is tracked efficiently § But, each function adds to the communication cost § Can we do better? EFFICIENT MONITORING OF DISTRIBUTED STREAMS 59
Multiple Functions with GM EFFICIENT MONITORING OF DISTRIBUTED STREAMS 60
Common Admissible Region EFFICIENT MONITORING OF DISTRIBUTED STREAMS 61
Common Admissible Region EFFICIENT MONITORING OF DISTRIBUTED STREAMS 62
Safe Zones EFFICIENT MONITORING OF DISTRIBUTED STREAMS 63
Common Safe Zone - CSZ EFFICIENT MONITORING OF DISTRIBUTED STREAMS 64
Evaluation Methodology § Applied Common safe zone & independent tracking § Counted messages § Cost is reported as ratio to naive § Expect CSZ to be better than indep EFFICIENT MONITORING OF DISTRIBUTED STREAMS 65
Tracking PCA Scores EFFICIENT MONITORING OF DISTRIBUTED STREAMS 66
Comm. Cost for PCA Scores EFFICIENT MONITORING OF DISTRIBUTED STREAMS 67
Tracking Regression and Condition Number * M. Gabel, D. Keren, and A. Schuster. Monitoring least squares models of distributed streams. In SIGKDD, 2015. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 68
Regression and Condition Number Communication Costs EFFICIENT MONITORING OF DISTRIBUTED STREAMS 69
Scalability (TWIT dataset) EFFICIENT MONITORING OF DISTRIBUTED STREAMS 70
One for All and All for One § One for all § One SZ for all functions https: //www. flickr. com/photos/pasukaru 76/5597863793 § All for one § Tracking all functions for (nearly) the price of one EFFICIENT MONITORING OF DISTRIBUTED STREAMS 71
Conclusions Three novel communication efficient methods: § Convex Decomposition § Convex Bounds § Simultaneous Approximation of multiple functions § Convex constraints on the domain allows: § Arbitrary functions (non-linear, non-monotonic, . . ) § Deterministic user defined error bounds § Orders of magnitude in communication reduction EFFICIENT MONITORING OF DISTRIBUTED STREAMS 72
Questions? ? EFFICIENT MONITORING OF DISTRIBUTED STREAMS 73
Open questions § Can we find an optimal safe-zone? § Probably yes for one dimensional functions § Yes, in some trivial cases § We suspect the general problem to be NP-hard § How to monitor other functions than the average? § Is it possible to devise a tight(er) lower-bound on communication? § Given a problem, how to select the best method (GM/CD/CB/. . )? EFFICIENT MONITORING OF DISTRIBUTED STREAMS ? 74
BACKUP EFFICIENT MONITORING OF DISTRIBUTED STREAMS 75
Local Constraint example EFFICIENT MONITORING OF DISTRIBUTED STREAMS 76
Example: Spam Filtering The Word: “Hello” Spam Not Spam Appeared 0. 1 0. 2 Did not Appear 0. 4 EFFICIENT MONITORING OF DISTRIBUTED STREAMS 0. 3 77
Example: Distributed Spam Filtering EFFICIENT MONITORING OF DISTRIBUTED STREAMS 79
Example: Distributed Spam Filtering Wro ng! EFFICIENT MONITORING OF DISTRIBUTED STREAMS 80
Why is the Naïve Attempt Wrong? § No local violations at the nodes: § But the global condition is violated: EFFICIENT MONITORING OF DISTRIBUTED STREAMS 81
CD backup EFFICIENT MONITORING OF DISTRIBUTED STREAMS 82
Computing self-join in distributed setting EFFICIENT MONITORING OF DISTRIBUTED STREAMS 83
SZ for linear median EFFICIENT MONITORING OF DISTRIBUTED STREAMS 84
Generalization to Higher Dimensionality EFFICIENT MONITORING OF DISTRIBUTED STREAMS 85
SZ for median of squares EFFICIENT MONITORING OF DISTRIBUTED STREAMS 86
CB local violations and sync EFFICIENT MONITORING OF DISTRIBUTED STREAMS 87
Local Violations § If the global threshold is crossed some local condition must be violated § However a local condition may be violated, while the global condition holds T T EFFICIENT MONITORING OF DISTRIBUTED STREAMS 88
The Synchronization Process T EFFICIENT MONITORING OF DISTRIBUTED STREAMS 89
Convexiszing Functions EFFICIENT MONITORING OF DISTRIBUTED STREAMS 90
Pearson correlation coefficient EFFICIENT MONITORING OF DISTRIBUTED STREAMS 91
Pearson correlation coefficient §PCC gives a value between -1 and 1 § 1 - total positive correlation. Whenever X appears Y also appears. § 0 - no correlation. X and Y are totally independent, appearance of X says nothing about Y §-1 - total negative correlation. Whenever X appears Y does not appear. Positive correlation Negative correlation EFFICIENT MONITORING OF DISTRIBUTED STREAMS No correlation 92
Convexizing Pearson convex Q 1 (convex) Q 2 (convex) convex EFFICIENT MONITORING OF DISTRIBUTED STREAMS convex 93
Convexizing Pearson A concave lower bound (blue) for PCC (green). The reference point (in red) is x 0 = 0. 3, y 0 = 0. 6, and T = 0. 4. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 94
Inner Product EFFICIENT MONITORING OF DISTRIBUTED STREAMS 95
Convexizing the Inner-Product convex EFFICIENT MONITORING OF DISTRIBUTED STREAMS 96
Cosine Similarity EFFICIENT MONITORING OF DISTRIBUTED STREAMS 97
Convexizing CSIM convex non convex Not clear how to decompose it into an inequality between two convex functions EFFICIENT MONITORING OF DISTRIBUTED STREAMS 98
PCA Score EFFICIENT MONITORING OF DISTRIBUTED STREAMS 99
Convexizing the PCA Score EFFICIENT MONITORING OF DISTRIBUTED STREAMS 100
CB detailed comm. cost EFFICIENT MONITORING OF DISTRIBUTED STREAMS 101
Communication Cost - PCC value over time § The documents were categorized, the most frequent category is “CCAT” § We monitor PCC value of (“CCAT”, “bosnia”) and (“CCAT”, “febru”) § Threshold levels between -0. 09 and -0. 05 Bosnia - communication cost Febru- communication cost EFFICIENT MONITORING OF DISTRIBUTED STREAMS 102
Communication Cost – Inner-Prod We monitored the inner product of feature vectors from two streams (created by splitting the records) Reuters EFFICIENT MONITORING OF DISTRIBUTED STREAMS Twitter 103
Communication cost - CSIM § Due to run-time limitations, it was not possible to apply GM to monitor cosine similarity on real data. § Instead we simulated a single stream by adding noise to some Violation ratio per noise magnitude pre-defined p 0 § We measured what percent of the vectors caused alerts EFFICIENT MONITORING OF DISTRIBUTED STREAMS 104
Violation Recovery EFFICIENT MONITORING OF DISTRIBUTED STREAMS 105
Violation Recovery § When a local condition is breached there may or may not be a global violation § The result depends on multiple nodes § It is the responsibility of the coordinator to resolve the issue § The coordinator must collect data from the nodes to be sure of the outcome. § This process is called violation recovery or violation resolution EFFICIENT MONITORING OF DISTRIBUTED STREAMS 106
Handling local violations - example EFFICIENT MONITORING OF DISTRIBUTED STREAMS 107
§All nodes may be ok, but average is not. §The goal is to independently construct sets at the nodes, whose union contains the average. §These sets, obviously, need to be as “small” as possible. If all sets are contained in A so will the average. §Actually, we construct sets which cover the entire convex hull of the local data vectors. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 108
CB’s space of solutions includes GM’s § Let f ≤ T be any monitoring problem. § Let C by the convex subset of A defined by GM § We wish to prove that there exists a convex bound c ≻ f such that, for any point u, u ∈ C iff c(u) ≤ T § We simply define c(u) = T + d. C(u). § where d. C is the distant transform for the set C § Clearly c ≻ f on C. (for u ∈ C , f(u) ≤ T) § c is convex because the distance transform of a convex set is convex. EFFICIENT MONITORING OF DISTRIBUTED STREAMS 109
- Slides: 108