Summarizing Distributed Data Ke Yi HKUST Small summaries
+ = ? Summarizing Distributed Data Ke Yi HKUST
Small summaries for BIG data ¨ Allow approximate computation with guarantees and small space – save space, time, and communication ¨ Tradeoff between error and size 2 Summarizing Disitributed Data
Summarization vs (Lossy) Compression Summarization: Compression ¨ No need to decompress ¨ Need to decompress before making queries ¨ Aims at particular properties ¨ Aims at generic of the data approximation of all data ¨ (Usually) provides ¨ Best-effort approach; does guarantees on query results not provide guarantees 3 Summarizing Disitributed Data
Summaries ¨ Summaries allow approximate computations: – – – 4 Random sampling Frequent items Sketches (JL transform, AMS, Count-Min, etc. ) Quantiles & histograms Geometric coresets … Summarizing Disitributed Data
Large-scale distributed computation Programmers have no control on how things are merged 5 Summarizing Disitributed Data
Map. Reduce 6 Summarizing Disitributed Data
Dremel 7 Summarizing Disitributed Data
Pregel: Combiners 8 Summarizing Disitributed Data
Sensor networks 9 Summarizing Disitributed Data
“A major technical challenge for big data is to push summarization to edge devices. ” Jagadish et al, “Technical Challenges for Big Data”, Communications of the ACM, August 2014. Summarizing Disitributed Data
Two models of summary computation ¨ Mergeability: Summarization behaves like a semigroup operator – – – Allows arbitrary computation trees (shape and size unknown to algorithm) Quality remains the same Any intermediate summary is valid Resulting summary can be further merged Generalizes the streaming model ¨ Multi-party communication Simultaneous message passing model – Message passing model – Blackboard model – 11 Summarizing Disitributed Data
Mergeable summaries ¨ Sketches ¨ Random samples easy ¨ Min. Hash ¨ Heavy hitters easy and cute ¨ ε-approximations easy algorithm, analysis requires work (quantiles, equi-height histograms) Agarwal, Cormode, Huang, Phillips, Wei, and Yi, “Mergeable Summaries”, TODS, Nov 2013 12 Summarizing Disitributed Data
Merging random samples + 13 Summarizing Disitributed Data
Merging random samples + 14 Summarizing Disitributed Data
Merging random samples + 15 Summarizing Disitributed Data
Merging random samples + 16 Summarizing Disitributed Data
Merging random samples + 17 Summarizing Disitributed Data
Merging sketches ¨ return the min 18 Summarizing Disitributed Data
Min. Hash 19 Summarizing Disitributed Data
Mergeable summaries ¨ Random samples ¨ Sketches easy ¨ Min. Hash easy and cute ¨ Heavy hitters ¨ ε-approximations easy algorithm, analysis requires work (quantiles, equi-height histograms) 20 Summarizing Disitributed Data
Heavy hitters ¨ 1 2 3 4 5 6 7 8 9 21 Summarizing Disitributed Data
Heavy hitters ¨ 1 2 3 4 5 6 7 8 9 22 Summarizing Disitributed Data
Heavy hitters ¨ 1 2 3 4 5 6 7 8 9 23 Summarizing Disitributed Data
Streaming MG analysis ¨ 24 Summarizing Disitributed Data
Merging two MG summaries ¨ 1 2 3 4 5 6 7 8 9 25 Summarizing Disitributed Data
Merging two MG summaries ¨ (prior error) (from merge) 1 2 3 4 5 6 7 8 9 26 Summarizing Disitributed Data
Space. Saving: Another heavy hitter summary ¨ 27 Summarizing Disitributed Data
Mergeable summaries ¨ Random samples ¨ Sketches easy ¨ Min. Hash easy and cute ¨ Heavy hitters ¨ ε-approximations easy algorithm, analysis requires work (quantiles, equi-height histograms) 28 Summarizing Disitributed Data
ε-approximations: a more “uniform” sample Random sample: ¨ 29 Summarizing Disitributed Data
Quantiles (order statistics) ¨ 30 Summarizing Disitributed Data
Quantiles gives equi-height histogram ¨ Automatically adapts to skew data distributions ¨ Equi-width histograms (fixed binning) are trivially mergeable but does not adapt to data distribution 31 Summarizing Disitributed Data
Previous quantile summaries ¨ 32 Summarizing Disitributed Data
Equal-weight merges ¨ 33 1 Summarizing Disitributed Data 5 2 3 1 3 6 7 8 4 9 10 5 7 9 +
Equal-weight merge analysis: Base case 2 34 Summarizing Disitributed Data
Equal-weight merge analysis: Multiple levels Level i=4 Level i=3 Level i=2 Level i=1 35 Summarizing Disitributed Data
Equal-sized merge analysis: Chernoff bound ¨ Chernoff-Hoeffding: Give unbiased variables Yj s. t. |Yj| yj : Pr[ abs( 1 j t Yj ) > ] 2 exp(-2 2/ 1 j t (2 yj)2) ¨ Set = h 2 m for our variables: – 2 2/( i j (2 max(Xi, j)2) = 2(h 2 m)2 / ( i 2 m-i. 22 i) = 2 h 2 22 m / i 2 m+i = 2 h 2 / i 2 i-m = 2 h 2 / i 2 -i 2 h 2 Level i=4 Level i=3 Level i=2 Level i=1 ¨ From Chernoff bound, error probability is at most 2 exp(-2 h 2) – 36 1/2 Set h = O(log -1) to obtain 1 - probability of success Summarizing Disitributed Data
Equal-sized merge analysis: finishing up ¨ Chernoff bound ensures absolute error at most =h 2 m – m is number of merges = log (n/k) for summary size k – So error is at most hn/k ¨ Set size of each summary k to be O(h/ ) = O(1/ log 1/2 1/ ) – Guarantees give n error with probability 1 - for any one range ¨ There are O(1/ ) different ranges to consider – Set = Θ( ) to ensure all ranges are correct with constant probability – Summary size: O(1/ log 1/2 1/ ) 37 Summarizing Disitributed Data
Fully mergeable -approximation ¨ Use equal-size merging in a standard logarithmic trick: Wt 32 Wt 16 Wt 8 Wt 4 Wt 2 Wt 1 ¨ Merge two summaries as binary addition ¨ Fully mergeable quantiles, in O(1/ log n log 1/2 1/ ) – n = number of items summarized, not known a priori ¨ But can we do better? 38 Summarizing Disitributed Data
Hybrid summary ¨ Classical result: It’s sufficient to build the summary on a random sample of size Θ(1/ε 2) – Problem: Don’t know n in advance Wt 32 Wt 16 Wt 8 Buffer ¨ Hybrid structure: Keep top O(log 1/ ) levels: summary size O(1/ log 1. 5(1/ )) – Also keep a “buffer” sample of O(1/ ) items – When buffer is “full”, extract points as a sample of lowest weight – 39 Summarizing Disitributed Data
-approximations in higher dimensions ¨ -approximations generalize to range spaces with bounded VC -dimension Generalize the “odd-even” trick to low-discrepancy colorings – -approx for constant VC-dimension d has size Õ( -2 d/(d+1)) – 40 Summarizing Disitributed Data
Other mergeable summaries: -kernels ¨ -kernels in d-dimensional space approximately preserve the projected extent in any direction -kernel has size O(1/ (d-1)/2) – Streaming -kernel has size O(1/ (d-1)/2 log(1/ )) – Mergeable -kernel has size O(1/ (d-1)/2 logdn) – 41 Summarizing Disitributed Data
Summary Heavy hitters ε-approximation (quantiles) deterministic ε-approximation (quantiles) randomized ε-kernel 42 Static 1/ε Streaming 1/ε Mergeable 1/ε 1/ε log n 1/ε log U - 1/ε log 1. 5(1/ε) 1/ε(d-1)/2 log(1/ε) 1/ (d-1)/2 logdn Summarizing Disitributed Data
Mergeability vs k-party communication ¨ Mergeability is a property of a summary itself Makes the summary behave like a simple commutative and associative aggregate – Is one way to summarize distributed data – Total communication cost: O(k summary size) – ¨ Can we do better in the k-party communication model? Size and/or shape of merging tree known in advance – Only the final summary is valid – Resulting summary may not be further merged – 43 Summarizing Disitributed Data
Random sample ¨ 44 Summarizing Disitributed Data
Some negative results ¨ 45 Summarizing Disitributed Data
Heavy hitters 46 Summarizing Disitributed Data
Heavy hitters 47 Summarizing Disitributed Data
Heavy hitters ¨ 48 Summarizing Disitributed Data
Huang and Yi, “The Communication Complexity of Distributed ε-Approximations”, FOCS’ 14 49 Summarizing Disitributed Data
Thank you!
- Slides: 50