Random Sampling on Big Data Techniques and Applications
Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust. hk
2 Random Sampling on Big Data
“Big Data” in one slide The 3 V’s: n Volume External memory algorithms – Distributed data – n Velocity – n Streaming data Variety Integers, real numbers – Points in a multi-dimensional space – Records in relational database – Graph-structured data – 3 Random Sampling on Big Data
Dealing with Big Data n n The first approach: scale up / out the computation Many great technical innovations: Distributed/parallel systems • Map. Reduce, Pregel, Dremel, Spark… – New computational models • BSP, MPC, … – Dan Suciu’s tutorial tomorrow! – My Beyond. MR talk of Friday – n 4 This talk is not about this approach! Random Sampling on Big Data
Downsizing data n A second approach to computational scalability: scale down the data! Too much redundancy in big data anyway – 100% accuracy is often not needed – What we finally want is small: human readable analysis / decisions – Examples: samples, sketches, histograms, various transforms • See tutorial by Graham Cormode for other data summaries – n Complementary to the first approach Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures • Good old RAM model no longer applies – 5 Random Sampling on Big Data
Outline of the talk n n Stream sampling Importance sampling Merge-reduce sampling Sampling for Approximate Query Processing Sampling from one table – Sampling from multiple tables (joins) – 6 Random Sampling on Big Data
Simple Random Sampling n Sampling without replacement Randomly draw an element – Don’t put it back – Repeat s times – n Sampling with replacement Randomly draw an element – Put it back – Repeat s times – n 7 Trivial in the RAM model Random Sampling on Big Data
Stream Sampling P Memory
9 Random Sampling on Big Data
Reservoir Sampling n [Waterman ? ? ; Knuth’s book] 10 Random Sampling on Big Data
Correctness Proof n 11 Random Sampling on Big Data
12 Random Sampling on Big Data
Reservoir Sampling Correctness Proof n s=2 13 a a b b b a c d c c c a a d d c Random Sampling on Big Data
External Memory Stream Sampling P Internal Memory External memory n 14 Random Sampling on Big Data
External Memory Stream Sampling n 15 Random Sampling on Big Data
Clean-up Step n 16 Random Sampling on Big Data
Clean-up Step n 17 Random Sampling on Big Data
External Memory Stream Sampling n [Gemulla and Lehner 06] [Hu, Qiao, Tao 15] 18 Random Sampling on Big Data
Sampling from Distributed Streams n [Cormode, Muthukrishnan, Yi, Zhang 09] [Woodruff, Tirthapura 11] 19 Random Sampling on Big Data
Reduction from Coin Flip Sampling n 20 Random Sampling on Big Data
The Algorithm n 21 Random Sampling on Big Data
Communication Cost of Algorithm n 22 Random Sampling on Big Data
Importance Sampling probability depends on how important data is
Frequency Estimation on Distributed Data n 24 [Zhao, Ogihara, Wang, Xu 06] [Huang, Yi, Liu, Chen 11] Random Sampling on Big Data
Frequency Estimation: Standard Solutions n 25 Random Sampling on Big Data
Importance Sampling n 26 Random Sampling on Big Data
n 27 Random Sampling on Big Data
Communication cost All possible inputs 28 Random Sampling on Big Data
What Happened? n n 29 Random Sampling on Big Data
Variance-Communication Duality Variance All possible inputs 30 Random Sampling on Big Data
Merge-Reduce Sampling Better than simple random sampling
Random sample: n 32 Random Sampling on Big Data
Median and Quantiles (order statistics) n 33 Random Sampling on Big Data
Merge-Reduce Sampling n 1 34 Random Sampling on Big Data 5 2 3 1 3 6 7 8 4 9 10 5 7 9 +
Application 1: Streaming Computation n 35 Random Sampling on Big Data
Error Analysis: Base case 36 Random Sampling on Big Data
Error Analysis: General Case Level 4 Level 3 Level 2 Level 1 37 Random Sampling on Big Data
Error Analysis: Azuma-Hoeffding n 38 Random Sampling on Big Data
Application 2: Distributed Data n 39 [Huang, Yi 14] Random Sampling on Big Data
Generalization to Multi-dimensions n 40 Random Sampling on Big Data
How to Reduce: Low-Discrepancy Coloring n 41 Random Sampling on Big Data
Known Discrepancy Results n 42 Random Sampling on Big Data
Sampling for Approximate Query Processing
Complex Analytical Queries (TPC-H) SELECT SUM(l_price) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_shipdate >= 2017 -03 -01 AND l_shipdate <= 2017 -03 -22 AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA‘ n n Things to consider: What to return? Simple aggregation (COUNT, SUM) – A sample (UDFs) – n Pre-computation allowed? – 44 Pre-computed samples, indexes Random Sampling on Big Data
Sampling from One Table n 45 Random Sampling on Big Data
Binary Tree with Pre-computed Samples Report: 5 Active nodes 5 7 14 3 1 1 4 2 46 12 8 3 5 4 5 7 6 7 9 8 14 12 14 16 9 10 11 12 13 14 15 16 Random Sampling on Big Data
Binary Tree with Pre-computed Samples Report: 5 Active nodes 5 7 14 3 1 1 4 2 47 12 8 3 5 4 5 7 6 7 9 8 14 12 14 16 9 10 11 12 13 14 15 16 Random Sampling on Big Data
Binary Tree with Pre-computed Samples Report: 5 7 Active nodes 5 Pick 7 or 14 with equal prob. 7 14 3 1 1 4 2 48 12 8 3 5 4 5 7 6 7 9 8 14 12 14 16 9 10 11 12 13 14 15 16 Random Sampling on Big Data
Binary Tree with Pre-computed Samples Report: 5 7 Active nodes 5 7 14 3 1 1 2 49 12 8 4 3 5 4 Pick 3, 8, or 14 with prob. 1: 1: 2 5 7 6 7 9 8 14 12 14 16 9 10 11 12 13 14 15 16 Random Sampling on Big Data
Binary Tree with Pre-computed Samples Report: 5 7 Active nodes 5 7 14 3 1 1 4 2 50 12 8 3 5 4 5 7 6 7 9 8 14 12 14 16 9 10 11 12 13 14 15 16 Random Sampling on Big Data
Binary Tree with Pre-computed Samples Report: 5 7 12 Active nodes 5 7 14 3 1 1 2 12 8 4 3 5 4 Pick 3, 8, or 12 with equal prob 5 7 6 7 9 8 14 12 14 16 9 10 11 12 13 14 15 16 [Wang, Christensen, Li, Yi 16] 51 Random Sampling on Big Data
Binary Tree with Pre-computed Samples n 52 Random Sampling on Big Data
n [Chaudhuri, Motwani, Narasayya 99] 53 Random Sampling on Big Data
Sampling Joins: Open Problems n 54 Random Sampling on Big Data
Two Tables: COUNT, No Pre-computation n Ripple join [Haas, Hellerstein 99] Sample a tuple from each table – Join with previously sampled tuples from other tables – The joined sampled tuples are not independent, but unbiased – n Works well for full Cartesian product – n n 55 But most joins are sparse … Can be extended to multiple tables but efficiency is even lower What can be done with pre-computation (indexes)? Random Sampling on Big Data
A Running Example Nation CID US 1 US Buyer. ID Order. ID 4 1 Order. ID Item. ID 4 301 2 What’s the total revenue of all orders 3 2 2 304 Price $2100 $100 3 from customers in China? 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 China 56 Random Sampling on Big Data
Join as a Graph Nation CID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 57 Buyer. ID Order. ID Random Sampling on Big Data Order. ID Item. ID Price
Sampling by Random Walks Nation CID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 58 Buyer. ID Order. ID Random Sampling on Big Data Order. ID Item. ID Price
Sampling by Random Walks Nation CID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 59 Buyer. ID Order. ID Random Sampling on Big Data Order. ID Item. ID Price
Sampling by Random Walks Nation CID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 UK 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 60 Buyer. ID Order. ID Random Sampling on Big Data Order. ID Item. ID Price
Sampling by Random Walks Nation CID US 1 4 301 $2100 US 2 3 2 2 304 $100 China 3 1 3 3 201 $300 UK 4 5 4 4 306 $500 China 5 5 5 3 401 $230 US 6 5 6 1 101 $800 China 7 3 7 2 201 $300 8 5 101 $200 Japan 9 3 9 4 301 $100 UK 10 7 10 2 201 $600 UK Buyer. ID Order. ID Item. ID Can also deal with selection predicates 61 Random Sampling on Big Data Price [Li, Wu, Yi, Zhao 16]
Open Problem n 62 Random Sampling on Big Data
Thank you!
- Slides: 63