Sampling Based Range Partition for Big Data Analytics
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012
Big Data Analytics • Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data • Some figures of scale – – Peta / Tera bytes of online services data processed daily 200 M tweets per day (Twitter) 1 B of content pieces shared per day (Facebook) 8, 000 Exabytes of global data by 2015 (The Economist) 2
Research Agenda Database queries Machine learning Optimization Distributed computing system 3
Outline • Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 4
Range Partition 101 -250 1 -100 950 -1024. . . 1 2 (120, 5) 120 1 5223 8 102424 1 831 120 23 8 102424 2 (120, 4) m (120, 10) . . . • Special interest: balanced range partition 120 831 5223 102424 k 5
Range Partition Requirements • 6
Two Approaches • Sampling based methods – Take a sample of data items – Compute partition boundaries using the sample • Quantile summary methods – At each node compute a local quantile summary – Merge at the coordinator node 7
Related Work • 8
Related Work (cont’d) • 9
Problem • Range partition data while making one pass through data with minimal communication between the coordinator and sites 10
Sampling Based Method • 1 coordinator • Pros 2. . . – simplicity, scalability k • Cons – how many samples to take from each site? data size imbalance: number of data input records per machine may differ from one machine to another 11
Data Sizes Imbalance Dataset Records Bytes Sites Data. Set-1 62 M 150 G 262 Data. Set-2 37 M 25 G 80 Data. Set-3 13 M 0. 26 G 1 Data. Set-4 7 M 1. 2 T 301 Data. Set-5 106 M 7 T 5652 12
Origins of Data Sizes Imbalance • JOIN SELECT FROM A INNER JOIN B ON A. KEY==B. KEY ORDER BY COL • Lookup Table If the record value of column X is in the lookup table, then return the row • UNPIVOT Input: Output: Col 1 Col 2 1 2, 3 2 3, 9, 8, 13 … (1, 2), (1, 3), (2, 9), … 13
Weighted Sampling Scheme • 14
SAMPLE 1 2 coordinator. . . k 15
MERGE coordinator . . . 16
PARTITION 1 coordinator 0 Range 1 2 3 4 5 17
Sufficient Sample Size • 18
Constant Factor Imbalance • 19
Proof Outline • 20
Performance • Data. Set-1 21
Performance (cont’d) • 22
Summary for Range Partitioning • Novel weighted sampling scheme • Provable performance guarantees • Simple and practical – Coder transfer to Cosmos • More info: Sampling Based Range Partition Methods for Big Data Analytics, V. , Xu, Zhou, MSR-TR-2012 -18, Mar 2012 23
Outline • Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 24
SUM Tracking Problem • 1 2 SUM: 3 k 25
SUM Tracking 26
Applications • input data 27
State of the Art • 28
The Challenge • Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? – Random permutation – Random i. i. d. – Fractional Brownian motion 29
Communication Complexity Bounds • 30
Communication Complexity Bounds Unknown Drift Case • 31
Our Tracker Algorithm • S= S 1 +…+ Sk S 1 S, S 1 site S, Sk site coordinator Sk Mi = 1 Xi 32
Two Applications • Second Frequency Moment • Bayesian Linear Regression 33
App 1: Second Frequency Moment • 34
AMS Sketch • {0, 1} valued hash 35
App 1: Second Frequency Moment (cont’d) • 36
App 2: Bayesian Linear Regression • 37
App 2: Bayesian Linear Regression (cont’d) • 38
Summary for Sum Tracking • Studied the sum tracking problem with nonmonotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion • Proposed a novel algorithm with nearly optimal communication complexity • Details: ACM PODS 2012 39
Outline • Range Partition with Fei Xu and Jingren Zhou • Count Tracking with Zhenming Liu and Bozidar Radunovic • Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 40
Problem • Partition a graph with two objectives – Sparsely connected components – Balanced number of vertices per component • Applications – Parallel processing – Community detection 41
Problem (cont’d) • Requirements 1 2 3 k – Streaming algorithm – Single pass / incremental – Efficient computing • Desired – Approximation guarantees – Average-case efficient 42
Summary for Graph Partitioning • Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics • Provable approximation guarantees • More details available soon 43
- Slides: 43