Random Sampling over Joins Revisited Zhuoyue Zhao Robert
Random Sampling over Joins Revisited Zhuoyue Zhao, Robert Christensen, Feifei Li University of Utah Xiao Hu, Ke Yi Hong Kong University of Science and Technology
A motivating example n Predicting the return flag of an item shipped to a customer – Using features of both the item and another item shipped to the same customer Label Flag Cust. Id Region Total Discount Flag 2 1 10 2 100 0. 2 0 20 0. 5 0 20 1 200 0. 0 0 100 0. 1 0 20 1 500 0. 1 0 300 0. 2 …… 2 Features …… Random Sampling over Joins Revisited Total 2 Discount 2
A motivating example Joining 7 Tables from TPC-H In order to predict the return_flag of an item ℓ 1 shipped to a customer c, we may want to look at another item ℓ 2 shipped to the same customer c and include the return_flag of ℓ 2 as a feature 3 Random Sampling over Joins Revisited
A motivating example n Training a classifier using SVM on the join over 7 tables Full join takes more than 12 hours to compute (TPC-H scale factor 40). – Training runs forever without down-sampling. – A B B C 1 2 D 2 2 1 2 21 3 4 2 3 42 4 4 3 4 4 4 D E E 2 F 2 F G 12 2 G 21 2 24 2 1 42 2 34 4 2 43 4 4 4 A B C D E F G H 1 2 2 1 1 2 2 3 3 2 2 H 2 Training Accuracy = 80% 3 4 5 3 2 3 3 4 …… 2 4 Evaluation 4 Random Sampling over Joins Revisited
A motivating example n Training a classifier using SVM on a join over 7 tables Full join takes more than 12 hours to compute. – Training runs forever without down-sampling. – A B B C 1 2 D 2 2 1 2 21 3 4 2 3 42 4 4 3 4 4 5 D E E 2 F 2 F G 12 2 G 21 2 24 2 1 42 2 34 4 2 43 4 4 4 A B C D E F G H 1 2 2 1 1 2 2 3 3 2 2 H 2 Training Accuracy = 80% 3 4 5 3 2 3 3 4 …… 2 4 Evaluation 4 Random Sampling over Joins Revisited
Problem definition n This work: simple random samples (uniform and independent) Good for complex tasks such as machine learning or statistics – Prior works have restrictions • 2 -table join [Olken Ph. D Dissertation’ 93; Chaudhuri et al. SIGMOD’ 99] • Multi-way foreign key joins [Acharya et al. , SIGMOD’ 99; ] – Challenge: sampling over general joins – n Prior works focus on non-uniform or correlated samples over joins – 6 Only work on simple aggregates • Ripple join (uniform but correlated samples) [Haas et al. SIGMOD’ 99] • Wander join (independent but non-uniform samples) [Li et al. SIGMOD’ 16] Random Sampling over Joins Revisited
Example: 2 -table join sampling A B B C A B C 1 1 2 1 4 2 1 2 2 4 2 2 3 1 3 3 5 2 1 4 2 3 4 5 2 2 5 2 3 5 6 2 1 6 2 3 6 6 2 2 7 3 3 7 3 4 7 3 5 7 3 6 7 Random Sampling over Joins Revisited
A general sampling framework for multi-way joins n 5 5 B, C 1|3 8 A, B 5 1|1 1 B, C 2|4 A, B 1|1 1 3 B, C 2|5 A, B 1|2 1 B, C 2|6 Random Sampling over Joins Revisited
A general sampling framework for multi-way joins n 5 5 B, C 1|3 9 A, B 5 1|1 1 B, C 2|4 A, B 1|1 1 5 3 B, C 2|5 A, B 1|2 1 B, C 2|6 Random Sampling over Joins Revisited
Instantiation of the Join Sampling Framework n 10 A, B 1|1 2|1 7|2 B, C 1|1 1|2 2|1 C, D 1|2 2|4 2|8 3|6 Random Sampling over Joins Revisited
General Join Acyclic Join 11 Cyclic Join Random Sampling over Joins Revisited Join w/ Selection Predicate
Learning over Join Example: SVM Joining 7 Tables from TPC-H In order to predict the return_flag of an item ℓ 1 shipped to a customer c, we want to look at another item ℓ 2 shipped to the same customer c and include the return_flag of ℓ 2 as a feature 12 Random Sampling over Joins Revisited
Learning over Join Example: SVM Prediction error of SVM trained on the full join results 13 Random Sampling over Joins Revisited
Summary n A general multi-way join sampling framework Covers prior works as special cases – Extends prior works to general joins & new instantiations – n Empirical evaluation to show good performance of the framework Thank you! Q&A 14 Random Sampling over Joins Revisited
Sampling framework instantiations n 15 Random Sampling over Joins Revisited
Olken’s algorithm for 2 -table joins A B B C 1 1 2 1 2 2 22 3 1 3 3 4 2 3 4 5 2 3 5 6 6 22 3 6 7 3 16 Random Sampling over Joins Revisited
Chaudhuri et al. ’s algorithm for 2 -table joins A B B C 1 1 2 1 2 2 22 3 1 3 3 4 2 3 4 5 2 3 5 6 6 22 3 6 7 3 17 Acceptance rate = 1 Random Sampling over Joins Revisited
Acharya et al. ’s algorithm for multi-way foreign-key joins A B B C C D 1 1 1 3 1 2 2 1 2 3 2 4 3 1 3 6 4 2 4 4 22 5 3 5 2 n 6 6 44 7 5 18 Random Sampling over Joins Revisited
A general sampling framework for multi-way joins n A B B C C D 1 1 1 2 2 4 3 1 1 3 2 8 4 2 2 1 3 6 5 2 3 3 6 2 7 2 19 Random Sampling over Joins Revisited
Exact Weight n 16 A B B C C D 4 1 1 1 2 1 4 2 1 1 2 2 2 4 1 4 33 11 1 3 1 2 8 1 1 4 2 2 1 1 3 3 66 1 1 5 2 3 3 1 1 6 2 1 7 2 20 Random Sampling over Joins Revisited
Reverse Sampling (Acharya et al. ’s algorithm) A B B C C D 1 1 1 3 1 2 2 1 2 3 2 4 3 1 3 6 4 2 4 4 22 5 3 5 2 n 6 6 44 7 5 21 Random Sampling over Joins Revisited
A general sampling framework for multi-way joins (contd. ) n 16 A B B C C D 4 1 1 1 2 1 4 2 1 1 2 2 2 4 1 4 3 1 1 3 1 2 8 1 1 4 2 2 1 1 3 6 1 1 5 2 3 3 1 1 6 2 1 7 2 22 Random Sampling over Joins Revisited
A general sampling framework for multi-way joins (contd. ) n 16 A B B C C D 4 1 1 1 2 1 4 2 1 1 2 2 2 4 1 4 3 1 1 3 1 2 8 1 1 4 2 2 1 1 3 6 1 1 5 2 3 3 1 1 6 2 1 7 2 23 Random Sampling over Joins Revisited
Extended Olken’s Algorithm n B C C D 4, 6 1 1 1, 2 1, 1 4, 6 2 1 1 2 2, 2 2 4 1, 1 4, 6 33 11 1 3 1, 2 2 8 1, 1 1, 6 4 2 2 1 1, 2 3 3 66 1, 1 1, 6 5 2 3 3 1, 2 A B 16, 42 1, 6 6 2 1, 6 7 2 24 Remarks: + Low initialization cost - High rejection rate when there’s degree skewness Random Sampling over Joins Revisited
Online Exploration n B C C D 4, 6 1 1 1, 2 1, 1 4, 6 2 1 1 2 2, 2 2 4 1, 1 4, 6 3 1 1 3 1, 2 2 8 1, 1 1, 6 4 2 2 1 1, 2 3 6 1, 1 1, 6 5 2 3 3 1, 2 A B 16, 42 1, 6 6 2 1, 6 7 2 25 Remarks: • Moderate initialization cost • Converges to Exact Weight Random Sampling over Joins Revisited
Acyclic Joins n 26 Random Sampling over Joins Revisited
Acyclic Joins n 27 Random Sampling over Joins Revisited
Acyclic Joins 28 Random Sampling over Joins Revisited
General Join n 29 Random Sampling over Joins Revisited
General Join n 30 Random Sampling over Joins Revisited
TPC-H (scale 10) Q 3: a 3 -table foreign key chain join 31 QY: a 7 -table cyclic join Random Sampling over Joins Revisited
Social Graph QT: a triangle join 32 QF: a 4 -table acyclic join (snowflake) Random Sampling over Joins Revisited
Scalability (TPC-H QY) Time to collect the first sample 33 Random Sampling over Joins Revisited
Selection Predicates One selection predicate 34 Two selection predicates (x >=a and y >= b) Selectivity is fixed at 20%. Random Sampling over Joins Revisited
A general sampling framework for multi-way joins n 16 1 4 4 1|1 1 2 1 1|1 2|1 1|2 35 7|2 2|1 2|4 2|8 1 3|6 Random Sampling over Joins Revisited
A general sampling framework for multi-way joins n 42 16 6 1 64 64 1|1 22 21 1|1 2 1 1|2 1 2|1 1 1 1|2 36 7|2 2|1 2|4 2|8 1 3|6 Random Sampling over Joins Revisited
- Slides: 36