Endbiased Samples for Join Cardinality Estimation Cristian Estan

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison

Problem description n Estimating join size n Not restricted to key-foreign key joins n Based on summaries of the two tables computed separately n Two main contributions of this paper n Proposing a new type of summaries based on a special type of sampling n Extensive experimental comparison of many types of summaries

We can get more accurate estimates! n [AGMS 99] showed that on certain data sets n All summaries give inaccurate estimates n Estimates based on random sampling are within constant factor of bound n We show that n On other data sets, our estimates significantly more accurate than those with random sampling n No known summaries give estimates more accurate than all others for every data set

Overview n End-biased samples n Theoretical comparison against other sampling -based methods n Experimental comparison against sketches and histograms

Building the end-biased samples n If frequency of every value known for both tables → exact join size n We keep a sample of this data n n Sampling probability proportional to frequency [DLT 01] Sampling decisions correlated by using a shared hash function [F 90], [DG 00], [EKMV 04] Frequency of values of join attribute in table A Frequency of values of join attribute in table B (c, 10) p=1 (d, 1) p=0. 25 (g, 1) p=0. 25 (m, 2) p=0. 5 (m, 1) p=0. 25 (s, 5) p=1 (r, 7) p=1 (t, 1) p=0. 25 (z, 1) p=0. 25 Sampling threshold T=4

Estimating join size n Let av be the frequency of value v in table A, bv in B and pv the probability that v is selected into both samples n Sum contribution of values in both samples (av bv/pv) to estimate join size If av ≥Ta and bv≥Tb , pv =1 n If av ≥Ta and bv<Tb , pv =bv/Tb n If av <Ta and bv≥Tb , pv =av/Ta n If av <Ta and bv<Tb , pv =min(av/Ta, bv/Tb) n

Why correlate the samples? n Example: tables with 1000 values appearing once, 50 values common to both tables We sample with probability 1/10 n Sample size ~ 100 for each table n n Comparison Correlated Uncorrel. n pv 0. 1 0. 01 n Common values sampled ~ 4, 5 or 6 ~ 0 or 1 n Join size estimate 40, 50 or 60 0 or 100

Comparison of sampling methods Type of values dominating the join Frequent in both relations Frequent in one relation Infrequent in both relations Accuracy of estimates of join size Random sampling Counting samples End-biased samples Good Very good Perfect Bad Bad Bad Good

Overview n End-biased samples n Theoretical comparison against other sampling -based methods n Experimental comparison against sketches and histograms

Experimental methodology n Randomly generated tables with ~ 1, 000 tuples n Explored multiple configurations n Varied the “peakedness” of the distribution n Varied memory budget from 204 to 659, 456 words n Varied the amount of correlation between tables n Uncorrelated – tables generated independently n Positively correlated – frequent values likely same in both tables n Negatively correlated – unlikely frequent values same in the two tables n 1, 000 runs for each configuration

Summaries compared n End-biased samples n End-biased equi-depth histograms [PC 84] n Sketches [AGMS 99], [DGGR 02], [GGR 04] n Concise samples [GM 98] n Counting samples [GM 98]

Comparison with histograms

Comparison with sketches

Memory comparison

Qualitative comparison Advantage Sketches Streaming updates End-biased samples X Simple configuration X Selection on join attribute X

Conclusions n End-biased samples and sketches are the best summaries for the join size estimation problem addressed in this paper n End-biased samples are compelling if Selections on the join attribute are required n Summaries must be very concise n The frequencies of join attributes in the two tables are strongly correlated n

Questions? Thank you! Scripts and results for experiments available at http: //www. cs. wisc. edu/~estan/ebs. tar. gz

Estimating the join size

Related work – sampling methods n [GM 98] concise samples, counting samples n [DLT 01] smart sampling n [F 90], [EKMV 04] using a hash function to select values used as summary of data

Related work – join size estimation n Histograms n Multidimensional histograms n [GG 02], [GK 04] Wavelets n [AGMS 99], [DGGR 02], [GGR 04] Sketches

Variance of join size estimate n No slide, point to the paper.