Efficient Estimation of Inclusion Coefficient using Hyper Log























![Experiments: Foreign Key Detection • Plug estimated INC into [Rostin et al, 2009] and Experiments: Foreign Key Detection • Plug estimated INC into [Rostin et al, 2009] and](https://slidetodoc.com/presentation_image_h/0440cfa68af7c044f8c643152efd1ccb/image-24.jpg)


- Slides: 26
Efficient Estimation of Inclusion Coefficient using Hyper. Log Sketches Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri
Inclusion Coefficient (INC) •
Applications • Data integration • Relaxed containment measure • Detecting data quality issues such as missing values • Foreign-key and FD detection • INC is an important feature in detection algorithms • For example, [Rostin et al, 2009] and [Chen et al. , 2014]
Exact vs. Approximate INC Computation • Exact INC computation is expensive • Full scan/join of data [Lopes et al. , 2002] • Too expensive for large database schema and tables • Worse if to be computed for all pairs of columns (foreign-key detection) • Approximate (estimated) INC: sketch-based approaches • Based on Bottom-k sketch [Cohen and Kaplan, 2007]: Jaccard -> INC [Zhang et al. , 2010] • This paper: better accuracy with error bounds (based on Hyperloglog sketch [Flajolet et al. , 2007])
Problem Definition • … …
Problem Definition •
Our Approach: BML (Binomial Mean-Lookup) •
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Key Ideas: Estimating INC from HLL Sketches • ?
Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup
Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup
Error Bound Estimation Binomial Mean Estimation ? Chernoff bound Lookup
Experiments • Datasets
Experiments Execution Time k = 1536 Average Error (TPCDS-300)
Experiments Execution Time k = 1536 Average Error (TPCDS-300)
Experiments Execution Time k = 1536 Average Error (TPCDS-300)
Experiments: Small vs. Large Columns
Experiments: Small vs. Large Columns
Experiments: Small vs. Large Columns
Experiments: Foreign Key Detection • Plug estimated INC into [Rostin et al, 2009] and [Chen et al. , 2014]
Conclusion and Future Work • BML estimator for INC using Hyperloglog sketches • A Maximum-Likelihood estimation schema • A novel error estimation method to produce data-dependent bound • Better accuracy than Bottom-k and can be used for FK detection • To be extended for other correlation measures
Thank You! Questions? Comments?