Efficient Estimation of Inclusion Coefficient using Hyper Log

  • Slides: 26
Download presentation
Efficient Estimation of Inclusion Coefficient using Hyper. Log Sketches Azade Nazi, Bolin Ding, Vivek

Efficient Estimation of Inclusion Coefficient using Hyper. Log Sketches Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri

Inclusion Coefficient (INC) •

Inclusion Coefficient (INC) •

Applications • Data integration • Relaxed containment measure • Detecting data quality issues such

Applications • Data integration • Relaxed containment measure • Detecting data quality issues such as missing values • Foreign-key and FD detection • INC is an important feature in detection algorithms • For example, [Rostin et al, 2009] and [Chen et al. , 2014]

Exact vs. Approximate INC Computation • Exact INC computation is expensive • Full scan/join

Exact vs. Approximate INC Computation • Exact INC computation is expensive • Full scan/join of data [Lopes et al. , 2002] • Too expensive for large database schema and tables • Worse if to be computed for all pairs of columns (foreign-key detection) • Approximate (estimated) INC: sketch-based approaches • Based on Bottom-k sketch [Cohen and Kaplan, 2007]: Jaccard -> INC [Zhang et al. , 2010] • This paper: better accuracy with error bounds (based on Hyperloglog sketch [Flajolet et al. , 2007])

Problem Definition • … …

Problem Definition • … …

Problem Definition •

Problem Definition •

Our Approach: BML (Binomial Mean-Lookup) •

Our Approach: BML (Binomial Mean-Lookup) •

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Hyperloglog Sketch Construction (Recap)

Key Ideas: Estimating INC from HLL Sketches • ?

Key Ideas: Estimating INC from HLL Sketches • ?

Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup

Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup

Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup

Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup

Error Bound Estimation Binomial Mean Estimation ? Chernoff bound Lookup

Error Bound Estimation Binomial Mean Estimation ? Chernoff bound Lookup

Experiments • Datasets

Experiments • Datasets

Experiments Execution Time k = 1536 Average Error (TPCDS-300)

Experiments Execution Time k = 1536 Average Error (TPCDS-300)

Experiments Execution Time k = 1536 Average Error (TPCDS-300)

Experiments Execution Time k = 1536 Average Error (TPCDS-300)

Experiments Execution Time k = 1536 Average Error (TPCDS-300)

Experiments Execution Time k = 1536 Average Error (TPCDS-300)

Experiments: Small vs. Large Columns

Experiments: Small vs. Large Columns

Experiments: Small vs. Large Columns

Experiments: Small vs. Large Columns

Experiments: Small vs. Large Columns

Experiments: Small vs. Large Columns

Experiments: Foreign Key Detection • Plug estimated INC into [Rostin et al, 2009] and

Experiments: Foreign Key Detection • Plug estimated INC into [Rostin et al, 2009] and [Chen et al. , 2014]

Conclusion and Future Work • BML estimator for INC using Hyperloglog sketches • A

Conclusion and Future Work • BML estimator for INC using Hyperloglog sketches • A Maximum-Likelihood estimation schema • A novel error estimation method to produce data-dependent bound • Better accuracy than Bottom-k and can be used for FK detection • To be extended for other correlation measures

Thank You! Questions? Comments?

Thank You! Questions? Comments?