Efficient Estimation of Inclusion Coefficient using Hyper Log
- Slides: 26
Efficient Estimation of Inclusion Coefficient using Hyper. Log Sketches Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri
Inclusion Coefficient (INC) •
Applications • Data integration • Relaxed containment measure • Detecting data quality issues such as missing values • Foreign-key and FD detection • INC is an important feature in detection algorithms • For example, [Rostin et al, 2009] and [Chen et al. , 2014]
Exact vs. Approximate INC Computation • Exact INC computation is expensive • Full scan/join of data [Lopes et al. , 2002] • Too expensive for large database schema and tables • Worse if to be computed for all pairs of columns (foreign-key detection) • Approximate (estimated) INC: sketch-based approaches • Based on Bottom-k sketch [Cohen and Kaplan, 2007]: Jaccard -> INC [Zhang et al. , 2010] • This paper: better accuracy with error bounds (based on Hyperloglog sketch [Flajolet et al. , 2007])
Problem Definition • … …
Problem Definition •
Our Approach: BML (Binomial Mean-Lookup) •
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Hyperloglog Sketch Construction (Recap)
Key Ideas: Estimating INC from HLL Sketches • ?
Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup
Binomial Mean-Lookup Estimator Binomial Mean Estimation ? Lookup
Error Bound Estimation Binomial Mean Estimation ? Chernoff bound Lookup
Experiments • Datasets
Experiments Execution Time k = 1536 Average Error (TPCDS-300)
Experiments Execution Time k = 1536 Average Error (TPCDS-300)
Experiments Execution Time k = 1536 Average Error (TPCDS-300)
Experiments: Small vs. Large Columns
Experiments: Small vs. Large Columns
Experiments: Small vs. Large Columns
Experiments: Foreign Key Detection • Plug estimated INC into [Rostin et al, 2009] and [Chen et al. , 2014]
Conclusion and Future Work • BML estimator for INC using Hyperloglog sketches • A Maximum-Likelihood estimation schema • A novel error estimation method to produce data-dependent bound • Better accuracy than Bottom-k and can be used for FK detection • To be extended for other correlation measures
Thank You! Questions? Comments?
- Hypercommercialism definition
- Jika log 2=0 301 dan log 3=0 477 maka nilai log 72 adalah
- Logaritma adalah
- 1 + 3,3 log 30
- Jika 3 log 2 = a , nilai 81 log ½ adalah
- Jika log 2 = 0 301 nilai log 32 =
- ⁸log 32
- Jika log 3=0 477 dan log 5=0 699 maka log 45 adalah
- Penyelesaian pertidaksamaan log(x-4)+log(x+8) log(2x+16)
- Productively efficient vs allocatively efficient
- Productively efficient vs allocatively efficient
- Productively efficient vs allocatively efficient
- Allocative efficiency vs productive efficiency
- Productively efficient vs allocatively efficient
- Efficient estimation of word representation in vector space
- Efficient estimation of word representation in vector space
- Efficient video classification using fewer frames
- Using the body in an efficient and careful way
- Sahli's haemoglobinometer pipette
- Power law log log plot
- Power law log log plot
- How do you get rid of ln
- Loga mn
- Experiment 343
- Evaluate log
- Persamaan 7 log 217 + 7 log 31 ialah
- What is a logarithm