Preserving Statistical Validity in Adaptive Data Analysis Vitaly
- Slides: 21
Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Microsoft Res. Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Google Res. Penn, CS U. of Toronto Samsung Res.
Param. estimates Correlations Predictive model Classifier, Clustering etc. Findings Analysis
Data Science 101 Does student nutrition affect academic performance? 50 100 Normalized grade
Check correlations Correlations with grade 0, 3 0, 2 0, 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -0, 1 -0, 2 -0, 3 -0, 4
Pick candidate foods Correlations with grade 0, 3 0, 2 0, 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -0, 1 -0, 2 -0, 3 -0, 4
Fit linear function of 3 selected foods True vs Predicted Grade SUMMARY OUTPUT 1, 5 Regression Statistics Multiple R 0. 4453533 R Square 0. 1983396 Adjusted R Square 0. 1732877 Standard Error 1. 0041891 Observations 100 1 0, 5 0 -4 -3 -2 -1, 5 df Intercept Mushroom Pumpkin Nutella -0, 5 -1 ANOVA Regression Residual Total -1 3 96 99 SS MS F 23. 95086544 7. 983622 7. 917151 96. 80600126 1. 008396 120. 7568667 Coefficients Standard Error -0. 044248 0. 100545016 -0. 296074 0. 10193011 0. 255769 0. 108443069 0. 2671363 0. 095186165 t Stat -0. 44008 -2. 90468 2. 358555 2. 806462 0 1 2 3 4 FAL SE DIS COV Significance F 8. 98706 E-05 ERY P-value 0. 660868 0. 004563 0. 020373 0. 006066 Freedman’s Paradox: “Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician. ” (1983)
Statistical inference “Fresh” data Data Procedure Hypothesis tests Regression Learning Result and statistical guarantees p-values confidence intervals prediction intervals
Data analysis is adaptive • • Exploratory data analysis Variable selection Hyper-parameter tuning Shared data - findings inform others Result Data Result
Is this a real problem? “Why Most Published Research Findings Are False” [Ioannidis 05] 1, 000+ downloads; 1400+ citations “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28 B/year loss” [Freedman, Cockburn, Simcoe 15] Adaptive data analysis is one of the causes In the course of collecting and analyzing data, researchers have many decisions to make […] It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance”, and to then report only what “worked”. [Simmons, Nelson, Simonsohn 11]
Evaluating adaptive queries Statistical query oracle [Kearns 93] Data analyst(s) Can measure correlations, moments, accuracy/error, parameters and run any SQ-based algorithm!
Answering non-adaptive SQs •
Answering adaptive SQs •
Our results
Tool: differential privacy A DAT
06] S Cynthia Frank Aaron Chris Kobbi Adam Algorithm ratio bounded
Why DP? DP composes adaptively A B
Why DP? DP composes adaptively A B
Why DP? DP composes adaptively DP implies generalizatio n
Back to queries •
Further developments •
- Preserving statistical validity in adaptive data analysis
- Criterion validity
- What is external validity
- Vitaly shmatikov
- Vitaly attack
- Vitaly feldman
- Data analysis
- Cowan statistical data analysis pdf
- Cowan statistical data analysis pdf
- Statistical analysis of experimental data
- Ecology preserving the animal kingdom
- Chapter 8 preserving your credit
- Preserving your credit
- Revealing information while preserving privacy
- Orthogonal matrix properties
- Style transfer
- Preserving food
- Dr nuzhat sultana
- Validity statement
- Validity data quality
- Valid data is reliable data
- Sawtooth conjoint analysis