Preserving Statistical Validity in Adaptive Data Analysis Vitaly

  • Slides: 21
Download presentation
Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia

Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Microsoft Res. Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Google Res. Penn, CS U. of Toronto Samsung Res.

 Param. estimates Correlations Predictive model Classifier, Clustering etc. Findings Analysis

Param. estimates Correlations Predictive model Classifier, Clustering etc. Findings Analysis

Data Science 101 Does student nutrition affect academic performance? 50 100 Normalized grade

Data Science 101 Does student nutrition affect academic performance? 50 100 Normalized grade

Check correlations Correlations with grade 0. 3 0. 2 0. 1 0 1 2

Check correlations Correlations with grade 0. 3 0. 2 0. 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -0. 1 -0. 2 -0. 3 -0. 4

Pick candidate foods Correlations with grade 0. 3 0. 2 0. 1 0 1

Pick candidate foods Correlations with grade 0. 3 0. 2 0. 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -0. 1 -0. 2 -0. 3 -0. 4

Fit linear function of 3 selected foods True vs Predicted Grade SUMMARY OUTPUT 1.

Fit linear function of 3 selected foods True vs Predicted Grade SUMMARY OUTPUT 1. 5 Regression Statistics Multiple R 0. 4453533 R Square 0. 1983396 Adjusted R Square 0. 1732877 Standard Error 1. 0041891 Observations 100 Intercept Mushroom Pumpkin Nutella 0. 5 0 -4 -3 -2 -1 -0. 5 -1 -1. 5 ANOVA Regression Residual Total 1 df 3 96 99 SS MS F 23. 95086544 7. 983622 7. 917151 96. 80600126 1. 008396 120. 7568667 Coefficients Standard Error t Stat P-value -0. 044248 0. 100545016 -0. 44008 0. 660868 -0. 296074 0. 10193011 -2. 90468 0. 004563 0. 255769 0. 108443069 2. 358555 0. 020373 0. 2671363 0. 095186165 2. 806462 0. 006066 0 1 2 3 4 FAL SE DIS COV Significance F 8. 98706 E-05 ERY Freedman’s Paradox: “Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician. ” (1983)

Statistical inference “Fresh” data Data Procedure Hypothesis tests Regression Learning Result and statistical guarantees

Statistical inference “Fresh” data Data Procedure Hypothesis tests Regression Learning Result and statistical guarantees p-values confidence intervals prediction intervals

Data analysis is adaptive • • Exploratory data analysis Variable selection Hyper-parameter tuning Shared

Data analysis is adaptive • • Exploratory data analysis Variable selection Hyper-parameter tuning Shared data - findings inform others Result Data Result

Is this a real problem? “Why Most Published Research Findings Are False” [Ioannidis 05]

Is this a real problem? “Why Most Published Research Findings Are False” [Ioannidis 05] 1, 000+ downloads; 1400+ citations “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28 B/year loss” [Freedman, Cockburn, Simcoe 15] Adaptive data analysis is one of the causes In the course of collecting and analyzing data, researchers have many decisions to make […] It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance”, and to then report only what “worked”. [Simmons, Nelson, Simonsohn 11]

Evaluating adaptive queries Statistical query oracle [Kearns 93] Data analyst(s) Can measure correlations, moments,

Evaluating adaptive queries Statistical query oracle [Kearns 93] Data analyst(s) Can measure correlations, moments, accuracy/error, parameters and run any SQ-based algorithm!

Answering non-adaptive SQs •

Answering non-adaptive SQs •

Answering adaptive SQs •

Answering adaptive SQs •

Our results

Our results

Tool: differential privacy DATA

Tool: differential privacy DATA

06] S Cynthia Frank Aaron Chris Kobbi Adam Algorithm ratio bounded

06] S Cynthia Frank Aaron Chris Kobbi Adam Algorithm ratio bounded

Why DP? DP composes adaptively A B

Why DP? DP composes adaptively A B

Why DP? DP composes adaptively A B

Why DP? DP composes adaptively A B

Why DP? DP composes adaptively DP implies generalizatio n

Why DP? DP composes adaptively DP implies generalizatio n

Back to queries •

Back to queries •

Further developments •

Further developments •