Understanding Generalization in Adaptive Data Analysis Vitaly the

  • Slides: 27
Download presentation
 Understanding Generalization in Adaptive Data Analysis Vitaly (the West Coast) Feldman

Understanding Generalization in Adaptive Data Analysis Vitaly (the West Coast) Feldman

Overview • Adaptive data analysis o Motivation o Framework o Basic results With Dwork,

Overview • Adaptive data analysis o Motivation o Framework o Basic results With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14, 15] • New results (with Thomas Steinke) • Open problems 2

 Results Data Analysis 3

Results Data Analysis 3

Statistical inference Data Theory Concentration/CLT Model complexity Rademacher compl. Stability Online-to-batch 0

Statistical inference Data Theory Concentration/CLT Model complexity Rademacher compl. Stability Online-to-batch 0

Data analysis is adaptive Steps depend on previous analyses of the same dataset Data

Data analysis is adaptive Steps depend on previous analyses of the same dataset Data pre-processing Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets … Data analyst(s)

Thou shalt not test hypotheses suggested by data “Quiet scandal of statistics” [Leo Breiman,

Thou shalt not test hypotheses suggested by data “Quiet scandal of statistics” [Leo Breiman, 1992]

Reproducibility crisis? Why Most Published Research Findings Are False [Ioannidis 2005] “Irreproducible preclinical research

Reproducibility crisis? Why Most Published Research Findings Are False [Ioannidis 2005] “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28 B/year loss” [Freedman, Cockburn, Simcoe 2015]

Existing approaches • Sample splitting • Selective inference o Model selection + parameter estimation

Existing approaches • Sample splitting • Selective inference o Model selection + parameter estimation o Variable selection + regression • Pre-registration © Center for Open Science 8

Adaptive data analysis [DFHPRR 14] Data analyst(s) Algorithm

Adaptive data analysis [DFHPRR 14] Data analyst(s) Algorithm

Adaptive statistical queries Data analyst(s) Statistical query oracle [Kearns 93] Can measure correlations, moments,

Adaptive statistical queries Data analyst(s) Statistical query oracle [Kearns 93] Can measure correlations, moments, accuracy/loss Run any statistical query algorithm

Answering non-adaptive SQs •

Answering non-adaptive SQs •

Answering adaptively-chosen SQs •

Answering adaptively-chosen SQs •

Answering adaptive SQs

Answering adaptive SQs

Value perturbation • Laplace/Gaussian 14

Value perturbation • Laplace/Gaussian 14

Differential privacy [Dwork, Mc. Sherry, Nissim, Smith 06] DP implies generalization Differential privacy is

Differential privacy [Dwork, Mc. Sherry, Nissim, Smith 06] DP implies generalization Differential privacy is stability

Differential privacy [DMNS 06] DP implies generalization Differential privacy limits information learned about the

Differential privacy [DMNS 06] DP implies generalization Differential privacy limits information learned about the dataset

Differential privacy [DMNS 06] DP implies generalization DP composes adaptively

Differential privacy [DMNS 06] DP implies generalization DP composes adaptively

Differential privacy [DMNS 06] DP implies generalization DP composes adaptively

Differential privacy [DMNS 06] DP implies generalization DP composes adaptively

Value perturbation [DMNS 06] • 19

Value perturbation [DMNS 06] • 19

Beyond low-sensitivity 20

Beyond low-sensitivity 20

Stable Median 21

Stable Median 21

Median algorithms • 22

Median algorithms • 22

Limits • 23

Limits • 23

ML practice Data Data Validation Training Testing XGBoost SVRG Tensorflow 24

ML practice Data Data Validation Training Testing XGBoost SVRG Tensorflow 24

Reusable holdout [DFHPRR 15] Data Data AI guru Reusable holdout algorithm

Reusable holdout [DFHPRR 15] Data Data AI guru Reusable holdout algorithm

Reusable holdout •

Reusable holdout •

Conclusions • Datasets are reused adaptively • New conceptual framework • Deep connections to

Conclusions • Datasets are reused adaptively • New conceptual framework • Deep connections to DP o Privacy and generalization are aligned o Data “freshness” is a limited resource ü Real-valued analyses (without any assumptions) • Going beyond adversarial adaptivity o Connections to stability and selective inference • Using these techniques in practice 27