An overview of methods for treating selectivity in

  • Slides: 10
Download presentation
An overview of methods for treating selectivity in big data sources ESS Big Data

An overview of methods for treating selectivity in big data sources ESS Big Data Workshop 2016

Methodological study • Purpose: • Identification of methods for treating selectivity in big data

Methodological study • Purpose: • Identification of methods for treating selectivity in big data sources • Initially to serve internal needs • Methodologists: • Maciej Beręsewicz (Poznan University of Economics, Poznan Statistical Office) • Risto Lehtonen (University of Helsinki) • Study managed by SOGETI

What's big data • Addressing big data from a methodological point of view •

What's big data • Addressing big data from a methodological point of view • • Defining big data: non-probabilistic nature Vagueness of population Big data as an opt-in panel survey Similarities with internet surveys • Big data survey = opt-in internet panel survey • Opt-in -> self-selection from non-response research • Internet -> multiple frames

Dealing with self-selection error • Unit identification problem and measurement error • To be

Dealing with self-selection error • Unit identification problem and measurement error • To be addressed before self-selection is corrected • Sources of bias • Big data source specific selectivity • Unit specific selectivity (self-selection) • Imputation of latent target and auxiliary variables • Mass imputation is a relevant approach • Re-sampling • For estimation: block bootstrap • For model selection: cross-validation

Methods for correcting selectivity • Unit level approach • At individual level • Domain

Methods for correcting selectivity • Unit level approach • At individual level • Domain level approach • At aggregated level

Unit level methods of selectivity correction (1) • Reweighting • Methods that account for

Unit level methods of selectivity correction (1) • Reweighting • Methods that account for existing information about auxiliary variables -> if correlated to selectivity mechanism will correct it Generalized weight share method Calibration (model-free and model-assisted) Pseudo-empirical likelihood • Methods that address directly the selectivity mechanism Propensity weighting Two-stage weighting method Lepkowski method (for under-coverage and self-selection)

Unit level methods of selectivity correction (2) • Modelling approach • The basic idea

Unit level methods of selectivity correction (2) • Modelling approach • The basic idea is that if the models include explanatory variables correlated to the selectivity mechanism then they can correct or mitigate selectivity bias M-Quantile models Hierarchical Bayes models Calibrated Bayes Pattern mixture model Machine learning (non-linear models) Sample matching

Domain level methods of selectivity correction (1) •

Domain level methods of selectivity correction (1) •

Domain level methods of selectivity correction (2) • Modelling approach • Estimation of bias

Domain level methods of selectivity correction (2) • Modelling approach • Estimation of bias in big data source From a sample survey (or registers) Direct estimation + model for sub-domains

Conclusions • The existence of auxiliary information is crucial • We need to understand

Conclusions • The existence of auxiliary information is crucial • We need to understand the selectivity mechanism • There are methods applicable at individual and aggregated level