Topic vi New and Emerging Methods Topic organizer

Topic (vi): Overview § Papers under this topic present new ideas and advancements in

Topic (vi): Overview § Probability editing WP. 36 (Sweden) – Select a subset of

Topic (vi): Overview § Model-based imputation WP. 38 (Slovenia) – Bayesian model + linear

Topic (iii): New and Emerging Methods Enjoy the presentations!

Topic (vi): New and Emerging Methods Summary of main developments and points for discussion

Topic (vi): Summary § WP. 36 – Probability Editing (Sweden) – Propose selecting units

Topic (vi): Summary § WP. 37 – Use of machine learning methods to impute

Topic (vi): Summary § WP. 38 – Implementation of the Bayesian approach to imputation

Topic (vi): Summary § WP. 39 – Automatic editing with hard and soft edits

Topic (vi): Points for discussion In probability editing, a sampling design is used to

Topic (vi): Points for discussion The paper from EUROSTAT deals with the use of

Topic (vi): Points for discussion Ø What are the experiences at other agencies when

Topic (vi): Points for discussion The paper from Slovenia implements the imputation model using

Topic (vi): Points for discussion Ø How can we address the situation in which

Topic (vi): Points for discussion Incorporating soft or query edits into the error localization

New and Emerging Methods: Closing Remarks

Closing Remarks Ø What happened to error localization? Ø Should we be focusing more

Slides: 18

Download presentation

Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, 24 -26 September 2012

Topic (vi): Overview § Papers under this topic present new ideas and advancements in the development of methods and techniques for solving, improving, and optimizing the editing and imputation of data. § Contributions cover: – Probability editing – Machine learning methods – Model-based imputation methods – Automatic editing of numerical data

Topic (vi): Overview § Probability editing WP. 36 (Sweden) – Select a subset of units to edit using a probability sampling framework. § Machine learning methods WP. 37 (EUROSTAT) – Imputation of categorical data using a neural networks classifier and a Bayesian networks classifier

Topic (vi): Overview § Model-based imputation WP. 38 (Slovenia) – Bayesian model + linear regression – Multiple imputation § Automatic editing WP. 39 (Netherlands) – Ensure hard edits are satisfied while incorporating information from soft edits into the editing/imputation process.

Topic (iii): New and Emerging Methods Enjoy the presentations!

Topic (vi): New and Emerging Methods Summary of main developments and points for discussion

Topic (vi): Summary § WP. 36 – Probability Editing (Sweden) – Propose selecting units for editing using a traditional probability sampling framework in which only a fraction of the data is edited – Applies to all types of data – Address statistical properties of the estimators

Topic (vi): Summary § WP. 37 – Use of machine learning methods to impute categorical data (EUROSTAT) – New neural networks classifier supervised learning method extended to deal with mixed numerical and categorical data. – Bayesian network classifier – Compare machine learning results with results obtained from logistic regression and multiple imputation.

Topic (vi): Summary § WP. 38 – Implementation of the Bayesian approach to imputation at SORS (Slovenia) – Bayesian model and linear regression combined into a method for imputing annual gross income in a household survey. – Solve the imputation problem within separate data groupings with different levels of the auxiliary variables within each group. – Multiple imputation.

Topic (vi): Summary § WP. 39 – Automatic editing with hard and soft edits (Netherlands) – Error localization problem involving both hard (inconsistency) and soft (query) edits. – Minimizing the number of fields to impute incorporates the cost associated with failed query edits. – Associated software written in R, uses an existing R package.

Topic (vi): Points for discussion In probability editing, a sampling design is used to select units for editing. If not using the two step approach, important units may not be in the sample and errors (possibly large) may remain in the data file. ØWhat are the implications? ØHow will this affect the estimator and the variance of the estimator? ØHow do errors still remaining in the data affect analysis, particularly if the data is used for other purposes not envisioned in the original survey design?

Topic (vi): Points for discussion The paper from EUROSTAT deals with the use of machine learning methods for imputing missing data: Ø The method uses either a Bayesian network classifier or a neural network classifier. What is the effect of using these methods for imputing missing values on the original distribution of the variables? Ø Choosing the appropriate method for handling imputation of missing values is a challenging problem. For what kind of surveys are these methods suitable?

Topic (vi): Points for discussion Ø What are the experiences at other agencies when incorporating machine learning methods into their imputation menu (e. g. , at BOC - B. Winkler 2009, 2010)?

Topic (vi): Points for discussion The paper from Slovenia implements the imputation model using an available procedure within the SAS language, PROC MCMC: Ø For some complex imputation problems (i. e. large number of variables, different types of variables, large data files) the default settings within commercial software procedures may be inappropriate. What type of diagnostics, graphical or analytical, should be examined to ensure the procedure is working properly?

Topic (vi): Points for discussion Ø How can we address the situation in which a large proportion of the missing fields occur within a particular data group? Ø Or the situation in which a particular data group is too small to fit the model?

Topic (vi): Points for discussion Incorporating soft or query edits into the error localization problem increases the number of variables and edits of this computationally intensive optimization problem: Ø How do we approach the added complexity? Is this new approach computationally feasible? Ø Does adding the information from the soft (query) edits lead to a reduction of time and resources spend on analysts’ review (trade-off )? Ø What are the effects of adding more edits and thus changing more fields on data quality?

New and Emerging Methods: Closing Remarks

Closing Remarks Ø What happened to error localization? Ø Should we be focusing more on imputation than on editing (again: error localization)? Ø Do we forecast multiple imputation as becoming the “standard” at most NSIs: – to smooth variability of single imputations? – to estimate variance due to imputation? – issue of non-response bias vs. variance estimation?