Clinical prediction models Session 11 Dealing with missing

Objectives • Brief review theoretical background on mechanisms of missingness of predictor values •

Problems • Missing data are a common problem • Standard statistical software for regression

Rationale • One must assume that true predictor values are hidden by the missing

Rationale Steyerbeg. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer

Rationale • Basics – If each predictor has 10% missing data and that each

Rationale • Basics – For example, one may wish to compare nested models, or

Missing mechanisms • Depending of the imputation strategy, the mechanism is not that relevant.

Examples of bias • A correlation between missingness of a predictor and the outcome

Bias due to missing data Steyerbeg. Clinical Prediction Models: A Practical Approach to Development,

Imputation • Imputation methods substitute the missing values by plausible values • As the

Imputation Steyerbeg. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer

Imputation • Sample random normal values – Only external information is used • Conditional

Imputation • Problems – We may want to predict missing values for one predictor,

Imputation • Choosing the imputation – Imputation model aims to approximate the true distributional

Imputation • Multiple Imputation – In multiple imputation (MI), missing values are imputed m

Imputation • Multiple Imputation – m complete-data analyses are combined to obtain the estimates

Imputation • Steps in dealing with missing data – Explore the missing patterns –

Conclusions • With software availability and current evidence pointing the benefits of imputations, it

fim Session 11 Dealing with missing data Pedro E A A do Brasil pedro.

Slides: 24

Download presentation

Clinical prediction models Session 11 Dealing with missing data Pedro E A A do Brasil pedro. brasil@ini. fiocruz. br 2018

Objectives • Brief review theoretical background on mechanisms of missingness of predictor values • Comment how these missingness may affect the modelling process. • Show examples on imputation methods as a solution • This session is not intended to exhaust the missing/imputation topic 2015 Session 11 2

Problems • Missing data are a common problem • Standard statistical software for regression analysis deletes subjects with any missing data on any predictor before analysis • Therefore, numbers of subjects may vary per analysis as different predictors are explored • Complete case analysis are hence statistically inefficient 2015 Session 11 3

Rationale • One must assume that true predictor values are hidden by the missing values. • One must understand that imputations is not a “good guess” of the missing data, rather a good use of the available data. • Evidence points to greater bias in predictions in complete case analysis when compared to analysis with imputed dataset. 2015 Session 11 4

Rationale Steyerbeg. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer in 2009. 2015 Session 11 5

Rationale • Basics – If each predictor has 10% missing data and that each patient has at most 1 missing value – Information available is 250 complete cases (250 × 5=1, 250) + 250 incomplete cases (250 × 4=1, 000) = 90% of the required data – Complete case data will use only 250/500 of patients in data – 10% missing -> 50% patients discarded 2015 Session 11 6

Rationale • Basics – For example, one may wish to compare nested models, or adjust analysis and have an idea of the adjusted effect from univariable to multivariable – In two models conducted with missing data, it is then impossible to infer whether differences in odds ratios, p values or R 2 arose because of true differences, because of correlation between the predictors or because of a selection of subjects due to missing values 2015 Session 11 7

Missing mechanisms • Depending of the imputation strategy, the mechanism is not that relevant. • In health data the mechanism is usually not at random. Steyerbeg. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer in 2009. 2015 Session 11 8

Examples of bias • A correlation between missingness of a predictor and the outcome poses a serious problem in predictive modelling. • If an association between missingness of predictors X and outcome Y is noted in a prospective study, the explanation must be through other predictors. • MAR on y for only one predictor is sufficient to bias coefficients of all predictors. Steyerbeg. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer in 2009. 2015 Session 11 9

Bias due to missing data Steyerbeg. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer in 2009. 2015 Session 11 10

Imputation • Imputation methods substitute the missing values by plausible values • As the relation with the outcome is the main source of bias, always include the outcome in the imputation process • Consider correlated predictors in the imputations process even if one of them is not going to be modeled: e. g. Hct <-> Hg 2015 Session 11 11

Imputation Steyerbeg. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer in 2009. 2015 Session 11 12

Imputation • Sample random normal values – Only external information is used • Conditional mean with a single imputation – Predictor data only is used • Single imputation with a random draw from the predictive distribution from a imputation model – Predictor data and outcome data are used • Multiple imputation with a random draw from the predictive distribution from an imputation model – Predictor data and outcome data are used 2015 Session 11 13

Imputation • Problems – We may want to predict missing values for one predictor, using other predictors which also have missing values. • Work around – data augmentation methods: which follow an iterative process of an imputation step, which imputes values for the missing data, and a posterior step, which draws new estimates for the model parameters based on the previously imputed values. 2015 Session 11 14

Imputation • Choosing the imputation – Imputation model aims to approximate the true distributional relationship between the unobserved data and the available information – Two modelling choices usually have to be made: • the form of the model (e. g. linear, logistic, polytomous) • and the set of variables that enter the model, including potential transformations of predictors. – Truncate imputed values, so that they remain within a plausible range – Always include all predictors and the outcome of the final model, consider auxiliary predictors. 2015 Session 11 15

Imputation • Multiple Imputation – In multiple imputation (MI), missing values are imputed m times using m independent draws from an imputation model. – This means that for each variable with missing data, a conditional distribution for the missing data can be specified given other data – m completed data sets are created instead of a single completed data set. Missing values are imputed m times using m independent draws from an imputation model. 2015 Session 11 16

Imputation • Multiple Imputation – m complete-data analyses are combined to obtain the estimates of regression coefficients and performance estimates – As the number of m increases the within variance becomes the stronger overall variance component. – The number of m may be as low as 1, when MI becomes single imputation. – In prediction research, subjects with missing outcome data are generally discarded. 2015 Session 11 17

Imputation • Steps in dealing with missing data – Explore the missing patterns – Explore missingness relationship with the outcome – Subject matter knowledge should be used to judge plausible mechanisms for the missing values • Omiting predictors – It may be convenient to omit predictors with 50% or more of missigness even if it is of major interest. 2015 Session 11 18

Imputation 2015 Session 11 19

Imputation 2015 Session 11 20

Imputation 2015 Session 11 21

Imputation 2015 Session 11 22

Conclusions • With software availability and current evidence pointing the benefits of imputations, it is considered bad practice not to impute data. • Nevertheless, some analysis steps are not possible with multiple imputed data (e. g. bootstrap optimism estimation) and one must choose a single complete dataset. • Imputation examples will be shown in the workshop. • Further reading in multiple imputation are available: – https: //www. crcpress. com/Flexible-Imputation-of-Missing-Data-Second. Edition/Buuren/p/book/9781138588318 2015 Session 11 23

fim Session 11 Dealing with missing data Pedro E A A do Brasil pedro. brasil@ini. fiocruz. br 2018