DEALING WITH MISSING DATA DR RICHARD HODGETT PLAN



















- Slides: 19
DEALING WITH MISSING DATA DR. RICHARD HODGETT
PLAN FOR THIS SESSION • Discuss how data can be missing. • Discuss how you can classify missing data. • Talk about the procedure for dealing with missing data. • Show to identify missing data. • Discuss deletion methods and then use these in R. • Discuss imputation methods and then use these in R. • Apply what you have learnt to the titanic passengers dataset.
MISSING DATA • Missing data is a common problem and challenge for analysts. • There are many reasons why data could be missing, including: Respondents forgot to answer questions. Respondents refused to answer certain questions. Respondents failed to complete the survey. A sensor failed. Someone purposefully turned off recording equipment. An internet connection was lost. A network went down. There was a power cut. A hard drive became corrupt. The method of data capture was changed. A data transfer was cut short.
MISSING DATA Missing data can usually be classified into: • Missing Completely at Random (MCAR): • If missingness doesn’t depend on the values of the data set. • e. g. a random sample of patients who had their blood pressure measured also had their weight measured. • Missing at Random (MAR): • If missingness does not depend on the unobserved values of the data set but does depend on the observed. • e. g. patients with high blood pressure had their weight measured. • Not Missing at Random (NMAR): • If missingness depends on the unobserved values of the data set. • e. g. overweight patients had their weight measured.
MISSING DATA Another example: Survey data on drug use. • Missing Completely at Random (MCAR): • You removed 10% of the respondents data randomly. • Missing at Random (MAR): (most common type) • People who come from poorer families might be less inclined to answer questions about drug use, and so the level of drug use is related to family income. • Not Missing at Random (NMAR): • Students skipped the question on drug use because they feared that they would be expelled from school.
MISSING DATA • Generally the procedure for dealing with missing data is: 1. Identify the missing data. 2. Identify the cause of the missing data. 3. Either: A. Remove the rows containing the missing data • Also called the naïve approach. • Make sure missing data isn’t biased! B. Replace missing values with alternative values. • Impute the missing values. • There a number of approaches. Deciding between A and B depends on which outcome you think will produce the most reliable and accurate results.
IDENTIFY MISSING DATA • Normally missing data is identified using summary in R: • There also a number of different ways to visualize missing data in R. .
IDENTIFY MISSING DATA
IDENTIFY MISSING DATA
REMOVING MISSING DATA ROWS • The two most common methods for removing missing data are: Listwise deletion (complete case analysis) Pairwise deletion Description: Analyse the data rows where there is complete data for every column. Analyse the data rows where the variables of interest have data present. Advantages: • Simple • Easily compare across analyses. • Uses all possible information. Limitations: • Could be biased (if the data is not MCAR). • Lower n, reduces statistical power. • Separate analyses cannot be compared as the data / sample will be different.
REMOVE MISSING DATA ROWS • In R missing values can be represented by: NA Not Available (placeholder for a missing value). NULL Empty value. Infinity. • It is possible to use is. na(), is. null() and is. infinite() functions in R to identify missing, empty and infinite values in datasets. • The function complete. cases() can be used to identify the data rows in a matrix or data frame that are / aren't complete. • Only NA and NULL are regarded as missing, Inf is treated as valid.
EXAMPLE MISSING DATASET • We will be using the following sleep dataset as an example. • It contains the following data on 62 species of mammals: Column Dream Non. D Sleep Body. Wgt Brain. Wgt Span Gest Pred Exp Danger Description Length of dreaming sleep Non-dreaming sleep Sum of Dream and Non. D Body weight (kg) Brain weight (g) Life span (yrs) Gestation time in days Degree to which species were preyed upon (1 -low to 5 -high scale) Degree of their exposure while sleeping (1 -low to 5 -high scale) Overall danger (1 -low to 5 -high scale) • Various data is missing in the dataset (NA values).
SESSION 1 • Time to get stuck in and deal with some missing data! • Download: http: //www. hodgett. co. uk/Missing. Data. zip • Follow the code for session 1.
REPLACING MISSING DATA • The two most common methods for replacing missing data are: Simple Imputation Description: Missing values are replaced with the mean, median or mode value. Stochastic: No Multiple Imputation Estimates missing data through repeated simulations. Yes Advantages: • Simple. • Variability more accurate. Limitations: • Could be biased (if the data is not MCAR). • Underestimates standard errors. • Could distort correlations among variables. • Algorithms are more complex. • Normally would require complex coding (R library available).
SIMPLE IMPUTATION • Simply replace the missing values with the mean, median or mode: • mean(x) • median(x) • names(sort(-table(x)))[1] • For example, using mean(sleep$x, na. rm=TRUE): 1. 972 (Mean of Dream) 8. 672917 (Mean of Non. D) 19. 87759 (Mean of Span) • Replace NA values with sleep$x[is. na(sleep$x)] <- value
MULTIPLE IMPUTATION • The idea of Multiple Imputation is to replace each missing value with multiple acceptable values that represent a distribution of possibilities. • This results in a number of complete datasets (usually 3 -10): Analyse te u p Po ol Im Dataset with missing values Dataset with multiple imputation applied Imputed datasets Analysis results of datasets
MULTIPLE IMPUTATION The general procedure for the chained equation approach to multiple imputation (used in mice()) is: 1. A simple imputation is performed for every missing value. 2. One of the missing variables are set back to missing. 3. Regression is performed (linear, logistic, polynomial etc. ), the missing variable being the forecast variable and all other variables in the dataset being the predictor variables. 4. Missing values are replaced with predictions (imputations) from the regression. 5. Repeat steps 2 -4 for each variable that has missing data (one cycle). 6. Repeat for a number of cycles then retain results as one imputed dataset.
MULTIPLE IMPUTATION • We will focus on using the mice package. • The mice package has many built in imputation techniques including: • Example: mice(data, meth=c('sample', 'pmm', 'logreg', 'norm'))
SESSION 2 & 3 Session 2 Session 3 Learn to use simple and multiple imputation methods. Apply what you have learnt to the titanic passengers dataset. Please ask me any questions you might have. My email: r. e. hodgett@leeds. ac. uk