Missing Values Adapting to missing data Sources of

Sources of Missing Data • • • People refuse to answer a question Responses

Assumptions 1 • Missing Completely at Random – probability of data missing on X

Assumptions 2 • Ignorable – MAR plus parameters governing missing data process unrelated to

Methods 1. 2. 3. 4. Listwise Deletion Pairwise Deletion Dummy Variable Adjustment Imputation

Listwise Deletion 1 • Delete any samples with missing data – Can be used

Listwise Delete 2 • If data are MAR, can produce biased estimates if missing

Listwise Deletion 3 • In anthropology listwise deletion often includes removal of variables (columns)

Pairwise Deletion 1 • Compute means using available data and covariances using cases with

Pairwise Deletion 2 • Covariance/Correlation matrix may be singular • Less of an issue

Dummy Variable • Create variable to flag observations missing on a particular variable •

Imputation • Replace missing values with an estimate: 1. Mean for that variable –

Maximum Likelihood • Try to reconstruct the complete data set by selecting values that

Expectation Maximization • Iterative steps of expectation and maximization to produce estimates that converge

Multiple Imputation 1 • Has the same optimal properties of ML but several advantages

Multiple Imputation 2 • Multivariate normal model (relatively resistant to deviations) • Each variable

Multiple Imputation 3 • Categorical data, multinomial model, package cat • Categorical and interval/ratio

Multiple Imputation 4 • Predictive mean matching – use regression to predict values for

Analysis • The analysis is run on each imputed data set and the estimates

Missing Data with R 1 • NA is used to identify a missing value

Missing Data with R 2 • Some functions have an na. rm= option. True

Missing Data in R 3 • Other functions (e. g. lm, princomp, glm) have

Missing Data in R 4 • Other functions (e. g. cor, cov, var) have

Example 1 • Ernest. Witte data set has missing values among the 242 cases

Example 2 • Total NA’s in Ernest. Witte (815) • sum(is. na(Ernest. Witte)) •

Multiple Imputation with R • A wide variety of options: – Packages norm, cat,

Slides: 26

Download presentation

Missing Values Adapting to missing data

Sources of Missing Data • • • People refuse to answer a question Responses are indistinct or ambiguous Numeric data are obviously wrong Broken objects cannot be measured Equipment failure or malfunction Detailed analysis of subsample

Assumptions 1 • Missing Completely at Random – probability of data missing on X is unrelated to the value of X or to values on other variables in data set • Missing at Random – the probability of missing data on X is unrelated to the value of X after controlling for other variables in the analysis

Assumptions 2 • Ignorable – MAR plus parameters governing missing data process unrelated to parameters being estimated • Nonignorable – If not MAR, missing data mechanism must be modeled to get good estimates of parameters

Methods 1. 2. 3. 4. Listwise Deletion Pairwise Deletion Dummy Variable Adjustment Imputation

Listwise Deletion 1 • Delete any samples with missing data – Can be used for any statistical analysis – No special computational methods • If data are MCAR (esp if random sample of full data set), they are an unbiased estimate of the full data set

Listwise Delete 2 • If data are MAR, can produce biased estimates if missing values in independent variables are dependent on dependent variable • Main issue is the loss of observations and the increase in standard errors (meaning a decrease in the power of the test)

Listwise Deletion 3 • In anthropology listwise deletion often includes removal of variables (columns) as well as cases (rows) • Finding an optimal complete data set involves removing variables with many missing variables and then rows still having missing variables

Pairwise Deletion 1 • Compute means using available data and covariances using cases with observations for the pair being computed • Uses more of the data • If MCAR, reasonably unbiased estimates, but if MAR, estimates may be seriously biased

Pairwise Deletion 2 • Covariance/Correlation matrix may be singular • Less of an issue with distance matrices

Dummy Variable • Create variable to flag observations missing on a particular variable • Used in regression analysis but provides biased estimators

Imputation • Replace missing values with an estimate: 1. Mean for that variable – biased estimates of variances and covariances 2. Multiple regression to predict value – complicated with multiple variables containing missing values, but can still lead to underestimated standard errors

Maximum Likelihood • Try to reconstruct the complete data set by selecting values that would maximize the probability of observing the actually observed data • Categorical and continuous data • Expectation-maximization algorithm gives estimates of means and covariances

Expectation Maximization • Iterative steps of expectation and maximization to produce estimates that converge on the ML estimates • These estimates will generally underestimate the standard errors in regression and other statistical models

Multiple Imputation 1 • Has the same optimal properties of ML but several advantages • Can be used with any kind of data and any kind of statistical model • But produces multiple estimates which must be combined • Random component used to give unbiased estimates

Multiple Imputation 2 • Multivariate normal model (relatively resistant to deviations) • Each variable represented as a linear function of the other variables • Methods – Data Augmentation, package norm – Sampling Importance/Resampling, package amelia

Multiple Imputation 3 • Categorical data, multinomial model, package cat • Categorical and interval/ratio data, package mix • Also can use multivariate normal models with dummy variables

Multiple Imputation 4 • Predictive mean matching – use regression to predict values for a particular variable. Find complete cases that have predictions similar to the case with a missing value on that variable and randomly one of the actual values, package Hmisc, function areg. Impute

Analysis • The analysis is run on each imputed data set and the estimates (e. g. regression coefficients are combined) • Packages such as zelig provide ways of combining the datasets for generalized linear models

Missing Data with R 1 • NA is used to identify a missing value • is. na() is used to test for a missing value: is. na(c(1: 4, NA, 6: 10)) • na. omit(dataframe) will delete all cases with missing data (Rcmdr: Data | Active Data set| Remove cases with missing values

Missing Data with R 2 • Some functions have an na. rm= option. True means remove cases with missing values, False means do not remove them so that the function returns NA if there are missing values.

Missing Data in R 3 • Other functions (e. g. lm, princomp, glm) have an na. action= option that must can be set to one of the following options: na. fail, na. omit, na. exclude to remove cases (omit, exclude) or have the analysis fail

Missing Data in R 4 • Other functions (e. g. cor, cov, var) have a use= option: – everything (NA’s propagate) – all. obs (NA causes error) – complete. obs (delete cases with NA’s) – na. or. complete (delete cases with NA’s) – pairwise. complete. obs (complete pairs of observations)

Example 1 • Ernest. Witte data set has missing values among the 242 cases and 38 variables • Using R to remove all cases with missing values reduces the number of cases to 52! • If we don’t need all of the variables we can retain more cases

Example 2 • Total NA’s in Ernest. Witte (815) • sum(is. na(Ernest. Witte)) • Check missing values by variable: • sort(apply(Ernest. Witte, 2, function(x) sum(is. na(x))), decreasing=TRUE) • Looking has 171, Skull. Pos 126, Depos 112 • Removing these gives 112 cases

Multiple Imputation with R • A wide variety of options: – Packages norm, cat, mix – Package amelia – Package mi (relatively new, but flexible)