Missing Values Adapting to missing data Sources of

  • Slides: 26
Download presentation
Missing Values Adapting to missing data

Missing Values Adapting to missing data

Sources of Missing Data • • • People refuse to answer a question Responses

Sources of Missing Data • • • People refuse to answer a question Responses are indistinct or ambiguous Numeric data are obviously wrong Broken objects cannot be measured Equipment failure or malfunction Detailed analysis of subsample

Assumptions 1 • Missing Completely at Random – probability of data missing on X

Assumptions 1 • Missing Completely at Random – probability of data missing on X is unrelated to the value of X or to values on other variables in data set • Missing at Random – the probability of missing data on X is unrelated to the value of X after controlling for other variables in the analysis

Assumptions 2 • Ignorable – MAR plus parameters governing missing data process unrelated to

Assumptions 2 • Ignorable – MAR plus parameters governing missing data process unrelated to parameters being estimated • Nonignorable – If not MAR, missing data mechanism must be modeled to get good estimates of parameters

Methods 1. 2. 3. 4. Listwise Deletion Pairwise Deletion Dummy Variable Adjustment Imputation

Methods 1. 2. 3. 4. Listwise Deletion Pairwise Deletion Dummy Variable Adjustment Imputation

Listwise Deletion 1 • Delete any samples with missing data – Can be used

Listwise Deletion 1 • Delete any samples with missing data – Can be used for any statistical analysis – No special computational methods • If data are MCAR (esp if random sample of full data set), they are an unbiased estimate of the full data set

Listwise Delete 2 • If data are MAR, can produce biased estimates if missing

Listwise Delete 2 • If data are MAR, can produce biased estimates if missing values in independent variables are dependent on dependent variable • Main issue is the loss of observations and the increase in standard errors (meaning a decrease in the power of the test)

Listwise Deletion 3 • In anthropology listwise deletion often includes removal of variables (columns)

Listwise Deletion 3 • In anthropology listwise deletion often includes removal of variables (columns) as well as cases (rows) • Finding an optimal complete data set involves removing variables with many missing variables and then rows still having missing variables

Pairwise Deletion 1 • Compute means using available data and covariances using cases with

Pairwise Deletion 1 • Compute means using available data and covariances using cases with observations for the pair being computed • Uses more of the data • If MCAR, reasonably unbiased estimates, but if MAR, estimates may be seriously biased

Pairwise Deletion 2 • Covariance/Correlation matrix may be singular • Less of an issue

Pairwise Deletion 2 • Covariance/Correlation matrix may be singular • Less of an issue with distance matrices

Dummy Variable • Create variable to flag observations missing on a particular variable •

Dummy Variable • Create variable to flag observations missing on a particular variable • Used in regression analysis but provides biased estimators

Imputation • Replace missing values with an estimate: 1. Mean for that variable –

Imputation • Replace missing values with an estimate: 1. Mean for that variable – biased estimates of variances and covariances 2. Multiple regression to predict value – complicated with multiple variables containing missing values, but can still lead to underestimated standard errors

Maximum Likelihood • Try to reconstruct the complete data set by selecting values that

Maximum Likelihood • Try to reconstruct the complete data set by selecting values that would maximize the probability of observing the actually observed data • Categorical and continuous data • Expectation-maximization algorithm gives estimates of means and covariances

Expectation Maximization • Iterative steps of expectation and maximization to produce estimates that converge

Expectation Maximization • Iterative steps of expectation and maximization to produce estimates that converge on the ML estimates • These estimates will generally underestimate the standard errors in regression and other statistical models

Multiple Imputation 1 • Has the same optimal properties of ML but several advantages

Multiple Imputation 1 • Has the same optimal properties of ML but several advantages • Can be used with any kind of data and any kind of statistical model • But produces multiple estimates which must be combined • Random component used to give unbiased estimates

Multiple Imputation 2 • Multivariate normal model (relatively resistant to deviations) • Each variable

Multiple Imputation 2 • Multivariate normal model (relatively resistant to deviations) • Each variable represented as a linear function of the other variables • Methods – Data Augmentation, package norm – Sampling Importance/Resampling, package amelia

Multiple Imputation 3 • Categorical data, multinomial model, package cat • Categorical and interval/ratio

Multiple Imputation 3 • Categorical data, multinomial model, package cat • Categorical and interval/ratio data, package mix • Also can use multivariate normal models with dummy variables

Multiple Imputation 4 • Predictive mean matching – use regression to predict values for

Multiple Imputation 4 • Predictive mean matching – use regression to predict values for a particular variable. Find complete cases that have predictions similar to the case with a missing value on that variable and randomly one of the actual values, package Hmisc, function areg. Impute

Analysis • The analysis is run on each imputed data set and the estimates

Analysis • The analysis is run on each imputed data set and the estimates (e. g. regression coefficients are combined) • Packages such as zelig provide ways of combining the datasets for generalized linear models

Missing Data with R 1 • NA is used to identify a missing value

Missing Data with R 1 • NA is used to identify a missing value • is. na() is used to test for a missing value: is. na(c(1: 4, NA, 6: 10)) • na. omit(dataframe) will delete all cases with missing data (Rcmdr: Data | Active Data set| Remove cases with missing values

Missing Data with R 2 • Some functions have an na. rm= option. True

Missing Data with R 2 • Some functions have an na. rm= option. True means remove cases with missing values, False means do not remove them so that the function returns NA if there are missing values.

Missing Data in R 3 • Other functions (e. g. lm, princomp, glm) have

Missing Data in R 3 • Other functions (e. g. lm, princomp, glm) have an na. action= option that must can be set to one of the following options: na. fail, na. omit, na. exclude to remove cases (omit, exclude) or have the analysis fail

Missing Data in R 4 • Other functions (e. g. cor, cov, var) have

Missing Data in R 4 • Other functions (e. g. cor, cov, var) have a use= option: – everything (NA’s propagate) – all. obs (NA causes error) – complete. obs (delete cases with NA’s) – na. or. complete (delete cases with NA’s) – pairwise. complete. obs (complete pairs of observations)

Example 1 • Ernest. Witte data set has missing values among the 242 cases

Example 1 • Ernest. Witte data set has missing values among the 242 cases and 38 variables • Using R to remove all cases with missing values reduces the number of cases to 52! • If we don’t need all of the variables we can retain more cases

Example 2 • Total NA’s in Ernest. Witte (815) • sum(is. na(Ernest. Witte)) •

Example 2 • Total NA’s in Ernest. Witte (815) • sum(is. na(Ernest. Witte)) • Check missing values by variable: • sort(apply(Ernest. Witte, 2, function(x) sum(is. na(x))), decreasing=TRUE) • Looking has 171, Skull. Pos 126, Depos 112 • Removing these gives 112 cases

Multiple Imputation with R • A wide variety of options: – Packages norm, cat,

Multiple Imputation with R • A wide variety of options: – Packages norm, cat, mix – Package amelia – Package mi (relatively new, but flexible)