Missing Values Adapting to missing data Sources of
- Slides: 26
Missing Values Adapting to missing data
Sources of Missing Data • • • People refuse to answer a question Responses are indistinct or ambiguous Numeric data are obviously wrong Broken objects cannot be measured Equipment failure or malfunction Detailed analysis of subsample
Assumptions 1 • Missing Completely at Random – probability of data missing on X is unrelated to the value of X or to values on other variables in data set • Missing at Random – the probability of missing data on X is unrelated to the value of X after controlling for other variables in the analysis
Assumptions 2 • Ignorable – MAR plus parameters governing missing data process unrelated to parameters being estimated • Nonignorable – If not MAR, missing data mechanism must be modeled to get good estimates of parameters
Methods 1. 2. 3. 4. Listwise Deletion Pairwise Deletion Dummy Variable Adjustment Imputation
Listwise Deletion 1 • Delete any samples with missing data – Can be used for any statistical analysis – No special computational methods • If data are MCAR (esp if random sample of full data set), they are an unbiased estimate of the full data set
Listwise Delete 2 • If data are MAR, can produce biased estimates if missing values in independent variables are dependent on dependent variable • Main issue is the loss of observations and the increase in standard errors (meaning a decrease in the power of the test)
Listwise Deletion 3 • In anthropology listwise deletion often includes removal of variables (columns) as well as cases (rows) • Finding an optimal complete data set involves removing variables with many missing variables and then rows still having missing variables
Pairwise Deletion 1 • Compute means using available data and covariances using cases with observations for the pair being computed • Uses more of the data • If MCAR, reasonably unbiased estimates, but if MAR, estimates may be seriously biased
Pairwise Deletion 2 • Covariance/Correlation matrix may be singular • Less of an issue with distance matrices
Dummy Variable • Create variable to flag observations missing on a particular variable • Used in regression analysis but provides biased estimators
Imputation • Replace missing values with an estimate: 1. Mean for that variable – biased estimates of variances and covariances 2. Multiple regression to predict value – complicated with multiple variables containing missing values, but can still lead to underestimated standard errors
Maximum Likelihood • Try to reconstruct the complete data set by selecting values that would maximize the probability of observing the actually observed data • Categorical and continuous data • Expectation-maximization algorithm gives estimates of means and covariances
Expectation Maximization • Iterative steps of expectation and maximization to produce estimates that converge on the ML estimates • These estimates will generally underestimate the standard errors in regression and other statistical models
Multiple Imputation 1 • Has the same optimal properties of ML but several advantages • Can be used with any kind of data and any kind of statistical model • But produces multiple estimates which must be combined • Random component used to give unbiased estimates
Multiple Imputation 2 • Multivariate normal model (relatively resistant to deviations) • Each variable represented as a linear function of the other variables • Methods – Data Augmentation, package norm – Sampling Importance/Resampling, package amelia
Multiple Imputation 3 • Categorical data, multinomial model, package cat • Categorical and interval/ratio data, package mix • Also can use multivariate normal models with dummy variables
Multiple Imputation 4 • Predictive mean matching – use regression to predict values for a particular variable. Find complete cases that have predictions similar to the case with a missing value on that variable and randomly one of the actual values, package Hmisc, function areg. Impute
Analysis • The analysis is run on each imputed data set and the estimates (e. g. regression coefficients are combined) • Packages such as zelig provide ways of combining the datasets for generalized linear models
Missing Data with R 1 • NA is used to identify a missing value • is. na() is used to test for a missing value: is. na(c(1: 4, NA, 6: 10)) • na. omit(dataframe) will delete all cases with missing data (Rcmdr: Data | Active Data set| Remove cases with missing values
Missing Data with R 2 • Some functions have an na. rm= option. True means remove cases with missing values, False means do not remove them so that the function returns NA if there are missing values.
Missing Data in R 3 • Other functions (e. g. lm, princomp, glm) have an na. action= option that must can be set to one of the following options: na. fail, na. omit, na. exclude to remove cases (omit, exclude) or have the analysis fail
Missing Data in R 4 • Other functions (e. g. cor, cov, var) have a use= option: – everything (NA’s propagate) – all. obs (NA causes error) – complete. obs (delete cases with NA’s) – na. or. complete (delete cases with NA’s) – pairwise. complete. obs (complete pairs of observations)
Example 1 • Ernest. Witte data set has missing values among the 242 cases and 38 variables • Using R to remove all cases with missing values reduces the number of cases to 52! • If we don’t need all of the variables we can retain more cases
Example 2 • Total NA’s in Ernest. Witte (815) • sum(is. na(Ernest. Witte)) • Check missing values by variable: • sort(apply(Ernest. Witte, 2, function(x) sum(is. na(x))), decreasing=TRUE) • Looking has 171, Skull. Pos 126, Depos 112 • Removing these gives 112 cases
Multiple Imputation with R • A wide variety of options: – Packages norm, cat, mix – Package amelia – Package mi (relatively new, but flexible)
- Print and web sources
- The importance of water resources
- How well are we adapting
- How to adapt marketing to the new economy
- What is adopting materials
- Paiboc model
- Adapting to challenges of the micro environment
- Adapting the price
- The process of adapting borrowed cultural traits.
- Adapting the price
- Audience analysis example
- How well are we adapting
- Adapting curriculum to bridge equity gaps
- The terms external secondary data and syndicated
- What is missing data in data mining
- A statistical surface has a
- Discrete missing values spss
- Discrete missing values spss
- Handling missing values in python
- Identify the quadrilateral and calculate the missing values
- Western vs eastern values
- Instumental values
- An individual's enduring tendency to feel
- A bit can only have the value of 0 or 1
- Concept of human values
- Syndicated sources of data are
- Emsi data sources