Imputation is very useful for improving the accuracy of your parameter estimates in situations where a significant amount of data would otherwise be deleted. Consider that in a study with, for example, 100 observations and four regressors, each with a 10% missing observation rate, you'll only be missing 10% of the data but on average you'll be deleting about 34% of the observations if you drop each observation with one or more missing values - which is what happens if you just run the data through a standard regression package. You'll be deleting much more data (2.4x in fact) than is actually missing. In addition, unless your data is missing completely at random, case deletion can introduce bias into your parameter estimates.
It is typically better to use an imputation algorithm that captures at least the covariance structure of the data and generates random numbers (rather than replacing with mean or median values.) This holds true especially if you're going to be doing some estimation using the imputed data, because you'll get more accurate estimates of the covariance matrix of the parameters. Replacing by the mean value will give you overly optimistic standard errors, sometimes by quite a bit.
I've included an example using the default imputation method from the mice
package in R. The example has a regression with 100 observations and four regressors, each with a 10% chance of a missing value at every observation. We compare the std. errors of the estimates for the complete-data regression (no missing values), the case deletion regression (delete any observation with a missing value), mean imputation (replace the missing value by the mean of the variable), and a good quality imputation routine that estimates the covariance matrix of the data and generates random values. I've constructed nonlinear relationships between the regressors such that mice
isn't going to model them using their true relationships, just to add a layer of inaccuracy to the whole thing. I've run the entire process 100 times and averaged the standard errors of the four methods for each of the parameters for comparative purposes.
Here's the code, with a comparison of the standard errors at the bottom:
results <- data.frame(se_x1 = rep(0,400),
se_x2 = rep(0,400),
se_x3 = rep(0,400),
se_x4 = rep(0,400),
method = c(rep("Complete data",100),
rep("Case deletion",100),
rep("Mean value imputation", 100),
rep("Randomized imputation", 100)))
N <- 100
pct_missing <- 0.1
for (i in 1:100) {
x1 <- 4 + rnorm(N)
x2 <- 0.025*x1^2 + rnorm(N)
x3 <- 0.2*x1^1.3 + 0.04*x2^0.7 + rnorm(N)
x4 <- 0.4*x1^0.3 - 0.3*x2^1.1 + rnorm(N)
e <- rnorm(N, 0, 1.5)
y <- x1 + x2 + x3 + e # The coefficient of x4 = 0
# Complete data regression
mc <- summary(lm(y~x1+x2+x3+x4))
results[i,1:4] <- mc$coefficients[2:5,2]
# Cause data to be missing
x1[rbinom(N,1,pct_missing)==1] <- NA
x2[rbinom(N,1,pct_missing)==1] <- NA
x3[rbinom(N,1,pct_missing)==1] <- NA
x4[rbinom(N,1,pct_missing)==1] <- NA
# Case deletion
mm <- summary(lm(y~x1+x2+x3+x4))
results[i+100,1:4] <- mm$coefficients[2:5,2]
# Mean value imputation
x1m <- x1; x1m[is.na(x1m)] <- mean(x1, na.rm=TRUE)
x2m <- x2; x2m[is.na(x2m)] <- mean(x2, na.rm=TRUE)
x3m <- x3; x3m[is.na(x3m)] <- mean(x3, na.rm=TRUE)
x4m <- x4; x4m[is.na(x4m)] <- mean(x4, na.rm=TRUE)
mmv <- summary(lm(y~x1m+x2m+x3m+x4m))
results[i+200,1:4] <- mmv$coefficients[2:5,2]
# Imputation; I'm only using 1 of the 5 multiple imputations
# It would be better to use all the multiple imputations, though.
imp <- mice(cbind(y,x1,x2,x3,x4), printFlag=FALSE)
x1[is.na(x1)] <- as.numeric(imp$imp$x1[,1])
x2[is.na(x2)] <- as.numeric(imp$imp$x2[,1])
x3[is.na(x3)] <- as.numeric(imp$imp$x3[,1])
x4[is.na(x4)] <- as.numeric(imp$imp$x4[,1])
mi <- summary(lm(y~x1+x2+x3+x4))
results[i+300,1:4] <- mi$coefficients[2:5,2]
}
options(digits = 3)
results <- data.table(results)
results[, .(se_x1 = mean(se_x1),
se_x2 = mean(se_x2),
se_x3 = mean(se_x3),
se_x4 = mean(se_x4)), by = method]
And the output:
method se_x1 se_x2 se_x3 se_x4
1: Complete data 0.208 0.278 0.192 0.193
2: Case deletion 0.267 0.359 0.244 0.250
3: Mean value imputation 0.231 0.301 0.212 0.217
4: Randomized imputation 0.213 0.271 0.195 0.198
Note that the complete data method is as good as you can get with this data. Case deletion results in considerably less accurate parameter estimates, but the randomized imputation of mice
gets you almost all the way back to the accuracy you would get with complete data. (These numbers are a little optimistic, as I'm not using the full multiple imputation approach, but this is just a simple example.) The mean value imputation in this case appears to have improved things considerably relative to case deletion, but is actually overly optimistic.
So the tl;dr version is: impute, unless you'd only be missing a very small fraction of your cases using case deletion (like 1%). The big caveat is: understand the assumptions that are required for imputation first! If data is not missing at random, and I'm using that phrase non-technically so look up what imputation requires in this respect, imputation won't help you, and may make things worse. But that's a topic for another question. Here are a couple of links which might be helpful: overview of imputation, missing data rates and imputation, different imputation algorithms.
Best Answer
Much depends on the reasons why data are missing. There are 3 commonly cited missingness mechanisms, missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).
MCAR means that missing values occur randomly in that variable without any dependence on any other variable, observed or not.
MAR means that missing values occur randomly in that variable but the probability of being missing depends on values of one or more other observed variables (which might include your outcome variable).
If missingness depends on unobserved variables then the data are missing not at random (MNAR).
Deleting observations with missing data (known as complete-case analysis or listwise deletion), is a bad idea at the very least because it discards information which results in larger standard errors, wider confidence intervals and loss of power. Under MCAR the estimates will be unbiased, but under MAR they may be biased:
From: Statistical Analysis with Missing Data, Second Edition, Roderick J.A. Little & Donald B Rubin, John Wiley and Sons, 2002. p41: http://dx.doi.org/10.1002/9781119013563
Creating a factor/indicator/dummy variable for missingness is also a biased method, for example see:
White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med 2010;29:2920-31. http://dx.doi.org/10.1002/sim.3944
Jones MP. Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 1996;91:222-30. http://dx.doi.org/10.1080/01621459.1996.10476680
If the data are plausibly MAR or MCAR then multiple imputation will yield unbiased estimates when applied correctly and standard errors will be smaller than with complete-case analysis. If missingness depends on unobserved variables then the data are missing not at random (MNAR) and this is much more difficult to handle.
Multiple imputation works by filling in missing values with plausible values from a model. This is done multiple times and each time the imputed values will differ to allow for uncertainty. The analysis model is run on each imputed dataset and the results are pooled. In essence, the method works because, on the one hand, while it is possible to estimate the most likely values for the missing data, the most likely values are in fact unlikely to be the correct values: there is inherent uncertainty. The variability in the values which are imputed between each completed dataset provides the uncertainty which is needed to reflect the uncertainty created by the missing values.
MICE
is an excellent package for R which implements multiple imputation. https://www.jstatsoft.org/index.php/jss/article/view/v045i03/v45i03.pdfUpdate: Examples on how to handle missing values in r using the imputation methods and
MICE
packages: https://uvastatlab.github.io/2019/05/01/getting-started-with-multiple-imputation-in-r/