Imputation using within subject means isn't a great idea because it will result in biased (too small) standard errors and possibly biased estimates.
Assuming that the data are missing at random, a much better idea is to use multiple imputation. The mice
package in R has the capability to impute continuous variables in a mixed efects framework with a single random effect (grouping variable) - just specify 2l.norm
as the grouping variable. For example, suppose our analysis model is
> require(mice)
> require(lme4)
> m0 <- lmer(teachpop~sex+texp+popular + (1|school), data=popmis)
> confint(m0)
2.5 % 97.5 %
.sig01 0.44905533 0.62574295
.sigma 0.54368549 0.59259188
(Intercept) 2.03118933 2.67864796
sex -0.07108881 0.09183821
texp 0.03024598 0.06505065
popular 0.22257646 0.32572600
Due to missingness in the predictor popular
this model may be biased. So we will use multiple imputation:
> ini <- mice(popmis, maxit=0)
> (pred <- ini$pred)
pupil school popular sex texp const teachpop
pupil 0 0 0 0 0 0 0
school 0 0 0 0 0 0 0
popular 1 1 0 1 1 0 1
sex 0 0 0 0 0 0 0
texp 0 0 0 0 0 0 0
const 0 0 0 0 0 0 0
teachpop 0 0 0 0 0 0 0
This is the default predictor matrix for the imputation model. Only popular
has missing values, and we are going to impute them using a mixed model where school
is the grouping factor, and the other variables are fixed effects. To do this, we use -2
to tell mice
that school is the grouping variable, and 2
for the fixed effects:
> pred["popular",] <- c(0, -2, 0, 2, 2, 2, 0)
> (pred)
So now we have:
pupil school popular sex texp const teachpop
pupil 0 0 0 0 0 0 0
school 0 0 0 0 0 0 0
popular 0 -2 0 2 2 2 0
sex 0 0 0 0 0 0 0
texp 0 0 0 0 0 0 0
const 0 0 0 0 0 0 0
teachpop 0 0 0 0 0 0 0
We have set up the predictor matrix, so we can now create 10 multiply imputed datasets using the 2l.norm
method to impute values for popular
> imp <- mice(popmis, meth = c("","","2l.norm","","","",""), pred = pred, maxit=10, m = 10)
Now we run the mixed model on each of the imputed datasets:
> fit <- with(imp, lmer(teachpop~sex+texp+popular + (1|school)))
...and pool the results:
> summary(pool(fit))
est se t df Pr(>|t|) lo 95 hi 95 nmis
(Intercept) 2.73951576 0.165053863 16.597708 1991.5874 0.000000e+00 2.41581941 3.06321211 NA
sex 0.08620420 0.031042794 2.776947 915.1865 5.599307e-03 0.02528087 0.14712753 0
texp 0.05682495 0.009713717 5.849970 1991.4452 5.733929e-09 0.03777484 0.07587506 0
popular 0.16696926 0.018760706 8.899945 1980.9159 0.000000e+00 0.13017647 0.20376205 848
Best Answer
To my knowledge, yes, it is typical to exclude the instances with missing data. I have not seen standard regression routines dealing with missing data by default in any other way; this "omission" is not unreasonable. Assuming that the missing data are "missing completely at random"(MCAR), deleting the instances with missing data does not lead to biased inference.
The most important thing when faced with missing data is to appreciate why they are missing. As mentioned, excluding instances with missing information is safe if the assumption of MCAR holds but there are other "missingness mechanisms" like "missing at random" (MAR - horrible naming idea as it is different to MCAR) and "missing not a random" (MNAR) that require special considerations. Gelman and Hill's "Data Analysis Using Regression and Multilevel/Hierarchical Models" has a relevant chapter on missing-data imputation that gives a well-round treatment of the subject. This is the true reason why standard regression routines do not implement imputation out-of-the-box: the correct imputation approach is dependant on the amount as well as the reason of missing readings.
There is a plethora of imputation techniques and all of them are successful at certain scenarios. One can pick from simple median or mean-value imputation (a popular and easy first step), to advanced multivariate techniques with substantial statistical background (eg. MICE or AMELIA) and Linear Algebra motived approaches that (mostly) ignore the missing mechanism and focus primarily on low-rank approximations (eg. matrix completion). And of course there are approaches in between (eg. imputation through Probabilistic PCA or Random Forests).
As a general advice, I would suggest finding out why missing data occur. In addition, given we impute a dataset using a certain imputation methodology, rerunning the analysis using a different imputation methodology is probably beneficial; if the results very greatly something fishy might be happening.