Solved – How to deal with missing data in mixed effects (or multi-level) models

missing datamixed model

I am curious about strategies for dealing with missing data in mixed effects (or multi-level models). By default, as far as I understand, many software tools use listwise deletion by default, so that cases with missing values for any of the variables are removed from the analysis. This is the case, for example, for the lme4 software in R. Is this typical for other software? Are there tools except for imputation for using all the available data, as in casewise deletion?

Best Answer

To my knowledge, yes, it is typical to exclude the instances with missing data. I have not seen standard regression routines dealing with missing data by default in any other way; this "omission" is not unreasonable. Assuming that the missing data are "missing completely at random"(MCAR), deleting the instances with missing data does not lead to biased inference.

The most important thing when faced with missing data is to appreciate why they are missing. As mentioned, excluding instances with missing information is safe if the assumption of MCAR holds but there are other "missingness mechanisms" like "missing at random" (MAR - horrible naming idea as it is different to MCAR) and "missing not a random" (MNAR) that require special considerations. Gelman and Hill's "Data Analysis Using Regression and Multilevel/Hierarchical Models" has a relevant chapter on missing-data imputation that gives a well-round treatment of the subject. This is the true reason why standard regression routines do not implement imputation out-of-the-box: the correct imputation approach is dependant on the amount as well as the reason of missing readings.

There is a plethora of imputation techniques and all of them are successful at certain scenarios. One can pick from simple median or mean-value imputation (a popular and easy first step), to advanced multivariate techniques with substantial statistical background (eg. MICE or AMELIA) and Linear Algebra motived approaches that (mostly) ignore the missing mechanism and focus primarily on low-rank approximations (eg. matrix completion). And of course there are approaches in between (eg. imputation through Probabilistic PCA or Random Forests).

As a general advice, I would suggest finding out why missing data occur. In addition, given we impute a dataset using a certain imputation methodology, rerunning the analysis using a different imputation methodology is probably beneficial; if the results very greatly something fishy might be happening.

Related Solutions

Solved – Missing Data Mixed Effects Modelling for Repeated Measures

Imputation using within subject means isn't a great idea because it will result in biased (too small) standard errors and possibly biased estimates.

Assuming that the data are missing at random, a much better idea is to use multiple imputation. The mice package in R has the capability to impute continuous variables in a mixed efects framework with a single random effect (grouping variable) - just specify 2l.norm as the grouping variable. For example, suppose our analysis model is

> require(mice)
> require(lme4)
> m0 <- lmer(teachpop~sex+texp+popular + (1|school), data=popmis)
> confint(m0)

                  2.5 %     97.5 %
.sig01       0.44905533 0.62574295
.sigma       0.54368549 0.59259188
(Intercept)  2.03118933 2.67864796
sex         -0.07108881 0.09183821
texp         0.03024598 0.06505065
popular      0.22257646 0.32572600

Due to missingness in the predictor popular this model may be biased. So we will use multiple imputation:

> ini <- mice(popmis, maxit=0)
> (pred <- ini$pred)

         pupil school popular sex texp const teachpop
pupil        0      0       0   0    0     0        0
school       0      0       0   0    0     0        0
popular      1      1       0   1    1     0        1
sex          0      0       0   0    0     0        0
texp         0      0       0   0    0     0        0
const        0      0       0   0    0     0        0
teachpop     0      0       0   0    0     0        0

This is the default predictor matrix for the imputation model. Only popular has missing values, and we are going to impute them using a mixed model where school is the grouping factor, and the other variables are fixed effects. To do this, we use -2 to tell mice that school is the grouping variable, and 2 for the fixed effects:

> pred["popular",] <- c(0, -2, 0, 2, 2, 2, 0)
> (pred)

So now we have:

         pupil school popular sex texp const teachpop
pupil        0      0       0   0    0     0        0
school       0      0       0   0    0     0        0
popular      0     -2       0   2    2     2        0
sex          0      0       0   0    0     0        0
texp         0      0       0   0    0     0        0
const        0      0       0   0    0     0        0
teachpop     0      0       0   0    0     0        0

We have set up the predictor matrix, so we can now create 10 multiply imputed datasets using the 2l.norm method to impute values for popular

> imp <- mice(popmis, meth = c("","","2l.norm","","","",""), pred = pred, maxit=10, m = 10)

Now we run the mixed model on each of the imputed datasets:

> fit <- with(imp, lmer(teachpop~sex+texp+popular + (1|school)))

...and pool the results:

> summary(pool(fit))

                   est          se         t        df     Pr(>|t|)      lo 95      hi 95 nmis
(Intercept) 2.73951576 0.165053863 16.597708 1991.5874 0.000000e+00 2.41581941 3.06321211   NA
sex         0.08620420 0.031042794  2.776947  915.1865 5.599307e-03 0.02528087 0.14712753    0
texp        0.05682495 0.009713717  5.849970 1991.4452 5.733929e-09 0.03777484 0.07587506    0
popular     0.16696926 0.018760706  8.899945 1980.9159 0.000000e+00 0.13017647 0.20376205  848

Best Answer

Related Solutions

Solved – Missing Data Mixed Effects Modelling for Repeated Measures

Related Question