Solved – How to deal with missing data in mixed effects (or multi-level) models

missing datamixed model

I am curious about strategies for dealing with missing data in mixed effects (or multi-level models). By default, as far as I understand, many software tools use listwise deletion by default, so that cases with missing values for any of the variables are removed from the analysis. This is the case, for example, for the lme4 software in R. Is this typical for other software? Are there tools except for imputation for using all the available data, as in casewise deletion?

Best Answer

To my knowledge, yes, it is typical to exclude the instances with missing data. I have not seen standard regression routines dealing with missing data by default in any other way; this "omission" is not unreasonable. Assuming that the missing data are "missing completely at random"(MCAR), deleting the instances with missing data does not lead to biased inference.

The most important thing when faced with missing data is to appreciate why they are missing. As mentioned, excluding instances with missing information is safe if the assumption of MCAR holds but there are other "missingness mechanisms" like "missing at random" (MAR - horrible naming idea as it is different to MCAR) and "missing not a random" (MNAR) that require special considerations. Gelman and Hill's "Data Analysis Using Regression and Multilevel/Hierarchical Models" has a relevant chapter on missing-data imputation that gives a well-round treatment of the subject. This is the true reason why standard regression routines do not implement imputation out-of-the-box: the correct imputation approach is dependant on the amount as well as the reason of missing readings.

There is a plethora of imputation techniques and all of them are successful at certain scenarios. One can pick from simple median or mean-value imputation (a popular and easy first step), to advanced multivariate techniques with substantial statistical background (eg. MICE or AMELIA) and Linear Algebra motived approaches that (mostly) ignore the missing mechanism and focus primarily on low-rank approximations (eg. matrix completion). And of course there are approaches in between (eg. imputation through Probabilistic PCA or Random Forests).

As a general advice, I would suggest finding out why missing data occur. In addition, given we impute a dataset using a certain imputation methodology, rerunning the analysis using a different imputation methodology is probably beneficial; if the results very greatly something fishy might be happening.

Related Question