Solved – Is it better to use data imputation for missing data or an analysis that is not affected by missing data (e.g., HLM/mixed effects modelling)

missing datamixed modelmultiple-imputation

I have treated two groups of 100 people with different treatments. I have pre-treatment and post-treatment data for most participants (as well as 1-month follow-up. I also have weekly data for some variables, but may or may not include those). About 10% of cases dropped out before post-treatment in one group and 30% of the other group dropped out at follow-up. I want to compare the efficacy of the treatments in each group (i.e., how group membership is associated with symptoms over time [continuous]). Note that the % of drop-outs is not part of the efficacy analysis (this will be considered in another analysis).

  • I have been told to use multiple imputation to help with the missing
    data problem. This option would involve doing a repeated-measures
    ANOVA following the data imputation and an attempt to pool the results from the repeated analyses using the different simulated databases.
  • I have also been told to use hierarchical linear modelling (i.e., mixed effects modelling), since
    this analysis is quite robust to missing data.

Which option is better? Should I use a method that simulates missing values (e.g., multiple imputation) or use what data I have with a method that is robust to missing data?

This question fascinates me and also has important implications for my line of work! Any guidance would be appreciated. Lets assume I can do either analysis (mostly) properly.

Edit: Also I should note that another option would be to use BOTH methods (i.e., imputation followed by HLM). I appreciate this is an option. If this is your recommendation, please go ahead and explain why, but please also explain which of the two options is better if they are mutually exclusive.

Best Answer

I would hands down use mixed-effects modeling. First, I am not aware of an easy way to pool multiple-degree-of-freedom effects (as in ANOVA with a factor with more than two levels). Also, multiple imputation and full-information maximum likelihood estimation (the latter being what mixed-effects models use) make the same assumptions (like multivariate normality), and so tend to yield similar results (see Baraldi & Enders, 2010, available here).

The choice is more a choice of convenience (as the authors cited above point out), and in this case, given how easy full-information maximum likelihood is implemented, mixed-effects modeling is a natural choice. Also, a mixed-effects model would allow you to have a random effect for the change over time, meaning that change over time is allowed to vary across participants, whereas this is not possible in a repeated-measures ANOVA (repeated-measures ANOVA have random intercepts but fixed slopes).

By the way, one situation that might have made multiple imputation more attractive is if there had been missing data on the predictor variables as well, since full-information maximum likelihood in mixed-effects modeling software typically handles missing data on the dependent variable only.

Reference:

  • Baraldi, A. N., & Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48, 5-37.
Related Question