Solved – Dealing with missing data – glmer in lme4 package

lme4-nlmemissing data

I have an unbalanced data set / data set with missing values, consisting of 20 submersible acoustic receivers that have been range tested on 8 days (Both receiver ID and Day are treated as random effects in my model). My aim is to test the effects of multiple environmental variables on the detection range of these receivers. Unfortunately, due to technical issues, a total of 5 receivers could not be tested every day (e.g., 2 receivers could not be tested on 2 days, 3 could not be tested on 1 day).

What would be the right way to proceed? I have 2 choices: 1) reduce data set (exclude all receivers that are not tested on every day), or 2) work with full data set, including missing values.

The first choice is easy, but may not be the best choice (as far as I've read online). besides, collected data will be thrown away.. The second choice however seems more difficult to work with. I read that glmm in the lme4 package can deal with missing values, however, the only thing it does is automatically exclude all rows that contain NAs.

So let's say I choose the second option, and let the model run with missing values that automatically get deleted. How would this affect hypothesis testing? In other words, is the interpretation of p-values just as straightforward as when it would be a balanced design?

[EDIT: I worked out the full data analysis that excluded those receivers for my masters project in order to maintain a balanced data set, however for a publication I would like to analyse my data using the second choice, as I think it's a better one. There's not much literature surrounding this subject as far as I'm aware, hence my post to this forum.]

Best Answer

Turning my comments into an answer as they seem to have answered your question...

Just exclude the actual missing data. If you format your data with columns ID, Day, environmental variables, response, everything should be fine to just omit the rows where an ID is missing a measurement on a certain day, still keeping the other measurements on those IDs.

For inference, you'll get the best accuracy using bootstrapped estimates, (lme4::confint() with method = "boot" works well -- you'll need to install the boot package for this to work). If you want more info on that, I'd recommend Faraway's Extending the Linear Model with R, section 8.2. The lme4 package has been considerably updated since Faraway's book's printing, you can see the accompanying transition guide. The principles, of course, remain the same.

Related Question