Solved – Dealing with missing data – glmer in lme4 package

lme4-nlmemissing data

I have an unbalanced data set / data set with missing values, consisting of 20 submersible acoustic receivers that have been range tested on 8 days (Both receiver ID and Day are treated as random effects in my model). My aim is to test the effects of multiple environmental variables on the detection range of these receivers. Unfortunately, due to technical issues, a total of 5 receivers could not be tested every day (e.g., 2 receivers could not be tested on 2 days, 3 could not be tested on 1 day).

What would be the right way to proceed? I have 2 choices: 1) reduce data set (exclude all receivers that are not tested on every day), or 2) work with full data set, including missing values.

The first choice is easy, but may not be the best choice (as far as I've read online). besides, collected data will be thrown away.. The second choice however seems more difficult to work with. I read that glmm in the lme4 package can deal with missing values, however, the only thing it does is automatically exclude all rows that contain NAs.

So let's say I choose the second option, and let the model run with missing values that automatically get deleted. How would this affect hypothesis testing? In other words, is the interpretation of p-values just as straightforward as when it would be a balanced design?

[EDIT: I worked out the full data analysis that excluded those receivers for my masters project in order to maintain a balanced data set, however for a publication I would like to analyse my data using the second choice, as I think it's a better one. There's not much literature surrounding this subject as far as I'm aware, hence my post to this forum.]

Best Answer

Turning my comments into an answer as they seem to have answered your question...

Just exclude the actual missing data. If you format your data with columns ID, Day, environmental variables, response, everything should be fine to just omit the rows where an ID is missing a measurement on a certain day, still keeping the other measurements on those IDs.

For inference, you'll get the best accuracy using bootstrapped estimates, (lme4::confint() with method = "boot" works well -- you'll need to install the boot package for this to work). If you want more info on that, I'd recommend Faraway's Extending the Linear Model with R, section 8.2. The lme4 package has been considerably updated since Faraway's book's printing, you can see the accompanying transition guide. The principles, of course, remain the same.

Related Solutions

Solved – Checking for outliers in a glmer (lme4 package) with 3 random factors

try the romr.fnc in the LMERConvenienceFunctions to remove outliers

df3.trimmed = romr.fnc(m, df3, trim = 2.5)
df3.trimmed = df3.trimmed$data

update initial model on trimmed data

mB = update(m1)

Solved – A data set with missing values in multiple variables

@Tim gave a nice response. To add to that, the best thinking about dealing with missing values (MVs) began with Donald Rubin and Roderick Little in their book Statistical Analysis with Missing Data, now in its 9th edition. They originated the classifications into MAR, MCAR, etc. To their several books I would add Paul Allison's highly readable Sage book Missing Data, which remains one of the best, most accessible treatments on this topic in the literature.

A number of commonly used, bad heuristics have emerged over the years for dealing with missing data, many of which still see use today since they are easily implemented "solutions." These include ones already mentioned such as discretizing the variable and creating a junk category labelled "Missing" or "NA" (not available, unknown) into which all missing values for that variable are tossed, as well as, for continuous variables, plugging the missing values with a constant -- e.g., the arithmetic mean. Secondarily and for regression models, some recommend using dummy variables (0,1) indicating the presence (absence) of an MV. The dummy is intended to "capture" the overall impact of the MVs on the model while also appropriately adjusting the parameters. These are all bad ideas because, in the first case, a heterogenous mix of information is lumped into a single category while, in the second case, a potentially large burst or spike containing a single value (the mean) is introduced into an otherwise typically smooth distribution for a predictor.

The least biasing of all of the options for imputation are regression models. In an American Statistican paper (for which I no longer have a reference, sorry), dummy variables for MVs in regression have been demonstrated to not only not capture the effects of missing values but also to generate biased parameters. The AmStat paper based these conclusions on a comparison of the scenarios for the various MV options with full information data. The author's recommendation was, assuming the magnitude or volume of missing information wasn't too much or too large, to use the least biasing solution -- full information modeled imputation based on data available after deleting the observations containing MVs. Of course, this response demands an answer to "what is too much?" Here, there are no firm benchmarks, only experiential, subjective heuristics and rules of thumb without any firm theoretical motivation. This means that it's up to the analyst to decide. Just so, @Discipulus' rule of thumb is to work with variables containing 50% or less MVs, certainly a reasonable heuristic. In the OPs case, that would exclude the two variables containing more than 50% MVs, variables that are described as "important" to the analysis. That said, it is safe to assume that 95% MVs qualifies as "too much."

If it's thought that there are not too many MVs, then use some variant of multiple imputation to plug them. Here too, there are many bad methods to choose from including, e.g., "sorted hot deck" multiple imputation where observations are sorted across a string of fully observed variables and the fully observed value that comes closest to the observation with missing information across that sort string is used as a plug. In general, all of these "mechanistic" solutions to plugging MVs are to be rejected in deference to model based multiple imputation.

In an ASA workshop taught by Rubin, several "best" practices were discussed for dealing with multiple variables containing MVs in a dataset. First, rank the variables by their frequency or percent of missing information from high to low and begin the process of imputation, one variable at a time, on those containing the lightest or least amount of MVs. Then, retain and use these newly plugged variables in the model-building process for each subsequent variable. Use every variable available to you in building imputation models, including the target or dependent variable(s) and excluding the lower ranked variables with MVs.

The key metric in building and evaluating model-based imputation is the comparison of the pre-imputed means and std devs (based on the full information after deleting MVs) with the post-imputation or plugged value means and std devs. If the imputation was successful, then little or no (significant) difference should be observed in these marginals. An important note of caution needs to be introduced at this point: this metric and multiple imputation in general is intended to evaluate the preservation of overall or unconditional marginals. This means that the actual values used and assigned to each MV field have a high likelihood of being "wrong" for that observation, if compared with the full but unavailable information. For instance, in a head-to-head comparison of actual vs imputed values based on a sample possessing both self-reported survey information (the actual values) versus the imputations made by a leading vendor of geo-demographic information (the imputed values), imputed fields such as head of household age and income were wrong nearly 80% of the time at the level of the individual observation. Even after partitioning these fields into high and low groups based on median splits, the imputations were still wrong more than 50% of the time. However, the marginals were recovered more or less accurately.

One final note, imputation can be appropriate for features, predictors or independent variables but is not recommended for the target or dependent variable.

Best Answer

Related Solutions

Solved – Checking for outliers in a glmer (lme4 package) with 3 random factors

Solved – A data set with missing values in multiple variables

Related Question