Solved – Resolving heteroscedasticity in Poisson GLMM

glmmheteroscedasticitypoisson distributionr

I have long-term collection data, and I'd like to test, whether the number of animals collected is influenced by weather effects. My model looks like below:

glmer(SumOfCatch ~ I(pc.act.1^2) +I(pc.act.2^2) + I(pc.may.1^2) + I(pc.may.2^2) + 
                   SampSize + as.factor(samp.prog) + (1|year/month), 
      control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1e9,npt=5)), 
      family="poisson", data=a2)

Explanation of the used variables:

SumOfCatch: number of animals collected
pc.act.1, pc.act.2: axes of a principal component representing weather conditions during sampling
pc.may.1, pc.may.2: axes of a PC representing weather conditions in May
SampSize: number of pitfall traps, or collecting transects of standard lengths
samp.prog: method of sampling
year: year of sampling (from 1993 to 2002)
month: month of sampling (from Aug to Nov)

The fitted model's residuals show considerable inhomogeneity (heteroscedasticity?) when plotted against fitted values (see Fig.1):

My main question is: is this a problem making the reliability of my model questionable? If so, what can I do to resolve it?

So far I have tried the followings:

control for overdispersion by defining observation-level random effects, i.e. using a unique ID for each observation, and applying this ID variable as random effect; although my data do show considerable overdispersion, this did not help as the residuals became even more ugly (see Fig. 2)

I fitted models without random effects, with quasi-Poisson glm and glm.nb; also yielded similar residual vs. fitted plots to the original model

As far as I know, there might be ways for the estimation of heteroscedasticity-consistent standard errors, but I have failed to find any such method for Poisson (or any other kind of) GLMMs in R.

In response to @FlorianHartig: the number of observations in my dataset is N=554, I think this is a fair obs. number for such a model, but of course, the more the merrier. I post two figures, first of which is the DHARMa scaled residual plot (suggested by Florian) of the main model.

The second figure is from a second model, in which the only difference is that it contains the observation-level random effect (the first does not).

UPDATE

Figure of the relationship between a weather-variable (as predictor, i.e. x-axis) and sampling success (response):

UPDATE II.

Figures showing predictor values vs. residuals:

Best Answer

It is difficult to assess the fit of the Poisson (or any other integer-valued GLM for that matter) with Pearson or deviance residuals, because also a perfectly fitting Poisson GLMM will exhibit inhomogeneous deviance residuals.

This is especially so if you do GLMMs with observation-level REs, because the dispersion created by OL-REs is not considered by the Pearson residuals.

To demonstrate the issue, the following code creates overdispersed Poisson data, that is then fitted with a perfect model. The Pearson residuals look very much like your plot - hence, it may be that there is no problem at all.

This problem is solved by the DHARMa R package, which simulates from the fitted model to transform the residuals of any GL(M)M into a standardized space. Once this is done, you can visually assess / test residual problems, such as deviations from the distribution, residual dependency on a predictor, heteroskedasticity or autocorrelation in the normal way. See the package vignette for worked-through examples. You can see in the lower plot that the same model now looks fine, as it should.

If you still see heteroscedasticity after plotting with DHARMa, you will have to model dispersion as a function of something, which is not a big problem, but would likely require you to move to JAGs or another Bayesian software.

library(DHARMa)
library(lme4)

testData = createData(sampleSize = 200, overdispersion = 1, randomEffectVariance = 1, family = poisson())

fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) + (1|ID), family = "poisson", data = testData, control=glmerControl(optCtrl=list(maxfun=20000) ))

# standard Pearson residuals
plot(fittedModel, resid(., type = "pearson") ~ fitted(.) , abline = 0)

# DHARMa residuals
plot(simulateResiduals(fittedModel))

Related Solutions

Solved – Why is Poisson regression different with glmer and gamlss

It may be that it has changed since this question was written, but it looks as though the random effect is not coded correctly for gamlss. You have it written as "random=~1|Trial," but when I try to run that through gamlss it states that the "|" is not valid for factors. More details on how to code random effects for gamlss is in the manual: http://www.gamlss.org/wp-content/uploads/2013/01/gamlss-manual.pdf

The benefit of gamlss is that you can model different aspects (I use a zero-inflated beta distribution and can model both the distribution of non-zero values and the probability of zeroes). It could be that, as coded, the random effect is not influencing the same part of the model as it is in glmer. When I use a ~ for the random effect in my gamlss models, it doesn't contribute to the distribution of non-zero values anymore; it is transferred to the probability of zero.

Unfortunately I have not figured out how to properly code mixed models for gamlss, but hopefully this at least clears up why you're getting different results.

Solved – Adding an observation level random term messes up residuals vs fitted plot. Why

Thanks for updating your post, Charly. I played with some over-dispersed Poisson data to see the impact of adding an observation level effect in the glmer model on the plot of residual versus fitted values. Here is the R code:

# generate data like here: https://rpubs.com/INBOstats/OLRE

set.seed(324)
n.i <- 10
n.j <- 10
n.k <- 10
beta.0 <- 1
beta.1 <- 0.3
sigma.b <- 0.5
theta <- 5
dataset <- expand.grid(
X = seq_len(n.i),
b = seq_len(n.j),
Replicate = seq_len(n.k)
)
rf.b <- rnorm(n.j, mean = 0, sd = sigma.b)
dataset$eta <- beta.0 + beta.1 * dataset$X + rf.b[dataset$b]
dataset$mu <- exp(dataset$eta)
dataset$Y <- rnbinom(nrow(dataset), mu = dataset$mu, size = theta)
dataset$OLRE <- seq_len(nrow(dataset))


require(lme4)

m.1 <- glmer(Y ~ X + (1 | b), family=poisson(link="log"), data=dataset)

m.2 <- glmer(Y ~ X + (1 | b) + (1 | OLRE), family=poisson(link="log"),  
             data=dataset)

Note that model m.2 includes an observation level random effect to account for over-dispersion.

To diagnose the presence of over-dispersion in model m.1, we can use the command:

# check for over-dispersion:  
# values greater than 1.4 indicate over-dispersion

require(blmeco)

dispersion_glmer(m.1)

The value returned by dispersion_glmer is 2.204209, which is larger than the cut-off of 1.4 where we would start to suspect the presence of over-dispersion.

When applying dispersion_glmer to model m.2, we get a value of 1.023656:

dispersion_glmer(m.2)

Here is the R code for the plot of residuals (Pearson or deviance) versus fitted values:

par(mfrow=c(1,2))
plot(residuals(m.1, type="pearson") ~ fitted(m.1), col="darkgrey")
     abline(h=0, col="red")
plot(residuals(m.2, type="pearson") ~ fitted(m.2), col="darkgrey")
     abline(h=0, col="red")

par(mfrow=c(1,2))
plot(residuals(m.1, type="deviance") ~ fitted(m.1), col="darkgrey")
abline(h=0, col="red")
plot(residuals(m.2, type="deviance") ~ fitted(m.2), col="darkgrey")
abline(h=0, col="red")

As you can see, the Pearson residuals plot for the model m.2 (which includes an observation level random effect) looks horrendous compared to the plot for model m.1.

I am not showing the deviance residuals plot for m.2 as it looks about the same (that is, horrendous).

Here is the plot of fitted values versus observed response values for models m.1 and m.2:

par(mfrow=c(1,2))
plot(fitted(m.1) ~ dataset$Y, col="darkgrey", 
     xlim=c(0, 250), ylim=c(0, 250), 
     xlab="Y (response)", ylab="Fitted Values")
abline(a=0, b=1, col="red")
plot(fitted(m.2) ~ dataset$Y, col="darkgrey", 
     xlim=c(0, 250), ylim=c(0, 250), 
     xlab="Y (response)", ylab="Fitted Values")
abline(a=0, b=1, col="red")

The plot of fitted values versus actual response values seems to look better for model m.2.

We should check the summary corresponding to the two models:

summary(m.1)

summary(m.2)

As argued in https://rpubs.com/INBOstats/OLRE, large discrepancies between the fixed effect coefficients and especially the random effects variance for b would suggest that something may be off. (The extent of overdispersion present in the initial model would drive the extent of these discrepancies.)

Let's look at some diagnostic plots for the two models obtained with the Dharma package:

require(DHARMa)

fittedModel <- m.1
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)

#----

fittedModel <- m.2
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)

The diagnostic plots for model m.1 (especially the left panel) clearly shows overdispersion is an issue.

The diagnostic plot for model m.2 shows overdispersion is no longer an issue.

See https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html for more details on these types of plots.

Finally, let's do a posterior predictive check for the two models (i.e., plotting the fitted values obtained across simulated data sets constructed from each model over a histogram of the real response values Y), as explained at http://www.flutterbys.com.au/stats/ws/ws12.html:

range(dataset$Y)  # Actual response values Y range from 0 to 247

set.seed(1234567)
glmer.sim1 <- simulate(m.1, nsim = 1000)
glmer.sim2 <- simulate(m.2, nsim = 1000)

out <- matrix(NA, ncol = 2, nrow = 251)
cnt <- 0:250
for (i in 1:length(cnt)) {
 for (j in 1:2) {
     eval(parse(text = paste("out[i,", j, "] <- 
          mean(sapply(glmer.sim", j,",\nFUN = function(x) {\nsum(x == cnt  
      [i]) }))", sep = "")))
 }
}


plot(table(dataset$Y), ylab = "Frequency", xlab = "Y", lwd = 2, 
     col="darkgrey")
lines(x = 0:250, y = out[, 1], lwd = 2, lty = 2, col = "red")    
lines(x = 0:250, y = out[, 2], lwd = 2, lty = 2, col = "blue")

The resulting plot shows that both models are doing a good job at approximating the distribution of Y.

Of course, there are other predictive checks one could look at, including the centipede plot, which would show where the model with observation level random effect would fail (e.g., the model would tend to under-predict low values of Y): http://newprairiepress.org/cgi/viewcontent.cgi?article=1005&context=agstatconference.

This particular example shows that it is possible for the addition of an observation level random effect to worsen the appearance of the plot of residuals versus fitted values, while producing other diagnostic plots which look fine. I wonder if other people on this site may be able to add further insights into how one should proceed in this situation, other than to report what happens with each diagnostic plot when the correction for over-dispersion is used.

Best Answer

Related Solutions

Solved – Why is Poisson regression different with glmer and gamlss

Solved – Adding an observation level random term messes up residuals vs fitted plot. Why

Related Question