Solved – assessing glmmTMB hurdle model fit using DHARMa scaled residual plot

glmmtmbrresiduals

My model

glmmTMB(y~fixed1+fixed2+fixed3+fixed4+(1|random),data=df,ziformula~.,
    family=list(family="truncated_nbinom1",link="log"))

The response variable (y) is e.g. kilos of wheat seeds planted per month. This involves two decisions (1) whether or not to plant wheat (2) number of kilos. Thus there are many zeros (some farmers chose not to plant wheat).

The random variable would be farms.

fixed4 is month as there were only 13 months in the study. I have tried it as random but there are insufficient cases.

Each case in the data set is a farm-month. not all farmers participated in the study for all months (but most did).

This form of response variable I think makes a hurdle model likely to be suitable as does the distribution of the variable (see histogram below)

Running the model with lmer and using DHARMa to understand fit suggests that there are problems with uniformity (qqplot) and zero inflation but not with dispersion. Poisson and binomial models also show problems with uniformity.

The hurdle model suggests there is not a problem with uniformity and the QQ plot seems suitable. However there is a problem with under dispersion and also in the residual vs predicted plot (see below right). The residual vs predicted lines do not match – there are red diagonal lines

I would like to know the extent to which this is a problem for the model? Is this just an illustration of the warning that "glmmTMB doesn't implement an option to create unconditional predictions from the model, which means that predicted values (in res ~ pred) plots include the random effects. With strong random effects, this can sometimes create diagonal patterns from bottom left to top right in the res ~ pred plot"

Also is it the case that underdispersion is not at issue in a hurdle model ? see https://github.com/glmmTMB/glmmTMB/issues/313

Best Answer

I would answer this on two levels: 1) is this the right model from theoretical considerations, and 2) is the residual plot cause for concern?

First of all, is this really a case for a hurdle count data model? Your description sounds like a decision of plant or not, followed by a continuous decision about the weight of the seeds, so why count data? You could possibly model this as a compound process by a tweedie distributions (not sure of glmmTMB zi formula works with Tweedie), but actually, given that any farmer that decides to plant will plant > 0 wheat (i.e. zeros are always originate from the first process), the entire analysis conveniently separates into a) a binomial model for 0, >0, and b) an lm for the weight for all >0 data. Just fit two models with lme4, should work like a charm, I don't any reason to make it more complicated than that.
The question of whether you fit the right model aside: I wouldn't be concerned about the qq plot, but the res~pred plot shows a very clear pattern. The problem is that there is still a limitation in glmmTMB, about which I warn when the DHARMa package is loaded. The issue is explained in https://github.com/florianhartig/DHARMa/issues/16 under limitations. This issue can produce this type of bottom-left to top-right patterns in the plot. See comments in the link about how to check whether this is the case. A possible solution is also to simulate new data, refit, and see if you get the same pattern. But as said above, I wouldn't use this model anyway. lme4 for doesn't have the same limitations, so lme4 residuals in DHARM can be interpreted without this consideration.

EDIT 11/02/21: the limitation concerning the glmmTMB package described in 2. has been solved.

Related Solutions

Solved – Can you use glmmTMB to simultaneously model offsets and zero-inflation

tl;dr as far as I can tell at this point,

 glmmTMB(formula=<...>+offset(log(sampling_effort)),
         ziformula = ~.,
         family=nbinom2,
         data=<...>)

should do what you want. (1) ziformula specifies zero-inflation. (2) The offset term in the conditional model (formula) adjusts for sampling effort; as I understand your context, you shouldn't need any adjustment for effort/sample depth in the zero-inflation part of the model (since it describes structural zeros, which will always be observed as zero regardless of sampling effort). (3) family=nbinom2 takes care of other sources of overdispersion. [You might want to alternately consider nbinom1, which specifies $\textrm{Var} \propto \mu$ rather than $\textrm{Var} = \mu + \mu^2/k$.]

This is an interesting question both from the statistical and the implementation point of view.

Implementation: you can add offsets to zero-inflation terms, you just can't do it via .. For example something like

 glmmTMB(y~x,
         family=nbinom2,
         zi=~x+offset(log(w)))

should work fine. It's only if you try to use zi = ~ . to match the conditional formula in a lazy way that the offset gets dropped.

Statistics: I question a couple of your premises.

First of all, it's not immediately obvious that different numbers of counts should lead to biased results; it's important to know what the source of the differences is - i.e. large natural variations in density, or variations in searching/capture effort?
Second, you have to think carefully about the form of the offset. Using log(effort) as the offset with a log-link zero-inflation term (the only choice at present) will mean that the probability of an observation being a structural zero will be proportional to the effort, i.e. $\log(p_z) = ... + \log(e) \to p_z \propto e$. In general a complementary log-log link (with log(effort) as the offset) is more appropriate for detection probabilities, as that makes the hazard of finding something proportional to effort. However, if you're really trying to model structural zeros I question whether search effort should influence this part of the model at all ...

Based on the comments, I think this question may be based on a (reasonable) misunderstanding of ?glmmTMB, which currently reads

ziformula: ... Offset terms will automatically be dropped from the conditional effects formula when using ‘~.’

This warning applies only to the zero-inflation formula: the conditional formula (formula) argument isn't modified at all, it's only the zero-inflation version of the formula (and, as discussed above, you can include an offset in the zero-inflation part of the model if you really want to, by writing out the formula explicitly). Thus if you follow the fairly standard procedure of adding + offset(log(sampling_effort)) to formula then the conditional mean number of counts will be proportional to effort (assuming a [default] log-link model for the counts).

In hopes of clarifying this I've tried editing this statement: is the following clearer?

Specifying ~. will set the right-hand side of the zero-inflation formula identical to the right-hand side of the main (conditional effects) formula; terms can also be added or subtracted. Offset terms will automatically be dropped from the conditional effects formula when using ~.

(Hmmm, now that I read this over it doesn't seem much better ...) Feel free to comment (or suggest clearer wording) here, or at https://github.com/glmmTMB/glmmTMB/issues ...

Solved – GLMM hurdle model for continuous data -Truncated negative binomial family in glmmTMB

I'm not sure why you say that glmmTMB can't handle zero-inflated Gamma responses: the glmmTMB news file says (for version 1.0.0, release 2020-02-03):

new ziGamma family (minor modification of stats::Gamma) allows zero-inflation (i.e., Gamma-hurdle models)

I'd say it's not crazy to use a truncated negative binomial, but I'd be worried as it doesn't make statistical sense (technically, the likelihood of any non-integer value is 0 ...) (If you really had count data, a zero-inflated NB rather than a hurdle would be a reasonable option ...)

Given the distribution/density functions for each distribution, parameterized in terms of the mean $\mu$ and a shape/dispersion parameter ($k$ for NB, $a$ for Gamma):

$$ \textrm{NB}: \qquad \frac{(k/(k+\mu))^k}{\Gamma(k)} \cdot \frac{\Gamma(k+x)}{x!} \cdot (\mu/(k+\mu))^x $$

$$ \textrm{Gamma}: \qquad \frac{1}{(\mu/a)^a \Gamma(a)} \cdot x^{a-1} \cdot e^{-(x/(\mu/a))} $$

I think you may? be able to show that NB converges approximately to Gamma for large $x$ (but someone better/more dedicated than I am will need to do the math ...)

An empirical demonstration (not "proof"!)

hist(rnbinom(100000,mu=100,size=2),freq=FALSE,ylim=c(0,0.008), breaks=100)
curve(dgamma(x,scale=100/2,shape=2),add=TRUE,col=2,lwd=2)

This definitely doesn't work for small mean (try it with mean=4 rather than 100 ...)

Best Answer

Related Solutions

Solved – Can you use glmmTMB to simultaneously model offsets and zero-inflation

Solved – GLMM hurdle model for continuous data -Truncated negative binomial family in glmmTMB

Related Question