Model residual diagnostics of gamma GLMM with log-link

gamma distributiongeneralized linear modelglmmglmmtmbresiduals

I am trying to model fish length data (N > 115.000) which are highly right-skewed using linear mixed effects models. Actually all data make sense and the the extreme high values are valid measurements. My data look like: Fish length - raw data.

I thought of using a GLMM of the family "Gamma" (with log-link) as length is a censored response (0,∞). Specifically, I am using the function glmmTMB() to fit following model structure:

    mod <- glmmTMB(L ~ gear + (1|species) + (1|location),
            family=Gamma(link = "log"),
            data=fish_df)

The model converged without any errors/warnings and the parameter estimates are reasonable, I think. However, I wanted to check/diagnose the model, especially the residuals. I was following some posts in other threads such as "What are the assumptions of a Gamma GLM or GLMM for hypothesis testing?" and "Coping with underdispersion after DHARMa diagnosis on GLMM with Gamma link=log", as I am observing similar issues of rather extreme values in my response that lead to skewed residuals of my models, I guess. So it seems that there is some heteroscedasticity in my residuals. I mean, given the large sample size it is not too surprising to see also supposed 'outliers'.

    ypred = predict(mod)
    res = residuals(mod, type = 'pearson')
    plot(ypred,res)
    hist(res)

Residuals vs ypred

Histogramm of residuals

I further checked with the DHARMa package for some patterns which are difficult to interpret to me.

    sim_res <- DHARMa::simulateResiduals(modX5, 250)
    plot(sim_res,asFactor = F)
    DHARMa::testDispersion(mod)
    DHARMa::testDispersion(mod, type="PearsonChisq", 
         alternative="greater")

Dharma residual diagnostics

The dispersion tests do not detect any significant overdispersion in the residuals but rather low dispersion parameters (for PearsonChisq = 0.13703, for DHARMa nonparametric dispersion test = 0.28605). In fact the gamma has already a dispersion parameter, so there should be no 'overdispersed' gamma according to this post.

Dharma dispersion test

So I am wondering whether the I could trust the model outcomes and estimates for fixed effects? To have more robust/reliable information on parameter estimates I would use parametric bootstrapped CI e.g. of the fixed effects via bootMer (but this does not solve the model fit per se). So are there any other recommendation I could try to improve the model fit, or is the fit sufficiently okay?

Best Answer

As the developer of DHARMa, I think I can answer at least some of your questions:

  1. First DHARMa plot, left: the qq plot basically tells you that the residual distribution doesn't follow a gamma distribution. To me, it looks as if your residuals have more mass mass in the centre of the distribution than what would be expected in a gamma, which suggests a kind of ucnderdispersion (see also point 3).

  2. First DHARMa plot, right: as you seem to have only two factor levels, you shouldn't force plot(sim_res,asFactor = F), this plot will be hard to interpret. Use the default plot. Probably, there is not much to be seen here though.

  3. Regarding the dispersion: while it is true that the gamma has a dispersion parameter, the value of this parameter is optimised using the MLE. If the distribution you fit is not a gamma, the MLE for the dispersion parameter does not necessarily coincide with a "correct" dispersion value according to the test statistics that are used in the dispersion tests in DHARMa. Thus, it is not surprising to see the values that you find here, and the origin is likely that your residual distribution is not gamma.

  4. Regarding res ~ pred plot that you did by hand: here, in pred, you include the REs. Note that this can lead to misleading residual patterns, see https://github.com/florianhartig/DHARMa/issues/43

Overall, my assessment is that your residual distribution is not really following a gamma. The rest of the patterns is a consequence of this problems. As a remedy, you could try transformations, move to quantile regressions, or just ignore it. The deviation is notable, but given that the dispersion is roughly OK and you have tons of data, it probably won't make a big difference on the inference.

Related Question