How to Handle GLM (Gamma) When Residuals Are Not Normally Distributed

data transformationgamma distributiongeneralized linear modelmodelingresiduals

Until now I have only done very basic/simple simple stats, but now I got stuck in all the literature/tips/forums … It's about the following problem:

I have the following data:

x <- test results (in percentages %)
y <- years (categorical 2011 to 2016)

My data is not normally distributed and skewed

histogram of data

I went for a GLM with Gamma Family.

The problem is that when I check my assumptions for the model it's not right …

standard glm diagnostic plots from R

It wouldn't be appropriate to remove any outliers and transformations don't work. Could someone help me with next step solutions to find a appropriate model or adjustment …? Thank you so much …!

Best Answer

Residuals in glm's such as with the gamma family is not normally distributed, so simply a QQ plot against the normal distribution isn't very helpful. To understand this, note that the usual linear model given by $$ y_i = \beta_0 + \beta_1 x_1 + \dotso +\beta_p x_p + \epsilon $$ has a very special form, the observation can be decomposed as an expected value plus a disturbance (or error) term. This decomposition is not available in most other glm models (see also Logistic Regression - Error Term and its Distribution). For the gamma glm, we have, specifically, that the gamma density of the observations can be written (here I follow McCullagh & Nelder: "Generalized Linear Models" second edition, chapter 8) $$ f(y;\mu,\nu) = \frac1{\Gamma(\nu)}\left( \frac{\nu y}{\mu}\right)^\nu \exp(-\frac{\nu y}{\mu}) d\{\log y\} $$ the index $\nu$ is assumed the same for all the observations, while the mean $\mu$ is modeled via a logarithmic link function as $\mu = e^{\beta^T x}$. So, each observation has its own expectation modeled as a function of its covariates $x$. This model is necessarily heteroskedastic, for example, the variance being proportional to $\mu^2$. Since there is no additive decomposition of the observations as expectation + error, it is not clear how to define residuals. This have led to a plethora of different definitions of residuals, raw, Pearson, deviance, ... and how to choose?

One idea to solve this (and get more interpretable residual plots) is to use simulation. There is an R package helping with this:

      library(DHARMa)

which has a good vignette introducing its use, open it via

       vignette("DHARMa")

I will illustrate its use, and the use of simulation based residuals, with the same example used in How to validate & diagnose a gamma GLM in R? By searching this site for "DHARMa" you will find other examples of its use.

    library(MASS)
    data(leuk)
    leuk.mod  <-  glm(time ~ ag -1 + log10(wbc), 
         family=Gamma(log),  data=leuk)
    summary(leuk.mod)
    
    Call:
    glm(formula = time ~ ag - 1 + log10(wbc), 
        family = Gamma(log), data = leuk)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -2.1746  -1.2663  -0.4251   0.4962   1.9048  
    
    Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
    agabsent     5.8154     1.3487   4.312 0.000161 ***
    agpresent    6.8331     1.3215   5.171 1.44e-05 ***
    log10(wbc)  -0.7009     0.3167  -2.213 0.034617 *  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for Gamma family taken to be 1.087715)
    
        Null deviance: 2445.238  on 33  degrees of freedom
    Residual deviance:   40.319  on 30  degrees of freedom
    AIC: 301.49
    
    Number of Fisher Scoring iterations: 8

then using functions from DHARMa:

    simout  <-  simulateResiduals(leuk.mod,  n=250)
    plotSimulatedResiduals(simout) 

giving the plot:

Plot of simulation-based residuals from R package DHARMa

The QQ plot here is not against the normal distribution, it is against the simulation-based expected distribution of the residuals. That is the idea with the simulation: we do not know, theoretically, what is the null distribution of the residuals, so we approximate it by simulating from the fitted model. This have similarities with bayesian simulation-based inference. A paper giving the theoretical background is https://calhoun.nps.edu/bitstream/handle/10945/24467/NPS-OR-95-009.pdf;sequence=3 A bayesian paper is http://www.stat.columbia.edu/~gelman/research/published/dogs.pdf The other plot, residuals against expected (fitted) value, is augmented with three red lines which should (if the model is correct) be horizontal. They are based on quantile regression of the residuals, for the three quantiles 0.25, 0.5, 0.75. In our example this plot then could indicate some problems, but then it could only be an effect of the very low sample size.

Some other posts on this site about gamma glm's: When to use gamma GLMs? Link function in a Gamma-distribution GLM or search this site for "gamma glm".