Solved – Should I use Poisson or Gaussian family in GAM

generalized-additive-model

I would like to ask if I am right to use a Gaussian family in my GAM analysis. I have some count data of the number of animals captured per 100 traps, which is not normally distributed. Then I did a log transformation and the shape is better, but it's still not normally distributed. I have some zeros in my data. I have continuous values after the transformation, and I get an error if I run the GAM with Poisson family and model selection like so:

raveco1p.gam  <-gam(logRAVtrapsb ~ 1 + region + logarea + Type, data=dd, family = "poisson")

There were 18 warnings (use warnings() to see them)

> summary(raveco1p.gam)

Family: Poisson 
Link function: log 

Formula:
logRAVtrapsb ~ 1 + region + logarea + Type

Parametric coefficients:
                Estimate Std. Error z value Pr(>|z|)   
(Intercept)      -3.9693     1.7281  -2.297   0.0216 * 
region1   1.2937     0.4417   2.929   0.0034 **
logarea           0.4033     0.1688   2.390   0.0169 * 
Type1      1.5471     1.1976   1.292   0.1964   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


R-sq.(adj) =  0.445   Deviance explained = 42.5%
UBRE = -0.11976  Scale est. = 1         n = 44

> options(na.action = "na.fail")
> dredge(raveco1p.gam)
Fixed term is "(Intercept)"
Global model call: gam(formula = logRAVtrapsb ~ 1 + region + logarea + Type, family = "poisson", 
    data = dd)
---
Model selection table 
  (Intrc)  logar regin Type df logLik AICc
1 -0.5112                    0   -Inf  Inf
2 -1.3000 0.2169             2   -Inf  Inf
3 -1.2400            +       1   -Inf  Inf
4 -1.8450 0.1977     +       3   -Inf  Inf
5  0.7358                 +  1   -Inf  Inf
6 -1.7000 0.2547          +  3   -Inf  Inf
7 -0.2002            +    +  3   -Inf  Inf
8 -3.9690 0.4033     +    +  4   -Inf  Inf
Models ranked by AICc(x) 
There were 50 or more warnings (use warnings() to see the first 50)

When I used Gaussian family, there were no problems. Is this the correct approach?

Best Answer

I would start with the raw count, the actual number of animals found. This would be an integer number. The model I would start with would be a Poisson as the count response is strictly non-negative and discrete (can't have 2.5 animals). I would include an offset in the model that was the number of traps. An offset term is a term that receives a fixed coefficient of 1 (not a value estimated from the data). Assuming you use the default/canonical link for the Poisson family (the log link), you would add + offset(log(n_traps)) to the model formula, which would have the effect of turning the response from a count into a count per trap, but this is all happening internal to the model, you still retain the discrete, non-negative nature of the response.

The GLM/GAM avoids the need to log transform the response as that transformation can cause problems; the log of a zero count is -infinity which you can't include in the model. This has lead some to add a continuity correction of say log(count + 1), but this causes problems too. Also, you are most likely interested in the actual counts, not their logarithms. If you log transform the response and then fit a model your model is for the expected value of the log-count, not the expected value of the count. You can back-transform this expectation to the non-log scale of course, but due to Jensen's inequality, the back-transformed expected log-count is not the same as thing as the expected count. So if you log-transform the data you are fitting a model on a response scale that you are likely not interested in and is probably harder to communicate to your intended audience.

In summary:

  1. model the raw counts; don't transform them
  2. use a response distribution that respects the features of the data; the Gaussian distribution is for all real values between -Inf and +Inf, which doesn't match count data (non-negative integer values). The Poisson is a reasonable starting point as it has support for the non-negative integers, but it is often too restricted a distribution to model the features of animal abundance data. Commonly used alternatives are the Negative Binomial (use family = nb() in mgcv::gam() for example) for data with more variance that that assumed by the Poisson.
  3. include the "effort" variable as an offset. This has the effect of standardising the response so that you are modelling the count per unit effort (count per trap in this case), but it does so internal to the model. You can still recover the observed counts by multiplying the expected count for each observation (the fitted value) by the number of traps, for example.

Some useful references on this are, cited in order of reading:

O’Hara, R. B., and D. J. Kotze. 2010. Do not log-transform count data. Methods Ecol. Evol. 1: 118–122. doi:10.1111/j.2041-210X.2010.00021.x

Ives, A. R. 2015. For testing the significance of regression coefficients, go ahead and log-transform count data. Methods Ecol. Evol. doi:10.1111/2041-210X.12386

Warton, D. I., M. Lyons, J. Stoklosa, and A. R. Ives. 2016. Three points to consider when choosing a LM or GLM test for count data. Methods Ecol. Evol. doi:10.1111/2041-210X.12552

Related Question