I would like to ask if I am right to use a Gaussian family in my GAM analysis. I have some count data of the number of animals captured per 100 traps, which is not normally distributed. Then I did a log transformation and the shape is better, but it's still not normally distributed. I have some zeros in my data. I have continuous values after the transformation, and I get an error if I run the GAM with Poisson family and model selection like so:
raveco1p.gam <-gam(logRAVtrapsb ~ 1 + region + logarea + Type, data=dd, family = "poisson")
There were 18 warnings (use warnings() to see them)
> summary(raveco1p.gam)
Family: Poisson
Link function: log
Formula:
logRAVtrapsb ~ 1 + region + logarea + Type
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.9693 1.7281 -2.297 0.0216 *
region1 1.2937 0.4417 2.929 0.0034 **
logarea 0.4033 0.1688 2.390 0.0169 *
Type1 1.5471 1.1976 1.292 0.1964
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.445 Deviance explained = 42.5%
UBRE = -0.11976 Scale est. = 1 n = 44
> options(na.action = "na.fail")
> dredge(raveco1p.gam)
Fixed term is "(Intercept)"
Global model call: gam(formula = logRAVtrapsb ~ 1 + region + logarea + Type, family = "poisson",
data = dd)
---
Model selection table
(Intrc) logar regin Type df logLik AICc
1 -0.5112 0 -Inf Inf
2 -1.3000 0.2169 2 -Inf Inf
3 -1.2400 + 1 -Inf Inf
4 -1.8450 0.1977 + 3 -Inf Inf
5 0.7358 + 1 -Inf Inf
6 -1.7000 0.2547 + 3 -Inf Inf
7 -0.2002 + + 3 -Inf Inf
8 -3.9690 0.4033 + + 4 -Inf Inf
Models ranked by AICc(x)
There were 50 or more warnings (use warnings() to see the first 50)
When I used Gaussian family, there were no problems. Is this the correct approach?
Best Answer
I would start with the raw count, the actual number of animals found. This would be an integer number. The model I would start with would be a Poisson as the count response is strictly non-negative and discrete (can't have 2.5 animals). I would include an offset in the model that was the number of traps. An offset term is a term that receives a fixed coefficient of 1 (not a value estimated from the data). Assuming you use the default/canonical link for the Poisson family (the log link), you would add
+ offset(log(n_traps))
to the model formula, which would have the effect of turning the response from a count into a count per trap, but this is all happening internal to the model, you still retain the discrete, non-negative nature of the response.The GLM/GAM avoids the need to log transform the response as that transformation can cause problems; the log of a zero count is -infinity which you can't include in the model. This has lead some to add a continuity correction of say
log(count + 1)
, but this causes problems too. Also, you are most likely interested in the actual counts, not their logarithms. If you log transform the response and then fit a model your model is for the expected value of the log-count, not the expected value of the count. You can back-transform this expectation to the non-log scale of course, but due to Jensen's inequality, the back-transformed expected log-count is not the same as thing as the expected count. So if you log-transform the data you are fitting a model on a response scale that you are likely not interested in and is probably harder to communicate to your intended audience.In summary:
family = nb()
inmgcv::gam()
for example) for data with more variance that that assumed by the Poisson.Some useful references on this are, cited in order of reading:
O’Hara, R. B., and D. J. Kotze. 2010. Do not log-transform count data. Methods Ecol. Evol. 1: 118–122. doi:10.1111/j.2041-210X.2010.00021.x
Ives, A. R. 2015. For testing the significance of regression coefficients, go ahead and log-transform count data. Methods Ecol. Evol. doi:10.1111/2041-210X.12386
Warton, D. I., M. Lyons, J. Stoklosa, and A. R. Ives. 2016. Three points to consider when choosing a LM or GLM test for count data. Methods Ecol. Evol. doi:10.1111/2041-210X.12552