I came in a new post because I’m quite confused with the use of an offset in gam() function (mgcv package). I’m actually working on predicting marine top predator’s distribution using aerial surveys (predators’ observations) and secondary production model outputs. My data are shaped in a 0.25° resolution grid, over large areas such as south-west Indian Ocean. Surveys have been done in several sectors over that grid, that is, all the grid is not sampled. My secondary production covariates are from SEAPODYM model, developed by Lehodey et al. I have a set of about 12 covariates, and wish to find the best combination of them to predict predators’ distribution.
That said, because my goal is to predict predators’ distribution, I performed a model selection based on GCV score, after having excluded all combinations of correlated covariates. I am modelling observations, therefore I used a quasipoisson distribution (overdispersed data) with sampled area per pixel as an offset. I made my model selection with the model written as follows:
mod1<-gam(Y ~ offset(log(sampled area))+ covariate1 + covariate2 + covariate3 + covariate4, family=quasipoisson)
Until then, everything worked well, but then I started predicting predators’ observations over my full grid. Doing this with the previous model resulted in predicting only over sampled pixel. Following colleagues’ advices, I predicted using a different writing, thinking it would not change anything in the results:
mod2<-gam(Y ~ covariate1 + covariate2 + covariate3 + covariate4, offset(log(sampled area)), family=quasipoisson)
Indeed, it allows me to predict observation over the entire grid, but I am completely confused since I looked at the summaries of the models. The results are in fact quite different, with both GCV scores and Explained Deviances having higher values:
summary(mod1)
Family: quasipoisson
Link function: log
Formula:
species ~ offset(log(sampled area) + s(covariate1,k=4) + s(covariate2,k=4)
+ s(covariate3,k=4) + s(covariate4,k=4)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.12268 0.08025 -1.529 0.127
Approximate significance of smooth terms:
edf Ref.df F p-value
s(bathy_mnk_B) 2.722 2.951 15.89 4.83e-10 ***
s(meso_mnk_B) 2.995 3.000 17.52 3.52e-11 ***
s(bathy_mnk_P) 2.989 3.000 20.37 6.21e-13 ***
s(mmeso_mnk_P) 2.536 2.855 12.71 6.49e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0838 Deviance explained = 16.4%
GCV score = 22.371 Scale est. = 22.188 n = 1491
summary(mod2)
Family: quasipoisson
Link function: log
Formula:
species ~ s(covariate1,k=4) + s(covariate2,k=4) + s(covariate3,k=4) +
s(covariate4,k=4)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.30093 0.07884 16.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(bathy_mnk_B) 2.802 2.974 17.37 5.18e-11 ***
s(meso_mnk_B) 3.000 3.000 16.35 1.86e-10 ***
s(bathy_mnk_P) 2.988 3.000 27.75 < 2e-16 ***
s(mmeso_mnk_P) 2.460 2.806 12.47 1.12e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0938 Deviance explained = 17.5%
GCV score = 31.809 Scale est. = 31.547 n = 1491
Given that I based my model selection on GCV scores, the fact that it changed like that puzzled me, and I am wondering whether or not my model selection was correct. I re-performed a model selection with the second kind of model (Y~X1+X2,offset), and resulting selected models are not always the same than with my first procedure.
Seeing this, I am wondering on which writing of the model should I base my modelling and predictions?
Does anyone could explain me why those two models are giving such different results?
Thanks in advance for your help and advices,
Charlotte
Best Answer
First, from the help-page of
gam
(bold font added by me):So for predicting, you should use the formula-specification. Further, if you specify your offset as an argument, rather than in the formula, you should use an equal sign (=):
This should give the exact same result as this specification:
I honestly don't know what R caluclates if you specify the offset inside the brackets, like
offset(sampled area)
.Hope that helps.