Solved – Predicting with GAM, using an offset

generalized-additive-modeloffsetr

I came in a new post because I’m quite confused with the use of an offset in gam() function (mgcv package). I’m actually working on predicting marine top predator’s distribution using aerial surveys (predators’ observations) and secondary production model outputs. My data are shaped in a 0.25° resolution grid, over large areas such as south-west Indian Ocean. Surveys have been done in several sectors over that grid, that is, all the grid is not sampled. My secondary production covariates are from SEAPODYM model, developed by Lehodey et al. I have a set of about 12 covariates, and wish to find the best combination of them to predict predators’ distribution.

That said, because my goal is to predict predators’ distribution, I performed a model selection based on GCV score, after having excluded all combinations of correlated covariates. I am modelling observations, therefore I used a quasipoisson distribution (overdispersed data) with sampled area per pixel as an offset. I made my model selection with the model written as follows:

mod1<-gam(Y ~ offset(log(sampled area))+ covariate1 + covariate2 + covariate3 + covariate4, family=quasipoisson)

Until then, everything worked well, but then I started predicting predators’ observations over my full grid. Doing this with the previous model resulted in predicting only over sampled pixel. Following colleagues’ advices, I predicted using a different writing, thinking it would not change anything in the results:

mod2<-gam(Y ~ covariate1 + covariate2 + covariate3 + covariate4, offset(log(sampled area)), family=quasipoisson)

Indeed, it allows me to predict observation over the entire grid, but I am completely confused since I looked at the summaries of the models. The results are in fact quite different, with both GCV scores and Explained Deviances having higher values:

summary(mod1)
Family: quasipoisson 
Link function: log 

Formula:
species ~ offset(log(sampled area) + s(covariate1,k=4) + s(covariate2,k=4)
     + s(covariate3,k=4) + s(covariate4,k=4)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.12268    0.08025  -1.529    0.127

Approximate significance of smooth terms:
                 edf Ref.df     F  p-value    
s(bathy_mnk_B) 2.722  2.951 15.89 4.83e-10 ***
s(meso_mnk_B)  2.995  3.000 17.52 3.52e-11 ***
s(bathy_mnk_P) 2.989  3.000 20.37 6.21e-13 ***
s(mmeso_mnk_P) 2.536  2.855 12.71 6.49e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

R-sq.(adj) =  0.0838   Deviance explained = 16.4%
GCV score = 22.371  Scale est. = 22.188    n = 1491


summary(mod2)
Family: quasipoisson 
Link function: log 
Formula:
species ~ s(covariate1,k=4) + s(covariate2,k=4) + s(covariate3,k=4) + 
    s(covariate4,k=4)
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.30093    0.07884    16.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Approximate significance of smooth terms:
                 edf Ref.df     F  p-value    
s(bathy_mnk_B) 2.802  2.974 17.37 5.18e-11 ***
s(meso_mnk_B)  3.000  3.000 16.35 1.86e-10 ***
s(bathy_mnk_P) 2.988  3.000 27.75  < 2e-16 ***
s(mmeso_mnk_P) 2.460  2.806 12.47 1.12e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

R-sq.(adj) =  0.0938   Deviance explained = 17.5%
GCV score = 31.809  Scale est. = 31.547    n = 1491

Given that I based my model selection on GCV scores, the fact that it changed like that puzzled me, and I am wondering whether or not my model selection was correct. I re-performed a model selection with the second kind of model (Y~X1+X2,offset), and resulting selected models are not always the same than with my first procedure.

Seeing this, I am wondering on which writing of the model should I base my modelling and predictions?

Does anyone could explain me why those two models are giving such different results?

Thanks in advance for your help and advices,

Charlotte

Best Answer

First, from the help-page of gam (bold font added by me):

offset: Can be used to supply a model offset for use in fitting. Note that this offset will always be completely ignored when predicting, unlike an offset included in formula: this conforms to the behaviour of lm and glm.

So for predicting, you should use the formula-specification. Further, if you specify your offset as an argument, rather than in the formula, you should use an equal sign (=):

mod2<-gam(Y ~ covariate1 + covariate2 + covariate3 + covariate4, offset=log(sampled area), family=quasipoisson)

This should give the exact same result as this specification:

mod1<-gam(Y ~ offset(log(sampled area))+ covariate1 + covariate2 + covariate3 + covariate4, family=quasipoisson)

I honestly don't know what R caluclates if you specify the offset inside the brackets, like offset(sampled area).

Hope that helps.