Solved – Predicting with GAM, using an offset

generalized-additive-modeloffsetr

I came in a new post because I’m quite confused with the use of an offset in gam() function (mgcv package). I’m actually working on predicting marine top predator’s distribution using aerial surveys (predators’ observations) and secondary production model outputs. My data are shaped in a 0.25° resolution grid, over large areas such as south-west Indian Ocean. Surveys have been done in several sectors over that grid, that is, all the grid is not sampled. My secondary production covariates are from SEAPODYM model, developed by Lehodey et al. I have a set of about 12 covariates, and wish to find the best combination of them to predict predators’ distribution.

That said, because my goal is to predict predators’ distribution, I performed a model selection based on GCV score, after having excluded all combinations of correlated covariates. I am modelling observations, therefore I used a quasipoisson distribution (overdispersed data) with sampled area per pixel as an offset. I made my model selection with the model written as follows:

mod1<-gam(Y ~ offset(log(sampled area))+ covariate1 + covariate2 + covariate3 + covariate4, family=quasipoisson)

Until then, everything worked well, but then I started predicting predators’ observations over my full grid. Doing this with the previous model resulted in predicting only over sampled pixel. Following colleagues’ advices, I predicted using a different writing, thinking it would not change anything in the results:

mod2<-gam(Y ~ covariate1 + covariate2 + covariate3 + covariate4, offset(log(sampled area)), family=quasipoisson)

Indeed, it allows me to predict observation over the entire grid, but I am completely confused since I looked at the summaries of the models. The results are in fact quite different, with both GCV scores and Explained Deviances having higher values:

summary(mod1)
Family: quasipoisson 
Link function: log 

Formula:
species ~ offset(log(sampled area) + s(covariate1,k=4) + s(covariate2,k=4)
     + s(covariate3,k=4) + s(covariate4,k=4)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.12268    0.08025  -1.529    0.127

Approximate significance of smooth terms:
                 edf Ref.df     F  p-value    
s(bathy_mnk_B) 2.722  2.951 15.89 4.83e-10 ***
s(meso_mnk_B)  2.995  3.000 17.52 3.52e-11 ***
s(bathy_mnk_P) 2.989  3.000 20.37 6.21e-13 ***
s(mmeso_mnk_P) 2.536  2.855 12.71 6.49e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

R-sq.(adj) =  0.0838   Deviance explained = 16.4%
GCV score = 22.371  Scale est. = 22.188    n = 1491


summary(mod2)
Family: quasipoisson 
Link function: log 
Formula:
species ~ s(covariate1,k=4) + s(covariate2,k=4) + s(covariate3,k=4) + 
    s(covariate4,k=4)
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.30093    0.07884    16.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Approximate significance of smooth terms:
                 edf Ref.df     F  p-value    
s(bathy_mnk_B) 2.802  2.974 17.37 5.18e-11 ***
s(meso_mnk_B)  3.000  3.000 16.35 1.86e-10 ***
s(bathy_mnk_P) 2.988  3.000 27.75  < 2e-16 ***
s(mmeso_mnk_P) 2.460  2.806 12.47 1.12e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

R-sq.(adj) =  0.0938   Deviance explained = 17.5%
GCV score = 31.809  Scale est. = 31.547    n = 1491

Given that I based my model selection on GCV scores, the fact that it changed like that puzzled me, and I am wondering whether or not my model selection was correct. I re-performed a model selection with the second kind of model (Y~X1+X2,offset), and resulting selected models are not always the same than with my first procedure.

Seeing this, I am wondering on which writing of the model should I base my modelling and predictions?

Does anyone could explain me why those two models are giving such different results?

Thanks in advance for your help and advices,

Charlotte

Best Answer

First, from the help-page of gam (bold font added by me):

offset: Can be used to supply a model offset for use in fitting. Note that this offset will always be completely ignored when predicting, unlike an offset included in formula: this conforms to the behaviour of lm and glm.

So for predicting, you should use the formula-specification. Further, if you specify your offset as an argument, rather than in the formula, you should use an equal sign (=):

mod2<-gam(Y ~ covariate1 + covariate2 + covariate3 + covariate4, offset=log(sampled area), family=quasipoisson)

This should give the exact same result as this specification:

mod1<-gam(Y ~ offset(log(sampled area))+ covariate1 + covariate2 + covariate3 + covariate4, family=quasipoisson)

I honestly don't know what R caluclates if you specify the offset inside the brackets, like offset(sampled area).

Hope that helps.

Related Solutions

Solved – Predict GLM poisson with offset

It is correct you to get cases instead of rates since you are predicting cases. If you want to obtain the rates you should use the predict method on a new data set having all columns equal to data but the population column identically equal to 1, so to have log(populaton)=0. In this case you will get the number of cases of one unit of population, i.e. the rate.

Generalized Additive Models – Predicting with Random Effects in mgcv gam

From version 1.8.8 of mgcv predict.gam has gained an exclude argument which allows for the zeroing out of terms in the model, including random effects, when predicting, without the dummy trick that was suggested previously.

predict.gam and predict.bam now accept an 'exclude' argument allowing terms (e.g. random effects) to be zeroed for prediction. For efficiency, smooth terms not in terms or in exclude are no longer evaluated, and are instead set to zero or not returned. See ?predict.gam.

library("mgcv")
require("nlme")
dum <- rep(1,18)
b1 <- gam(travel ~ s(Rail, bs="re", by=dum), data=Rail, method="REML")
b2 <- gam(travel ~ s(Rail, bs="re"), data=Rail, method="REML")

head(predict(b1, newdata = cbind(Rail, dum = dum)))    # ranefs on
head(predict(b1, newdata = cbind(Rail, dum = 0)))      # ranefs off
head(predict(b2, newdata = Rail, exclude = "s(Rail)")) # ranefs off, no dummy

> head(predict(b1, newdata = cbind(Rail, dum = dum)))    # ranefs on
       1        2        3        4        5        6 
54.10852 54.10852 54.10852 31.96909 31.96909 31.96909  
> head(predict(b1, newdata = cbind(Rail, dum = 0)))      # ranefs off
   1    2    3    4    5    6 
66.5 66.5 66.5 66.5 66.5 66.5
> head(predict(b2, newdata = Rail, exclude = "s(Rail)")) # ranefs off, no dummy
   1    2    3    4    5    6 
66.5 66.5 66.5 66.5 66.5 66.5

Older approach

Simon Wood has used the following simple example to check this is working:

library("mgcv")
require("nlme")
dum <- rep(1,18)
b <- gam(travel ~ s(Rail, bs="re", by=dum), data=Rail, method="REML")
predict(b, newdata=data.frame(Rail="1", dum=0)) ## r.e. "turned off"
predict(b, newdata=data.frame(Rail="1", dum=1)) ## prediction with r.e

Which works for me. Likewise:

dum <- rep(1, NROW(na.omit(Orthodont)))
m <- gam(distance ~ s(age, bs = "re", by = dum) + Sex, data = Orthodont)
predict(m, data.frame(age = 8, Sex = "Female", dum = 1))
predict(m, data.frame(age = 8, Sex = "Female", dum = 0))

also works.

So I would check the data you are supplying in newdata is what you think it is as the problem may not be with VesselID — the error is coming from the function that would have been called by the predict() calls in the examples above, and Rail is a factor in the first example.

Best Answer

Related Solutions

Solved – Predict GLM poisson with offset

Generalized Additive Models – Predicting with Random Effects in mgcv gam

Older approach

Related Question