GAM partial effect line does not overlap raw data

generalized-additive-modelmgcv

I am using a GAM to model the relationship between a response (mean_NN_dist) and a predictor (spRichness), while incorporating the spatial structure of the dataset (x and y).

My goal is to plot the partial effect of the main predictor, while holding the spatial predictor constant, as a way of "accounting for" spatial autocorrelation.

I fit a GAM using the bam() function, as this is a relatively large dataset.

gamFit <- bam(mean_NN_dist ~ s(spRichness) + s(x, y, k = 1000, 
              bs = 'gp', m = 2), data = cellDF, gamma = 1.4, 
              method = 'fREML')

And for comparison, I fit another model that does not explicitly include the spatial structure.

gamFit2 <- bam(mean_NN_dist ~ s(spRichness), data = cellDF, 
               gamma = 1.4, method = 'fREML')

I use the visreg package to plot the partial effect plot. I plot on the response scale so that I can plot the GAM fitted line in the same plot as the raw data.

par(mfrow = c(1,2))

visreg(gamFit, 'spRichness', scale = 'response', ylim = c(0, 0.20))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
        col = adjustcolor('gray70', alpha.f = 0.5))

visreg(gamFit2, 'spRichness', scale = 'response', ylim = c(0, 0.20))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
        col = adjustcolor('gray70', alpha.f = 0.5))

enter image description here

What I find confusing is that with the model plotted on the left, which includes the spatial predictor, the fitted line does not overlap the raw data, whereas it does for the other model that does not include the spatial coordinates.

Is this a sign that I am not going about this properly, or is this potentially reasonable, given that this is a partial effects plot?

It might be helpful to know that most of the values in the dataset are for lower values of spRichness, and there are relatively few observations for higher values.

> table(cellDF$spRichness)

   2    3    4    5    6    7    8    9   10   11   12 
 967 1790 1810 1223 1129  528  250  185   44   25    6 

Advice would be greatly appreciated. Thanks!

EDIT:

Following up on a comment below, here is a plot of the raw data against the predicted response under the GAM.

> plot(cellDF$mean_NN_dist, predict(gamFit))
> abline(0, 1, col='red')

enter image description here

This may have to do with visreg R package, and how it is handling this GAM. If I do the following, this issue goes away.

par(mfrow=c(1,2))
plot.gam(gamFit, seWithMean = TRUE, select = 1, 
          shift = coef(gamFit)[1], shade = TRUE, 
          shade.col = 'light blue', ylim = c(0, 0.2))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
         col = adjustcolor('gray70', alpha.f = 0.5))
plot.gam(gamFit2, seWithMean = TRUE, select = 1, 
         shift = coef(gamFit2)[1], shade = TRUE, 
         shade.col = 'light blue', ylim = c(0, 0.2))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
         col = adjustcolor('gray70', alpha.f = 0.5))

enter image description here

Best Answer

It's because you forgot about the spatial component; you didn't tell visreg what covariate values to use for the spatial locations so it will have gone with a default which might be the mean of each spatial coordinate. If they happen to be in a place with higher counts, that would shift the plot up. The second model doesn't have a spatial component so there's nothing to condition on in this plot so the fit does go through the data as you would expect.

The reason plot.gam is producing what you want is because this is showing a partial effect which you then shift by adding on the intercept. This is basically ignoring the spatial component in the first model (it's contribution is being set to 0) which is what you also get if you exclude this spatial term from the model.

You can proceed to do what you want but visreg style without conditioning the predictions on particular values of the x and y coordinates with:

newdf <- with(cellDF,
              data.frame(sppRichness = seq(min(sppRichness),
                                           max(sppRichness),
                                           length = 100),
                         # have to provide something for x and y
                         x = 1, y = 1))
p <- predict(gamFit, new data = newdf,
             exclude = "s(x,y)", # exclude the spatial effect from yhat
             type = "link", se.fit = TRUE)
newdf <- cbind(newdf, as.data.frame(p))
newdf <- transform(newdf,
                   upper = fit + (2*se.fit),
                   lower = fit - (2*se.fit))

As this is a Gaussian model you don't need to back transform to the response scale - if it was you'd need to apply the inverse of the link function to each of fit, lower, and upper before plotting.

The exclude argument to predict.gam allows you to generate predictions ignoring the listed term(s). Here we are doing what visreg is doing but we don't need to condition on a spatial coordinate. This might not go as close to the middle of the data are you might like, but that'll be because some of the response magnitude is removed from the smooth of sppRichness and is accounted for by the spatial term.

Related Question