GAM partial effect line does not overlap raw data

generalized-additive-modelmgcv

I am using a GAM to model the relationship between a response (mean_NN_dist) and a predictor (spRichness), while incorporating the spatial structure of the dataset (x and y).

My goal is to plot the partial effect of the main predictor, while holding the spatial predictor constant, as a way of "accounting for" spatial autocorrelation.

I fit a GAM using the bam() function, as this is a relatively large dataset.

gamFit <- bam(mean_NN_dist ~ s(spRichness) + s(x, y, k = 1000, 
              bs = 'gp', m = 2), data = cellDF, gamma = 1.4, 
              method = 'fREML')

And for comparison, I fit another model that does not explicitly include the spatial structure.

gamFit2 <- bam(mean_NN_dist ~ s(spRichness), data = cellDF, 
               gamma = 1.4, method = 'fREML')

I use the visreg package to plot the partial effect plot. I plot on the response scale so that I can plot the GAM fitted line in the same plot as the raw data.

par(mfrow = c(1,2))

visreg(gamFit, 'spRichness', scale = 'response', ylim = c(0, 0.20))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
        col = adjustcolor('gray70', alpha.f = 0.5))

visreg(gamFit2, 'spRichness', scale = 'response', ylim = c(0, 0.20))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
        col = adjustcolor('gray70', alpha.f = 0.5))

What I find confusing is that with the model plotted on the left, which includes the spatial predictor, the fitted line does not overlap the raw data, whereas it does for the other model that does not include the spatial coordinates.

Is this a sign that I am not going about this properly, or is this potentially reasonable, given that this is a partial effects plot?

It might be helpful to know that most of the values in the dataset are for lower values of spRichness, and there are relatively few observations for higher values.

> table(cellDF$spRichness)

   2    3    4    5    6    7    8    9   10   11   12 
 967 1790 1810 1223 1129  528  250  185   44   25    6

Advice would be greatly appreciated. Thanks!

EDIT:

Following up on a comment below, here is a plot of the raw data against the predicted response under the GAM.

> plot(cellDF$mean_NN_dist, predict(gamFit))
> abline(0, 1, col='red')

This may have to do with visreg R package, and how it is handling this GAM. If I do the following, this issue goes away.

par(mfrow=c(1,2))
plot.gam(gamFit, seWithMean = TRUE, select = 1, 
          shift = coef(gamFit)[1], shade = TRUE, 
          shade.col = 'light blue', ylim = c(0, 0.2))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
         col = adjustcolor('gray70', alpha.f = 0.5))
plot.gam(gamFit2, seWithMean = TRUE, select = 1, 
         shift = coef(gamFit2)[1], shade = TRUE, 
         shade.col = 'light blue', ylim = c(0, 0.2))
points(jitter(cellDF$spRichness), cellDF$mean_NN_dist, 
         col = adjustcolor('gray70', alpha.f = 0.5))

Best Answer

It's because you forgot about the spatial component; you didn't tell visreg what covariate values to use for the spatial locations so it will have gone with a default which might be the mean of each spatial coordinate. If they happen to be in a place with higher counts, that would shift the plot up. The second model doesn't have a spatial component so there's nothing to condition on in this plot so the fit does go through the data as you would expect.

The reason plot.gam is producing what you want is because this is showing a partial effect which you then shift by adding on the intercept. This is basically ignoring the spatial component in the first model (it's contribution is being set to 0) which is what you also get if you exclude this spatial term from the model.

You can proceed to do what you want but visreg style without conditioning the predictions on particular values of the x and y coordinates with:

newdf <- with(cellDF,
              data.frame(sppRichness = seq(min(sppRichness),
                                           max(sppRichness),
                                           length = 100),
                         # have to provide something for x and y
                         x = 1, y = 1))
p <- predict(gamFit, new data = newdf,
             exclude = "s(x,y)", # exclude the spatial effect from yhat
             type = "link", se.fit = TRUE)
newdf <- cbind(newdf, as.data.frame(p))
newdf <- transform(newdf,
                   upper = fit + (2*se.fit),
                   lower = fit - (2*se.fit))

As this is a Gaussian model you don't need to back transform to the response scale - if it was you'd need to apply the inverse of the link function to each of fit, lower, and upper before plotting.

The exclude argument to predict.gam allows you to generate predictions ignoring the listed term(s). Here we are doing what visreg is doing but we don't need to condition on a spatial coordinate. This might not go as close to the middle of the data are you might like, but that'll be because some of the response magnitude is removed from the smooth of sppRichness and is accounted for by the spatial term.

Related Solutions

Find difference in means at particular points in a by-factor GAMM

I believe this

... and there is no point at which the CI excludes 0, suggesting the means are the same at every point

is one source of confusion.

On the face of it, the data and smooths suggest that the difference of means is not exactly zero. Put another way, whatever difference in means there is between the two groups is small relative to the uncertainty in the estimates of those mean. You observed a difference but cannot say with any level of confidence that this observed difference is a real difference in the population of subjects you might have sampled.

Why are each of the model terms significant but the difference not significant? This comes from propagating the uncertainty in all of the things you estimated into the computation of the difference of means as a function of age.

Controlling for the smooth effect of Age by sex, you detect a small difference between the two groups in the mean response. The two smooth terms were assessed to be unlikely to have arise if the true functions were flat constant functions.

Computing a difference of means as a function of age requires you to bring together all these estimated effects and propagate their uncertainties into your estimates of the difference in the response between the two groups as a function of age.

You could try a more direct estimation of the thing you are interested in by using the ordered-factor method.

data <- transform(data, oSex = ordered(Sex))
m <- gam(Weight ~ oSex + s(Age) + s(Age, by = oSex) + s(Subject, bs = "re"),
         method = "REML"

where s(Age) is the smooth effect of Age for the reference level of Sex, while s(Age, by = oSex) is the smooth difference between the smooth effect of Age in the reference level and the other level of Sex.

This shouldn't change the outcome much though; I suspect you'll see a small effect of Sex overall once you condition on the smooth effects of Age in both groups, but you won't find large ("significant") differences when you ask the more specific question of "At what ages are the groups different?".

Best Answer

Related Solutions

Find difference in means at particular points in a by-factor GAMM

Related Question