Predicting mean smooth in GAM with multiple continuous and categorical “by” variables

generalized-additive-modelmgcvr

I use a GAM to model interactions between two continuous and one unordered categorical factor with three levels. Depending on significance of the smooth terms I would like to extract and visualize either level-specific smooth predictions if they are significant, or the interaction between the two continuous variables across all levels is, in case this one is signficant.

Using the iris dataset here because it is pretty close to my own data, which contains different species and their traits as predictors and matings as reponse (I am trying to model fitness landscapes). I specified the model below in an attempt to exclude the other terms (e.g., using predict.gam(mod1, newdata, exclude=c( "ti(Sepal.Length,Sepal.Width):Speciessetosa", "ti(Sepal.Length,Sepal.Width):Speciesversicolor", "ti(Sepal.Length,Sepal.Width):Speciesvirginica"), if this makes sense (see below):

## data
data(iris)

## gam
mod1 = gam(Petal.Width ~ Sepal.Length * Sepal.Width * Species + 
         s(Sepal.Length, k=3) +
         s(Sepal.Length, by=Species, k=3) +
         s(Sepal.Width, k=3) +
         s(Sepal.Width, by=Species, k=3) +
         ti(Sepal.Length, Sepal.Width, k=3) +
         ti(Sepal.Length, Sepal.Width, by=Species, k=3),
       method = "REML",
       data=iris
)
> anova(mod1)

Family: gaussian 
Link function: identity 

Formula:
Petal.Width ~ Sepal.Length * Sepal.Width * Species + s(Sepal.Length, 
    k = 3) + s(Sepal.Length, by = Species, k = 3) + s(Sepal.Width, 
    k = 3) + s(Sepal.Width, by = Species, k = 3) + ti(Sepal.Length, 
    Sepal.Width, k = 3) + ti(Sepal.Length, Sepal.Width, by = Species, 
    k = 3)

Parametric Terms:
                                 df       F  p-value
Sepal.Length                      1 4.30623 0.039860
Sepal.Width                       1     NaN      NaN
Species                           0     NaN      NaN
Sepal.Length:Sepal.Width          1 4.10032 0.044835
Sepal.Length:Species              2 0.94772 0.390174
Sepal.Width:Species               0     NaN      NaN
Sepal.Length:Sepal.Width:Species  2 1.02435 0.361785

Approximate significance of smooth terms:
                                                          edf         Ref.df        F   p-value
s(Sepal.Length)                                1.000000528612 1.000001005439 11.22252 0.0010463
s(Sepal.Length):Speciessetosa                  1.000000232442 1.000000464223  4.81234 0.0299554
s(Sepal.Length):Speciesversicolor              0.000001331420 0.000002637874  0.04797 0.9997172
s(Sepal.Length):Speciesvirginica               1.000000996148 1.000001892404  2.09408 0.1501724
s(Sepal.Width)                                 1.000007172805 1.000012865152  3.19529 0.0760765
s(Sepal.Width):Speciessetosa                   1.000000855121 1.000001708770  0.10222 0.7496783
s(Sepal.Width):Speciesversicolor               1.665759568126 1.888265235757  1.03729 0.2710795
s(Sepal.Width):Speciesvirginica                0.641978333598 0.870807147692  2.25871 0.1630554
ti(Sepal.Length,Sepal.Width)                   0.873528400557 0.982977416832  7.70746 0.0067218
ti(Sepal.Length,Sepal.Width):Speciessetosa     0.000001845268 0.000003682513  0.00065 0.5000000
ti(Sepal.Length,Sepal.Width):Speciesversicolor 0.000002545775 0.000005019515  0.03278 0.9996770
ti(Sepal.Length,Sepal.Width):Speciesvirginica  0.000003049801 0.000005392319  0.08124 0.9994726

So, considering these results, I would like to visualize the smooth for s(Sepal.Length):Speciessetosa separately from s(Sepal.Length), because the smooth effect is significant, but also ti(Sepal.Length,Sepal.Width) because the across-species smooth is significant, unlike the by-species terms.

Does this model specification make sense both in the context of i) my research question (do species differ in how their trait interaction-surfaces affect the response var) and ii) to selectively extract smooth surfaces within and across categorical levels?

Best Answer

The model as you have it doesn't make much sense.

why do you have linear terms and their interactions for length and width with species? The bases for the smooths already include those terms.
why are you setting k so low here - you're only allowing two degrees of freedom per smooth here. (Is this just an limitation of this example data set?)

Making statistical decisions in the way you plan using p values isn't advisable; you need to consider the potential effect sizes too. So I would suggest that you visualise all smooths.

I'm not sure what the problem is because plot(mod1) will plot all the smooths allowing you to see the average effect of the smooth for sepal length separately from the species-specific smooth of that same covariate. You can use the select argument to choose only the terms you want to plot (select uses the numeric order of the smooths as shown in the output from summary()). gratia::draw() will also allow the same plots.

The smooth interaction complicates things a bit, as the effect of sepal length is contained in the average smooth and the interaction. The way to proceed here is to prepare a counter factual and predict from the model at evenly space values of sepal length, while holding the sepal width fixed at some representative value. Then you also select which smooths you want in the counterfactual prediction:

# packages
library("mgcv")
library("gratia")
## data
data(iris)

## model
mod1 <- gam(Petal.Width ~ Species + 
         s(Sepal.Length, k=4) +
         s(Sepal.Length, by=Species, k=4) +
         s(Sepal.Width, k=4) +
         s(Sepal.Width, by=Species, k=4) +
         ti(Sepal.Length, Sepal.Width, k=4) +
         ti(Sepal.Length, Sepal.Width, by=Species, k=4),
       method = "REML",
       data = iris)

Let's generate data for the counterfactual (here I'm fixing this at Sepal.Width at 2.5 as the median value of the interaction is ~entirely flat)

ds <- data_slice(mod1, Sepal.Length = evenly(Sepal.Length),
  Sepal.Width = 2.5)

This produces

> ds                                                                          
# A tibble: 100 × 3
   Sepal.Length Sepal.Width Species
          <dbl>       <dbl> <fct>  
 1         4.3          2.5 setosa 
 2         4.34         2.5 setosa 
 3         4.37         2.5 setosa 
 4         4.41         2.5 setosa 
 5         4.45         2.5 setosa 
 6         4.48         2.5 setosa 
 7         4.52         2.5 setosa 
 8         4.55         2.5 setosa 
 9         4.59         2.5 setosa 
10         4.63         2.5 setosa 
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows

and we see another problem; because the model contains the parametric effects for the individual species mean petal widths, the intercept if for the reference level (with the default contrasts). So you have to either accept this and note that the plots of the smooth effect of Sepal.Length is offset by some vertical amount that would differ if you looked at a different species, or show the smooth for all three species, or change the contrasts such that there is a different interpretation of the intercept (whether that makes any sense depends on the specific contrasts you choose). This is important, because this counterfactual will include the estimated average petal width for the reference species but we are not going to show the species-specific smooth effects as per your request.

Given that caveat, we proceed by using predict and here it makes more sense to choose which terms to include than to spell out all the terms you want to exclude, which we do with the terms argument:

sm_want <- c("(Intercept)", "s(Sepal.Width)",
  "ti(Sepal.Length,Sepal.Width)")
fv <- fitted_values(mod1, data = ds, terms = sm_want)

which produces

> fv                                                                          
# A tibble: 100 × 7
   Sepal.Length Sepal.Width Species fitted    se  lower upper
          <dbl>       <dbl> <fct>    <dbl> <dbl>  <dbl> <dbl>
 1         4.3          2.5 setosa  -0.222 0.212 -0.638 0.193
 2         4.34         2.5 setosa  -0.219 0.209 -0.628 0.191
 3         4.37         2.5 setosa  -0.215 0.206 -0.619 0.189
 4         4.41         2.5 setosa  -0.211 0.203 -0.609 0.187
 5         4.45         2.5 setosa  -0.207 0.201 -0.600 0.186
 6         4.48         2.5 setosa  -0.203 0.198 -0.591 0.185
 7         4.52         2.5 setosa  -0.200 0.196 -0.583 0.184
 8         4.55         2.5 setosa  -0.196 0.193 -0.574 0.183
 9         4.59         2.5 setosa  -0.192 0.191 -0.566 0.182
10         4.63         2.5 setosa  -0.188 0.189 -0.558 0.182
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows

(Note that we ignored the Species parametric term here so this counterfactual data set is for setosa we would have to double check that this is the reference level - it is here, but you could add Species = ref_level(Species) to the data_slice() call above to be explicit about which species we are considering in the counterfactual.)

Now we can plot

library("ggplot2")

fv |>
  ggplot(aes(x = Sepal.Length, y = fitted)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  geom_line()

This produces

Wherein the figure shows the smooth effect of the sepal length on the response petal width, at a sepal width of 2.5 for Iris setosa.

Hope this answer explains how to plot the smooth you wanted?

As for the model specification I address that at the start (what you have is wrong as you are including the linear components twice (hence [in part] the NaNs).

This model (even corrected) doesn't answer your research question; you would need to fit a model without the species-specific effects (surfaces) and then compare the two models with a generalised likelihood ratio test or AIC for a test of the more complex model over the simpler model. Or you can try to estimate the difference between the estimated surfaces for pairs of species.

Doing that in general is relatively easy (I have a blog post on the general idea [shown only for univariate smooths] here). gratia has a function difference_smooths() for this but something isn't working with your specific model right now that I'll need to look at and fix.

Related Solutions

Solved – Predicting with GAM, using an offset

First, from the help-page of gam (bold font added by me):

offset: Can be used to supply a model offset for use in fitting. Note that this offset will always be completely ignored when predicting, unlike an offset included in formula: this conforms to the behaviour of lm and glm.

So for predicting, you should use the formula-specification. Further, if you specify your offset as an argument, rather than in the formula, you should use an equal sign (=):

mod2<-gam(Y ~ covariate1 + covariate2 + covariate3 + covariate4, offset=log(sampled area), family=quasipoisson)

This should give the exact same result as this specification:

mod1<-gam(Y ~ offset(log(sampled area))+ covariate1 + covariate2 + covariate3 + covariate4, family=quasipoisson)

I honestly don't know what R caluclates if you specify the offset inside the brackets, like offset(sampled area).

Hope that helps.

Solved – Predicting mean smooth in GAM with smooth-by-random-factor interaction

The solution suggested by Simon Wood to the simpler problem of predicting the population level effect from a model with random intercepts represented as a smooth is to use a by variable in the random effect smooth. See this Answer for some detail.

You can't do this dummy trick directly with your model as you have the smooth and random effects all bound up in the 2d spline term. As I understand it, you should be able to decompose your tensor product spline into "main effects" and the "spline interaction". I quote these as the decomposition will be to split out the fixed effects and random effects parts of the model.

Nb: I think I have this right but it would be helpful to have people knowledgeable with mgcv give this a once over.

## load packages
library("mgcv")
library("ggplot2")
set.seed(0)
means <- rnorm(5, mean=0, sd=2)
group <- as.factor(rep(1:5, each=100))

## generate data
df <- data.frame(group = group,
                 x = rep(seq(-3,3, length.out =100), 5),
                 y = as.numeric(dnorm(x, mean=means[group]) > 
                       0.4*runif(10)),
                 dummy = 1) # dummy variable trick

This is what I came up with:

gam_model3 <- gam(y ~ s(x, bs = "ts") + s(group, bs = "re", by = dummy) + 
                  ti(x, group, bs = c("ts","re"), by = dummy),
                  data = df, family = binomial, method = "REML")

Here I've broken out the fixed effects smooth of x, the random intercepts and the random - smooth interaction. Each of the random effect terms includes by = dummy. This allows us to zero out these terms by switching dummy to be a vector of 0s. This works because by terms here multiply the smooth by a numeric value; where dummy == 1 we get the effect of the random effect smooth but when dummy == 0 we are multiplying the effect of each random effect smoother by 0.

To get the population level we need just the effect of s(x, bs = "ts") and zero out the other terms.

newdf <- data.frame(group = as.factor(rep(1, 100)), 
                    x = seq(-3, 3, length = 100),
                    dummy = rep(0, 100)) # zero out ranef terms
ilink <- family(gam_model3)$linkinv      # inverse link function
df2 <- predict(gam_model3, newdf, se.fit = TRUE)
ilink <- family(gam_model3)$linkinv
df2 <- with(df2, data.frame(newdf,
                            response = ilink(fit),
                            lwr = ilink(fit - 2*se.fit),
                            upr = ilink(fit + 2*se.fit)))

(Note that all this was done on the scale of the linear predictor and only backtransformed at the end using ilink())

Here's what the population-level effect looks like

theme_set(theme_bw())
p <- ggplot(df2, aes(x = x, y = response)) +
geom_point(data = df, aes(x = x, y = y, colour = group)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.1) +
geom_line()
p

And here are the group level smooths with the population level one superimposed

df3 <- predict(gam_model3, se.fit = TRUE)
df3 <- with(df3, data.frame(df,
                            response = ilink(fit),
                            lwr = ilink(fit - 2*se.fit),
                            upr = ilink(fit + 2*se.fit)))

and a plot

p2 <- ggplot(df3, aes(x = x, y = response)) +
geom_point(data = df, aes(x = x, y = y, colour = group)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = group), alpha = 0.1) +
geom_line(aes(colour = group)) +
geom_ribbon(data = df2, aes(ymin = lwr, ymax = upr), alpha = 0.1) +
geom_line(data = df2, aes(y = response))
p2

From a cursory inspection this looks qualitatively similar to the result from Ben's answer but it is smoother; you don't get the blips where the next group's data is not all zero.

Best Answer

Related Solutions

Solved – Predicting with GAM, using an offset

Solved – Predicting mean smooth in GAM with smooth-by-random-factor interaction

Related Question