Predicting mean smooth in GAM with multiple continuous and categorical “by” variables

generalized-additive-modelmgcvr

I use a GAM to model interactions between two continuous and one unordered categorical factor with three levels. Depending on significance of the smooth terms I would like to extract and visualize either level-specific smooth predictions if they are significant, or the interaction between the two continuous variables across all levels is, in case this one is signficant.

Using the iris dataset here because it is pretty close to my own data, which contains different species and their traits as predictors and matings as reponse (I am trying to model fitness landscapes). I specified the model below in an attempt to exclude the other terms (e.g., using predict.gam(mod1, newdata, exclude=c( "ti(Sepal.Length,Sepal.Width):Speciessetosa", "ti(Sepal.Length,Sepal.Width):Speciesversicolor", "ti(Sepal.Length,Sepal.Width):Speciesvirginica"), if this makes sense (see below):

## data
data(iris)

## gam
mod1 = gam(Petal.Width ~ Sepal.Length * Sepal.Width * Species + 
         s(Sepal.Length, k=3) +
         s(Sepal.Length, by=Species, k=3) +
         s(Sepal.Width, k=3) +
         s(Sepal.Width, by=Species, k=3) +
         ti(Sepal.Length, Sepal.Width, k=3) +
         ti(Sepal.Length, Sepal.Width, by=Species, k=3),
       method = "REML",
       data=iris
)
> anova(mod1)

Family: gaussian 
Link function: identity 

Formula:
Petal.Width ~ Sepal.Length * Sepal.Width * Species + s(Sepal.Length, 
    k = 3) + s(Sepal.Length, by = Species, k = 3) + s(Sepal.Width, 
    k = 3) + s(Sepal.Width, by = Species, k = 3) + ti(Sepal.Length, 
    Sepal.Width, k = 3) + ti(Sepal.Length, Sepal.Width, by = Species, 
    k = 3)

Parametric Terms:
                                 df       F  p-value
Sepal.Length                      1 4.30623 0.039860
Sepal.Width                       1     NaN      NaN
Species                           0     NaN      NaN
Sepal.Length:Sepal.Width          1 4.10032 0.044835
Sepal.Length:Species              2 0.94772 0.390174
Sepal.Width:Species               0     NaN      NaN
Sepal.Length:Sepal.Width:Species  2 1.02435 0.361785

Approximate significance of smooth terms:
                                                          edf         Ref.df        F   p-value
s(Sepal.Length)                                1.000000528612 1.000001005439 11.22252 0.0010463
s(Sepal.Length):Speciessetosa                  1.000000232442 1.000000464223  4.81234 0.0299554
s(Sepal.Length):Speciesversicolor              0.000001331420 0.000002637874  0.04797 0.9997172
s(Sepal.Length):Speciesvirginica               1.000000996148 1.000001892404  2.09408 0.1501724
s(Sepal.Width)                                 1.000007172805 1.000012865152  3.19529 0.0760765
s(Sepal.Width):Speciessetosa                   1.000000855121 1.000001708770  0.10222 0.7496783
s(Sepal.Width):Speciesversicolor               1.665759568126 1.888265235757  1.03729 0.2710795
s(Sepal.Width):Speciesvirginica                0.641978333598 0.870807147692  2.25871 0.1630554
ti(Sepal.Length,Sepal.Width)                   0.873528400557 0.982977416832  7.70746 0.0067218
ti(Sepal.Length,Sepal.Width):Speciessetosa     0.000001845268 0.000003682513  0.00065 0.5000000
ti(Sepal.Length,Sepal.Width):Speciesversicolor 0.000002545775 0.000005019515  0.03278 0.9996770
ti(Sepal.Length,Sepal.Width):Speciesvirginica  0.000003049801 0.000005392319  0.08124 0.9994726

So, considering these results, I would like to visualize the smooth for s(Sepal.Length):Speciessetosa separately from s(Sepal.Length), because the smooth effect is significant, but also ti(Sepal.Length,Sepal.Width) because the across-species smooth is significant, unlike the by-species terms.

Does this model specification make sense both in the context of i) my research question (do species differ in how their trait interaction-surfaces affect the response var) and ii) to selectively extract smooth surfaces within and across categorical levels?

Related:

Best Answer

The model as you have it doesn't make much sense.

  1. why do you have linear terms and their interactions for length and width with species? The bases for the smooths already include those terms.

  2. why are you setting k so low here - you're only allowing two degrees of freedom per smooth here. (Is this just an limitation of this example data set?)

Making statistical decisions in the way you plan using p values isn't advisable; you need to consider the potential effect sizes too. So I would suggest that you visualise all smooths.

I'm not sure what the problem is because plot(mod1) will plot all the smooths allowing you to see the average effect of the smooth for sepal length separately from the species-specific smooth of that same covariate. You can use the select argument to choose only the terms you want to plot (select uses the numeric order of the smooths as shown in the output from summary()). gratia::draw() will also allow the same plots.

The smooth interaction complicates things a bit, as the effect of sepal length is contained in the average smooth and the interaction. The way to proceed here is to prepare a counter factual and predict from the model at evenly space values of sepal length, while holding the sepal width fixed at some representative value. Then you also select which smooths you want in the counterfactual prediction:

# packages
library("mgcv")
library("gratia")
## data
data(iris)

## model
mod1 <- gam(Petal.Width ~ Species + 
         s(Sepal.Length, k=4) +
         s(Sepal.Length, by=Species, k=4) +
         s(Sepal.Width, k=4) +
         s(Sepal.Width, by=Species, k=4) +
         ti(Sepal.Length, Sepal.Width, k=4) +
         ti(Sepal.Length, Sepal.Width, by=Species, k=4),
       method = "REML",
       data = iris)

Let's generate data for the counterfactual (here I'm fixing this at Sepal.Width at 2.5 as the median value of the interaction is ~entirely flat)

ds <- data_slice(mod1, Sepal.Length = evenly(Sepal.Length),
  Sepal.Width = 2.5)

This produces

> ds                                                                          
# A tibble: 100 × 3
   Sepal.Length Sepal.Width Species
          <dbl>       <dbl> <fct>  
 1         4.3          2.5 setosa 
 2         4.34         2.5 setosa 
 3         4.37         2.5 setosa 
 4         4.41         2.5 setosa 
 5         4.45         2.5 setosa 
 6         4.48         2.5 setosa 
 7         4.52         2.5 setosa 
 8         4.55         2.5 setosa 
 9         4.59         2.5 setosa 
10         4.63         2.5 setosa 
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows

and we see another problem; because the model contains the parametric effects for the individual species mean petal widths, the intercept if for the reference level (with the default contrasts). So you have to either accept this and note that the plots of the smooth effect of Sepal.Length is offset by some vertical amount that would differ if you looked at a different species, or show the smooth for all three species, or change the contrasts such that there is a different interpretation of the intercept (whether that makes any sense depends on the specific contrasts you choose). This is important, because this counterfactual will include the estimated average petal width for the reference species but we are not going to show the species-specific smooth effects as per your request.

Given that caveat, we proceed by using predict and here it makes more sense to choose which terms to include than to spell out all the terms you want to exclude, which we do with the terms argument:

sm_want <- c("(Intercept)", "s(Sepal.Width)",
  "ti(Sepal.Length,Sepal.Width)")
fv <- fitted_values(mod1, data = ds, terms = sm_want)

which produces

> fv                                                                          
# A tibble: 100 × 7
   Sepal.Length Sepal.Width Species fitted    se  lower upper
          <dbl>       <dbl> <fct>    <dbl> <dbl>  <dbl> <dbl>
 1         4.3          2.5 setosa  -0.222 0.212 -0.638 0.193
 2         4.34         2.5 setosa  -0.219 0.209 -0.628 0.191
 3         4.37         2.5 setosa  -0.215 0.206 -0.619 0.189
 4         4.41         2.5 setosa  -0.211 0.203 -0.609 0.187
 5         4.45         2.5 setosa  -0.207 0.201 -0.600 0.186
 6         4.48         2.5 setosa  -0.203 0.198 -0.591 0.185
 7         4.52         2.5 setosa  -0.200 0.196 -0.583 0.184
 8         4.55         2.5 setosa  -0.196 0.193 -0.574 0.183
 9         4.59         2.5 setosa  -0.192 0.191 -0.566 0.182
10         4.63         2.5 setosa  -0.188 0.189 -0.558 0.182
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows

(Note that we ignored the Species parametric term here so this counterfactual data set is for setosa we would have to double check that this is the reference level - it is here, but you could add Species = ref_level(Species) to the data_slice() call above to be explicit about which species we are considering in the counterfactual.)

Now we can plot

library("ggplot2")

fv |>
  ggplot(aes(x = Sepal.Length, y = fitted)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  geom_line()

This produces

enter image description here

Wherein the figure shows the smooth effect of the sepal length on the response petal width, at a sepal width of 2.5 for Iris setosa.

Hope this answer explains how to plot the smooth you wanted?

As for the model specification I address that at the start (what you have is wrong as you are including the linear components twice (hence [in part] the NaNs).

This model (even corrected) doesn't answer your research question; you would need to fit a model without the species-specific effects (surfaces) and then compare the two models with a generalised likelihood ratio test or AIC for a test of the more complex model over the simpler model. Or you can try to estimate the difference between the estimated surfaces for pairs of species.

Doing that in general is relatively easy (I have a blog post on the general idea [shown only for univariate smooths] here). gratia has a function difference_smooths() for this but something isn't working with your specific model right now that I'll need to look at and fix.