Model Selection – Handling Interactions and Interpreting Results in Generalized Additive Models

generalized-additive-modelinteractionmodel selection

I'm relatively new with gams, yet handling a complex dataset.
I have a set of related questions and none of the posts on the topics fully answer my questions.

Let's start with a simple question and example.
We simulate a dataset using example 2 from mgcv, which according to the help file is "A smooth function of 2 variables.".
Thus we would expect have an interactions between the two variables (x and z).
I'm going to add a spurious continuous variable (s1) and spurious categorical variable (s2).

library(mgcv)
library(mgcViz)
n <- 300
set.seed(123)
# Simulate example 2 from mgcv: A smooth function of 2 variables.
dat <- gamSim(2, n=n, scale=.15) ## simulate data
dat$data$s1 <- runif(0,1, n=n) #spurious continuous variable
dat$data$s2 <- as.factor(sample(c("A", "S"), n, replace = TRUE)) # spurious categorical variable

Let's say we are interested in finding which variables best explain the data,
including whether there are important interactions with x and the other continuous variables.

According to answers such as these: https://stats.stackexchange.com/a/405292/32339,
which refer to Mara and Wood (2011),
it sounds like shrinkage is the best option to select variables.
This can be achieved in two ways, but as an example we will use the double penalty, using select=TRUE in mgcv.
Based on answers such as these https://stats.stackexchange.com/a/447321/32339, it looks like if I am interested in interactions,
I should put the main effects in s() and then use ti(), not te(), for model selection.

full_select_ti <- gam(y ~ s(x) + s(z) + s(s1) + s2 +
                        ti(x,z) + ti(x,s1), data=dat$data, select=TRUE, method="ML")
summary(full_select_ti) 


# Family: gaussian 
# Link function: identity 
#
# Formula:
# y ~ s(x) + s(z) + s(s1) + s2 + ti(x, z) + ti(x, s1)
#
# Parametric coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.30212    0.01228  24.598   <2e-16 ***
# s2S          0.01241    0.01761   0.705    0.481    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Approximate significance of smooth terms:
#                edf Ref.df      F  p-value    
# s(x)     3.1472900      9 13.013  < 2e-16 ***
# s(z)     2.6769672      9  3.317 8.94e-07 ***
# s(s1)    0.0001545      9  0.000   0.5194    
# ti(x,z)  7.8746346     16 16.656  < 2e-16 ***
# ti(x,s1) 1.5363495     16  0.240   0.0597 .  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# R-sq.(adj) =  0.609   Deviance explained =   63%
# -ML = -130.66  Scale est. = 0.021796  n = 300
# 

These results indicate that s2 (categorical) and s1 (main effect continuous) are not significant, and ti(x,s1) is also not significant, but marginally.

Q1: Generally, would I stop here and conclude that s2, ti(x,s1), and s1 are not important? And present the results of this model? Or do
I do some sort of step selection, and remove the most insignificant term,
taking into account the hierarchy (not removing the main effect, if the interaction is present)? In many posts, there is mention of using shrinkage and AIC, is there any issue to use p-values from the shrinkage along with AIC?

If I go ahead do a backward selection based on p-value, here I would remove the categorical term.

full_Xs2_select_ti <- gam(y ~ s(x) + s(z) + s(s1) + 
                        ti(x,z) + ti(x,s1), data=dat$data, select=TRUE, method="ML")
summary(full_Xs2_select_ti)

# Family: gaussian 
# Link function: identity 
# 
# Formula:
# y ~ s(x) + s(z) + s(s1) + ti(x, z) + ti(x, s1)
# 
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 0.308164   0.008719   35.34   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Approximate significance of smooth terms:
#                edf Ref.df      F  p-value    
# s(x)     3.1593997      9 13.025  < 2e-16 ***
# s(z)     2.6860733      9  3.318 9.05e-07 ***
# s(s1)    0.0002157      9  0.000   0.5112    
# ti(x,z)  7.9039063     16 16.704  < 2e-16 ***
# ti(x,s1) 1.4673178     16  0.221   0.0682 .  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# R-sq.(adj) =  0.609   Deviance explained = 62.9%
# -ML = -130.41  Scale est. = 0.02176   n = 300

Q2: Now, regardless of whether I do model selection above, given that main effects are affected by interaction and I'm not clear how this shrinkage (e.g., this double penalty method) takes into account the hierarchy,
would it be better to remove the interaction, and see what happens to the main effect?

So here I would do:

full_Xs2_Xtis1_select_ti <- gam(y ~ s(x) + s(z) + s(s1) + 
                            ti(x,z), data=dat$data, select=TRUE, method="ML")
summary(full_Xs2_Xtis1_select_ti)

# Family: gaussian 
# Link function: identity 
# 
# Formula:
# y ~ s(x) + s(z) + s(s1) + ti(x, z)
# 
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 0.306673   0.008733   35.12   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Approximate significance of smooth terms:
#               edf Ref.df     F  p-value    
# s(x)    3.110e+00      9 12.88  < 2e-16 ***
# s(z)    2.719e+00      9  3.48 6.01e-07 ***
# s(s1)   9.363e-05      9  0.00    0.565    
# ti(x,z) 7.938e+00     16 16.36  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# R-sq.(adj) =  0.605   Deviance explained = 62.3%
# -ML = -129.76  Scale est. = 0.022027  n = 300

Regardless of what is done above, according to https://stats.stackexchange.com/a/234817/32339,
if an interaction term is significant, you likely want to interpret the results using te(),
rather than s() + ti(). This is quite in line with how you would interpret the results of simpler linear analysis,
where would not interpret the main effects separately, but rather plot the full interaction between two terms.
So here we would go ahead and do:
(note that I have also remove s1 which was not significant above)

final_model <- gam(y ~ te(x,z), data=dat$data,  method="ML")
summary(final_model) 

# Family: gaussian 
# Link function: identity 
# 
# Formula:
# y ~ te(x, z)
# 
# Parametric coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 0.297592   0.008576    34.7   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Approximate significance of smooth terms:
#           edf Ref.df     F p-value    
# te(x,z) 13.75  17.17 26.59  <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# R-sq.(adj) =  0.604   Deviance explained = 62.2%
# -ML = -135.96  Scale est. = 0.022067  n = 300

# Plot for interpretation
vizObj_best_m <- getViz(final_model)
plot(vizObj_best_m)

enter image description here
That's great and work well with simulation, we got back what we simulated and we can interpret the plot easily.

My real data is more complicated. It has random effects, etc.
Importantly, the interactions with x and multiple continuous variables are significant. So it would be like if ti(x,z) and ti(x,s1) would be significant and we would end up with:

m_2interaction_with_x <- gam(y ~ te(x,z) + te(x,s1), data=dat$data,  method="ML")
# We ignore that te(x,s1) is not significant here 
# (I can't figure out a quick simulation that does this)
vizObj_m_2interaction_with_x <- getViz(m_2interaction_with_x)
print(plot(vizObj_m_2interaction_with_x, allTerms= TRUE), pages=1)

enter image description here

Q3: One of the main remaining questions (and most important question) that I have is how do I interpret these results given that x is in both?
This answer https://stats.stackexchange.com/a/519601/32339 appears to suggest that if the same variable is in multiple interactions,
I should decompose it as such:

m_2interaction_with_x2 <- gam(y ~ s(x) + s(z) + s(s1) + ti(x,z) + ti(x,s1), data=dat$data,  method="ML")
vizObj_m_2interaction_with_x2 <- getViz(m_2interaction_with_x2)
print(plot(vizObj_m_2interaction_with_x2, allTerms= TRUE), pages=1)

enter image description here
But then how do I interpret the results?
The relationship with just x or just z would be simple to interpret, but with linear models we generally don't want to interpret the main effects when there is an interaction.
So should I interpret the s() at all here?
And how do we interpret the plots showing the interactions with x and z or s1.
Any tips for the best interpretation of the results would be fantastic!

Reference:
Marra, G., and S. N. Wood. 2011. Practical variable selection for generalized additive models. Comput. Stat. Data Anal. 55: 2372–2387. doi:10.1016/j.csda.2011.02.004

Update

Follow up based on the fantastic response from @GavinSimpson

I think I understand option 1. It can be coded as follow.

library(gratia)
library(tidyr)
# Option 1
slice_xz <- data_slice(full_select_ti, x= evenly(x, n=100), z=evenly(z, n=100))
pred_xz <- predict(full_select_ti,slice_xz)
summary(pred_xz)
slice_xz_pred <- cbind(slice_xz, est=pred_xz)


slice_xz_pred %>%
  ggplot(aes(y = z, x=x, z=est)) +
  geom_contour_filled() + 
  labs(title = expression("Partial effect of" ~ s(x) + s(z) + ti(s,z)))

ggplot() +
  geom_contour_filled(data = slice_xz_pred, aes(y = z, x=x, z=est)) + 
  geom_point(data=dat$data, aes(x=x, y=z)) +
  labs(title = expression("Partial effect of" ~ s(x) + s(z) + ti(s,z)))

enter image description here

Q4 My remaining question for this option is: If I plot the SE, using say the output from predict where the argument is se.fit=TRUE, I would get the prediction intervals, right? Rather than confidence intervals.

I don't fully understand option 2. I think I can code it, though I'm not sure whether I'm supposed to make a different slice for each term, or whether I can use the slice from option 1. Here is the code using that same slice as in option 1:

pred_xz_x <- predict(full_select_ti,slice_xz, term = "s(x)")
pred_xz_z <- predict(full_select_ti,slice_xz, term = "s(z)")
pred_xz_ti <- predict(full_select_ti,slice_xz, term = "ti(x,z)")
pred_xz_2 <- pred_xz_x + pred_xz_z + pred_xz_ti
slice_xz_pred_2 <- cbind(slice_xz, est=pred_xz_2)

ggplot() +
  geom_contour_filled(data = slice_xz_pred_2, aes(y = z, x=x, z=est)) + 
  geom_point(data=dat$data, aes(x=x, y=z)) +
  labs(title = expression("Partial effect of" ~ s(x) + s(z) + ti(s,z)))

enter image description here

Q5 The results are different, but I don't understand why I would do this, and what are the advantages over the other option.

Best Answer

You don't say why you are fitting the model — whether for prediction or estimation/illumination — and this can affect you decisions re model selection. Let's assume it is not for prediction of new data.

Q1

You don't have to fit interactions using s(x1) + s(x2) + ti(x1,x2), but this form helps if your question is "Is there an interactive effect of x1 and x2 on the response?" You could just estimate the entire effect te(x1,x2). In your case however, the ti() construction provides another advantage; as you have multiple interactions involving a single variable x1, it can be computationally advisable to separate out the marginal smooths such that basis functions for those smooth are only found in the model matrix once. If you used te(x1,x2) + te(x1,x3), basis functions for the marginal smooth of x1 would get included in the model matrix twice, and while {mgcv} will try to remove collinear terms, it isn't always 100% effective and this collinearity of terms can lead to fitting probelms. Instead, you would fit the model with ti() smooths: s(x1) + s(x2) + s(x3) + ti(x1,x2) + ti(x1,x3)

This, of course, leads to the problem of visualising the effect you also mention.

As you have put extra penalties on the null spaces of the smooths in your model, you can just stop there. It would be wrong to drop out those terms that have effectively been shrunk out of the model and refit with the remaining terms, especially if you would then use the p values from the second model.

Also, an especially if you aren't building a model for prediction, you should just leave the model as estimated at this point. The argument I give here is that were you to remove those effectively-removed or non-significant terms from the model, you would be making the very strong assumption that the effects on the response of these terms are all exactly equal to zero. But your model has estimated effects that aren't exactly zero (perhaps with the exception of s(x1)). I would argue that you should leave this model as it is, as the tests for the smooths is taking into account your uncertainty as to which terms should be in the model (note the Ref.df column, which is the DF used in the construction of the p value).

In sum, just leave this model as it is and don't drop out any non-significant or effectively removed terms.

Q2

This is moot given what I suggest above, but also it's hard to imagine a situation where a marginal smooth (s(x)) would be effectively shrunk out of the model but the interaction it is part of remains in the model and has substantial curvature.

As recommended above, I wouldn't remove the interaction. I would use the model as is. Unless I were going for prediction, in which case I would be testing on an independent test set.

You can only go to the te() version if you decide to drop that other marginal smooth and interaction that don't involve terms in the te() you are fitting. As I argued above, if the point is to achieve do the best inference I would suggest you just leave the model as is.

Q3

You can solve your visulization problem if you keep the ti() versions in one of two ways:

  1. Create adjusted predictions from the model, varying the variables involved in the interaction you wish to visualize while keeping other variables at representative constant values (the observation closest to the median of each continuous variable, and the modal level for factor variables — you can get these from model$var.summary (replacing model with your fitted GAM)). This would yield data for a plot of the smooth on the response or link scale that includes the intercept and effects of any parametric terms, with the predicted values conditional upon the constants chosen for variables not involved with the interaction you are trying to visualise.
  2. Proceed as above but use predict(..., type = "terms") and compute the row sums of the columns of the returned object that relate to the smooths of interest (s(x), s(z), and ti(x,z)). This would give you an estimated surface that is similar to the one you plotted (with the smooth centred about 0) because in this formulation the intercept has not been included (nor would any parametric terms if you don't include those columns in the row sums.) You still have to provide a single constant value for each covariate that you aren't varying, otherwise predict() will throw an error.

In both of these you can also use the exclude argument to exclude the effects of named terms from the predictions - this isn't needed for option 2. as you are doing this manually when you only include selected columns in the row sums as that effectively excludes those terms related to the columns left out of the row sum.

When you get into more complex situations with covariates appearing in multiple smooths and random effects, you will need to use one of the above two approaches and likely you'll want the first, which allows you to have total control over what you are visualising as you are specifiying the values of the covariates that you want to predict at.

The data_slice() function in my {gratia} package can help with this (you'll need the development version which you can install from GitHub or my R-universe to have something that works well for the case you describe).