Solved – Smooth bivariate interaction decomposition in GAM models

generalized-additive-modelinteraction

Background

Consider the following additive model containing a smooth bivariate interaction term:

$y = f(x,z) + \epsilon$

where y is a continuous outcome variable, x and z are continuous predictor variables and $\epsilon$ is a random error term following a Normal distribution with mean 0 and unknown variance $\sigma^2$. Furthermore, $f$ is an unknown smooth bivariate function. In R, such a model would be fitted with the commands:

library(mgcv) 

model.1 <- gam(y ~ te(x,z), family=gaussian, data=data)

The mgcv help file https://www.rdocumentation.org/packages/mgcv/versions/1.8-24/topics/gam.models mentions that sometimes it is interesting to allow $f$ to have a main effects + interaction structure of the form $f(x,z) = f_1(x) + f_2(z) + f_3(x,z)$, meaning that the model would be expressed as:

$y = f_1(x) + f_2(z) + f_3(x,z) + \epsilon$

and fitted with the R command:

model.1 <- gam(y ~ ti(x) + ti(z) + ti(x,z), family=gaussian, data=data)

Questions

Q1: Generally speaking, under what circumstances would it be interesting to consider the main effects + interaction formulation for a bivariate smooth interaction? How would we know for a specific data set whether those circumstances were applicable? In particular, are those circumstances supposed to be driven by the data or by the research questions?

Q2: Assuming the circumstances in question are applicable for a data set, what is the interpretation of the main effects $f_1(x)$ and $f_2(z)$? For example, can we say that $f_1(x)$ represents the overall nonlinear effect of $x$ on the the outcome variable $y$, averaged across the values of $z$, while $f_2(x)$ represents the overall nonlinear effect of $z$ on the outcome variable $y$, averaged across the values of $x$? Also, would we need to worry in this context about the main smooth effects not necessarily being interpretable in the presence of a smooth interaction?

Q3: What is the interpretation of the smooth interaction term $f_3(x, z)$? Intuitively, how is this interpretation different from the interpretation of the smooth interaction term $f(x, z)$?

Many thanks for any insights you can share.

Best Answer

Q1

There are a couple of key practical situations where the decomposed model using separate marginal smooths plus a special tensor-product interaction, or ti(), smooth is a useful approach to take.

The first situation is one where testing for the presence of an interaction is important to the problem being addressed with the GAM. If we fit the bivariate smooth model

$$y_i \sim \mathcal{D}(\mu_i, \theta)$$ $$g(\mu_i) = \beta_0 + f(x_i, z_i)$$

representing $f(x_, z_i)$ via a tensor product smooth using te() in mgcv then, despite estimating the bivariate smooth interaction model we wanted, we have no model term that explicitly represents the interaction that we want to test. One might assume that we can use a generalized likelihood ratio test to compare the model above with the simpler model

$$g(\mu_i) = \beta_0 + f_1(x_i) + f_2(z_i)$$

but one has to take great care to ensure that these two models are strictly nested for that GLRT approach to be valid.

To get the correct nesting we can decompose the full bivariate smooth tensor product smooth into separate marginal smooths plus a tensor product smooth whose basis expansion has had the main effects of the separate marginal smooths removed from it.

Our model then becomes

$$g(\mu_i) = \beta_0 + f_1(x_i) + f_2(z_i) + f_3^{*}(x_i, z_i)$$

where the superscript $*$ is just there to remind us that $f_3^{*}(x_i, z_i)$ uses a different basis expansion to that of $f(x_i, z_i)$ from the original model.

Now you could compare this decomposed model with the simpler model with two univariate smooths and the proper nesting required for the GLRT, or one could use the Wald-like tests of the null hypothesis that $f_3() = 0$.

A practical example of when this might be needed in my own work is modelling monthly or daily air temperature time series. These data have a seasonal component plus a long-term trend and it is of scientific interest to ask whether the air temperatures are

  1. simply increasing, or,
  2. in addition to increasing over time, whether the seasonal distribution of temperature is changing as temperatures increase with the long term trend.

The first option would amount to fitting the simpler univariate smooth model and focusing inference on the smooth of year:

$$g(\mu_i) = \beta_0 + f_1(\text{year}_i) + f_2(\text{month}_i),$$

whilst the second option would imply that $f_3^{*}(\text{year}_i,\text{month}_i)$ in

$$g(\mu_i) = \beta_0 + f_1(\text{year}_i) + f_2(\text{month}_i) + f_3^{*}(\text{year}_i,\text{month}_i)$$

is non-zero.

Another situation where the tensor product interaction smooth is useful is when you wish to include several smooth effects in your model, and only some of them are required to interact or be part of bivariate relationships. Say we have four covariates that we will indicate by $x_i$, $z_i$, $v_i$ and $w_i$. Further assume we want to fit a model where $\mathbf{x}$ interacts with $\mathbf{z}$ and $\mathbf{v}$ whilst $\mathbf{z}$ interacts with $\mathbf{v}$ and $\mathbf{w}$, we could fit the following model

$$g(\mu_i) = \beta_0 + f_1(x_i, z_i) + f_1(x_i, v_i) + f_1(z_i, v_i) + f_1(z_i, w_i)$$

but we should note that the main effects of all of the covariates *except that of $\mathbf{w}$ occur in the bases of two or more smooth functions. Despite the identifiability constraints that mgcv imposes, it is likely that problems will occur in fitting the model with these overlapping terms.

We can largely avoid these issues through the use of the tensor product interaction smooths and the decomposed parameterisation of the model:

$$g(\mu_i) = \beta_0 + f_1(x_i) + f_2(z_i) + f_3(v_i) + f_4(w_i) + f_5^{*}(x_i, z_i) + f_6^{*}(x_i, v_i) + f_7^{*}(z_i, v_i) + f_8^{*}(z_i, w_i)$$

where the main effects of each covariate is now separated out from the interaction smooths.

How one approaches the modelling of a specific data set will depend on the specifics of the problem being tackled. Even if you are not explicitly interested in testing the interaction effect, once you start estimating multiple bivariate (or higher) smooth functions in moderately complex settings, the decomposed version of the model can become necessary to avoid computational issues.

Note that the decomposed model is not a free lunch; the te(x,y) smooth would require the selection of two smoothness parameters, whereas the decomposed version s(x) + s(z) + ti(x,z) would require the selection of four smoothness parameters, one each for the two marginal smooths plus two for the tensor product. Hence, if you don't need to fit the decomposition, you can fit a simpler model by using the full tensor product rather than the decomposed parameterization.

Q2

The interpretation doesn't appear to be quite so simple as $f_1(x_i)$ being the smooth effect of $\mathbf{x}$ averaged over the values of $\mathbf{z}$ when in the decomposed model.

The figure below shows estimates of $f_1(x_i)$ where the true relationship with $\mathbf{y}$ involves a bivariate effects of $\mathbf{x}$ and $\mathbf{z}$. The models are

m1 <- gam(y ~ s(x), data = df, method = 'REML')
m2 <- gam(y ~ s(x) + s(z), data = df, method = 'REML')
m3 <- gam(y ~ s(x) + s(z) + ti(x,z), data = df, method = 'REML')

(full code is shown below). Here $f_1(x_i)$ is s(x) in all three models. In the first we ignore the effect of z, in the second we include the additive effect of z but not the interaction, while in the third model we include the main effects and interaction or x and z on y.

enter image description here

If $f_1(x_i)$ were simply the smooth effect of x averaged over z then we would expect all three estimated functions to be effectively the same. We do see that the smooth functions of x in models 1 and 2 are effectively equivalent, but the smooth of x in model 3 is noticeably smoother than those of the other two models. The differences between the three estimated functions are quite small indeed, and almost indistinguishable when one plots them with the data.

In hindsight we might have expected $f_1(x_i)$ to be smoother in model 3 than in either of models 1 and 2, because some of the variation in x is accounted for by the variation in z and the bivariate smooth relationship between the two variables.

Another way to think about this is to consider what the average function of x might be $\bar{f}_1(x_i)$ over all possible values for $z_i$. My maths skills aren't good enough to work this out, but some manipulation of the m3 model fitted above and some averaging suggests that the function $f_1(x_i)$ is close to be not the same as $\bar{f}_1(x_i)$ might be (estimated from fits to data below).

enter image description here

What I'm showing here is:

  1. the rainbow-coloured lines show a random sample of 10 (out of 250 in the example) slices through the estimated smooth surface, each of which is a smooth function of x by construction,
  2. The solid red line is the average over z of the fitted values of the smooth for each unique value of x. In other words it is the average of the rainbow-coloured lines (slices through the surface) at each unique point on x (but for all 250 of them, not just the 10 shown), and
  3. The blue line shows the estimate function $\hat{f}_1(x_i)$ from m3, the decomposed model (in other words the "main effect" smooth for x).

The differences at the ends of the range of x are on the order of 0.01 to 0.03, which while not large aren't zero and don't seem to be converging towards 0 as I increased the number of locations in $z$ that I averaged the surface over.

Q3

How do we interpret the tensor product interaction smooth $f_3(x,z)$?

I think the simplest way to think about the interaction smooth is as the smooth deformations you need to apply to $f_1(x)$ as you vary $z$, or conversely to $f_2(z)$ as you vary $x$.

The plots below attempt to illustrate this. The first panel shows the additive ("main") effects $f_1(x) + f_2(z)$. The second panel shows the deformations you would apply to that surface to yield the final panel, which is the full estimated value of $y$ given the additive effects of all three smooth functions.

enter image description here

It is worth noting that in this example, if we estimated the bivariate effect using te() we get subtly different model fits, and this is most likely due to the extra smoothness parameters in the decomposed model compared with the simpler te() model.

Code

library('mgcv')
library('dplyr')
library('gratia')
library('ggplot2')
library('tidyr')
library('cowplot')

set.seed(1)
sim <- gamSim(eg = 2, n = 1000)

df <- sim$data

m1 <- gam(y ~ s(x), data = df, method = 'REML')
m2 <- gam(y ~ s(x) + s(z), data = df, method = 'REML')
m3 <- gam(y ~ s(x) + s(z) + ti(x,z), data = df, method = 'REML')

effs <- bind_rows(evaluate_smooth(m1, smooth = 's(x)'),
                  evaluate_smooth(m2, smooth = 's(x)'),
                  evaluate_smooth(m3, smooth = 's(x)'))
effs <- mutate(effs, model = rep(paste0('m', 1:3), each = 100),
               upper = est + (2 * se),
               lower = est - (2 * se))

## first figure in Q2
ggplot(effs, aes(x = x, y = est, colour = model)) +
    geom_ribbon(aes(x = x, ymin = lower, ymax = upper, fill = model),
                alpha = 0.1, inherit.aes = FALSE) +
    geom_line(size = 1)

## second figure in Q2
newd <- crossing(x = seq(0, 1, length = 250), z = seq(0, 1, length = 250))
newd <- mutate(newd, fit  = predict(m3, newd))

avg_fx <- newd %>%
    group_by(x) %>%
    summarise(avg = mean(fit))

fx <- mutate(evaluate_smooth(m3, 's(x)'), est = est + coef(m3)[1L])

set.seed(1)
ggplot(filter(newd, z %in% sample(unique(z), 10)),
       aes(x = x, y = fit, colour = factor(z))) +
    geom_line() +
    guides(colour = FALSE) +
    geom_line(data = avg_fx,
              mapping = aes(x = x, y = avg),
              colour = "red", size = 1) +
    geom_line(data = fx,
              mapping = aes(x = x, y = est),
              colour = "blue", size = 1, lty = 'dashed')

## Q3 
newd <- crossing(x = seq(0,1, length = 100), z = seq(0,1, length = 100))
newd <- mutate(newd,
               main_eff = predict(m3, newd, exclude = 'ti(x,z)'),
               interact = predict(m3, newd, terms = 'ti(x,z)'),
               both     = predict(m3, newd))

p1 <- ggplot(newd, aes(x = x, y = z, fill = main_eff)) +
    geom_raster() +
    geom_contour(aes(z = main_eff)) +
    scale_fill_distiller(palette = "RdBu", type = "div") +
    labs(title = "Main effects", subtitle = "s(x) + s(z)", fill = NULL)

p2 <- ggplot(newd, aes(x = x, y = z, fill = interact)) +
    geom_raster() +
    geom_contour(aes(z = interact)) +
    scale_fill_distiller(palette = "RdBu", type = "div") +
    labs(title = "Interaction effect", subtitle = "ti(x,z)", fill = NULL)

p3 <- ggplot(newd, aes(x = x, y = z, fill = both)) +
    geom_raster() +
    geom_contour(aes(z = both)) +
    scale_fill_distiller(palette = "RdBu", type = "div") +
    labs(title = "Main effects + interaction",
         subtitle = "s(x) + s(z) + ti(x,z)", fill = NULL)

plot_grid(p1, p2, p3, ncol = 3, align = 'hv', axis = 'tb')
Related Question