Generalized Additive Models – GAMs, Interactions, and Covariates

generalized-additive-modelmgcvmodelingr

I've been exploring a number of tools for forecasting, and have found Generalized Additive Models (GAMs) to have the most potential for this purpose. GAMs are great! They allow for complex models to be specified very succinctly. However, that same succinctness is causing me some confusion, specifically in regard to how GAMs conceive of interaction terms and covariates.

Consider an example data set (reproducible code at end of post) in which y is a monotonic function perturbed by a couple of gaussians, plus some noise:

enter image description here

The data set has a few predictor variables:

  • x: The index of the data (1-100).
  • w: A secondary feature that marks out the sections of y where a gaussian is present. w has values of 1-20 where x is between 11 and 30, and 51 to 70. Otherwise, w is 0.
  • w2: w + 1, so that there are no 0 values.

R's mgcv package makes it easy to specify a number of possible models for these data:

enter image description here

Models 1 and 2 are fairly intuitive. Predicting y only from the index value in x at default smoothness produces something vaguely correct, but too smooth. Predicting y only from w results in a model of the "average gaussian" present in y, and no "awareness" of the other data points, all of which have a w value of 0.

Model 3 uses both x and w as 1D smooths, producing a nice fit. Model 4 uses x and w in a 2D smooth, also giving a nice fit. These two models are very similar, though not identical.

Model 5 models x "by" w. Model 6 does the opposite. mgcv's documentation states that "the by argument ensures that the smooth function gets multiplied by [the covariate given in the 'by' argument]". So shouldn't Models 5 & 6 be equivalent?

Models 7 and 8 use one of the predictors as a linear term. These make intuitive sense to me, as they are simply doing what a GLM would do with these predictors, and then adding the effect to the rest of the model.

Lastly, Model 9 is the same as Model 5, except that x is smoothed "by" w2 (which is w + 1). What's strange to me here is that the absence of zeros in w2 produces a remarkably different effect in the "by" interaction.

So, my questions are these:

  • What is the difference between the specifications in Models 3 and 4? Is there some other example that would draw out the difference more clearly?
  • What, exactly, is "by" doing here? Much of what I've read in Wood's book and this website suggests that "by" produces a multiplicative effect, but I'm having trouble grasping the intuition of it.
  • Why would there be such a notable difference between Models 5 and 9?

Reprex follows, written in R.

library(magrittr)
library(tidyverse)
library(mgcv)

set.seed(1222)
data.ex <- tibble(
  x = 1:100,
  w = c(rep(0, 10), 1:20, rep(0, 20), 1:20, rep(0, 30)),
  w2 = w + 1,
  y = dnorm(x, mean = rep(c(20, 60), each = 50), sd = 3) + (seq(0, 1, length = 100)^2) / 2 + rnorm(100, sd = 0.01)
)

models <- tibble(
  model = 1:9,
  formula = c('y ~ s(x)', 'y ~ s(w)', 'y ~ s(x) + s(w)', 'y ~ s(x, w)', 'y ~ s(x, by = w)', 'y ~ s(w, by = x)', 'y ~ x + s(w)', 'y ~ w + s(x)', 'y ~ s(x, by = w2)'),
  gam = map(formula, function(x) gam(as.formula(x), data = data.ex)),
  data.to.plot = map(gam, function(x) cbind(data.ex, predicted = predict(x)))
)

plot.models <- unnest(models, data.to.plot) %>%
  mutate(facet = sprintf('%i: %s', model, formula)) %>%
  ggplot(data = ., aes(x = x, y = y)) +
  geom_point() +
  geom_line(aes(y = predicted), color = 'red') +
  facet_wrap(facets = ~facet)
print(plot.models)

Best Answer

Q1 What's the difference between models 3 and 4?

Model 3 is a purely additive model

$$y = \alpha + f_1(x) + f_2(w) + \varepsilon$$

so we have a constant $\alpha$ plus the smooth effect of $x$ plus the smooth effect of $w$.

Model 4 is a smooth interaction of two continuous variables

$$y = \alpha + f_1(x, w) + \varepsilon$$

In a practical sense, Model 3 says that no matter what the effect of $w$, the effect of $x$ on the response is the same; if we fix $x$ at some known value and vary $w$ over some range, the contributions from $f_1(x)$ to the fitted model remains the same. Verify this if you want by predict()-ing from model 3 for a fixed value of x and a few different values of w and use the type = 'terms' argument of the predict() method. You'll see a constant contribution to the fitted/predicted values for s(x).

This is not the case of model 4; this model says that the smooth effect of $x$ varies smoothly with the value of $w$ and vice-versa.

Note that unless $x$ and $w$ are in the same units or we expect the same wiggliness in both variables, you should be using te() to fit the interaction.

m4a <- gam(y ~ te(x, w), data = data.ex, method = 'REML')

pdata <- mutate(data.ex, Fittedm4a = predict(m4a))
ggplot(pdata, aes(x = x, y = y)) +
  geom_point() +
  geom_line(aes(y = Fittedm4a), col = 'red')

enter image description here

In one sense, model 4 is fitting

$$y = \alpha + f_1(x) + f_2(w) + f_3(x, w) + \varepsilon$$

where $f_3$ is the pure smooth interaction of the "main" smooth effects of $x$ and $w$, and which have been removed, for sake of identifiability, from the basis of $f_3$. You can get this model via

m4b <- gam(y ~ ti(x) + ti(w) + ti(x, w), data = data.ex, method = 'REML')

but note this estimates 4 smoothness parameters:

  1. the one associated with the main smooth effect of $x$
  2. the one associated with the main smooth effect of $w$
  3. the one associated with the marginal smooth of $x$ in the interaction tensor product smooth
  4. the one associated with the marginal smooth of $w$ in the interaction tensor product smooth

The te() model contains just two smoothness parameters, one per marginal basis.

A fundamental problem with all these models is that the effect of $w$ is not strictly smooth; there's a discontinuity where the effect of $w$ falls to 0 (or 1 in w2). This is showing up in your plots (and the one I show in detail here).

Q2 What, exactly, is "by" doing here?

by variable smooths can do a number of different things depending on what you pass to the by argument. In your examples the by variable, $w$ is continuous. In that case what you get a varying coefficient model. This is a model in which the linear effect of $w$ varies smoothly with $x$. In equation terms this is what your, say model 5, is doing

$$y = \alpha + f_1(x)w + \varepsilon$$

If this isn't immediately clear (it wasn't to me when I first looked at these models) for some given values of $x$ we evaluate the smooth function at this value and this then become the equivalent of $\beta_1 w$; in other words, it is the linear effect of $w$ at the given value of $x$, and those linear effects vary smoothly with $x$. See section 7.5.3 in the second edition of Simon's book for a concrete example where the linear effect of a covariate varies a smooth function of space (lat and long).

Q3 Why would there be such a notable difference between Models 5 and 9?

The difference between models 5 and 9 I think is simply due to multiplying by 0 or multiplying by 1. With the former, the effect of the only term in the model $f_1(x)w$ is 0 because $f_1(x) \times 0 = 0$. In model 9 you have $f_1(x) \times 1 = f_1(x)$ in those areas where there is no contribution from the gaussians of $w$. As $f_1(x)$ is an ~ exponential function, you get this superimposed upon the overall effect of $w$.

In other words, model 5 contains zero trend everywhere $w$ is 0, but model 9 includes an ~ exponential trend everywhere $w$ is 0 (1), upon which the varying coefficient effect of $w$ is superimposed.