Generalized Additive Model – Spline Basis with Linear Term in mgcv

basis functiongeneralized-additive-modelmgcvsmoothingsplines

I'm curious about the basis functions generated by the call to "s()" function with default parameter values, but even more specifically I'm curious about a smoother for a single variable varname that would explicitly have a linear term for that variable (a term perfectly correlated with that single variable on the entire real line). I'd like that linear term for the purposes of leveraging certain penalized regression techniques later on that would allow to determine whether the variable is best modeled linearly (via that one linear term) or non-linearly (via the entire spline).

I know that, by default, s() delivers a thin plate spline. Presume I use k=4 for the dimension of the spline, then it gives me 3 spline terms (the "s(varname).1", "s(varname).2", "s(varname).3"). I don't fully understand how these are calculated, but I picked up on the fact that the last term (in this case, "s(varname).3") is perfectly correlated with the original variable varname. This tendency repeats itself for different k value (e.g. for k=6 it would be "s(varname).5" that's perfectly correlated with varname) and for other variables.

Would sincerely appreciate if someone could:

  1. Confirm that the last term of a thin plate spline generated by default "s()" function call for a single numeric variable will always be perfectly correlated with that variable
  2. Shed more light on precisely how these basis functions are calculated in case of a default "s()" function call for a single numeric variable
  3. Provide examples of other, non-default, spline types – among those provided by s() function – that explicitly include a linear term for the original variable (if any)?

Best Answer

Q1

Yes, with the default univariate smooth s(x) there will always be one basis function that is a linear function and hence perfectly correlated with x. That this is the last function is, I think, implementational; nothing changes if you put this linear basis function first or anywhere in the set of basis functions.

Note, however, that reducing the default size of the penalty null space will remove this linear basis function: with s(x, m = c(2, 0)), we are requesting a low-rank thin plate regression spline (TPRS) basis with 2nd order derivative penalty and zero penalty null space. As the linear function is in the penalty null space (it is not affected by the wiggliness penalty as it has second derivative of 0), it will be removed from the basis.

If we have a bivariate low rank TPRS, then there will be a linear plane that is perfectly correlated with $x_1$ and another that is linearly correlated with $x_2$ for low rank TPRS $f(x_1,x_2)$.

Q2

I'm not going to repeat Wood (2003) — if you want the math behind thin plate splines, read that paper or §5.5 of (the second edition of) Simon's book (2017) for the detail.

The raw basis functions for the univariate thin plate spline are given by

$$ \eta_{md}(r) = \frac{\Gamma(d/2 - m)}{2^{2m}\pi^{d/2}(m-1)!} r^{2m-d} $$

for $d$ odd and here $d = 1$ as we are speaking about a univariate spline. $m$ is the order of the penalty, so be default $m = 2$. $r = \| \mathbf{x}_i - \mathbf{x}_j \|$, i.e. the Euclidean distance between the data $\mathbf{x}_i$ and the control points or knots $\mathbf{x}_j$, the latter being the unique values of $\mathbf{x}_i$.

These functions look like this

# definition from Wood 2017 and Wood 2003 JRSSB
# this is for a 1 dimensional smooth (d) with 2nd order derivative penalty (m)
eta <- function(x, x0) {
    d <- 1
    m <- 2
    r <- sqrt((x - x0)^2)
    top <- gamma((d/2) - m)
    bot <- (2^(2*m) * pi^(d/2)) * factorial(m-1)
    (top / bot) * r^((2*m) - d)
}

tprs <- function(x, knots = NULL, null_space = TRUE) {
    require("tidyr")
    # want a basis function per unique x if no control points provided
    if (is.null(knots)) {
        knots <- sort(unique(x))
    }
    bf <- outer(x, knots, FUN = eta)
    if (null_space) {
        bf <- cbind(rep(1, length(x)), x, bf)
    }
    
    # convert to a tibble for easy plotting in ggplot
    n_knots <- length(knots)
    n_bf <- ifelse(null_space, n_knots + 2, n_knots)
    colnames(bf) <- paste0("bf", seq_len(n_bf))
    bf <- bf |>
        tidyr::as_tibble() |>
        tibble::add_column(x, .before = 1L) |>
        tidyr::pivot_longer(!all_of("x"), names_to = "bf", names_prefix = "bf") |>
        dplyr::mutate(bf = factor(bf, levels = seq_len(n_bf)))
    bf
}


set.seed(1)
x_ref <- seq(0, 1, length = 100)
x <- runif(20)
knots <- sort(unique(x))

bfuns <- tprs(x_ref, knots = x)

library("ggplot2")

bfuns |>
    ggplot(aes(y = value, x = x, group = bf, colour = bf)) +
    geom_line() +
    theme(legend.position = "none") +
    facet_wrap(~ bf, scales = "free_y")

enter image description here

Basis functions 1 and 2 are the functions in the penalty null space for this basis (they have 0 second derivative).

For practical usage, the basis needs to have identifiability constraints applied to it; typically this is a sum-to-zero constraint. As a result the knot-based (1 basis function per $\mathbf{x}_j$) thin plate spline basis looks like this:

library("gratia")

x_red <- seq(0, 1, length = 50)
x <- runif(7)
knots <- sort(unique(x))

bfs <- basis(s(x, k = 7), data = data.frame(x = x), knots = list(x = knots), 
             at = data.frame(x = x.ref), constraints = FALSE)
draw(bfs) & facet_wrap(~bf)

enter image description here

Here shown for 7 data, hence 7 basis functions. This is achieved in mgcv by passing in the knots argument and having it be the same length as k. If you want to do this with large n and hence k you will likely need to read ?tprs and note the setting max.knots.

By default however, mgcv doesn't use this knot-based tprs basis. Instead it uses the low-rank approach of Wood (2004), but applies and eigendecomposition to the full basis, and retains the eigenvectors associated $k$-largest eigenvalues as a new basis. The point of this is that we can retain much of the original, rich basis in the low-rank one, thus providing a close approximation to the ideal spline basis, without needing $n$ basis functions (number of unique data points), i.e. covariates. This low-rank solution requires a computationally costly eigendecomposition, but mgcv uses an algorithm such that it only ever needs to find the eigenvectors for the $k$-largest eigenvalues, not the full set of eigenvectors. Even so, for large $n$ this is still computationally costly and ?tprs suggests what to do in such cases.

These eigendecomposition-based basis functions, for the same 7 data as above, look like this

bfs_eig <- basis(s(x, k = 7), data = data.frame(x = x_ref), constraints = FALSE)
draw(bfs_eig) & facet_wrap(~bf)

enter image description here

Note that in these plots I'm showing the basis with the constant function included. In a typical model, the constant function is removed because it is confounded with the intercept.

Q3

The only other basis in mgcv that has this property is the Duchon spline (bs = "ds"). This is not surprising as the thin plate spline is a special case of the more general class of Duchon splines.

This is not to say that other bases do not include the linear function in their span; they do, but this is achieved through a specific weighting of the individual basis functions, and as such there isn't a basis function that is linear in the other bases in mgcv.

References

Wood, S.N., 2003. Thin plate regression splines. J. R. Stat. Soc. Series B Stat. Methodol. 65, 95–114. https://doi.org/10.1111/1467-9868.00374 §

Wood, S.N., 2017. Generalized Additive Models: An Introduction with R, Second Edition. CRC Press.