Generalized Additive Model – Spline Basis with Linear Term in mgcv

basis functiongeneralized-additive-modelmgcvsmoothingsplines

I'm curious about the basis functions generated by the call to "s()" function with default parameter values, but even more specifically I'm curious about a smoother for a single variable varname that would explicitly have a linear term for that variable (a term perfectly correlated with that single variable on the entire real line). I'd like that linear term for the purposes of leveraging certain penalized regression techniques later on that would allow to determine whether the variable is best modeled linearly (via that one linear term) or non-linearly (via the entire spline).

I know that, by default, s() delivers a thin plate spline. Presume I use k=4 for the dimension of the spline, then it gives me 3 spline terms (the "s(varname).1", "s(varname).2", "s(varname).3"). I don't fully understand how these are calculated, but I picked up on the fact that the last term (in this case, "s(varname).3") is perfectly correlated with the original variable varname. This tendency repeats itself for different k value (e.g. for k=6 it would be "s(varname).5" that's perfectly correlated with varname) and for other variables.

Would sincerely appreciate if someone could:

Confirm that the last term of a thin plate spline generated by default "s()" function call for a single numeric variable will always be perfectly correlated with that variable
Shed more light on precisely how these basis functions are calculated in case of a default "s()" function call for a single numeric variable
Provide examples of other, non-default, spline types – among those provided by s() function – that explicitly include a linear term for the original variable (if any)?

Best Answer

Q1

Yes, with the default univariate smooth s(x) there will always be one basis function that is a linear function and hence perfectly correlated with x. That this is the last function is, I think, implementational; nothing changes if you put this linear basis function first or anywhere in the set of basis functions.

Note, however, that reducing the default size of the penalty null space will remove this linear basis function: with s(x, m = c(2, 0)), we are requesting a low-rank thin plate regression spline (TPRS) basis with 2nd order derivative penalty and zero penalty null space. As the linear function is in the penalty null space (it is not affected by the wiggliness penalty as it has second derivative of 0), it will be removed from the basis.

If we have a bivariate low rank TPRS, then there will be a linear plane that is perfectly correlated with $x_1$ and another that is linearly correlated with $x_2$ for low rank TPRS $f(x_1,x_2)$.

Q2

I'm not going to repeat Wood (2003) — if you want the math behind thin plate splines, read that paper or §5.5 of (the second edition of) Simon's book (2017) for the detail.

The raw basis functions for the univariate thin plate spline are given by

$$ \eta_{md}(r) = \frac{\Gamma(d/2 - m)}{2^{2m}\pi^{d/2}(m-1)!} r^{2m-d} $$

for $d$ odd and here $d = 1$ as we are speaking about a univariate spline. $m$ is the order of the penalty, so be default $m = 2$. $r = \| \mathbf{x}_i - \mathbf{x}_j \|$, i.e. the Euclidean distance between the data $\mathbf{x}_i$ and the control points or knots $\mathbf{x}_j$, the latter being the unique values of $\mathbf{x}_i$.

These functions look like this

# definition from Wood 2017 and Wood 2003 JRSSB
# this is for a 1 dimensional smooth (d) with 2nd order derivative penalty (m)
eta <- function(x, x0) {
    d <- 1
    m <- 2
    r <- sqrt((x - x0)^2)
    top <- gamma((d/2) - m)
    bot <- (2^(2*m) * pi^(d/2)) * factorial(m-1)
    (top / bot) * r^((2*m) - d)
}

tprs <- function(x, knots = NULL, null_space = TRUE) {
    require("tidyr")
    # want a basis function per unique x if no control points provided
    if (is.null(knots)) {
        knots <- sort(unique(x))
    }
    bf <- outer(x, knots, FUN = eta)
    if (null_space) {
        bf <- cbind(rep(1, length(x)), x, bf)
    }
    
    # convert to a tibble for easy plotting in ggplot
    n_knots <- length(knots)
    n_bf <- ifelse(null_space, n_knots + 2, n_knots)
    colnames(bf) <- paste0("bf", seq_len(n_bf))
    bf <- bf |>
        tidyr::as_tibble() |>
        tibble::add_column(x, .before = 1L) |>
        tidyr::pivot_longer(!all_of("x"), names_to = "bf", names_prefix = "bf") |>
        dplyr::mutate(bf = factor(bf, levels = seq_len(n_bf)))
    bf
}


set.seed(1)
x_ref <- seq(0, 1, length = 100)
x <- runif(20)
knots <- sort(unique(x))

bfuns <- tprs(x_ref, knots = x)

library("ggplot2")

bfuns |>
    ggplot(aes(y = value, x = x, group = bf, colour = bf)) +
    geom_line() +
    theme(legend.position = "none") +
    facet_wrap(~ bf, scales = "free_y")

Basis functions 1 and 2 are the functions in the penalty null space for this basis (they have 0 second derivative).

For practical usage, the basis needs to have identifiability constraints applied to it; typically this is a sum-to-zero constraint. As a result the knot-based (1 basis function per $\mathbf{x}_j$) thin plate spline basis looks like this:

library("gratia")

x_red <- seq(0, 1, length = 50)
x <- runif(7)
knots <- sort(unique(x))

bfs <- basis(s(x, k = 7), data = data.frame(x = x), knots = list(x = knots), 
             at = data.frame(x = x.ref), constraints = FALSE)
draw(bfs) & facet_wrap(~bf)

Here shown for 7 data, hence 7 basis functions. This is achieved in mgcv by passing in the knots argument and having it be the same length as k. If you want to do this with large n and hence k you will likely need to read ?tprs and note the setting max.knots.

By default however, mgcv doesn't use this knot-based tprs basis. Instead it uses the low-rank approach of Wood (2004), but applies and eigendecomposition to the full basis, and retains the eigenvectors associated $k$-largest eigenvalues as a new basis. The point of this is that we can retain much of the original, rich basis in the low-rank one, thus providing a close approximation to the ideal spline basis, without needing $n$ basis functions (number of unique data points), i.e. covariates. This low-rank solution requires a computationally costly eigendecomposition, but mgcv uses an algorithm such that it only ever needs to find the eigenvectors for the $k$-largest eigenvalues, not the full set of eigenvectors. Even so, for large $n$ this is still computationally costly and ?tprs suggests what to do in such cases.

These eigendecomposition-based basis functions, for the same 7 data as above, look like this

bfs_eig <- basis(s(x, k = 7), data = data.frame(x = x_ref), constraints = FALSE)
draw(bfs_eig) & facet_wrap(~bf)

Note that in these plots I'm showing the basis with the constant function included. In a typical model, the constant function is removed because it is confounded with the intercept.

Q3

The only other basis in mgcv that has this property is the Duchon spline (bs = "ds"). This is not surprising as the thin plate spline is a special case of the more general class of Duchon splines.

This is not to say that other bases do not include the linear function in their span; they do, but this is achieved through a specific weighting of the individual basis functions, and as such there isn't a basis function that is linear in the other bases in mgcv.

References

Wood, S.N., 2003. Thin plate regression splines. J. R. Stat. Soc. Series B Stat. Methodol. 65, 95–114. https://doi.org/10.1111/1467-9868.00374 §

Wood, S.N., 2017. Generalized Additive Models: An Introduction with R, Second Edition. CRC Press.

If you can bear the setup cost, use thin-plate regression splines (TPRS).

These splines are optimal in an asymptotic MSE sense, but require one basis function per observation. What Simon does in mgcv is generate a low-rank version of the standard TPRS by taking the full TPRS basis and subjecting it to an eigendecomposition. This creates a new basis where the first k basis function in the new space retain most of the signal in the original basis, but in many fewer basis functions. This is how mgcv manages to get a TPRS that uses only a specified number of basis functions rather than one per observation. This eigendecomposition preserves much of the optimality of the classic TPRS basis, but at considerable computational effort for large data sets.

If you can't bear the setup cost of TPRS, use cubic regression splines (CRS)

This is a quick basis to generate and hence is suited to problems with a lot of data. It is knot-based however, so to some extent the user now needs to choose where those knots should be placed. For most problems there is little to be gained by going beyond the default knot placement (at the boundary of the data and spaced evenly in between), but if you have particularly uneven sampling over the range of the covariate, you may choose to place knots evenly spaced sample quantiles of the covariate, for example.

Every other smooth in mgcv is special, being used where you want isotropic smooths or two or more covariates, or are for spatial smoothing, or which implement shrinkage, or random effects and random splines, or where covariates are cyclic, or the wiggliness varies over the range of a covariate. You only need to venture this far into the smooth toolbox if you have a problem that requires special handling.

Shrinkage

There are shrinkage versions of both the TPRS and CRS in mgcv. These implement a spline where the perfectly smooth part of the basis is also subject to the smoothness penalty. This allows the smoothness selection process to shrink a smooth back beyond even a linear function essentially to zero. This allows the smoothness penalty to also perform feature selection.

Duchon splines, P splines and B splines

These splines are available for specialist applications where you need to specify the basis order and the penalty order separately. Duchon splines generalise the TPRS. I get the impression that P splines were added to mgcv to allow comparison with other penalized likelihood-based approaches, and because they are splines used by Eilers & Marx in their 1996 paper which spurred a lot of the subsequent work in GAMs. The P splines are also useful as a base for other splines, like splines with shape constraints, and adaptive splines.

B splines, as implemented in mgcv allow for a great deal of flexibility in setting up the penalty and knots for the splines, which can allow for some extrapolation beyond the range of the observed data.

Cyclic splines

If the range of values for a covariate can be thought of as on a circle where the end points of the range should actually be equivalent (month or day of year, angle of movement, aspect, wind direction), this constraint can be imposed on the basis. If you have covariates like this, then it makes sense to impose this constraint.

Adaptive smoothers

Rather than fit a separate GAM in sections of the covariate, adaptive splines use a weighted penalty matrix, where the weights are allowed to vary smoothly over the range of the covariate. For the TPRS and CRS splines, for example, they assume the same degree of smoothness across the range of the covariate. If you have a relationship where this is not the case, then you can end up using more degrees of freedom than expected to allow for the spline to adapt to the wiggly and non-wiggly parts. A classic example in the smoothing literature is the

library('ggplot2')
theme_set(theme_bw())
library('mgcv')
data(mcycle, package = 'MASS')
pdata <- with(mcycle,
              data.frame(times = seq(min(times), max(times), length = 500)))

ggplot(mcycle, aes(x = times, y = accel)) + geom_point()

These data clearly exhibit periods of different smoothness - effectively zero for the first part of the series, lots during the impact, reducing thereafter.

if we fit a standard GAM to these data,

m1 <- gam(accel ~ s(times, k = 20), data = mcycle, method = 'REML')

we get a reasonable fit but there is some extra wiggliness at the beginning and end the range of times and the fit used ~14 degrees of freedom

plot(m1, scheme = 1, residuals = TRUE, pch= 16)

To accommodate the varying wiggliness, an adaptive spline uses a weighted penalty matrix with the weights varying smoothly with the covariate. Here I refit the original model with the same basis dimension (k = 20) but now we have 5 smoothness parameters (default is m = 5) instead of the original's 1.

m2 <- gam(accel ~ s(times, k = 20, bs = 'ad'), data = mcycle, method = 'REML')

Notice that this model uses far fewer degrees of freedom (~8) and the fitted smooth is much less wiggly at the ends, whilst still being able to adequately fit the large changes in head acceleration during the impact.

What's actually going on here is that the spline has a basis for smooth and a basis for the penalty (to allow the weights to vary smoothly with the covariate). By default both of these are P splines, but you can also use the CRS basis types too (bs can only be one of 'ps', 'cr', 'cc', 'cs'.)

As illustrated here, the choice of whether to go adaptive or not really depends on the problem; if you have a relationship for which you assume the functional form is smooth, but the degree of smoothness varies over the range of the covariate in the relationship then an adaptive spline can make sense. If your series had periods of rapid change and periods of low or more gradual change, that could indicate that an adaptive smooth may be needed.

Best Answer

Q1

Q2

Q3

References

Related Solutions

Solved – Smoothing methods for gam in mgcv package

Solved – Adaptive GAM smooths in mgcv