Regression – Optimal Choice of m (Order of Derivative) for MGCV Splines in Generalized Additive Models

generalized-additive-modelmgcvregressionregularizationsplines

I am working on a project that aims to estimate the association between a certain marker in blood and risk of an adverse cardiovascular outcome. Conventionally, a clinical threshold for the marker has been established based on observations with other outcomes. The goal is to flexibly model the association between serum concentration of the marker and the outcome of interest, to determine if the association conforms to the same form seen with other outcomes.

I ran the analysis using both restricted cubic splines (RMS package in R) and penalized splines (MGCV). Using restricted cubic splines, there seem to be an inflection point that corresponds to the conventionally established concentration threshold. The same is not observed however when using a penalized spline (default s() option), which suggest the trend is linear. However, when I change the order of derivative in the thin plate spline penalty from the default m=2 to m=1, for example s(x, m=1), the penalized spline mimics the cubic spline, indicating the same inflection point.

I don’t have a good understanding of what the change in m does, and how/whether it should be fine-tuned to the specific data. The only reason I started playing with it was to explore why the default penalized spline did not detect an inflection point. Is it legitimate to change m, or are there ways to assess what optimal values of m to use? I will say that the diagnostics for the fitted GAM model seem to suggest a good fit regardless of the choice for m. AIC value was also essentially identical for the two models.

Can’t share the actual data since it is protected, but hopefully the question is generic enough. See below a figure with the splines under the different m values. A very "subtle" inflection point, but potentially of clinical significance.

Best Answer

I think for a more complete answer, I will wait for somebody more experienced with GAMs to correct what I say here, but this is what I know of the derivatives in GAMs and how the m argument works with it. For starters, recall that a first derivative is the slope of a smooth function at a specific point when the limit of a function approaches infinitesimally small increments (Larson & Edwards, 2023, p.103),

$$ f'(x) = \lim_{\Delta{x} \to 0}\frac{f(x + \Delta{x})-f(x)}{\Delta{x}} $$

where $x$ is an input and $\Delta{x}$ is the change in distance from that $x$ on the $x$ axis. Basically, at any given point on a smooth function, we can approximate its slope like a linear function. The second derivative conversely is defined as the change in slope for a smooth function, and is simply the derivative of the first derivative (Larson & Edwards, 2023, p.128):

$$ f''(x) = f'(f'(x)) $$

Note that how these derivatives are defined in the context of GAMs as well as their penalties can be a little different, given that a GAM function is typically defined by a number of basis functions rather than a singular function (Wood, 2017, p.162):

$$ f(x) = \sum\limits_{j=1}^{k}b_j(x)\beta_j $$

where $j$ is a given basis function. The actual penalization of this function by default involves finite differences with relation to the second derivative (see a relevant discussion here). In any case, to summarize the derivatives, we have two things: our slopes at a given point, and our rates of change. These can actually be visualized in a fit GAM model using the gratia package. Here is an example pulled directly from the Github:

library(mgcv)
library(gratia)
dat <- data_sim("eg1", n = 800, dist = "normal", scale = 2, seed = 42)
mod <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat, method = "REML")
draw(mod)

If we draw our base model, we get these smooths:

To obtain the derivatives, we use the following function and use draw once more:

#### First Derivatives of Smooths ####
df <- derivatives(mod, type = "central")
draw(df)

Now if you compare the first and second crop of plots, you can see the first smooth's slope does not change much from about 0 to about .25. As the slope starts to decrease and hit the middle of the plot, its slope is decreasing from a large positive slope to an evermore small slope. This visual change is captured by the second plot, which shows the values of 0 to about .25 as flat until the slope starts decreasing, wherein you see the relationship also decrease in the plot. This decrease in slope continues until you see the flat line at the end, where the slope doesn't change in the original function.

Alternatively, the second derivatives are shown as such:

df2 <- derivatives(mod, type = "central", order = 2)
draw(df2)

You can see they look "blocky." This is because of the nature of second derivatives. If for example we simplify things a lot by defining a function as $f(x) = x^3$, we can find that

$$ f'(x) = 3x^2 $$

$$ f''(x) = 6x $$

Notice here that in the case of the defined polynomial here, our function effectively becomes linear at this given point, which contributes to the zig-zaggy behavior of the plot for df2.

What does this mean within the context of GAMs? We know that TPRS splines and some similar splines by default penalize the second derivatives using m=2. If we compare our model here with one that penalizes the first derivative, we can compare them directly with another gratia function called compare_smooths, which will show how it changes our respective fits. Here I explicitly set m here to 1 or 2 to make it clear what I'm doing.

#### Fit Models ####
fit1 <- gam(y ~ s(x0,m=1) + s(x1,m=1) + s(x2,m=1) + s(x3,m=1), 
            data = dat,
            method = "REML") # dx1
fit2 <- gam(y ~ s(x0,m=2) + s(x1,m=2) + s(x2,m=2) + s(x3,m=2), 
            data = dat,
            method = "REML") # dx2

#### Plot ####
c <- compare_smooths(fit1,fit2)
draw(c)

We can now directly compare the model fit1, which penalizes the first derivative, and fit2, which penalizes the second derivative:

Notice for some smooths it has simplified them, and in others it has made them more wiggly. Now for the million dollar question for which I don't have an exact answer: what does this mean for us? Well in some cases this may make the data fit the model better, and that is enough. Penalizing the right derivative can also lend itself to some other properties as well, as noted in Pedersen et al., 2019:

We have observed that mgcv still occasionally finds solutions with simulated data where the global term is over-smoothed. To avoid this issue, we recommend both careful choice of basis and relative degrees of freedom of global and group-level terms. In the examples in section “What are Hierarchical GAMs?,” we used smoothers with an unpenalized null space for the global smoother and ones with no null space for the group-level terms. When using TPRS this can be achieved with splines with a lower order of derivative penalized in the group-level smoothers than the global smoothers, as lower-order TPRS have fewer basis functions in the null space. For example, we used m=2 (penalizing squared second derivatives) for the global smoother, and m=1 (penalizing squared first derivatives) for group-level smoothers in models GS and GI.

The best I can say is that you should try to fit your data to the functional relationship that is best defined by the theory and data driving that function, and try when possible to not overfit the data, regardless of the selection of $m$. From my own experience with tinkering with $m$, it can sometimes create functions that way overfit the data where there are a sparse number of data points, and that is something you would of course want to avoid. Here is an extreme artificial example below, where the sp and k arguments are set to cause a lot of bends in the curve:

#### Get Filtered Beaver Data ####
library(MASS)
tib <- tibble(beav1) %>% 
  filter(time > 1800)
tib

#### Fit Models ####
bf1 <- gam(temp ~ s(time, m = 1, k = 15, sp = .001),
           data = tib,
           method = "REML")
bf2 <- gam(temp ~ s(time, m = 2, k = 15, sp = .001),
           data = tib,
           method = "REML")

#### Plot ####
bfp1 <- draw(bf1, residuals=T)
bfp2 <- draw(bf2, residuals=T)
ggpubr::ggarrange(bfp1,bfp2)

You can see because there are not a lot of data points, this can contribute to some very bad misspecification with a penalized first derivative. Predictions from this model would probably be poor if the actual functional relationship is more smooth:

It sounds like you already know how your data is supposed to behave, and that should give you the right answer anyway about what is an adequate fit to your data.

Citations

Larson, R., & Edwards, B. H. (2023). Calculus (12e ed.). Cengage.
Pedersen, E. J., Miller, D. L., Simpson, G. L., & Ross, N. (2019). Hierarchical generalized additive models in ecology: An introduction with mgcv. PeerJ, 7, e6876. https://doi.org/10.7717/peerj.6876
Wood, S. N. (2017). Generalized additive models: An introduction with R (2nd ed.). CRC Press, Taylor and Francis Group.

Related Solutions

Solved – Back-testing or cross-validating when the model-building process was interactive

FYI, this might be more appropriate for SE.DataScience, but for the time being, I'll answer it here.

It seems to me like you might be in a situation where you will have no choice but to write a script that will implement your solutions. Never having worked with splines, my knowledge of them is strictly theoretical so please bear with me and let me know if there is anything I'm not seeing.

Broadly speaking, it appears that you have a couple of different items that you will have to resolve in order to implement this.

1.) Determining the model parameters in a dynamic fashion. You have previously mentioned that you've used a combination of domain knowledge and univariate measures. That seems to me like something that you should be able to handle heuristically. You will have to agree at the outset on a set of rules which your program will implement. This may or may not be a trivial task as you will have to do some hard thinking about the potential implications of those rules. This may require you to re-visit every step of your process and cataloging not just the decisions, but also the reasons behind those decisions.

2.) Actually implementing your program. In order to make your performance testing properly dynamic and easy to maintain and modify going forward, you will have to think about how you're going to structure it. You will likely want to use some sort of loop for your main model predictive performance estimation, preferably with a user-definable length in order to allow for greater flexibility going forward. You will also likely want to write separate functions for each action that you want your program to take as this will make it easier to test functionality, and to maintain and modify your program going forward. You will, at a minimum, likely need functions for dataset selection (i.e. only time periods that have "gone by" at the moment of backtesting), cleaning and validation (which you'll really have to think about, as data munging is a critical part of model building), functions for model training parameters, and functions for model prediction and performance measure collection and storage.

Your question about outlier detection and handling also falls under those two concerns and I would go about implementing by writing smaller loops within your main program loop that would continue to "clean" and refit the model until it's reached a point where you would be happy with it (which again, you'll have to define yourself).

If this sounds like a big task, it's because it is; people have written entire software libraries (sometimes very lucratively) in order to perform this sort of task. Beyond that, it's hard to offer any more specific advice without knowing more about your processes, data structure, and the programming language you've done your work in thus far.

If any of this of useful to you and you'd like me to expand on any of it, comment, let me know, and I'd be more than happy to do so.

Solved – Fitting splines using mgcv

library("ggplot2")
theme_set(theme_bw())
library("mgcv")

df <- data.frame(one   = 1*(1:20),
                 two   = c(1,6,2,8,7,4,9,8,5,4, c(1,6,2,8,7,4,3,8,5,4)/2),
                 three = 3*(1:20))
m <- gam(three ~ s(one, by=two, k = 8, pc=10), data = dfOne)

Q1

What your model is fitting is a centred smooth effect of one multiplied by two. The spline itself is quite smooth:

Next note that the estimates of the spline for the first n elements are all negative and quite similar in magnitude, esp the first few where you see the anticorrelated behaviour:

> head(fitted(m))
[1] 25.936853  4.209230 23.373273  5.188732 12.415398 22.525351
> head(predict(m, type = "terms")[,1])
         1          2          3          4          5          6 
 -5.118848 -26.846471  -7.682427 -25.866969 -18.640302  -8.530349

What you see here is the relatively similar contribution from the smooth being multiplied by the quite different value of two for these first two elements. As the spline effects are negative, the resulting contribution is somewhat negative for the first value and much more strongly negative for the second value, and so on.

This is all due to the model you specified; estimate a spline ,evaluate the spline at the ith location and multiply that effect by the ith row of the numeric by variable.

Q2

Knots are not inflexions of the resulting fitted values. In fact, there are as many knots as data points as you used a thin plate spline here. There k represents the dimension of the basis expansion (in this the dimension is 7 because you loose 1 due to the centring constraints imposed on the spline), and mgcv employs a neat trick to select 7 basis functions from the n basis functions in a traditional thin plate spline basis. (The basis functions are subject to an eigen decomposition and the eigenvectors associated with k-1 largest eigenvalues are used as a new basis. This concentrates most of the information in the full thin plate spline basis into a smaller set of orthogonal basis functions for model fitting.)

The estimated coefficients clearly indicate that only 7 basis functions were used in the model:

> coef(m)
 (Intercept) s(one):two.1 s(one):two.2 s(one):two.3 s(one):two.4 s(one):two.5 
 31.05570043   0.70385580  -0.65193244  -0.01680226  -0.21851339   0.14293203 
s(one):two.6 s(one):two.7 
 -0.52163338   5.27573517

Q3

As the plot of the spline shows, the spline does pass through x = 10 as you requested.

You are confusing the fitted values of the model with the estimated spline.

Best Answer

Citations

Related Solutions

Solved – Back-testing or cross-validating when the model-building process was interactive

Solved – Fitting splines using mgcv

Q1

Q2

Q3

Related Question