R – How to Set Knots in Natural Cubic Splines

rsplines

I have data with many correlated features, and I want to start by reducing the features with a smooth basis function, before running an LDA. I'm trying to use natural cubic splines in the splines package with the ns function. How do I go about assigning the knots?

Here's the basic R code:

library(splines)
lda.pred <- lda(y ~ ns(x, knots=5))

But I have no idea about how to chose the knots in ns.

Best Answer

How to specify the knots in R

The ns function generates a natural regression spline basis given an input vector. The knots can be specified either via a degrees-of-freedom argument df which takes an integer or via a knots argument knots which takes a vector giving the desired placement of the knots. Note that in the code you've written

library(splines)
lda.pred <- lda(y ~ ns(x, knots=5))

you have not requested five knots, but rather have requested a single (interior) knot at location 5.

If you use the df argument, then the interior knots will be selected based on quantiles of the vector x. For example, if you make the call

ns(x, df=5)

Then the basis will include two boundary knots and 4 internal knots, placed at the 20th, 40th, 60th, and 80th quantiles of x, respectively. The boundary knots, by default, are placed at the min and max of x.

Here is an example to specify the locations of the knots

x <- 0:100
ns(x, knots=c(20,35,50))

If you were to instead call ns(x, df=4), you would end up with 3 internal knots at locations 25, 50, and 75, respectively.

You can also specify whether you want an intercept term. Normally this isn't specified since ns is most often used in conjunction with lm, which includes an intercept implicitly (unless forced not to). If you use intercept=TRUE in your call to ns, make sure you know why you're doing so, since if you do this and then call lm naively, the design matrix will end up being rank deficient.

Strategies for placing knots

Knots are most commonly placed at quantiles, like the default behavior of ns. The intuition is that if you have lots of data clustered close together, then you might want more knots there to model any potential nonlinearities in that region. But, that doesn't mean this is either (a) the only choice or (b) the best choice.

Other choices can obviously be made and are domain-specific. Looking at histograms and density estimates of your predictors may provide clues as to where knots are needed, unless there is some "canonical" choice given your data.

In terms of interpreting regressions, I would note that, while you can certainly "play around" with knot placement, you should realize that you incur a model-selection penalty for this that you should be careful to evaluate and should adjust any inferences as a result.