Solved – Number of basis functions in natural cubic spline

basis functionrsplines

According to ESL, natural cubic basic spline with $K$ knots is represented by $K$ basis function.

However, the ns() function in R with knots=K generates a basis matrix with $K+2$ basis function. This representation seems to add just two and not four constraints in both the boundary regions.

Indeed the documentation says that that the resulting matrix is a $N \times df$ matrix, where df = length(knots) + 1 + intercept.

(In addition to this, the resulting plot of lm(y~ns(x,knots=c(k_1,k_2,...,K_n),intercept=T))is not a linear function in the extreme regions)

My question is : is this a different definition of natural cubic spline?

Best Answer

There are no different definitions but unfortunately as S. Wood says: "Note that there are many alternative ways of representing such a cubic spline using basis functions: although all are equivalent, the link to the piecewise cubic characterization is not always transparent." [SW2017]

The definition of the natural cubic spline is as always: "The natural cubic spline, $g(x)$, interpolating (a set of points $\{x_i , y_i: i = 1, \dots, n\}$ where $x_i <x_{i+1}$), is a function made up of sections of cubic polynomial, one for each $[x_i, x_{i+1}]$, which are joined together so that the whole spline is continuous to second derivative, while $g(x_i) = y_i$ and $g′′(x_1) = g′′(x_n) = 0$." (Again from [SW2017])

In addition, and making a specific mention now to the concept of knots: "(Letting) $\xi_1 < \xi_2 < \dots < \xi_k$ be a set of ordered points - called knots - contained in some interval $(a, b)$, a cubic spline is a continuous function $r$ such that: (i) $r$ is a cubic polynomial over ($\xi_1$, $\xi_2$), ($\xi_2$, $\xi_3$), $\dots$. and (ii) $r$ has continuous first and second derivatives at the knots. More generally, an $M$th-order spline is a piecewise $M-1$ degree polynomial with $M-2$ continuous derivatives at the knots. A spline that is linear beyond the boundary knots is called a natural spline." (from [LW2006])

Returning now to ns, simply put the naming of the function ns is confusing. As Phil Karlton, one of the original Netscape project leaders/curmudgeons, said: "There are only two hard things in Computer Science: cache invalidation and naming things.". Here, the naming is probably a bit off because someone thought that the boundary points are not really knots but just points. Therefore, it made sense for knots to be actually only the interior points. This is alluded in the documentation of ns that comments on the association of the argument df with "the number of inner knots as length(knots)". This suggests that actually knots refers to inner knots.

For example, both splines::ns(...) and mgcv::s( bs='cr', ...) use the same knot locations. (where by default they are on relevant quantiles of $x$)

library(mgcv)
library(splines)

set.seed(3); 
N <- 234
x <- rt(N, df = 12)
e <- rnorm(N, 0, 0.4)
yTrue <- sin(x) + 0.2 * x 
yObs <- yTrue + e
numKnots <- 8

crFit <- gam(yObs ~ s(x, bs = 'cr', k = numKnots))
crKnots <- crFit$smooth[[1]]$xp  # get knots locations
nsRepr <- ns(x = x, intercept = TRUE, df = numKnots)
nsKnots <- sort(c( attr(nsRepr, "knots"), attr(nsRepr, "Boundary.knots") ))

all.equal(nsKnots, crKnots, check.attributes = FALSE)
# [1] TRUE
length(crKnots) == numKnots
# [1] TRUE
all.equal(nsKnots, quantile(x, seq(0, 1, length.out = numKnots)), 
          check.attributes = FALSE)
# [1] TRUE

Finally to clarify your side-question: NCS are constraint in such way that the function is linear beyond the boundary knots, not between a boundary point and the adjacent interior knot.

Keeping with the same example as before:

newX <- seq(-7,7, by=0.1)
plot(x= x, y= yObs, pch=15, panel.firs= grid(), xlim= range(newX))
abline(v= crKnots, col= 'red', lty= 2)
lines(x= newX, predict(crFit, newdata= data.frame(x= newX)), col='blue' ) 
legend("bottomleft", col= c("black",'red','blue'), lty= c(0,2,1), lwd= c(0,2,2), 
       legend= c("yObs","Knot locations", "Predictions GAM"), pch= c(15,NA,NA))

enter image description here

In general, unless one needs to use the splines package to define particular knot locations, etc., I would suggest using mgcv for an out-of-the-box analysis that uses splines. It is well-documented and straight-forward to use.

[SW2017]: S. Wood, 2017, Generalized Additive Models An Introduction with R, 2nd Ed. Chapt. 5.

[LW2006]: L. Wasserman, 2006, All of Nonparametric Statistics, Chapt. 5.