Solved – Find good smoothing spline factor

curve fittingsmoothingsplines

I'd like an automatic way to find the "best" smoothing factor s for a spline fit to a given set of data points. Here's a sample visualization of some data and the fit splines for various s values:

enter image description here

In this case, clearly s=2 (and at a lesser degree s=1) is not a good fit. On the other hand s=0.5 fits the data almost as good as s=0.1 but with less than half the number of knots and thus is less susceptible to overfitting. So my question is, what's a robust method to determine the "optimal", or at least a good enough s to fit the data?

Best Answer

What is a smoothing spline?

The Wikipedia article on smoothing splines does a good job in explaining that. To recap, given a set of data points, $\{ (x_i, y_i)_{i=1}^n \}$, a smoothing spline is a solution to the interpolation problem:

$$\underset{f}{\arg\min} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_{x_{(1)}}^{x_{(n)}} f''(x)^2 dx,$$

with $f$ constrained to be piecewise cubic between different $x_i$. The first part measures the goodness of fit of such an $f$ to the observed data. The second part is a penalty term for the wiggliness (non-smoothness) of $f$.

Leaving it to us to find a good trade-off between fit and smoothness by means of $\lambda$.


Smoothing splines in R

Luckily R has the splines package that does the heavy lifting for us.

library(splines)

mydata <- read.csv(...)

myspline <- smooth.spline(x = mydata$x, y = mydata$y
                          , lambda = 8e-9 # optim 8.332658e-11
                          , cv = TRUE) 

xgrid <- sort(union(mydata$x
             , seq(from = min(mydata$x), to = max(mydata$x), by = 1))
             , decreasing = FALSE)

yhat_xgrid <- predict(myspline, x = xgrid)$y

plot(x = mydata$x, y = mydata$y, log = "x", ylim = c(0,1)
     , xlab = "x (log-scale)", ylab = "y"
     , col  = "lightblue", pch = 19)
lines(x = xgrid, y = yhat_xgrid, type = "l", col = "darkorange")
grid()
legend(...)

And we obtain this lovely plot.

Smoothing spline $\lambda = 8 \cdot 10^{-9}$.

The optimal values for $\lambda$ are $\hat{\lambda}^*_{\text{LOO}} = 8.33 \cdot 10^{-11}$ and $\hat{\lambda}^*_{\text{GCV}} = 5.81 \cdot 10^{-13}$. I like the one plotted: $\hat{\lambda}^*_{\text{Jim}} = 8 \cdot 10^{-9}$.

Related Question