Solved – Knots in Smoothing Splines

overfittingsmoothingsplines

In Introduction to Statistical Learning, there's this line under the section describing Smoothing Spline's tuning parameter $\lambda$:

In fitting a smoothing spline, we do not need to select the number or
location of knots – there will be a knot at each training observation,
$x_1,…,x_n$.

But, in my opinion wouldn't that be overfitting?

Reason for such a claim:
If there are more points which act as knots in the Spline fit, then the range for each slot for fitting is very small. Thus each change in the data, is recognized and imititated in the fit. Thus fitting the changes, which are random, and do not mean anything. Thus overfitting.

Possibilities:
We know that cubic splines and Smoothing splines are different. The basic differences are:

Smoothing splines try to minimize the function $$\sum_{i=1}^n(y_i-g(x_i))^2+\lambda\int g''(t)^2dt$$
Smoothing splines shrink the parameters.

Moreover, Smoothing Splines are basically natural cubic splines, and thus they're smooth too.

So, how can these differences( and properties) save the Smoothing Spline from overfitting? Is there anything else that I'm missing here?

Best Answer

But, in my opinion wouldn't that be overfitting?

No.

Your equation explains it all. $$\underbrace{\sum_{i=1}^n(y_i-g(x_i))^2}_\text{residual squares}+\underbrace{\lambda\int g''(t)^2dt}_\text{roughness penalty}$$

The second part $\lambda\int g''(t)^2dt$ is often called a roughness penalty, and $\lambda$ - roughness coefficient. The idea here is that first and second parts are competing. Think of this, if you make your function $g(x_i)=y_i$, i.e. go through each point exactly, then $\sum_{i=1}^n(y_i-g(x_i))^2=0$, but it usually leads to the function being very bumpy, it goes up and down trying to pass through each observation, which have noise in them. This would increase the contribution of the right part because generally $g''(x)$ will be higher, and depending on $\lambda$ the second part may become very large. Note, that $g''(x)$ is an approximation of the curvature of the spline.

So, you may find a curve that doesn't go exactly through each point $g(x_i)\ne y_i$ and $\sum_{i=1}^n(y_i-g(x_i))^2>0$, but your function becomes less bumpy, more smooth so that $g''(x)$ becomes smaller, and the increase in the first part is compensated by the decrease of the second part. Therefore, the roughness penalty does what shrinkage does, it actually cures overfitting.

Note, that the equation you gave is not the only possible way to build the smoothing spline. It's probably the simplest and most intuitive one. You could replace the second part with something different, e.g. $\lambda\int g'(t)^2dt$ would lead to the Laplacian kernel. It minimizes the length of the smooth curve.

The example actually has a simple physical representation. So let's start with an ordinary spline. Imagine that we nail a ring to the board at coordinates $x_i,y_i$, then we pass a flat spline through each ring. Now the shape of the flat spline is what you get from an ordinary (cubic) spline. Here how it looks (pic is from Wiki):

Now, instead of the ring, we nail springs into the same point. Then we attach the spline to the spring. Since the springs can extend the spline no longer will go through each observation! It'll relax a bit. What defines the shape of the new spline? The competition between the potential energy of the springs and the energy of tension in the flat spline. The more you bend the flat spline the more energy is in its tension, just like with a spring extension.

So, if you recall what is potential energy of a spring, it's just a square of its extension, which is given by the error (residual) $e_i=y_y-g(x_i)$, i.e. the sum of squares in the first part of your smoothing spline equation:

Now the second part of your equation gives the potential energy of the tension in the spline. In my example $\lambda\int g'(t)^2dt$ represents an approximation of the length of the spline. So, the shape of the spline will be the one that minimizes the total potential energy (in your case) or sum of the potential energy of spring extensions and the length of the spline (in my example).

Related Solutions

Solved – Find good smoothing spline factor

What is a smoothing spline?

The Wikipedia article on smoothing splines does a good job in explaining that. To recap, given a set of data points, $\{ (x_i, y_i)_{i=1}^n \}$, a smoothing spline is a solution to the interpolation problem:

$$\underset{f}{\arg\min} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_{x_{(1)}}^{x_{(n)}} f''(x)^2 dx,$$

with $f$ constrained to be piecewise cubic between different $x_i$. The first part measures the goodness of fit of such an $f$ to the observed data. The second part is a penalty term for the wiggliness (non-smoothness) of $f$.

Leaving it to us to find a good trade-off between fit and smoothness by means of $\lambda$.

Smoothing splines in R

Luckily R has the splines package that does the heavy lifting for us.

library(splines)

mydata <- read.csv(...)

myspline <- smooth.spline(x = mydata$x, y = mydata$y
                          , lambda = 8e-9 # optim 8.332658e-11
                          , cv = TRUE) 

xgrid <- sort(union(mydata$x
             , seq(from = min(mydata$x), to = max(mydata$x), by = 1))
             , decreasing = FALSE)

yhat_xgrid <- predict(myspline, x = xgrid)$y

plot(x = mydata$x, y = mydata$y, log = "x", ylim = c(0,1)
     , xlab = "x (log-scale)", ylab = "y"
     , col  = "lightblue", pch = 19)
lines(x = xgrid, y = yhat_xgrid, type = "l", col = "darkorange")
grid()
legend(...)

And we obtain this lovely plot.

$Smoothing spline $\lambda = 8 \cdot 10^{-9}$.$

The optimal values for $\lambda$ are $\hat{\lambda}^*_{\text{LOO}} = 8.33 \cdot 10^{-11}$ and $\hat{\lambda}^*_{\text{GCV}} = 5.81 \cdot 10^{-13}$. I like the one plotted: $\hat{\lambda}^*_{\text{Jim}} = 8 \cdot 10^{-9}$.

Solved – Proper terminology for what happens at knots in a cubic spline function

This answer on math.stackexchange.com suggests one way to proceed. In particular:

The typical mathematical definition of "smooth" says something about how many continuous derivatives the function has. But these sorts of definitions bear little relationship to the intuitive notion of "smoothness" of a curve.

Starting from the limitations of fitting an infinitely mathematically smooth (in the sense of infinite differentiability) high-degree polynomial might help heuristically. I'd suggest something like:

At knots the required level of mathematical smoothness is relaxed, better to match the intuitive notion of smoothness while allowing the curve to pass through the knots. The curve between each pair of adjacent knots can then be a simple, infinitely smooth, 3rd-degree polynomial.

If there is time, an illustration like the following based on Runge's phenomenon might help.

Consider the following 9 points joined with straight lines:

We want to fit a smooth curve to these points, to avoid the sharp changes in the line at the points. We could try to fit a curve that is infinitely smooth mathematically through these points, in the sense that not only is the curve continuous, but the slope of the curve is continuous, as is the slope of the slope, and so on forever (infinite differentiability). Polynomials are infinitely smooth in that sense, but here's what you get if you fit a polynomial through these points:

As @bubba has put it about the high-degree polynomials needed for this type of fitting:

No-one (except a mathematician) would call them "smooth".

If we remove the requirement for infinite mathematical smoothness at the knots, however, we can do much better. Then we can use an infinitely smooth 3rd-degree polynomial between each pair of adjacent knots, and at the knots require just the minimum smoothness needed to make the joins invisible:

where the orange line is a cubic spline fit and the blue line shows the smooth Runge function from which the points were sampled. This approach provides "the least possible amount of wiggling in between" the knots and thus meets an intuitive sense of "smoothness."

Best Answer

Related Solutions

Solved – Find good smoothing spline factor

Solved – Proper terminology for what happens at knots in a cubic spline function

Related Question