Solved – How to find a p-value of smooth spline / loess regression

loessrregressionsplines

I have some variables and I am interested to find non-linear relationships between them. So I decided to fit some spline or loess, and print nice plots (see the code below). But, I also want to have some statistics that gives me an idea how much likely is that the relationship is a matter of randomness… i.e., I need some overall p-value, like I have for linear regression for example. In other words, I need to know whether the fitted curve makes any sense, since my code will fit a curve to any data.

x <- rnorm(1000)
y <- sin(x) + rnorm(1000, 0, 0.5)

cor.test(x,y)
plot(x, y, xlab = xlab, ylab = ylab)
spl1 <- smooth.spline(x, y, tol = 1e-6, df = 8)
lines(spl1, col = "green", lwd = 2)

spl2 <- loess(y ~ x)
x.pr <- seq(min(x), max(x), length.out = 100)
lines(x.pr, predict(spl2, x.pr), col = "blue", lwd = 2)

Best Answer

The splines library has functions bs and ns that will create spline basis to use with the lm function, then you can fit a linear model and a model including splines and use the anova function to do the full and reduced model test to see if the spline model fits significantly better than the linear model.

Here is some example code:

x <- rnorm(1000)
y <- sin(x) + rnorm(1000, 0, 0.5)

library(splines)

fit1 <- lm(y~x)
fit0 <- lm(y~1)
fit2 <- lm(y~bs(x,5))

anova(fit1,fit2)
anova(fit0,fit2)

plot(x,y, pch='.')
abline(fit1, col='red')
xx <- seq(min(x),max(x), length.out=250)
yy <- predict(fit2, data.frame(x=xx))
lines(xx,yy, col='blue')

You can also use the poly function to do a polynomial fit and test the non-linear terms as a test of curvature.

For the loess fit it is a little more complicated. There are some estimates of equivalent degrees of freedom for the loess smoothing parameter that could be used along with the $R^2$ values for the linear and loess models to construct and F test. I think methods based on bootstrapping and permutation tests may be more intuitive.

There are techniques to compute and plot a confidence interval for a loess fit (I think there may be a built-in way in the ggplot2 package), you can plot the confidence band and see if a straight line would fit within the band (this is not a p-value, but still gives a yes/no.

You could fit a linear model and take the residuals and fit a loess model to the residuals as response (and the variable of interest as predictor), if the true model is linear then this fit should be close to a flat line and reordering the points relative to the predictor should not make any difference. You can use this to create a permutation test. Fit the loess, find the predicted value farthest from 0, now randomly permute the points and fit a new loess and find the furthest predicted point from 0, repeat a bunch of times, the p-value is the proportion of permuted values that are further from 0 than the original value.

You may also want to look at cross-validation as a method of choosing the loess bandwidth. This does not give a p-value, but an infinite bandwidth corresponds to a perfect linear model, if the cross-validation suggests a very large bandwidth then that suggests a linear model may be reasonable, if the higher bandwidths are clearly inferior to some of the smaller bandwidths then this suggests definite curvature and linear is not sufficient.