Solved – the difference between GLM and splines

empirical-cumulative-distr-fngeneralized linear modelloessnonlinear regressionsplines

Suppose we want to predict $Y$ given the following $X$ observations:

x = c(abs(rnorm(2500, 0.1, 0.25)), abs(rnorm(2500, 0, 0.05)))
y = (x^0.35) + rnorm(length(x), 0, 0.25)
x = c(x, -x)
y = c(y, -y)

Clearly we have an exponential relationship between the predictor $X$ and $Y$. It's more obvious looking at a local regression plot:

regrplot = function(x, y, main.par="", regr.par=T, lowess.par=T, xlab.par=NULL, ylab.par=NULL)
{
  plot(x, y, cex=0.5, col="red", main=main.par, xlab=xlab.par, ylab=ylab.par, pch=".")
  if( regr.par )
    abline(lm(y ~ x), col="blue", lwd=1)
  if( lowess.par )
    lines(lowess(x, y), lwd=1, col="darkgreen")
}

enter image description here

This is a replica of an empirical distribution I have to model and predict.

What are the difference in using splines vs. GLM? Which GLM regression better predicts this kind of distributions?

From what I understand splines are better suited to study empirical distributions, because of the flexibility of the polynomials based on a guided number of knots. GLMs seems to be better to study theoretical distributions. Of course, the drawback with splines is overfitting, especially for prediction.

Best Answer

You are confusing a few issues. A spline function is typically used to relax the linearity assumption of one or more predictors, whether in the context of a GLM or a non-GLM. Loess is a local weighted linear regression, not a spline.

When discussing empirical distributions you really need to talk about the conditional distribution of $Y|X$ which may have little to do with the right hand side of the equation I discussed above. Most models assume a parametric distribution for $Y|X$. The empirical alternative is semi-parametric models such as cumulative probability ordinal response models (e.g., proportional odds and prop. hazards models). These do not assume any model for $Y|X$, effectively using the empirical CDF.