Suppose we want to predict $Y$ given the following $X$ observations:
x = c(abs(rnorm(2500, 0.1, 0.25)), abs(rnorm(2500, 0, 0.05)))
y = (x^0.35) + rnorm(length(x), 0, 0.25)
x = c(x, -x)
y = c(y, -y)
Clearly we have an exponential relationship between the predictor $X$ and $Y$. It's more obvious looking at a local regression plot:
regrplot = function(x, y, main.par="", regr.par=T, lowess.par=T, xlab.par=NULL, ylab.par=NULL)
{
plot(x, y, cex=0.5, col="red", main=main.par, xlab=xlab.par, ylab=ylab.par, pch=".")
if( regr.par )
abline(lm(y ~ x), col="blue", lwd=1)
if( lowess.par )
lines(lowess(x, y), lwd=1, col="darkgreen")
}
This is a replica of an empirical distribution I have to model and predict.
What are the difference in using splines vs. GLM? Which GLM regression better predicts this kind of distributions?
From what I understand splines are better suited to study empirical distributions, because of the flexibility of the polynomials based on a guided number of knots. GLMs seems to be better to study theoretical distributions. Of course, the drawback with splines is overfitting, especially for prediction.
Best Answer
You are confusing a few issues. A spline function is typically used to relax the linearity assumption of one or more predictors, whether in the context of a GLM or a non-GLM. Loess is a local weighted linear regression, not a spline.
When discussing empirical distributions you really need to talk about the conditional distribution of $Y|X$ which may have little to do with the right hand side of the equation I discussed above. Most models assume a parametric distribution for $Y|X$. The empirical alternative is semi-parametric models such as cumulative probability ordinal response models (e.g., proportional odds and prop. hazards models). These do not assume any model for $Y|X$, effectively using the empirical CDF.