Solved – Does cross-validation on simple or multiple linear regression make sense

cross-validationlinear modelmultiple regressionregression

Does it make sense to apply train-test split or k-fold cross-validation to a simple linear regression model or multiple linear regression model?

I'm really confused about this because I saw this question: How to Evaluate Results of Linear Regression, where the upvoted comments and answers suggest no.

Comment by @octern:

I don't think this kind of assessment is generally used with simple
regression models. What would it tell you that you wouldn't find out
from using the entire dataset to generate your regression parameters?
Normally the reason to use an evaluation dataset is to prevent
overfitting, but that's not an issue when you already know that your
model is going to contain just one independent variable.

Top answer by @MattKrause:

I'd agree with @Octern that one rarely sees people using train/test
splits (or even things like cross-validation) for linear models.
Overfitting is (almost) certainly not an issue with a very simple
model like this one.

Best Answer

First, over-fitting may not always be a real concern. No variable selection (or any other way of using the response to decide how to specify the predictors), few estimated parameters, many observations, only weakly correlated predictors, & a low error variance might lead someone to suppose that validating the model-fitting procedure isn't worth the candle. Fair enough; though you might ask why, if they're so sure about that, they didn't specify more parameters to allow for non-linear relationships between predictors & response, or for interactions.

Second, it may be that parameter estimation rather than prediction is the aim of the analysis. If you're using regression to estimate the Young's modulus of a material, then the job's done once you have the point estimate & confidence interval.

Third, with ordinary least-squares regressions (& no variable selection) you can calculate estimates of predictive performance analytically: the adjusted coefficient of determination & predicted residual sum of squares statistic (see Does adjusted R-square seek to estimate fixed score or random score population r-squared? & Why not using cross validation for estimating the error of a linear model?).

Related Question