Solved – Importance of multiple linear regression assumptions when building predictive regression models

assumptionsmulticollinearitypredictive-models

As far as I know, one can differentiate between two main goals of the regression analysis:

  1. The goal is understanding causal relations between variables. Here, one has to check several common regression assumptions (main being linearity, normality, residuals zero mean, homoscedasticity, independence of errors). Collinearity needs to be analysed with care and removed if possible (either by deleting collinear predictors detected by variance inflation factors or applying principal component regression etc.).

  2. The goal of the analysis is creation of predictive models. In contrast to the previous case, here one does not need to worry about collinearity at all (there are many online sources regarding this).

However, I am not sure what should one do with other "standard" regression tests, such as testing assumptions and hypotheses (p-values). Can one ignore these as well, just like collinearity, when the goal is building predictive models?

In the other words: Could some "dummy" forward or backward feature selection be applied to select the features that maximize prediction R-square (using cross-validation or test sets), without taking care of regression assumptions or predictor significance?

Best Answer

I think for prediction, the only thing that matters is validating properly to avoid overfitting.

By "prediction" I mean that the output of the model is a point estimate of some future response. On the other hand, if the output includes not only the point estimate but also CI, then it's different. Imagine you know the "true" predictors and true regression coefficients for a linear model, but you don't know the distribution of error terms and assume it's normal. Then your point estimate of future response will be fine, but you'll never be able to provide an adequate CI for it if the normality assumption doesn't hold.

Related Question