Multiple Regression – Linear Regression Validation Performance Despite Linearity Assumptions

boostingleast squareslinearitymultiple regressionresiduals

I have a dataset with about 8000 samples and 18 predictors (16 continuous, 2 categorical). I am trying fit a linear regression, but despite trying multiple transformations, I can't make it meet the linearity assumption by checking at the predicted vs actual plot. The best I can do is:

enter image description here

Also the residuals looks normal to the naked eye but they don't pass any normality statistical test, so this assumption is not met neither.

However, when testing the regression on other datasets as validation, it performs just as good as an XGBoost model fit on the same data (LR: R2=0.47, MAPE=13.72; XGB: R2=0.47, MAPE: 13.32).

The validation data has approximately the same range of values as the test data, so extrapolation does not seem to be the issue.

My question is: if linear regression is doing that good in MAPE, RMSE and R2 in external datasets, can you ignore that it does not meet the assumptions? Is it telling me that it could do even better if it met the assumptions? And how can it be that despite not meeting the linearity assumption, it still does as good as XGBoost, which should handle non-linear data better?

PS: XGBoost has been hyper-optimized before comparing

Best Answer

The idea that you'd need to "make the data meet" certain model assumptions is wrong, as model assumptions are never perfectly fulfilled anyway. In particular, formal model assumptions almost always require that data are not pre-processed in a data dependent manner, including doing any transformation to make data look "more linear" or "more normal" and the like. So if data don't satisfy model assumptions before such pre-processing (which they quite generally don't), there is no way to make them satisfy the model assumptions doing any such thing.

Model assumptions mean that a statistical method has certain good properties if the model is true. This doesn't mean that it cannot perform well if model assumptions are not met. There is nothing that in principle stops a method such as linear regression from having a good prediction performance in case model assumptions are violated.

The role of any data manipulation that brings the data closer to how they supposedly should look like (namely here linear with normal residuals) can never be to ultimately meet the assumptions. It can however, in many situations, improve the fit of the model to the data, and then also the prediction accuracy. But if you don't find a way to improve your fit in this way, so be it. There is no guarantee. It is well known that normality and linearity tests are not a reliable indicator that such an improvement exists. Particularly with many data points assumptions can be rejected by such tests in situations in which violations are mild and improvements may not be available.

Another aspect is that inference such as tests and confidence intervals is based on the model assumptions. There are also situations in which model assumptions are violated, prediction quality is fine, but tests and confidence intervals are biased. Unfortunately this can formally not be repaired by data dependent pre-processing (because in order to be valid, inference would need to take such pre-processing into account, which standard inference doesn't). Still, applying such pre-processing may reduce bias in case that model violations are bad and the pre-processing improves the fit strongly. There exists some research about this, but unfortunately the message is far from clear - it may help, it may also do harm, and it is hard to diagnose whether we are in one or the other situation. The baseline is that it is by no means mandatory to do something to get the data closer to the model assumptions; chances are that it is worthwhile if a simple transformation gives you a striking visible improvement, however trying hard to make significant p-values of normality or linearity tests go away may well be worse than useless.

As long as you are only interested in prediction performance, the inference aspect doesn't need to worry you much though. It may be enough to keep in mind that there is model uncertainty on top of the uncertainty expressed by model-based inference. Model uncertainty may not be a big deal in situations in which model-based inference asymptotically applies to a more general class of models (central limit theorem and the like, also applying to linear regression in many cases) and the sample is reasonably large, although certain problems (outliers, strong nonlinearity) can make it hit harder.