Multiple Regression – Linear Regression Validation Performance Despite Linearity Assumptions

boostingleast squareslinearitymultiple regressionresiduals

I have a dataset with about 8000 samples and 18 predictors (16 continuous, 2 categorical). I am trying fit a linear regression, but despite trying multiple transformations, I can't make it meet the linearity assumption by checking at the predicted vs actual plot. The best I can do is:

Also the residuals looks normal to the naked eye but they don't pass any normality statistical test, so this assumption is not met neither.

However, when testing the regression on other datasets as validation, it performs just as good as an XGBoost model fit on the same data (LR: R2=0.47, MAPE=13.72; XGB: R2=0.47, MAPE: 13.32).

The validation data has approximately the same range of values as the test data, so extrapolation does not seem to be the issue.

My question is: if linear regression is doing that good in MAPE, RMSE and R2 in external datasets, can you ignore that it does not meet the assumptions? Is it telling me that it could do even better if it met the assumptions? And how can it be that despite not meeting the linearity assumption, it still does as good as XGBoost, which should handle non-linear data better?

PS: XGBoost has been hyper-optimized before comparing

Best Answer

The idea that you'd need to "make the data meet" certain model assumptions is wrong, as model assumptions are never perfectly fulfilled anyway. In particular, formal model assumptions almost always require that data are not pre-processed in a data dependent manner, including doing any transformation to make data look "more linear" or "more normal" and the like. So if data don't satisfy model assumptions before such pre-processing (which they quite generally don't), there is no way to make them satisfy the model assumptions doing any such thing.

Model assumptions mean that a statistical method has certain good properties if the model is true. This doesn't mean that it cannot perform well if model assumptions are not met. There is nothing that in principle stops a method such as linear regression from having a good prediction performance in case model assumptions are violated.

The role of any data manipulation that brings the data closer to how they supposedly should look like (namely here linear with normal residuals) can never be to ultimately meet the assumptions. It can however, in many situations, improve the fit of the model to the data, and then also the prediction accuracy. But if you don't find a way to improve your fit in this way, so be it. There is no guarantee. It is well known that normality and linearity tests are not a reliable indicator that such an improvement exists. Particularly with many data points assumptions can be rejected by such tests in situations in which violations are mild and improvements may not be available.

Another aspect is that inference such as tests and confidence intervals is based on the model assumptions. There are also situations in which model assumptions are violated, prediction quality is fine, but tests and confidence intervals are biased. Unfortunately this can formally not be repaired by data dependent pre-processing (because in order to be valid, inference would need to take such pre-processing into account, which standard inference doesn't). Still, applying such pre-processing may reduce bias in case that model violations are bad and the pre-processing improves the fit strongly. There exists some research about this, but unfortunately the message is far from clear - it may help, it may also do harm, and it is hard to diagnose whether we are in one or the other situation. The baseline is that it is by no means mandatory to do something to get the data closer to the model assumptions; chances are that it is worthwhile if a simple transformation gives you a striking visible improvement, however trying hard to make significant p-values of normality or linearity tests go away may well be worse than useless.

As long as you are only interested in prediction performance, the inference aspect doesn't need to worry you much though. It may be enough to keep in mind that there is model uncertainty on top of the uncertainty expressed by model-based inference. Model uncertainty may not be a big deal in situations in which model-based inference asymptotically applies to a more general class of models (central limit theorem and the like, also applying to linear regression in many cases) and the sample is reasonably large, although certain problems (outliers, strong nonlinearity) can make it hit harder.

Related Solutions

Multiple Regression – Using R^2 to Test the Linearity Assumption in Multiple Regression Analysis

Note that the linearity assumption you're speaking of only says that the conditional mean of $Y_i$ given $X_i$ is a linear function. You cannot use the value of $R^2$ to test this assumption.

This is because $R^2$ is merely the squared correlation between the observed and predicted values and the value of the correlation coefficient does not uniquely determine the relationship between $X$ and $Y$ (linear or otherwise) and both of the following two scenarios are possible:

High $R^2$ but the linearity assumption is still be wrong in an important way
Low $R^2$ but the linearity assumption still satisfied

I will discuss each in turn:

(1) High $R^2$ but the linearity assumption is still be wrong in an important way: The trick here is to manipulate the fact that correlation is very sensitive to outliers. Suppose you have predictors $X_1, ..., X_n$ that are generated from a mixture distribution that is standard normal $99\%$ of the time and a point mass at $M$ the other $1\%$ and a response variable that is

$$ Y_i = \begin{cases} Z_i & {\rm if \ } X_i \neq M \\ M & {\rm if \ } X_i = M \\ \end{cases} $$

where $Z_i \sim N(\mu,1)$ and $M$ is a positive constant much larger than $\mu$, e.g. $\mu=0, M=10^5$. Then $X_i$ and $Y_i$ will be almost perfectly correlated:

u = runif(1e4)>.99
x = rnorm(1e4)
x[which(u==1)] = 1e5
y = rnorm(1e4)
y[which(x==1e5)] = 1e5
cor(x,y)
[1] 1

despite the fact that the expected value of $Y_i$ given $X_i$ is not linear - in fact it is a discontinuous step function and the expected value of $Y_i$ doesn't even depend on $X_i$ except when $X_i = M$.

(2) Low $R^2$ but the linearity assumption still satisfied: The trick here is to make the amount of "noise" around the linear trend large. Suppose you have a predictor $X_i$ and response $Y_i$ and the model

$$ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i $$

was the correct model. Therefore, the conditional mean of $Y_i$ given $X_i$ is a linear function of $X_i$, so the linearity assumption is satisfied. If ${\rm var}(\varepsilon_i) = \sigma^2$ is large relative to $\beta_1$ then $R^2$ will be small. For example,

x = rnorm(200)
y = 1 + 2*x + rnorm(200,sd=5)
cor(x,y)^2
[1] 0.1125698

Therefore, assessing the linearity assumption is not a matter of seeing whether $R^2$ lies within some tolerable range, but it is more a matter of examining scatter plots between the predictors/predicted values and the response and making a (perhaps subjective) decision.

Re: What to do when the linearity assumption is not met and transforming the IVs also doesn't help?!!

When non-linearity is an issue, it may be helpful to look at plots of the residuals vs. each predictor - if there is any noticeable pattern, this can indicate non-linearity in that predictor. For example, if this plot reveals a "bowl-shaped" relationship between the residuals and the predictor, this may indicate a missing quadratic term in that predictor. Other patterns may indicate a different functional form. In some cases, it may be that you haven't tried to right transformation or that the true model isn't linear in any transformed version of the variables (although it may be possible to find a reasonable approximation).

Regarding your example: Based on the predicted vs. actual plots (1st and 3rd plots in the original post) for the two different dependent variables, it seems to me that the linearity assumption is tenable for both cases. In the first plot, it looks like there may be some heteroskedasticity, but the relationship between the two does look pretty linear. In the second plot, the relationship looks linear, but the strength of the relationship is rather weak, as indicated by the large scatter around the line (i.e. the large error variance) - this is why you're seeing a low $R^2$.

Solved – Multiple Linear Regression Zero Conditional Mean Assumption

I'll start with your second question as it will inform the answer to the first.

Note the distinction between regression coefficients and structural causal model coefficients. The former is what you get when you run a regression - always. Only under specific circumstances would the regression coefficients have a causal interpretation, or in other words, only under specific circumstances will the regression coefficients coincide with the coefficients in the structural causal model. What are these specific circumstances? A necessary condition is the zero conditional mean assumption (pertaining to the structural errors), discussed by Wooldridge and Greene. Or the ignorability assumption discussed by Gelman and Hill. The latter is once again a necessary assumption for the regression coefficients to have a causal interpretation, it is just described in a different context - that of potential outcomes. The zero conditional mean assumption and the ignorability assumption, also called selection on observables, and also called CIA [Conditional Independence Assumption] (in Mostly Harmless Econometrics) are two sides of the same coin. Chen & Pearl said with reference to Greene's book "In summary, while Greene provides the most detailed account of potential outcomes and counterfactuals of all the authors surveyed, his failure to acknowledge the oneness of the potential outcomes and structural equation frameworks is likely to cause more confusion than clarity, especially in view of the current debate between two antagonistic and narrowly focused schools of econometric research (See Pearl 2009, p. 379-380)."

So, to answer your question. If the zero conditional mean assumption (with regards to the structural errors) is violated then the regression coefficients will not coincide with those of the structural model; in other words, the regression coefficients will not have a causal interpretation.

Because they chose to describe the conditions necessary for the coefficients to have a causal interpretation in the context of potential outcomes. Just the other side of the same coin.

For more detail on the difference between regression and structural causal model, see Carlos Cinelli's answer here and here.

Best Answer

Related Solutions

Multiple Regression – Using R^2 to Test the Linearity Assumption in Multiple Regression Analysis

Solved – Multiple Linear Regression Zero Conditional Mean Assumption

Related Question