Linear Regression Advantages – Comparing Linear Regression and Quantile Regression

multiple regressionquantile regressionregression

The linear regression model makes a bunch of assumptions that quantile regression does not and, if the assumptions of linear regression are met, then my intuition (and some very limited experience) is that median regression would give nearly identical results as linear regression.

So, what advantages does linear regression have? It's certainly more familiar, but other than that?

Best Answer

It is very often stated that minimizing least squared residuals is preferred over minimizing absolute residuals because of the reason that it is computationally simpler. But, it may also be better for other reasons. Namely, if the assumptions are true (and this is not so uncommon) then it provides a solution that is (on average) more accurate.

Maximum likelihood

Least squares regression and quantile regression (when performed by minimizing the absolute residuals) can be seen as maximizing the likelihood function for Gaussian/Laplace distributed errors, and are in this sense very much related.

  • Gaussian distribution:

    $$f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

    with the log-likelihood being maximized when minimizing the sum of squared residuals

    $$\log \mathcal{L}(x) = -\frac{n}{2} \log (2 \pi) - n \log(\sigma) - \frac{1}{2\sigma^2} \underbrace{\sum_{i=1}^n (x_i-\mu)^2}_{\text{sum of squared residuals}} $$

  • Laplace distribution:

    $$f(x) = \frac{1}{2b} e^{-\frac{\vert x-\mu \vert}{b}}$$

    with the log-likelihood being maximized when minimizing the sum of absolute residuals

    $$\log \mathcal{L}(x) = -n \log (2) - n \log(b) - \frac{1}{b} \underbrace{\sum_{i=1}^n |x_i-\mu|}_{\text{sum of absolute residuals}} $$

Note: the Laplace distribution and the sum of absolute residuals relates to the median, but it can be generalized to other quantiles by giving different weights to negative and positive residuals.

Known error distribution

When we know the error-distribution (when the assumptions are likely true) it makes sense to choose the associated likelihood function. Minimizing that function is more optimal.

Very often the errors are (approximately) normal distributed. In that case using least squares is the best way to find the parameter $\mu$ (which relates to both the mean and the median). It is the best way because it has the lowest sample variance (lowest of all unbiased estimators). Or you can say more strongly: that it is stochastically dominant (see the illustration in this question comparing the distribution of the sample median and the sample mean).

So, when the errors are normal distributed, then the sample mean is a better estimator of the distribution median than the sample median. The least squares regression is a more optimal estimator of the quantiles. It is better than using the least sum of absolute residuals.

Because so many problems deal with normal distributed errors the use of the least squares method is very popular. To work with other type of distributions one can use the Generalized linear model. And, the method of iterative least squares, which can be used to solve GLMs, also works for the Laplace distribution (ie. for absolute deviations), which is equivalent to finding the median (or in the generalized version other quantiles).

Unknown error distribution

Robustness

The median or other quantiles have the advantage that they are very robust regarding the type of distribution. The actual values do not matter much and the quantiles only care about the order. So no matter what the distribution is, minimizing the absolute residuals (which is equivalent to finding the quantiles) is working very well.

The question becomes complex and broad here and it is dependent on what type of knowledge we have or do not have about the distribution function. For instance a distribution may be approximately normal distributed but only with some additional outliers. This can be dealt with by removing the outer values. This removal of the extreme values even works in estimating the location parameter of the Cauchy distribution where the truncated mean can be a better estimator than the median. So not only for the ideal situation when the assumptions hold, but also for some less ideal applications (e.g. additional outliers) there might be good robust methods that still use some form of a sum of squared residuals instead of sum of absolute residuals.

I imagine that regression with truncated residuals might be computationally much more complex. So it may actually be quantile regression which is the type of regression that is performed because of the reason that it is computationally simpler (not simpler than ordinary least squares, but simpler than truncated least squares).

Biased/unbiased

Another issue is biased versus unbiased estimators. In the above I described the maximum likelihood estimate for the mean, ie the least squares solution, as a good or preferable estimator because it often has the lowest variance of all unbiased estimators (when the errors are normal distributed). But, biased estimators may be better (lower expected sum of squared error).

This makes the question again broad and complex. There are many different estimators and many different situations to apply them. The use of an adapted sum of squared residuals loss function often works well to reduce the error (e.g. all kinds of regularization methods), but it may not need to work well for all cases. Intuitively it is not strange to imagine that, since the sum of squared residuals loss function often works well for all unbiased estimators, the optimal biased estimators is probably something close to a sum of squared residuals loss function.

Related Question