Error Minimization – Is Minimizing Squared Error Equivalent to Minimizing Absolute Error?

errorleast squares

When we conduct linear regression $y=ax+b$ to fit a bunch of data points $(x_1,y_1),(x_2,y_2),…,(x_n,y_n)$, the classic approach minimizes the squared error. I have long been puzzled by a question that will minimizing the squared error yield the same result as minimizing the absolute error? If not, why minimizing squared error is better? Is there any reason other than "the objective function is differentiable"?

Squared error is also widely used to evaluate model performance, but absolute error is less popular. Why squared error is more commonly used than the absolute error? If taking derivatives is not involved, calculating absolute error is as easy as calculating squared error, then why squared error is so prevalent? Is there any unique advantage that can explain its prevalence?

Thank you.

Best Answer

Minimizing square errors (MSE) is definitely not the same as minimizing absolute deviations (MAD) of errors. MSE provides the mean response of $y$ conditioned on $x$, while MAD provides the median response of $y$ conditioned on $x$.

Historically, Laplace originally considered the maximum observed error as a measure of the correctness of a model. He soon moved to considering MAD instead. Due to his inability to exact solving both situations, he soon considered the differential MSE. Himself and Gauss (seemingly concurrently) derived the normal equations, a closed-form solution for this problem. Nowadays, solving the MAD is relatively easy by means of linear programming. As it is well known, however, linear programming does not have a closed-form solution.

From an optimization perspective, both correspond to convex functions. However, MSE is differentiable, thus, allowing for gradient-based methods, much efficient than their non-differentiable counterpart. MAD is not differentiable at $x=0$.

A further theoretical reason is that, in a bayesian setting, when assuming uniform priors of the model parameters, MSE yields normal distributed errors, which has been taken as a proof of correctness of the method. Theorists like the normal distribution because they believed it is an empirical fact, while experimentals like it because they believe it a theoretical result.

A final reason of why MSE may have had the wide acceptance it has is that it is based on the euclidean distance (in fact it is a solution of the projection problem on an euclidean banach space) which is extremely intuitive given our geometrical reality.

Related Question