Error Minimization – Is Minimizing Squared Error Equivalent to Minimizing Absolute Error?

errorleast squares

When we conduct linear regression $y=ax+b$ to fit a bunch of data points $(x_1,y_1),(x_2,y_2),…,(x_n,y_n)$, the classic approach minimizes the squared error. I have long been puzzled by a question that will minimizing the squared error yield the same result as minimizing the absolute error? If not, why minimizing squared error is better? Is there any reason other than "the objective function is differentiable"?

Squared error is also widely used to evaluate model performance, but absolute error is less popular. Why squared error is more commonly used than the absolute error? If taking derivatives is not involved, calculating absolute error is as easy as calculating squared error, then why squared error is so prevalent? Is there any unique advantage that can explain its prevalence?

Thank you.

Best Answer

Minimizing square errors (MSE) is definitely not the same as minimizing absolute deviations (MAD) of errors. MSE provides the mean response of $y$ conditioned on $x$, while MAD provides the median response of $y$ conditioned on $x$.

Historically, Laplace originally considered the maximum observed error as a measure of the correctness of a model. He soon moved to considering MAD instead. Due to his inability to exact solving both situations, he soon considered the differential MSE. Himself and Gauss (seemingly concurrently) derived the normal equations, a closed-form solution for this problem. Nowadays, solving the MAD is relatively easy by means of linear programming. As it is well known, however, linear programming does not have a closed-form solution.

From an optimization perspective, both correspond to convex functions. However, MSE is differentiable, thus, allowing for gradient-based methods, much efficient than their non-differentiable counterpart. MAD is not differentiable at $x=0$.

A further theoretical reason is that, in a bayesian setting, when assuming uniform priors of the model parameters, MSE yields normal distributed errors, which has been taken as a proof of correctness of the method. Theorists like the normal distribution because they believed it is an empirical fact, while experimentals like it because they believe it a theoretical result.

A final reason of why MSE may have had the wide acceptance it has is that it is based on the euclidean distance (in fact it is a solution of the projection problem on an euclidean banach space) which is extremely intuitive given our geometrical reality.

Related Solutions

Mean Absolute Error – MAE vs Root Mean Squared Error in Regression Analysis

This depends on your loss function. In many circumstances it makes sense to give more weight to points further away from the mean--that is, being off by 10 is more than twice as bad as being off by 5. In such cases RMSE is a more appropriate measure of error.

If being off by ten is just twice as bad as being off by 5, then MAE is more appropriate.

In any case, it doesn't make sense to compare RMSE and MAE to each other as you do in your second-to-last sentence ("MAE gives a lower error than RMSE"). MAE will never be higher than RMSE because of the way they are calculated. They only make sense in comparison to the same measure of error: you can compare RMSE for Method 1 to RMSE for Method 2, or MAE for Method 1 to MAE for Method 2, but you can't say MAE is better than RMSE for Method 1 because it's smaller.

Symmetric Mean Absolute Percentage Error – Minimization Techniques

I don't think there is a closed-form solution to this question. (I'd be interested in being proven wrong.) I'd assume you will need to simulate. And hope that your predictive posterior is not misspecified too badly.

In case it is interesting, we wrote a little paper (see also this presentation) once that explained how minimizing percentage errors can lead to forecasting bias, by rolling standard six-sided dice. We also looked at various flavors of MAPE and wMAPE, but let's concentrate on the sMAPE here.

Here is a plot where we simulate "sales" by rolling $n=8$ six-sided dice $N=1,000$ times and plot the average sMAPE, together with pointwise quantiles:

fcst <- seq(1,6,by=.01)
n.sims <- 1000
n.sales <- 10
confidence <- .8
result.smape <- matrix(nrow=n.sims,ncol=length(fcst))
set.seed(2011)
for ( jj in 1:n.sims ) {
  sales <- sample(seq(1,6),size=n.sales,replace=TRUE)
  for ( ii in 1:length(fcst) ) {
    result.smape[jj,ii] <-
      2*mean(abs(sales-rep(fcst[ii],n.sales))/(sales+rep(fcst[ii],n.sales)))
  }
}

(Note that I'm using the alternative sMAPE formula which divides the denominator by 2.)

plot(sales,type="o",ylab="",xlab="",pch=21,bg="black",ylim=c(1,6),
  main=paste("Sales:",n.sales,"throws of a six-sided die"))
plot(fcst,fcst,type="n",ylab="sMAPE",xlab="Forecast",ylim=c(0.3,1.1))
polygon(c(fcst,rev(fcst)),c(
    apply(result.smape,2,quantile,probs=(1-confidence)/2),
    rev(apply(result.smape,2,quantile,probs=1-(1-confidence)/2))),
  density=10,angle=45)
lines(fcst,apply(result.smape,2,mean))
legend(x="topright",inset=.02,col="black",lwd=1,legend="sMAPE")

simulated sMAPEs

Something along these lines may help in your case. (Again, you will need to assume that your posterior predictive distribution is "correct enough" to do this kind of simulation - but you would need to assume that for any other approach, too, so this just adds a general caveat, not a specific issue.)

In this simple example of rolling standard six-sided dice, we can actually calculate and plot the expected s(M)APE as a function of the forecast:

expected.sape <- function ( fcst ) sum(abs(fcst-seq(1,6))/(seq(1,6)+fcst))/3
plot(fcst,mapply(expected.sape,fcst),type="l",xlab="Forecast",ylab="Expected sAPE")

expected sAPE

This agrees rather well with the simulation averages above. And it shows nicely that the EsAPE-minimal forecast for rolling a standard six-sided die is a biased 4, instead of the unbiased expectation of 3.5.

Additional fun fact: if your predictive distribution is a Poisson with a predicted parameter $\hat{\lambda}<1$, then the forecast that minimizes the expected sAPE is $\hat{y}=1$ - independently of the specific value of $\hat{\lambda}$.

At least this is claimed in footnote 1 in Seaman & Bowman (in press, IJF, commentary on the M5 forecasting competiton) without a proof. It's quite easy to see that the EsAPE-minimal forecast satisfies $\hat{y}\geq 1$ (you just show that any alternative forecast $\hat{y}'<1$ will lead to a larger EsAPE). Showing that $\hat{y}'>1$ will lead to a larger EsAPE than $\hat{y}=1$ seems to be a little tedious. However, simulations look reassuring.

Best Answer

Related Solutions

Mean Absolute Error – MAE vs Root Mean Squared Error in Regression Analysis

Symmetric Mean Absolute Percentage Error – Minimization Techniques

Related Question