Is the solution to Mean Squared Error also the one to Mean Absolute Error

linear programmingoptimizationstatistics

When I was dealing with optimization problem and deciding whether to minimize MSE or MAE, I had the following question: Is the solution to Mean Squared Error also the one to Mean Absolute Error? If yes, why would we choose one model over the other?

Best Answer

Consider $x_1,\ldots,x_n\in \mathbb R$.

The solution of the MSE problem is $\text{argmin}_x \sum_{k=1}^n (x_i-x)^2 = \{\frac 1n \sum_{k=1}^n x_i\}$ which you might know as the sample mean.

The solution of the MAE problem is $\text{argmin}_x \sum_{k=1}^n |x_i-x| = \{\text{medians of }(x_1,\ldots,x_n) \}$.

In general, the sample mean is not a median of $(x_1,\ldots,x_n)$, so the solutions of both problems are different.

Using this result with the distribution that puts mass $\frac 1n$ on each sample, the difference between the two solutions must be less than $\displaystyle \sqrt{\frac 1n\sum_{k=1}^n (x_k-\overline x)^2}$ where $\overline x = \frac 1n \sum_{k=1}^n x_k$.

Related Solutions

[Math] Mean Squared Error Maximum uniform distribution

Let $T$ be an estimator of $\theta$ and denote $E(T) = \mu.$ Then $$\text{MSE} = E[(T-\tau)^2] = E[((T - \mu) + (\mu -\tau))^2]\\ = E[(T-\mu)^2] + 2(\mu - \tau)E(T - \mu) + (\mu - \tau)^2\\ = Var(T) + 0 + b_T^2(\tau)$$

Here is a simulation in R statistical software for $W = \max(X_1, \dots, X_n)$ as an estimator of $\theta,$ where $X_i \stackrel{indep}{\sim} Unif(0,\theta),\ n=5,$ and $\theta=10.$ With a million iterations, quantities with original units should be accurate to three places, and quantities with squared units perhaps about two places.

m = 10^6;  n = 5;  th = 10  
x = runif(m*n, 0, th)
DTA = matrix(x, nrow=m)
w = apply(DTA, 1, max)
mean(w);  var(w);  mean((w-th)^2)
## 8.3332              # aprx E(W) = 8.333
## 1.986969            # aprx Var(W) = 1.984
## 4.765187            # aprx MSE = 4.762

mu = n*th/(n+1); mu
## 8.333333            # exact E(W)
var.w = n*th^2/((n+2)*(n+1)^2); var.w
## 1.984127            # exact Var(W)
bias.sq = (mu - th)^2; bias.sq
## 2.777778            # exact squared bias
var.w + bias.sq
## 4.761905            # exact MSE

Except for your incorrect sign (mentioned in the comments), your computations are compatible with simulation results.

Note: An unbiased estimator of $\theta$ is $2\bar X$, but it has a much larger variance ($4\theta^2/12n$), and hence a much larger MSE, than the maximum.

doub.avg = 2*rowMeans(DTA)
mean(doub.avg)
## 10.00068    # unbiased 
var(doub.avg)
## 6.687577    # relatively large var = MSE

The histograms below compare the properties of these two estimators.

[Math] Mean squared error for vectors

Note that if $\hat \theta(X)$ is an estimator (depending on random data $X$) for the parameter $\theta\in \mathbb{R}^n,$ the MSE is a scalar quantity defined as

$$\begin{align}MSE(\hat\theta,\theta)&\equiv E[\|\hat\theta(X)-\theta\|^2]\\ &=E[(\hat\theta(X)-\theta)'(\hat\theta(X)-\theta)].\\\end{align}$$

With some matrix algebra, one can easily prove the identity

$$\begin{align}MSE(\hat\theta,\theta)&=\|Bias(\hat\theta,\theta)\|^2+tr(Var(\hat\theta(X))),\\ Bias(\hat\theta,\theta)&\equiv E[\hat\theta(X)]-\theta. \end{align}$$

So rather than look at a vector of individual MSEs, we typically look at the above metric as the generalization of MSE.

However, the MSE is only one metric to judge an estimator by. One may also be interested in looking at the variance-covariance matrix $Var(\hat\theta(X))$, in which case your question still stands, namely how do we decide which of $V_1\equiv Var(\hat\theta_1(X))$, $V_2\equiv Var(\hat\theta_2(X))$ is "greater" given two estimators $\hat\theta_1(X),\hat\theta_2(X)$?

A common partial order used in this respect that is defined on the set of symmetric positive semidefinite matrices is the Loewner order: $$V_1\geq V_2\iff V_1-V_2 \text{ is positive semidefinite (p.s.d)}.$$

Being a partial order, this relation cannot be used to compare any two variance-covariance matrices summoned from the ether, but it is still meaningful. For instance, because p.s.d matrices have nonnegative diagonal entries, one immediate implication of $V_1\geq V_2$ is that the variance of each component of $\hat\theta_1(X)$ is at least as great as the variance of the corresponding component of $\hat\theta_2(X).$

Best Answer

Related Solutions

[Math] Mean Squared Error Maximum uniform distribution

[Math] Mean squared error for vectors

Related Question