Solved – Is the mean squared error used to assess relative superiority of one estimator over another

estimationmse

Suppose we have two estimators $\alpha_1$ and $\alpha_2$ for some parameter $x$. To determine which estimator is "better" do we look at the MSE (mean squared error)? In other words we look at $$MSE = \beta^2+ \sigma^2$$ where $\beta$ is the bias of the estimator and $\sigma^2$ is the variance of the estimator? Whichever has a greater MSE is a worse estimator?

Best Answer

If you have two competing estimators $\hat \theta_1$ and $\hat \theta_2$, whether or not $$ {\rm MSE}(\hat \theta_1) < {\rm MSE}(\hat \theta_2) $$ tells you that $\hat \theta_1$ is the better estimator depends entirely on your definition of "best". For example, if you are comparing unbiased estimators and by "better" you mean has lower variance then, yes, this would imply that $\hat \theta_1$ is better. $\rm MSE$ is a popular criterion because of its connection with Least Squares and the Gaussian log-likelihood but, like many statistical criteria, one should be cautioned from using $\rm MSE$ blindly as a measure of estimator quality without paying attention to the application.

There are certain situations where choosing an estimator to minimize ${\rm MSE}$ may not be a particularly sensible thing to do. Two scenarios come to mind:

  • If there are very large outliers in a data set then they can affect MSE drastically and thus the estimator that minimizes the MSE can be unduely influenced by such outliers. In such situations, the fact that an estimator minimizes the MSE doesn't really tell you much since, if you removed the outlier(s), you can get a wildly different estimate. In that sense, the MSE is not "robust" to outliers. In the context of regression, this fact is what motivated the Huber M-Estimator (that I discuss in this answer), which minimizes a different criterion function (that is a mixture between squared error and absolute error) when there are long-tailed errors.

  • If you are estimating a bounded parameter, comparing $\rm MSE$s may not be appropriate since it penalizes over and understimation differently in that case. For example, suppose you're estimating a variance, $\sigma^2$. Then, if you consciously underestimate the quantity your $\rm MSE$ can be at most $\sigma^4$, while overestimation can produce an $\rm MSE$ that far exceeds $\sigma^4$, perhaps even by an unbounded amount.

To make these drawback more clear, I'll give a concrete example of when, because of these issues, the $\rm MSE$ may not be an appropriate measure of estimator quality.

Suppose you have a sample $X_1, ..., X_n$ from a $t$ distribution with $\nu>2$ degrees of freedom and we are trying to estimate the variance, which is $\nu/(\nu-2)$. Consider two competing estimators: $$\hat \theta_{1}: {\rm the \ unbiased \ sample \ variance} $$and $$\hat \theta_{2} = 0,{\rm \ regardless \ of \ the \ data}$$ Clearly $\rm MSE(\hat \theta_{2}) = \frac{\nu^2}{(\nu-2)^2}$ and it is a fact that $$ {\rm MSE}(\hat \theta_{1}) = \begin{cases} \infty &\mbox{if } \nu \leq 4 \\ \frac{\nu^2}{(\nu-2)^2} \left( \frac{2}{n-1}+\frac{6}{n(\nu-4)} \right) & \mbox{if } \nu>4 . \end{cases} $$ which can be derived using the fact discussed in this thread and the properties of the $t$-distribution. Thus the naive estimator outperforms in terms of $\rm MSE$ regardless of the sample size whenever $\nu < 4$, which is rather disconcerting. It also outperforms when $\left( \frac{2}{n-1}+\frac{6}{n(\nu-4)} \right) > 1$ but this is only relevant for very small sample sizes. The above happens because of the long tailed nature of the $t$ distribution with small degrees of freedom, which makes $\hat \theta_{2}$ prone to very large values and the $\rm MSE$ penalizes heavily for the overestimation, while $\hat \theta_1$ does not have this problem.

The bottom line here is that $\rm MSE$ is not an appropriate measure estimator performance in this scenario. This is clear because the estimator that dominates in terms of $\rm MSE$ is a ridiculous one (particularly since there is no chance that it is correct if there is any variability in the observed data). Perhaps a more appropriate approach (as pointed out by Casella and Berger) would be to choose the variance estimator, $\hat \theta$ that minimizes Stein's Loss:

$$ S(\hat \theta) = \frac{ \hat \theta}{\nu/(\nu-2)} - 1 - \log \left( \frac{ \hat \theta}{\nu/(\nu-2)} \right) $$

which penalizes underestimation equally to overestimation. It also brings us back to sanity since $S(\hat \theta_1)=\infty$ :)

Related Question