Estimation Techniques – When Method of Moments Beats Maximum Likelihood in Small Samples

efficiencyestimationmaximum likelihoodmethod of momentsmse

Maximum likelihood estimators (MLE) are asymptotically efficient; we see the practical upshot in that they often do better than method of moments (MoM) estimates (when they differ), even at small sample sizes

Here 'better than' means in the sense of typically having smaller variance when both are unbiased, and typically smaller mean square error (MSE) more generally.

The question, occurs, however:

Are there cases where the MoM can beat the MLE – on MSE, say – in small samples?

(where this isn't some odd/degenerate situation – i.e. given that conditions for ML to exist/be asymptotically efficient hold)

A followup question would then be 'how big can small be?' – that is, if there are examples, are there some which still hold at relatively large sample sizes, perhaps even all finite sample sizes?

[I can find an example of a biased estimator that can beat ML in finite samples, but it isn't MoM.]

Note added retrospectively: my focus here is primarily on the univariate case (which is actually where my underlying curiosity is coming from). I don't want to rule out multivariate cases, but I also don't particularly want to stray into extended discussions of James-Stein estimation.

Best Answer

This may be considered... cheating, but the OLS estimator is a MoM estimator. Consider a standard linear regression specification (with $K$ stochastic regressors, so magnitudes are conditional on the regressor matrix), and a sample of size $n$. Denote $s^2$ the OLS estimator of the variance $\sigma^2$ of the error term. It is unbiased so

$$ MSE(s^2) = \operatorname {Var}(s^2) = \frac {2\sigma^4}{n-K} $$

Consider now the MLE of $\sigma^2$. It is

$$\hat \sigma^2_{ML} = \frac {n-K}{n}s^2$$ Is it biased. Its MSE is

$$MSE (\hat \sigma^2_{ML}) = \operatorname {Var}(\hat \sigma^2_{ML}) + \Big[E(\hat \sigma^2_{ML})-\sigma^2\Big]^2$$ Expressing the MLE in terms of the OLS and using the expression for the OLS estimator variance we obtain

$$MSE (\hat \sigma^2_{ML}) = \left(\frac {n-K}{n}\right)^2\frac {2\sigma^4}{n-K} + \left(\frac {K}{n}\right)^2\sigma^4$$ $$\Rightarrow MSE (\hat \sigma^2_{ML}) = \frac {2(n-K)+K^2}{n^2}\sigma^4$$

We want the conditions (if they exist) under which

$$MSE (\hat \sigma^2_{ML}) > MSE (s^2) \Rightarrow \frac {2(n-K)+K^2}{n^2} > \frac {2}{n-K}$$

$$\Rightarrow 2(n-K)^2+K^2(n-K)> 2n^2$$ $$ 2n^2 -4nK + 2K^2 +nK^2 - K^3 > 2n^2 $$ Simplifying we obtain $$ -4n + 2K +nK - K^2 > 0 \Rightarrow K^2 - (n+2)K + 4n < 0 $$ Is it feasible for this quadratic in $K$ to obtain negative values? We need its discriminant to be positive. We have $$\Delta_K = (n+2)^2 -16n = n^2 + 4n + 4 - 16n = n^2 -12n + 4$$ which is another quadratic, in $n$ this time. This discriminant is $$\Delta_n = 12^2 - 4^2 = 8\cdot 16$$ so $$n_1,n_2 = \frac {12\pm \sqrt{8\cdot 16}}{2} = 6 \pm 4\sqrt2 \Rightarrow n_1,n_2 = \{1, 12\}$$ to take into account the fact that $n$ is an integer. If $n$ is inside this interval we have that $\Delta_K <0$ and the quadratic in $K$ takes always positive values, so we cannot obtain the required inequality. So: we need a sample size larger than 12.

Given this the roots for $K$-quadratic are

$$K_1, K_2 = \frac {(n+2)\pm \sqrt{n^2 -12n + 4}}{2} = \frac n2 +1 \pm \sqrt{\left(\frac n2\right)^2 +1 -3n}$$

Overall : for sample size $n>12$ and number of regressors $K$ such that $\lceil K_1\rceil <K<\lfloor K_2\rfloor $ we have $$MSE (\hat \sigma^2_{ML}) > MSE (s^2)$$ For example, if $n=50$ then one finds that the number of regressors must be $5<K<47$ for the inequality to hold. It is interesting that for small numbers of regressors the MLE is better in MSE sense.

ADDENDUM
The equation for the roots of the $K$-quadratic can be written

$$K_1, K_2 = \left(\frac n2 +1\right) \pm \sqrt{\left(\frac n2 +1\right)^2 -4n}$$ which by a quick look I think implies that the lower root will always be $5$ (taking into account the "integer-value" restriction) -so MLE will be MSE-efficient when regressors are up to $5$ for any (finite) sample size.

Best Answer

Related Solutions

Solved – If you use a point estimate that maximizes $P(x | \theta)$, what does that say about your philosophy? (frequentist or Bayesian or something else?)

Solved – Maximum likelihood method vs. least squares method

Related Question