There's a slight confusion of two things here: methods for deriving estimators, and criteria for evaluating estimators. Maximum likelihood (ML) and method-of-moments (MoM) are ways of deriving estimators; Uniformly minimum variance unbiasedness (UMVU) and decision theory are criteria for evaluating different estimators once you have them, but they won't tell you how to derive them.
Of the methods for deriving estimators, ML usually produces estimators that are more efficient (i.e. lower variance) than MoM if you know the model under which your data were derived (the 'data-generating process' (DGP) in the jargon). But MoM makes fewer assumptions about the model; as its name implies, it only uses one or more moments, usually just the mean or just the mean and variance, so it's sometimes more robust if you're not sure about the DGP. There can be more than one MoM estimator for the same problem, while if you know the DGP, there is only one ML estimator.
Of the methods for evaluating estimators, decision-theoretic depends on having a loss function to use to judge your estimator, although the results can be fairly robust to a range of 'reasonable' loss functions. UMVU estimators often don't even exist; in many cases there is no unbiased estimator that always has minimum variance. And the criterion of unbiasedness is also of questionable usefulness, as it's not invariant to transformations. For example, would you prefer an unbiased estimator of the odds ratio, or of the log odds ratio? The two will be different.
This may be considered... cheating, but the OLS estimator is a MoM estimator. Consider a standard linear regression specification (with $K$ stochastic regressors, so magnitudes are conditional on the regressor matrix), and a sample of size $n$. Denote $s^2$ the OLS estimator of the variance $\sigma^2$ of the error term. It is unbiased so
$$ MSE(s^2) = \operatorname {Var}(s^2) = \frac {2\sigma^4}{n-K} $$
Consider now the MLE of $\sigma^2$. It is
$$\hat \sigma^2_{ML} = \frac {n-K}{n}s^2$$
Is it biased. Its MSE is
$$MSE (\hat \sigma^2_{ML}) = \operatorname {Var}(\hat \sigma^2_{ML}) + \Big[E(\hat \sigma^2_{ML})-\sigma^2\Big]^2$$
Expressing the MLE in terms of the OLS and using the expression for the OLS estimator variance we obtain
$$MSE (\hat \sigma^2_{ML}) = \left(\frac {n-K}{n}\right)^2\frac {2\sigma^4}{n-K} + \left(\frac {K}{n}\right)^2\sigma^4$$
$$\Rightarrow MSE (\hat \sigma^2_{ML}) = \frac {2(n-K)+K^2}{n^2}\sigma^4$$
We want the conditions (if they exist) under which
$$MSE (\hat \sigma^2_{ML}) > MSE (s^2) \Rightarrow \frac {2(n-K)+K^2}{n^2} > \frac {2}{n-K}$$
$$\Rightarrow 2(n-K)^2+K^2(n-K)> 2n^2$$
$$ 2n^2 -4nK + 2K^2 +nK^2 - K^3 > 2n^2 $$
Simplifying we obtain
$$ -4n + 2K +nK - K^2 > 0 \Rightarrow K^2 - (n+2)K + 4n < 0 $$
Is it feasible for this quadratic in $K$ to obtain negative values? We need its discriminant to be positive. We have
$$\Delta_K = (n+2)^2 -16n = n^2 + 4n + 4 - 16n = n^2 -12n + 4$$
which is another quadratic, in $n$ this time. This discriminant is
$$\Delta_n = 12^2 - 4^2 = 8\cdot 16$$
so
$$n_1,n_2 = \frac {12\pm \sqrt{8\cdot 16}}{2} = 6 \pm 4\sqrt2 \Rightarrow n_1,n_2 = \{1, 12\}$$
to take into account the fact that $n$ is an integer. If $n$ is inside this interval we have that $\Delta_K <0$ and the quadratic in $K$ takes always positive values, so we cannot obtain the required inequality. So: we need a sample size larger than 12.
Given this the roots for $K$-quadratic are
$$K_1, K_2 = \frac {(n+2)\pm \sqrt{n^2 -12n + 4}}{2} = \frac n2 +1 \pm \sqrt{\left(\frac n2\right)^2 +1 -3n}$$
Overall : for sample size $n>12$ and number of regressors $K$ such that $\lceil K_1\rceil <K<\lfloor K_2\rfloor $
we have
$$MSE (\hat \sigma^2_{ML}) > MSE (s^2)$$
For example, if $n=50$ then one finds that the number of regressors must be $5<K<47$ for the inequality to hold. It is interesting that for small numbers of regressors the MLE is better in MSE sense.
ADDENDUM
The equation for the roots of the $K$-quadratic can be written
$$K_1, K_2 = \left(\frac n2 +1\right) \pm \sqrt{\left(\frac n2 +1\right)^2 -4n}$$
which by a quick look I think implies that the lower root will always be $5$ (taking into account the "integer-value" restriction) -so MLE will be MSE-efficient when regressors are up to $5$ for any (finite) sample size.
Best Answer
In An Introduction to the Bootstrap, Efron and Tibshirani find it useful to characterize the empirical cumulative distribution function (ecdf) as the nonparametric maximum likelihood estimate of the "underlying population" $F$.
Given data $x_1, x_2, \ldots, x_n$, the likelihood function (by definition) is the product of the probabilities
$$L(F) = \prod_{i=1}^n {\Pr}_F(x_i).$$
E&T claim this is maximized by the ecdf. Since they leave it as an exercise, let's work out the solution here. It's not completely trivial, because we have to account for the possibility of duplicates among the data. Let's take care with the notation, then. Let $x_1, \ldots, x_m$ be the distinct data values, with $x_i$ appearing $k_i \ge 1$ times in the dataset. (Thus, $x_{m+1}, \ldots, x_n$ are all duplicates of the first $m$ values.) The ecdf is the discrete distribution that assigns probability $k_i/n$ to $x_i$ for $1 \le i \le m$.
For any distribution $F$, the likelihood $L(F)$ has $k_i$ terms equal to $p_i = {\Pr}_F(x_i)$ for each $i$. It therefore is completely determined by the vector $p=(p_1, p_2, \ldots, p_m)$ and can be computed as
$$L(F) = L(p) = \prod_{i=1}^m p_i^{k_i}.$$
Since the likelihood for the ecdf is nonzero, the maximum likelihood will be nonzero. Therefore, for any distribution $\hat F$ that maximizes the likelihood, $p_i = {\Pr}_{\hat F}(x_i)$ must be nonzero for all the data. The Axiom of Total Probability asserts the sum of the $p_i$ is at most $1$. This reduces the problem to a constrained optimization:
$$\text{Maximize } L(p) = \prod_{i=1}^m p_i^{k_i}$$
subject to
$$p_i \gt 0, i=1, 2, \ldots m;\quad \sum_{i=1}^m p_i \le 1.$$
This can be solved in many ways. Perhaps the most direct is to use a Lagrange multiplier $\lambda$ to optimize $\log L$, which produces the critical equations
$$\left(\frac{p_1}{k_1}, \frac{p_2}{k_2}, \ldots, \frac{p_m}{k_m}\right) = \lambda\left(1, 1, \ldots, 1\right)$$
with unique solution $$\hat p_i = \frac{k_i}{k_1+\cdots+k_m} = \frac{k_i}{n},$$
precisely the ecdf, QED.
Why is this point of view important? Here are E&T:
[Section 21.7, p. 310]
Some words of explanation: "as a result" follows from the (easily proven) fact that the MLE (maximum likelihood estimate) of any function of a parameter is that function of the MLE of the parameter. A "functional statistic" (or "plug-in" statistic) is one that depends only on the distribution function. As an example of this distinction, E&T point out that the usual unbiased variance estimator $s^2 = \sum (x_i-\bar x)^2/(n-1) $ is not a functional statistic because if you were to double all the data, the ecdf would not change, but the $s^2$ would be multiplied by $2(n-1)/(2n-1)$, which does change (albeit only slightly). Functional statistics are crucial to understanding and analyzing the Bootstrap.
Reference
Bradley Efron and Robert J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993.