This may be considered... cheating, but the OLS estimator is a MoM estimator. Consider a standard linear regression specification (with $K$ stochastic regressors, so magnitudes are conditional on the regressor matrix), and a sample of size $n$. Denote $s^2$ the OLS estimator of the variance $\sigma^2$ of the error term. It is unbiased so
$$ MSE(s^2) = \operatorname {Var}(s^2) = \frac {2\sigma^4}{n-K} $$
Consider now the MLE of $\sigma^2$. It is
$$\hat \sigma^2_{ML} = \frac {n-K}{n}s^2$$
Is it biased. Its MSE is
$$MSE (\hat \sigma^2_{ML}) = \operatorname {Var}(\hat \sigma^2_{ML}) + \Big[E(\hat \sigma^2_{ML})-\sigma^2\Big]^2$$
Expressing the MLE in terms of the OLS and using the expression for the OLS estimator variance we obtain
$$MSE (\hat \sigma^2_{ML}) = \left(\frac {n-K}{n}\right)^2\frac {2\sigma^4}{n-K} + \left(\frac {K}{n}\right)^2\sigma^4$$
$$\Rightarrow MSE (\hat \sigma^2_{ML}) = \frac {2(n-K)+K^2}{n^2}\sigma^4$$
We want the conditions (if they exist) under which
$$MSE (\hat \sigma^2_{ML}) > MSE (s^2) \Rightarrow \frac {2(n-K)+K^2}{n^2} > \frac {2}{n-K}$$
$$\Rightarrow 2(n-K)^2+K^2(n-K)> 2n^2$$
$$ 2n^2 -4nK + 2K^2 +nK^2 - K^3 > 2n^2 $$
Simplifying we obtain
$$ -4n + 2K +nK - K^2 > 0 \Rightarrow K^2 - (n+2)K + 4n < 0 $$
Is it feasible for this quadratic in $K$ to obtain negative values? We need its discriminant to be positive. We have
$$\Delta_K = (n+2)^2 -16n = n^2 + 4n + 4 - 16n = n^2 -12n + 4$$
which is another quadratic, in $n$ this time. This discriminant is
$$\Delta_n = 12^2 - 4^2 = 8\cdot 16$$
so
$$n_1,n_2 = \frac {12\pm \sqrt{8\cdot 16}}{2} = 6 \pm 4\sqrt2 \Rightarrow n_1,n_2 = \{1, 12\}$$
to take into account the fact that $n$ is an integer. If $n$ is inside this interval we have that $\Delta_K <0$ and the quadratic in $K$ takes always positive values, so we cannot obtain the required inequality. So: we need a sample size larger than 12.
Given this the roots for $K$-quadratic are
$$K_1, K_2 = \frac {(n+2)\pm \sqrt{n^2 -12n + 4}}{2} = \frac n2 +1 \pm \sqrt{\left(\frac n2\right)^2 +1 -3n}$$
Overall : for sample size $n>12$ and number of regressors $K$ such that $\lceil K_1\rceil <K<\lfloor K_2\rfloor $
we have
$$MSE (\hat \sigma^2_{ML}) > MSE (s^2)$$
For example, if $n=50$ then one finds that the number of regressors must be $5<K<47$ for the inequality to hold. It is interesting that for small numbers of regressors the MLE is better in MSE sense.
ADDENDUM
The equation for the roots of the $K$-quadratic can be written
$$K_1, K_2 = \left(\frac n2 +1\right) \pm \sqrt{\left(\frac n2 +1\right)^2 -4n}$$
which by a quick look I think implies that the lower root will always be $5$ (taking into account the "integer-value" restriction) -so MLE will be MSE-efficient when regressors are up to $5$ for any (finite) sample size.
There needs to be a distinction made between predictor multicollinearity and VIF.
The variance-inflation factor (VIF) represents relationships among the coefficients of the independent variables, as captured in the coefficient variance-covariance matrix. The equation given above for unweighted ordinary least squares (OLS) takes advantage of the independence of the entries of that matrix from the dependent-variable values, outside of the estimate of overall variance. There is a direct relationship between the predictor multicollinearity calculated from that equation and the covariances among the estimated coefficients.
For models fit by maximum likelihood (ML), the relationships between the predictors and the dependent variable can influence the variance-covariance matrix, unlike with OLS. In an ML model the OLS formula for VIF will similarly show multicollinearity among predictors, but the relationship between the predictor multicollinearity and the coefficient covariances won't necessarily be so direct.
You might want to consider a different approach, rather than VIF, for evaluating predictor multicollinearity per se. The generalized VIF helps if your independent variables include categorical factors, and for ML models it will provide an estimate based directly on the variance-covariance matrix. See this page.
Best Answer
I first encountered the ANOVA when I was a Master's student at Oxford in 1978. Modern approaches, by teaching continuous and categorical variables together in the multiple regression model, make it difficult for younger statisticians to understand what is going on. So it can be helpful to go back to simpler times.
In its original form, the ANOVA is an exercise in arithmetic whereby you break up the total sum of squares into pieces associated with treatments, blocks, interactions, whatever. In a balanced setting, sums of squares with an intuitive meaning (like SSB and SST) add up to the adjusted total sum of squares. All of this works thanks to Cochran's Theorem. Using Cochran, you can work out the expected values of these terms under the usual null hypotheses, and the F statistics flow from there.
As a bonus, once you start thinking about Cochran and sums of squares, it makes sense to go on slicing and dicing your treatment sums of squares using orthogonal contrasts. Every entry in the ANOVA table should have an interpretation of interest to the statistician and yield a testable hypothesis.
I recently wrote an answer where the difference between MOM and ML methods arose. The question turned on estimating random effects models. At this point, the traditional ANOVA approach totally parts company with maximum likelihood estimation, and the estimates of the effects are no longer the same. When the design is unbalanced, you don't get the same F statistics either.
Back in the day, when statisticians wanted to compute random effects from split-plot or repeated measures designs, the random effects variance was computed from the mean squares of the ANOVA table. So if you have a plot with variance $\sigma^2_p$ and the residual variance is $\sigma^2$, you might have that the expected value of the mean square ("expected mean square", EMS) for plots is $\sigma^2 + n\sigma_p^2$, with $n$ the number of splits in the plot. You set the mean square equal to its expectation and solve for $\hat{\sigma_b^2}$. The ANOVA yields a method of moments estimator for the random effect variance. Now, we tend to solve such problems with mixed effects models and the variance components are obtained through maximum likelihood estimation or REML.
The ANOVA as such is not a method of moments procedure. It turns on splitting the sum of squares (or more generally, a quadratic form of the response) into components that yield meaningful hypotheses. It depends strongly on normality since we want the sums of squares to have chi-squared distributions for the F tests to work.
The maximum likelihood framework is more general and applies to situations like generalized linear models where sums of squares do not apply. Some software (like R) invite confusion by specifying anova methods to likelihood ratio tests with asymptotic chi-squared distributions. One can justify use of the term "anova", but strictly speaking, the theory behind it is different.