First of all, it is worth to specify that you are talking about normal distribution. Otherwise, $S^2$ is not (necessarily) the MLE of $\text{var}(X)$.
"if the MLE is supposed to reflect the best attempt..."
There is no universally best method to derive estimators. ML maximization is only one possible and widely accepted method. However, its justification mainly based on the asymptotic ($n\to \infty$) properties of the estimators rather than small sample features like vanishing bias. On a slightly theoretical ground, what would you expect from a ``good'' estimator?
1) Consistency, $ \hat{\tau}_n \xrightarrow{p} \tau$.
1.1) Asymptotically unbiased $\lim_{n\to\infty} \mathbb{E}\hat{\tau}_n=\tau$.
2) Utilize all the sample available information in the sense of Fisher Information, i.e., $\mathcal{I}_{\hat{\tau}_n}(\tau)=\mathcal{I}_{X_1,...,X_n}(\tau) $.
ML estimators satisfies these three conditions, furthermore, under some regular conditions (finite variance and independence of $\tau$ and the support of $X_1,...,X_n$ the MLE will converge in distribution to a normal r.v with the minimal possible variance (Cramer-Rao lower bound; $\mathcal{I}^{-1}_{X_!,...,X_n}(\tau)$).
So.. if it is so good why the aforementioned ''discrepancies'' occur? As you can see, some of the desired properties may hold only for $n\to \infty$. As such, if for some reason you are dealing with small $n$ and value ubiasness - ML estimator won't necessarily be your best choice. Another possible reason to reject the method is intractability of the estimator. Deriving MLE for $\mathcal{N}(\mu, \sigma^2)$ is mathematically easy, but once your parametric space is of higher dimension or/and the ML function is not so smooth and ``nice'' - the task of maximization may become pretty troublesome.
Strictly speaking of the estimator of $\text{var}(X)$ in $\mathcal{N}(\mu, \sigma^2)$. All the presented estimators are asymptotically equivalent in the terms of bias and efficiency as $n\pm 1 \approx n$ for large enough $n$. Thus, for very large samples is doesn't matter which one you choose. For small samples, you may care about bias and efficiency (in terms of MSE), so it is reasonable to choose from one of the other modified estimators.
Best Answer
If $Y_1,\ldots,Y_n\sim \text{i.i.d.} \operatorname N(\mu,\sigma^2)$ then the sample mean $(Y_1+\cdots+Y_n)/n$ is both the least-squares estimator of $\mu$ and the maximum-likelihood estimator of $\mu.$
It is also the best linear unbiased estimator of $\mu,$ i.e.
The same thing applies to more elaborate sorts of linear models. For example, suppose we have $$ \text{independent } Y_i \sim \operatorname N(a+bx_i, \sigma^2) \text{ for } i=1,\ldots,n. $$ Then the least-squares estimators of $a$ and $b$ are likewise BLUE.
In the situations above, least-squares estimation of $\mu$ or $(a,b)$ coincides with maximum-likelihood estimation.
The proofs of the assertions in the bulleted list above, except for the fourth bullet point, can be done with far less information than that the $Y\text{s}$ have the distributions above. It is enough to assume that
The Gauss–Markov theorem says that these three assumptions are enough to guarantee that least-squares is BLUE.
But with these weaker Gauss–Markov assumptions, it makes no sense to speak of maximum likelihood, since we don't have a parametrized family of probability distributions.