First of all, it is worth to specify that you are talking about normal distribution. Otherwise, $S^2$ is not (necessarily) the MLE of $\text{var}(X)$.
"if the MLE is supposed to reflect the best attempt..."
There is no universally best method to derive estimators. ML maximization is only one possible and widely accepted method. However, its justification mainly based on the asymptotic ($n\to \infty$) properties of the estimators rather than small sample features like vanishing bias. On a slightly theoretical ground, what would you expect from a ``good'' estimator?
1) Consistency, $ \hat{\tau}_n \xrightarrow{p} \tau$.
1.1) Asymptotically unbiased $\lim_{n\to\infty} \mathbb{E}\hat{\tau}_n=\tau$.
2) Utilize all the sample available information in the sense of Fisher Information, i.e., $\mathcal{I}_{\hat{\tau}_n}(\tau)=\mathcal{I}_{X_1,...,X_n}(\tau) $.
ML estimators satisfies these three conditions, furthermore, under some regular conditions (finite variance and independence of $\tau$ and the support of $X_1,...,X_n$ the MLE will converge in distribution to a normal r.v with the minimal possible variance (Cramer-Rao lower bound; $\mathcal{I}^{-1}_{X_!,...,X_n}(\tau)$).
So.. if it is so good why the aforementioned ''discrepancies'' occur? As you can see, some of the desired properties may hold only for $n\to \infty$. As such, if for some reason you are dealing with small $n$ and value ubiasness - ML estimator won't necessarily be your best choice. Another possible reason to reject the method is intractability of the estimator. Deriving MLE for $\mathcal{N}(\mu, \sigma^2)$ is mathematically easy, but once your parametric space is of higher dimension or/and the ML function is not so smooth and ``nice'' - the task of maximization may become pretty troublesome.
Strictly speaking of the estimator of $\text{var}(X)$ in $\mathcal{N}(\mu, \sigma^2)$. All the presented estimators are asymptotically equivalent in the terms of bias and efficiency as $n\pm 1 \approx n$ for large enough $n$. Thus, for very large samples is doesn't matter which one you choose. For small samples, you may care about bias and efficiency (in terms of MSE), so it is reasonable to choose from one of the other modified estimators.
The point of writing $$d_2-\theta=\color{blue}{d_2-(\theta+2)}+2$$
is to compare $d_2$ to its expected value. Note that the blue expression has zero expectation, and the expectation of its square is just the variance of $d_2$ (by definition of variance).
You can also expand $E[(d_2-\theta)^2]$ without rewriting the expression in this way; you get $$E[d_2^2]-2\theta E[d_2]+\theta^2$$
But the first term is just the variance plus $E[d_2]^2$, which is $(\theta+2)^2$, and the second term is $-2\theta(\theta+2)$. Thus we have
$$2+(\theta+2)^2-2\theta^2-4\theta+\theta^2$$
Simplifying gives the same MSE we obtained the other way, $6$. The first solution is algebraically cleverer but by no means necessary.
Best Answer
The main difference is whether you are considering the deviation of the estimator of interest from the true parameter (this is the mean squared error), or the deviation of the estimator from its expected value (this is the variance). Consequently, we can see that when the bias of the estimator is zero, the variance and mean squared error are equal.
Mathematically, if $\hat \theta$ is an estimator for $\theta$, then $$\operatorname{MSE}[\hat \theta] = \operatorname{E}[(\hat\theta - \theta)^2],$$ whereas $$\operatorname{Var}[\hat\theta] = \operatorname{E}[(\hat\theta - \operatorname{E}[\hat\theta])^2].$$ And regarding the previous remark, $$\operatorname{Bias}[\hat\theta] = \operatorname{E}[\hat\theta - \theta],$$ so when the bias is zero, $\operatorname{E}[\hat\theta] = \theta$ and now we easily see how the MSE and variance become equivalent.
Note, however, we can also write: $$\operatorname{Var}[\hat\theta] = \operatorname{E}[\hat \theta^2 - 2\hat\theta \operatorname{E}[\hat \theta] + \operatorname{E}[\hat\theta]^2] = \operatorname{E}[\hat\theta^2] - \operatorname{E}[\hat\theta]^2,$$ so that $$\begin{align*} \operatorname{Var}[\hat\theta] + \operatorname{Bias}^2[\hat\theta] &= \operatorname{E}[\hat\theta^2] - \operatorname{E}[\hat\theta]^2 + (\operatorname{E}[\hat \theta] - \theta)^2 \\ &= \operatorname{E}[\hat\theta^2] - 2\theta \operatorname{E}[\hat\theta] + \theta^2 \\ &= \operatorname{E}[(\hat \theta - \theta)^2] \\ &= \operatorname{MSE}[\hat\theta]. \end{align*}$$