The probability density function for the data is
\begin{align}
f_\theta\left(\{X\}_i,\{Y\}_j\right)&=\prod_i\exp\left(\frac12\left(\frac{X_i-\mu_1}\sigma\right)^2\right)\prod_j\exp\left(\frac12\left(\frac{Y_j-\mu_2}\sigma\right)^2\right)\\
&=\exp\left(\frac1{2\sigma^2}\left(\sum_iX_i^2+\sum_jY_j^2-2\mu_1\sum_iX_i-2\mu_2\sum_jY_j+m\mu_1^2+n\mu_2^2\right)\right)\;,
\end{align}
so $\left(\sum_iX_i^2+\sum_jY_j^2,\sum_iX_i,\sum_jY_j\right)$ is a sufficient statistic. Since
$$
\frac{f_\theta\left(\{X\}_i,\{Y\}_j\right)}{f_\theta\left(\{X'\}_i,\{Y'\}_j\right)}
$$
is independent of $\theta$ if and only if this statistic is the same for the two sets of data, this is also a minimal sufficient statistic. (There's no such thing as "the" minimal sufficient statistic, since you can apply any bijective function to a minimal sufficient statistic to obtain another one.)
First of all, it is worth to specify that you are talking about normal distribution. Otherwise, $S^2$ is not (necessarily) the MLE of $\text{var}(X)$.
"if the MLE is supposed to reflect the best attempt..."
There is no universally best method to derive estimators. ML maximization is only one possible and widely accepted method. However, its justification mainly based on the asymptotic ($n\to \infty$) properties of the estimators rather than small sample features like vanishing bias. On a slightly theoretical ground, what would you expect from a ``good'' estimator?
1) Consistency, $ \hat{\tau}_n \xrightarrow{p} \tau$.
1.1) Asymptotically unbiased $\lim_{n\to\infty} \mathbb{E}\hat{\tau}_n=\tau$.
2) Utilize all the sample available information in the sense of Fisher Information, i.e., $\mathcal{I}_{\hat{\tau}_n}(\tau)=\mathcal{I}_{X_1,...,X_n}(\tau) $.
ML estimators satisfies these three conditions, furthermore, under some regular conditions (finite variance and independence of $\tau$ and the support of $X_1,...,X_n$ the MLE will converge in distribution to a normal r.v with the minimal possible variance (Cramer-Rao lower bound; $\mathcal{I}^{-1}_{X_!,...,X_n}(\tau)$).
So.. if it is so good why the aforementioned ''discrepancies'' occur? As you can see, some of the desired properties may hold only for $n\to \infty$. As such, if for some reason you are dealing with small $n$ and value ubiasness - ML estimator won't necessarily be your best choice. Another possible reason to reject the method is intractability of the estimator. Deriving MLE for $\mathcal{N}(\mu, \sigma^2)$ is mathematically easy, but once your parametric space is of higher dimension or/and the ML function is not so smooth and ``nice'' - the task of maximization may become pretty troublesome.
Strictly speaking of the estimator of $\text{var}(X)$ in $\mathcal{N}(\mu, \sigma^2)$. All the presented estimators are asymptotically equivalent in the terms of bias and efficiency as $n\pm 1 \approx n$ for large enough $n$. Thus, for very large samples is doesn't matter which one you choose. For small samples, you may care about bias and efficiency (in terms of MSE), so it is reasonable to choose from one of the other modified estimators.
Best Answer
To be strict. the parameters can be $\alpha,\sigma_1, \mu_2,\sigma_2$, so $\mu_1=\alpha+\mu_2$. We can write the negative of log likelihood as:
$\sum_1^{n_1}{\frac{(X_i-\mu_2-\alpha)^2}{2\sigma_1^2}}+\sum_1^{n_2}{\frac{(Y_i-\mu_2)^2}{2\sigma_2^2}}$
take FOC we have:
$-\sum_1^{n_1}{\frac{X_i-\mu_2-\alpha}{\sigma_1^2}}=0 \rightarrow \sum_1^{n_1}({X_i-\mu_2-\alpha})=0$
$-\sum_1^{n_1}{\frac{X_i-\mu_2-\alpha}{\sigma_1^2}}-\sum_1^{n_2}{\frac{Y_i-\mu_2}{\sigma_2^2}}=0 \rightarrow \sum_1^{n_2}({Y_i-\mu_2})=0$
So $\hat\mu_2=\bar{Y},\hat\alpha=\bar{X}-\bar{Y} $
$var(\hat\alpha)=var(\bar{X}-\bar{Y})=var(\bar{X})+var(\bar{Y})$ if $\sigma_1, \sigma_2$ are known, it would be easy to minimize.Otherwise use the t-distribution in my comments.