Minimize SSE = Σi (zi-a)2 with respect to a and you get the average of the two sample means, so it looks like your MLE is also your minimum MSE estimator, whether you have one (x,y) pair or several. Not surprising, since the expression for the SSE and the log likelihood function are almost identical.
I will try to give an intuitive example to understand why the arithmetic mean
\begin{equation} \overline x_1 = \sum_{i=1}^{n} \frac{x_i}{n}
\end{equation}
is not as good as
\begin{equation} \overline x_2 = \frac{a + b}{2}
\end{equation}
In the case where $X \sim \mathrm{unif}(\alpha,\beta)$
Imagine that you have 10 observations from $\mathrm{unif}(1,11)$
(There is a candy factory that puts a candy of 1cm, 2cm, ..., 11cm, 1cm, 2cm,..., 11cm,... in separate bags). We know from the above information that the real mean is 6 (For the factory example, the factory would have spend the same amount of sugar if each candy was 6cm)
Now, if you don't know any of the above and take a random sample to estimate that, then $\bar x_2$ would only require the smallest and the highest number to appear in your sample and that's it! It would always guess the correct answer with 0 error!
$\bar x_1$ on the other hand, would be sensitive to every single value that you get and it will "fluctuate" around the real value. Furthermore, if the highest (or equivalently the lowest) value doesn't appear in your sample, then again $\bar x_2$ will almost always be closer to the real mean compared to $\bar x_1$. $\bar x_1$ will be better only if your sample is already centralized around 6 which is less likely to happen compared to the other possible scenarios.
For the candy factory example. If you try to predict the "average candy" that is in each bag, it's better to get the average between the smallest and the largest candy you had so far than averaging the candies in every single bag you open and change your prediction (and thus error) after every bag.
Best Answer
If you have two competing estimators $\hat \theta_1$ and $\hat \theta_2$, whether or not $$ {\rm MSE}(\hat \theta_1) < {\rm MSE}(\hat \theta_2) $$ tells you that $\hat \theta_1$ is the better estimator depends entirely on your definition of "best". For example, if you are comparing unbiased estimators and by "better" you mean has lower variance then, yes, this would imply that $\hat \theta_1$ is better. $\rm MSE$ is a popular criterion because of its connection with Least Squares and the Gaussian log-likelihood but, like many statistical criteria, one should be cautioned from using $\rm MSE$ blindly as a measure of estimator quality without paying attention to the application.
There are certain situations where choosing an estimator to minimize ${\rm MSE}$ may not be a particularly sensible thing to do. Two scenarios come to mind:
If there are very large outliers in a data set then they can affect MSE drastically and thus the estimator that minimizes the MSE can be unduely influenced by such outliers. In such situations, the fact that an estimator minimizes the MSE doesn't really tell you much since, if you removed the outlier(s), you can get a wildly different estimate. In that sense, the MSE is not "robust" to outliers. In the context of regression, this fact is what motivated the Huber M-Estimator (that I discuss in this answer), which minimizes a different criterion function (that is a mixture between squared error and absolute error) when there are long-tailed errors.
If you are estimating a bounded parameter, comparing $\rm MSE$s may not be appropriate since it penalizes over and understimation differently in that case. For example, suppose you're estimating a variance, $\sigma^2$. Then, if you consciously underestimate the quantity your $\rm MSE$ can be at most $\sigma^4$, while overestimation can produce an $\rm MSE$ that far exceeds $\sigma^4$, perhaps even by an unbounded amount.
To make these drawback more clear, I'll give a concrete example of when, because of these issues, the $\rm MSE$ may not be an appropriate measure of estimator quality.
Suppose you have a sample $X_1, ..., X_n$ from a $t$ distribution with $\nu>2$ degrees of freedom and we are trying to estimate the variance, which is $\nu/(\nu-2)$. Consider two competing estimators: $$\hat \theta_{1}: {\rm the \ unbiased \ sample \ variance} $$and $$\hat \theta_{2} = 0,{\rm \ regardless \ of \ the \ data}$$ Clearly $\rm MSE(\hat \theta_{2}) = \frac{\nu^2}{(\nu-2)^2}$ and it is a fact that $$ {\rm MSE}(\hat \theta_{1}) = \begin{cases} \infty &\mbox{if } \nu \leq 4 \\ \frac{\nu^2}{(\nu-2)^2} \left( \frac{2}{n-1}+\frac{6}{n(\nu-4)} \right) & \mbox{if } \nu>4 . \end{cases} $$ which can be derived using the fact discussed in this thread and the properties of the $t$-distribution. Thus the naive estimator outperforms in terms of $\rm MSE$ regardless of the sample size whenever $\nu < 4$, which is rather disconcerting. It also outperforms when $\left( \frac{2}{n-1}+\frac{6}{n(\nu-4)} \right) > 1$ but this is only relevant for very small sample sizes. The above happens because of the long tailed nature of the $t$ distribution with small degrees of freedom, which makes $\hat \theta_{2}$ prone to very large values and the $\rm MSE$ penalizes heavily for the overestimation, while $\hat \theta_1$ does not have this problem.
The bottom line here is that $\rm MSE$ is not an appropriate measure estimator performance in this scenario. This is clear because the estimator that dominates in terms of $\rm MSE$ is a ridiculous one (particularly since there is no chance that it is correct if there is any variability in the observed data). Perhaps a more appropriate approach (as pointed out by Casella and Berger) would be to choose the variance estimator, $\hat \theta$ that minimizes Stein's Loss:
$$ S(\hat \theta) = \frac{ \hat \theta}{\nu/(\nu-2)} - 1 - \log \left( \frac{ \hat \theta}{\nu/(\nu-2)} \right) $$
which penalizes underestimation equally to overestimation. It also brings us back to sanity since $S(\hat \theta_1)=\infty$ :)