Christoph Hanck has not posted the details of his proposed example. I take it he means the uniform distribution on the interval $[0,\theta],$ based on an i.i.d. sample $X_1,\ldots,X_n$ of size more than $n=1.$

The mean is $\theta/2$.

The MLE of the mean is $\max\{X_1,\ldots,X_n\}/2.$

That is biased since $\Pr(\max < \theta) = 1,$ so $\operatorname{E}({\max}/2)<\theta/2.$

**PS:** Perhaps we should note that the best unbiased estimator of the mean $\theta/2$ is *not* the sample mean, but rather is $$\frac{n+1} {2n} \cdot \max\{X_1,\ldots,X_n\}.$$ The sample mean is a lousy estimator of $\theta/2$ because for some samples, the sample mean is less than $\dfrac 1 2 \max\{X_1,\ldots,X_n\},$ and it is clearly impossible for $\theta/2$ to be less than ${\max}/2.$

**end of PS**

I suspect the Pareto distribution is another such case. Here's the probability measure:
$$
\alpha\left( \frac \kappa x \right)^\alpha\ \frac{dx} x \text{ for } x >\kappa.
$$
The expected value is $\dfrac \alpha {\alpha -1 } \kappa.$ The MLE of the expected value is
$$
\frac n {n - \sum_{i=1}^n \big((\log X_i) - \log(\min)\big)} \cdot \min
$$
where $\min = \min\{X_1,\ldots,X_n\}.$

I haven't worked out the expected value of the MLE for the mean, so I don't know what its bias is.

## Best Answer

Given the assumptions, the ML estimator is the value of the parameter that has the best chance of producing the data set.

Bias is about expectations of sampling distributions. "Most likely to produce the data" isn't about expectations of sampling distributions. Why would they be expected to go together?

What is the basis on which it is surprising they don't necessarily correspond?

I'd suggest you consider some simple cases of MLE and ponder how the difference arises in those particular cases.

As an example, consider observations on a uniform on $(0,\theta)$. The largest observation is (necessarily) no bigger than the parameter, so the parameter can only take values at least as large as the largest observation.

When you consider the likelihood for $\theta$, it is (obviously) larger the closer $\theta$ is to the largest observation. So it's maximized

atthe largest observation; that's clearly the estimate for $\theta$ that maximizes the chance of obtaining the sample you got:But on the other hand it must be biased, since the largest observation is obviously (with probability 1) smaller than the true value of $\theta$; any other estimate of $\theta$ not already ruled out by the sample itself must be larger than it, and must (quite plainly in this case) be less likely to produce the sample.

The expectation of the largest observation from a $U(0,\theta)$ is $\frac{n}{n+1}\theta$, so the usual way to unbias it is to take as the estimator of $\theta$: $\hat\theta=\frac{n+1}{n}X_{(n)}$, where $X_{(n)}$ is the largest observation.

This lies to the right of the MLE, and so has lower likelihood.