Consider *any* location-scale family determined by a "standard" distribution $F$,

$$\Omega_F = \left\{F_{(\mu, \sigma)}: x \to F\left(\frac{x-\mu}{\sigma}\right) \mid \sigma \gt 0\right\}.$$

Assuming $F$ differentiable we readily find that the PDFs are $\frac{1}{\sigma}f\left((x-\mu)/\sigma\right)dx$.

Truncating these distributions to restrict their support between $a$ and $b$, $a \lt b$, means that the PDFs are replaced by

$$f_{(\mu, \sigma; a,b)}(x) = \frac{f\left(\frac{x-\mu}{\sigma}\right)dx}{\sigma C(\mu, \sigma, a, b)}, a \le x \le b$$

(and are zero for all other values of $x$) where $C(\mu, \sigma, a, b) = F_{(\mu,\sigma)}(b) - F_{(\mu,\sigma)}(a)$ is the normalizing factor needed to ensure that $f_{(\mu, \sigma; a, b)}$ integrates to unity. (Note that $C$ is identically $1$ in the absence of truncation.) The log likelihood for iid data $x_i$ therefore is

$$\Lambda(\mu, \sigma) = \sum_i \left[\log{f\left(\frac{x_i-\mu}{\sigma}\right)} - \log{\sigma}-\log{C(\mu, \sigma, a, b)}\right].$$

Critical points (including any global minima) are found where either $\sigma=0$ (a special case I will ignore here) or the gradient vanishes. Using subscripts to denote derivatives, we may formally compute the gradient and write the likelihood equations as

$$\eqalign{
0 &= \frac{\partial\Lambda}{\partial\mu} &= \sum_i \left[\frac{-f_\mu\left(\frac{x_i-\mu}{\sigma}\right)}{f\left(\frac{x_i-\mu}{\sigma}\right)} -\frac{C_\mu(\mu,\sigma,a,b)}{C(\mu,\sigma,a,b)}\right] \\
0 &= \frac{\partial\Lambda}{\partial\sigma} &= \sum_i \left[\frac{-f_\sigma\left(\frac{x_i-\mu}{\sigma}\right)}{\sigma^2f\left(\frac{x_i-\mu}{\sigma}\right)} -\frac{1}{\sigma}-\frac{C_\sigma(\mu,\sigma,a,b)}{C(\mu,\sigma,a,b)}\right]
}$$

Because $a$ and $b$ are fixed, drop them from the notation and write $nC_\mu(\mu, \sigma, a, b)/C(\mu, \sigma,a,b)$ as $A(\mu,\sigma)$ and $nC_\sigma(\mu, \sigma, a, b)/C(\mu, \sigma,a,b)$ as $B(\mu, \sigma)$. (With no truncation, both functions would be identically zero.) Separating the terms involving the data from the rest gives

$$\eqalign{
-A(\mu,\sigma) &= \sum_i \frac{f_\mu\left(\frac{x_i-\mu}{\sigma}\right)}{f\left(\frac{x_i-\mu}{\sigma}\right)} \\
-\sigma^2 B(\mu,\sigma) - n\sigma &= \sum_i \frac{f_\sigma\left(\frac{x_i-\mu}{\sigma}\right)}{f\left(\frac{x_i-\mu}{\sigma}\right)}
}$$

By comparing these to the no-truncation situation it is evident that

Any sufficient statistics for the original problem are sufficient for the truncated problem (because the right hand sides have not changed).

Our ability to find closed-form solutions relies on the tractability of $A$ and $B$. If these do not involve $\mu$ and $\sigma$ in simple ways, we cannot hope to obtain closed-form solutions in general.

For the case of a normal family, $C(\mu,\sigma,a,b)$ of course is given by the cumulative normal PDF, which is a difference of error functions: there is no chance that a closed-form solution can be obtained in general. However, there are only two sufficient statistics (the sample mean and variance will do) and the CDF is as smooth as can be, so numerical solutions will be relatively easy to obtain.

You can use a sum-of-squares argument to see this.

$$\sum_i (x_i-\theta)^2 = \sum_i (x_i - \bar{x}+\bar{x}-\theta)^2 = \sum_i (x_i-\bar{x})^2+n(\bar{x}-\theta)^2+\\\color{red}{2(\bar{x}-\theta)\sum_i(x_i-\bar{x})}$$

Now, $\bar{x}$ is defined so that the cross term becomes zero since $\sum_i x_i = \sum_i \bar{x}$.

We are therefore left with:

$$\sum_i (x_i - \theta)^2 = \sum_i (x_i-\bar{x})^2+n(\bar{x}-\theta)^2$$

This is the variance/bias decomposition of the squared-error associated with estimator $\theta$. Since both terms on the RHS are positive and only the bias term depends on $\theta$, we have:

$$\sum_i (x_i - \theta)^2 > \sum_i (x_i - \bar{x})^2$$

## Best Answer

An alternate proof for $\widehat{\Sigma}$ that takes the derivative with respect to $\Sigma$ directly:

Picking up with the log-likelihood as above: \begin{eqnarray} \ell(\mu, \Sigma) &=& C - \frac{m}{2}\log|\Sigma|-\frac{1}{2} \sum_{i=1}^m \text{tr}\left[(\mathbf{x}^{(i)}-\mu)^T \Sigma^{-1} (\mathbf{x}^{(i)}-\mu)\right]\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| + \sum_{i=1}^m\text{tr} \left[(\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T\Sigma^{-1} \right]\right)\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| +\text{tr}\left[ S_\mu \Sigma^{-1} \right] \right) \end{eqnarray} where $S_\mu = \sum_{i=1}^m (\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T$ and we have used the cyclic and linear properties of $\text{tr}$. To compute $\partial \ell /\partial \Sigma$ we first observe that $$ \frac{\partial}{\partial \Sigma} \log |\Sigma| = \Sigma^{-T}=\Sigma^{-1} $$ by the fourth property above. To take the derivative of the second term we will need the property that $$ \frac{\partial}{\partial X}\text{tr}\left( A X^{-1} B\right) = -(X^{-1}BAX^{-1})^T. $$ (from The Matrix Cookbook, equation 63). Applying this with $B=I$ we obtain that $$ \frac{\partial}{\partial \Sigma}\text{tr}\left[S_\mu \Sigma^{-1}\right] = -\left( \Sigma^{-1} S_\mu \Sigma^{-1}\right)^T = -\Sigma^{-1} S_\mu \Sigma^{-1} $$ because both $\Sigma$ and $S_\mu$ are symmetric. Then $$ \frac{\partial}{\partial \Sigma}\ell(\mu, \Sigma) \propto m \Sigma^{-1} - \Sigma^{-1} S_\mu \Sigma^{-1}. $$ Setting this to 0 and rearranging gives $$ \widehat{\Sigma} = \frac{1}{m}S_\mu. $$

This approach is more work than the standard one using derivatives with respect to $\Lambda = \Sigma^{-1}$, and requires a more complicated trace identity. I only found it useful because I currently need to take derivatives of a modified likelihood function for which it seems much harder to use $\partial/{\partial \Sigma^{-1}}$ than $\partial/\partial \Sigma$.