If you wanted to show only that the sample mean has a smaller variance than every other weighted average of the observations, then this would be an exercise in Lagrange multipliers. But if you want to include all unbiased estimators of $\mu$ based on $X_1,\ldots,X_n$ (for example, the sample median is one such estimator, and is not a weighted average of the observations), then this becomes equivalent to the one-to-one nature of the two-sided Laplace transform.
Observe that the conditional distribution of $(X_1,\ldots,X_n)$ given $\bar X = (X_1+\cdots+X_n)/n$ does not depend on $\mu$. (I could add the details of how to find the conditional distribution if necessary.) In other words, the sample mean $\bar X$ is a sufficient statistic for $\mu$. Therefore, the Rao–Blackwell theorem tells us that any minimum-variance estimator is to be found only among functions of $\bar X$.
Therefore it is enough to show that the only function $g(\bar X)$ of $\bar X$ (where of course, which function $g$ is, is not allowed to depend on $\mu$; i.e. $g(\bar X)$ is actually a statistic) that is an unbiased estimator of $\mu$ is $\bar X$ itself.
The density function of $\bar X$ is
$$
x\mapsto \text{constant}\cdot \exp\left(\frac{-1}{2}\cdot\left(\frac{x-\mu}{1/\sqrt{n}}\right)^2\right).
$$
In order that the function $g(\bar X)$ be an unbiased estimator of $\mu$, we must $g(\bar X)-\bar X$ being an unbiased estimator of $0$. Let $h(x) = g(x)-x$; then we must have
$$
\int_{-\infty}^\infty (\text{same constant})\cdot h(x) \exp\left(\frac{-1}{2}\cdot\left(\frac{x-\mu}{1/\sqrt{n}}\right)^2\right) \, dx = 0
$$
for all values of $\mu$. Hence
$$
\text{same constant}\cdot \exp\left(\frac{-n\mu^2}{2}\right) \cdot \int_{-\infty}^\infty \left(h(x) \exp\left(\frac{-n}{2} x^2\right)\right) \exp\left(nx\mu\right) \, dx = 0
$$
regardless of the value of $\mu$. Thus the two-sided Laplace transform of the function
$$
x\mapsto h(x)\exp\left( \frac{-nx^2}{2} \right)
$$
is $0$ for all values of $\mu$.
Since the two-sided Laplace transform is one-to-one, it can map only one function to the identically zero function.
Question 2 is answered in this Wikipedia article about Bessel's correction.
The meaning of the standard deviation is largely the same as the meaning of the mean absolute deviation:
- Both are translation invariant, i.e. if you add the same number to every data point, you don't change the measure of dispersion, whether it is the standard deviation or the mean absolute deviation; and
- Both are equivariant under multiplication by non-negative numbers, i.e. if you multiply every data point by the same non-negative number, then you multiply the measure of dispersion by that. (And if the number you multiply by may be negative, then the measure of dispersion is multiplied by the absolute value of that number.)
However, the standard deviation enjoys one great advantage over the mean absolute deviation: the variance (the square of the standard deviation) of the sum of independent random variables is the sum of their variances. For example, suppose you toss a coin $1800$ times. What is the variance of the probability distribution of the number of heads? You can find it easily, whereas you can't do the same with the mean absolute deviation. That makes it possible to find the probability that the number of heads is between 895 and 912, by using the central limit theorem.
A subtler advantage is also enjoyed by the standard deviation. Suppose the population standard deviation is $\sigma$, and you can't observe that but must estimate it based on a sample. You can multiply the sample standard deviation by a particular constant (I don't remember its value off hand) to get an unbiased estimate of $\sigma$, and you can do the same with the mean absolute deviation and a different constant to do the same. Which one is more accurate, in the sense of having a smaller mean square error? It is the sample standard deviation if the population is normally distributed. However, one way in which this advantage is subtler is that it may be lost of there is a slight deviation from normality. I seem to recall that with a mixture of normal distributions with different variances, with mixing weights $0.99$ and $0.01$, that advantage may be lost.
Best Answer
Clearly not for $n=1$ when you always get $0$ and (less dramatically) not for larger $n$.
$\frac{\sum_{i=1}^n\left|X_i-\bar{X}\right|}{n}$ faces the same issue as $\frac{\sum_{i=1}^n(X_i-\bar{X})^2}{n}$ in that $\bar X$ tends to be closer to the $X_i$ than $\mu$ is.
For a normal distribution (but not others) $\mathbb E\left[\frac{\sum_{i=1}^n|X_i-\mu|}{n}\right] =\sqrt{\frac{2}{\pi}} \sigma$, while it seems empirically $\mathbb E\left[\frac{\sum_{i=1}^n\left|X_i-\bar{X}\right|}{n}\right] $ $= \sqrt{\frac{n-1}{n}} \sqrt{\frac{2}{\pi}} \sigma$ or close to that.
As an illustration, with a standard normal and sample size $n=4$, the expected absolute distance to the sample average seems to be closer to $\sqrt{\frac3{2\pi}} \approx 0.691$ than the expected absolute distance to the mean of $\sqrt{\frac2{\pi}} \approx 0.798$:
For comparison, for a uniform distribution on $[a,b]$, you have $\mathbb E\left[\frac{\sum_{i=1}^n|X_i-\mu|}{n}\right] =\frac{b-a}{4}$, while it seems empirically $\mathbb E\left[\frac{\sum_{i=1}^n\left|X_i-\bar{X}\right|}{n}\right] $ $= \left(1-\frac{2}{3n}\right)\frac{b-a}{4}$ or close to that at least with $n\ge 2$. Another simulation of $U(0,1)$, again with $n=4$, shows the expected absolute distance to the sample average seems to be closer to $\frac5{24} \approx 0.208$ than the expected absolute distance to the mean of $\frac14=0.25$: