Solved – Is sample mean the “best” estimation of distribution mean in some sense

estimationexpected valuelaw-of-large-numbers

By (weak/strong) law of large numbers, given some iid sample points $\{x_i \in \mathbb{R}^n, i=1,\ldots,N\}$ of a distribution, their sample mean $f^*(\{x_i, i=1,\ldots,N\}):=\frac{1}{N} \sum_{i=1}^N x_i $ converges to the distribution mean both in probability and a.s., as sample size $N$ goes to infinity.

When the sample size $N$ is fixed, I wonder if the LLN estimator $f^*$ is an estimator best in some sense?
For example,

  1. its expectation is the distribution mean, so it is an unbiased estimator. Its variance is $\frac{\sigma^2}{N}$ where $\sigma^2$ is the distribution variance. But is it UMVU?
  2. is there some function $l_0: \mathbb{R}^n \times \mathbb{R}^n
    \rightarrow [0,\infty)$ such that $f^*(\{x_i, i=1,\ldots,N\})$ solve
    the minimization problem: $$ f^*(\{x_i, i=1,\ldots,N\}) = \operatorname{argmin}_{u
    \in \mathbb{R}^n} \quad \sum_{i=1}^N l_0(x_i, u)? $$

    In other words, $f^*$ is the best wrt some contrast function $l_0$ in the minimum contrast framework (c.f. Section 2.1 "Basic Heuristics of Estimation" in "Mathematical statistics:
    basic ideas and selected topics, Volume 1
    " by Bickle and Doksum).

    For example, if the distribution is known/restricted to be from the family of Gaussian distributions, then sample mean will be the MLE estimator of distribution mean, and MLE belongs to the minimum contrast framework, and its contrast function $l_0$ is minus the log likelihood function.

  3. is there some function $l: \mathbb{R}^n \times F
    \rightarrow [0,\infty)$ such that $f^*$ solve the minimization
    problem: $$ f^* = \operatorname{argmin}_{f} \quad \operatorname{E}_{\text{iid }\{x_i, i=1,\ldots,N\} \text{ each with distribution }P } \quad l(f(\{x_i, i=1,\ldots,N\}), P)?
    $$
    for any distribution $P$ of $x_i$ within some family $F$ of distributions?

    In other words, $f^*$ is the best wrt some lost function $l$ and some family $F$ of distributions in the decision theoretic framework (c.f. Section 1.3 "The Decision Theoretic Framework" in "Mathematical statistics:
    basic ideas and selected topics, Volume 1
    " by Bickle and Doksum).

Note that the above are three different interpretations for a "best" estimation that I have known so far. If you know about other possible interpretations that may apply to the LLN estimator, please don't hesitate to mention that as well.

Best Answer

The answer to your second question is yes: The sample mean is a minimum contrast estimator when your function $l_0$ is $(x-u)^2$, when x and u are real numbers, or $(x-u)'(x-u)$, when x and u are column vectors. This follows from least-squares theory or differential calculus.

A minimum contrast estimator is, under certain technical conditions, both consistent and asymptotically normal. For the sample mean, this already follows from the LLN and the central limit theorem. I don't know that minimum contrast estimators are "optimal" in any way. What's nice about minimum contrast estimators is that many robust estimators (e.g. the median, Huber estimators, sample quantiles) fall into this family, and we can conclude that they are consistent and asymptotically normal just by applying the general theorem for minimum contrast estimators, so long as we check some technical conditions (though often this is much difficult than it sounds).

One notion of optimality that you don't mention in your question is efficiency which, roughly speaking, is about how large a sample you need to get an estimate of a certain quality. See http://en.wikipedia.org/wiki/Efficiency_(statistics)#Asymptotic_efficiency for a comparison of the efficiency of mean and median (mean is more efficient, but the median is more robust to outliers).

For the third question, without some restriction on the set of functions f over which you are finding the argmin, I don't think the sample mean will be optimal. For any distribution P, you can fix f to be a constant that ignores the $x_i$'s and minimizes the loss for the particular P. Sample mean can't beat that.

Minimax optimality is a weaker condition than the one you give: instead of asking that $f^*$ be the best function for any $P$ in a class, you can ask that $f^*$ have the best worst-case performance. That is, between the argmin and the expectation, put in a $\max_{P\in F}$. Bayesian optimality is another approach: put a prior distribution on $P\in F$, and take the expectation over $P$ as well as the sample from $P$.

Related Question