Solved – Why is the definition of a consistent estimator the way it is? What about alternative definitions of consistency

consistencymachine learningmathematical-statistics

Quote from wikipedia:

In statistics, a consistent estimator or asymptotically consistent
estimator is an estimator—a rule for computing estimates of a
parameter $θ^*$—having the property that as the number of data points
used increases indefinitely, the resulting sequence of estimates
converges in probability to $θ^*$.

To make this statement precise let $\theta^*$ be the value of the true parameter you want to estimate and let $\hat\theta(S_n)$ be the rule for estimating this parameter as a function of the data. Then the definition of consistency of an estimator can be expressed in the following way:

$$\lim_{n \to \infty} Pr[|\hat{\theta(S_{n}}) – \theta^*|\geq \epsilon ]=0$$

my question seems superficial at first sight but it is: why was the word "consistency/consistent" used to describe this behaviour of an estimator?

The reason that I care about this is because for me, intuitively, the word consistent means something different (or at least it seems different to me, maybe they can be shown to be equal). Let me tell you what it means by a means of an example. Say "you" are consistently "good" (for some definition of good), then consistent means that every time that you have a chance to prove/show me that you are good, you indeed prove me that you are good, every single time (or at least most of the time).

Lets apply my intuition to define consistency of an estimator. Let "you" be the function computing $\hat{\theta}$ and let "good" mean how far you are from the true estimate $\theta^*$ (good, in the $l_1$ norm sense, why not). Then a better definition of consistency would be:

$$\forall n,\forall S_n, Pr[|\hat{\theta(S_{n}}) – \theta^*|\geq \epsilon ] < \delta$$

Even though it might be a less useful definition of consistency, it makes more sense to me in the way I would define consistency, because for any training/sample sets you throw to my estimator $\hat\theta$, I will be able to do a good job, i.e. I will consistently do well. I am aware, that its a little unrealistic to do it for all n (probably impossible), but we can fix this definition by saying:

$$\exists n_0, \forall n \geq n_0,\forall S_n, Pr[|\hat{\theta(S_{n}}) – \theta^*|\geq \epsilon ] < \delta$$

i.e. for sufficiently large n, our estimator will not do worse than $\epsilon$ (i.e. not more than $\epsilon$ away from the "truth") from the true $\theta^*$ (the $n_0$ is trying to capture the intuition that you need at least some number of example to learn/estimate anything, and once you have reached that number, your estimator will do well most of the time if its consistent in the way we are trying to define it).

However, the previous definition is to strong, maybe we could allow us to have a low probability of being far from $\theta^*$ for most of the training sets of size $n \geq n_0$ (i.e. not require this for all $S_n$, but over the distribution of $S_n$ or something like that). So we will have a high error only very rarely for most of the sample/training sets that we have.

Anyway, my question is, are these proposed definitions of "consistency" actually the same as the "official" definition of consistency, but the equivalence is hard to prove? If you know the proof please share it! Or is my intuition completely off and is there a deeper reason for choosing the definition consistency in the way that it usually is defined as? Why is ("official") consistency defined the way it is?

Some of my thoughts of a candidate proof for some sort of equivalence, or maybe similarity between my notion of consistency and the accepted notion of consistency might be to unravel the definition of a limit in the official definition of consistency using the $(\epsilon, \delta)-$definition of a limit. However, I was not 100% sure how to do that and even if I tried, the official definition of consistency does not seem to take into account talking about all potential training/sample sets. Since I believe they are equivalent, is the official definition I provided incomplete (i.e. why does it not talk about the the data sets we could could or all the different data sets that could generate our sample sets)?

One of my last thoughts is, any definition that we supply should also be precise wrt to whose probability distribution we talk about, is it $P_x$ or is it $P_{S_n}$. I think a candidate should also be precise if whatever it guarantees, if it does guarantee it wrt to some fixed distribution or wrt to all possible distributions to the training sets…right?

Best Answer

Consider the second tentative statement by the OP, slightly modified,

$$\forall \theta\in \Theta, \epsilon>0, \delta>0, S_n, \exists n_0(\theta, \epsilon, \delta): \forall n \geq n_0,\;\\P_n\big[|{\hat \theta(S_{n}}) - \theta^*|\geq \epsilon \big] < \delta \tag{1}$$

We are examining the bounded in $[0,1]$ sequence of real numbers $$\big\{ P_n\big[|{\hat\theta(S_{n}}) - \theta^*|\geq \epsilon \big]\big\}$$

indexed by $n$. If this sequence has a limit as $n\rightarrow \infty$, call it simply $p$, we will have that

$$\forall \theta\in \Theta, \epsilon>0, \delta>0, S_n,\,\exists n_0(\theta, \epsilon, \delta): \forall n \geq n_0,\;\\\Big| P_n\big[|\hat{\theta(S_{n}}) - \theta^*|\geq \epsilon \big] -p\Big|< \delta \tag{2}$$

So if we assume (or require) $(1)$, we essentially assume (or require) that the limit as $n\rightarrow \infty$ exists and is equal to zero, $p=0$.

So $(1)$ reads "the limit of $P_n\big[|\hat{\theta(S_{n}}) - \theta^*|\geq \epsilon\big]$ as $n\rightarrow \infty$ is $0$". Which is exactly the current definition of consistency (and yes, it covers "all possible samples")

So it appears that the OP essentially proposed an alternative expression for the exact same property, and not a different property, of the estimator.

ADDENDUM (forgot the history part)

In his "Foundations of the Theory of Probability" (1933), Kolmogorov mentions in a footnote that (the concept of convergence in probability)

"...is due to Bernoulli;its completely general treatment was introduced by E.E.Slutsky".

(in 1925). The work of Slutsky is in German -there may be even an issue of how the German word was translated in English (or the term used by Bernoulli). But don't try to read too much into a word.

Related Question