For OLS parameter estimates to be consistent it must be the case that
E(u|x)=0. Is it true?
E(u|x)=0 is a required condition for unbiasedness. But as far as I understand, unbiasedness does not necessarily mean consistency. Therefore I am really confused.
Solved – Is E(u|x)=0 is a required condition for estimator consistency
consistencyexpected valueself-study
Related Solutions
Glad to see that my (incorrect) answer generated two more, and turned a dead question into a lively Q&A thread. So it's time to try to offer something worthwhile, I guess).
Consider a serially correlated, covariance-stationary stochastic process $\{y_t\},\;\; t=1,...,n$, with mean $\mu$ and autocovariances $\{\gamma_j\},\;\; \gamma_j\equiv \operatorname{Cov}(y_t,y_{t-j})$. Assume that $\lim_{j\rightarrow \infty}\gamma_j= 0$ (this bounds the "strength" of autocorrelation as two realizations of the process are further and further away in time). Then we have that
$$\bar y_n = \frac 1n\sum_{t=1}^ny_t\rightarrow_{m.s} \mu,\;\; \text{as}\; n\rightarrow \infty$$
i.e. the sample mean converges in mean square to the true mean of the process, and therefore it also converges in probability: so it is a consistent estimator of $\mu$.
The variance of $\bar y_n$ can be found to be
$$\operatorname{Var}(\bar y_n) = \frac 1n \gamma_0+\frac 2n \sum_{j=1}^{n-1}\left(1-\frac {j}{n}\right)\gamma_j$$
which is easily shown to go to zero as $n$ goes to infinity.
Now, making use of Cardinal's comment let's randomize further our estimator of the mean, by considering the estimator
$$\tilde \mu_n = \bar y_n + z_n$$
where $\{z_t\}$ is an stochastic process of independent random variables which are also independent from the $y_i$'s, taking the value $at$ (parameter $a>0$ to be specified by us) with probability $1/t^2$, the value $-at$ with probability $1/t^2$, and zero otherwise. So $\{z_t\}$ has expected value and variance
$$E(z_t) = at\frac 1{t^2} -at\frac 1{t^2} + 0\cdot \left (1-\frac 2{t^2}\right)= 0,\;\;\operatorname{Var}(z_t) = 2a^2$$
The expected value and the variance of the estimator is therefore
$$E(\tilde \mu) = \mu,\;\;\operatorname{Var}(\tilde \mu) = \operatorname{Var}(\bar y_n) + 2a^2$$
Consider the probability distribution of $|z_n|$, $P\left(|z_n| \le \epsilon\right),\;\epsilon>0$: $|z_n|$ takes the value $0$ with probability $(1-2/n^2)$ and the value $an$ with probability $2/n^2$. So
$$P\left(|z_n| <\epsilon\right) \ge 1-2/n^2 = \lim_{n\rightarrow \infty}P\left(|z_n| < \epsilon\right) \ge 1 = 1$$
which means that $z_n$ converges in probability to $0$ (while its variance remains finite). Therefore
$$\operatorname{plim}\tilde \mu_n = \operatorname{plim}\bar y_n+\operatorname{plim} z_n = \mu$$
so this randomized estimator of the mean value of the $y$-stochastic process remains consistent. But its variance does not go to zero as $n$ goes to infinity, neither does it go to infinity.
Closing, why all the apparently useless elaboration with an autocorrelated stochastic process? Because Cardinal qualified his example by calling it "absurd", like "just to show that mathematically, we can have a consistent estimator with non-zero and finite variance".
I wanted to give a hint that it isn't necessarily a curiosity, at least in spirit: There are times in real life that new processes begin, man-made processes, that had to do with how we organize our lives and activities. While we usually have designed them, and can say a lot about them, still, they may be so complex that they are reasonably treated as stochastic (the illusion of complete control over such processes, or of complete a priori knowledge on their evolution, processes that may represent new ways to trade or produce, or arrange the rights-and-obligations structure between humans, is just that, an illusion). Being also new, we do not have enough accumulated realizations of them in order to do reliable statistical inference on how they will evolve. Then, ad hoc and perhaps "suboptimal" corrections are nevertheless an actual phenomenon, when for example we have a process where we strongly believe that its present depends on the past (hence the auto-correlated stochastic process), but we really don't know how as yet (hence the ad hoc randomization, while we wait for data to accumulate in order to estimate the covariances). And maybe a statistician would find a better way to deal with such kind of severe uncertainty -but many entities have to function in an uncertain environment without the benefit of such scientific services.
What follows is the initial (wrong) answer (see especially Cardinal's comment)
Estimators that converge in probability to a random variable do exist: the case of "spurious regression" comes to mind, where if we attempt to regress two independent random walks (i.e. non-stationary stochastic processes) on each other by using ordinary least squares estimation, the OLS estimator will converge to a random variable.
But a consistent estimator with non-zero variance does not exist, because consistency is defined as the convergence in probability of an estimator to a constant, which, by conception, has zero variance.
Consider the second tentative statement by the OP, slightly modified,
$$\forall \theta\in \Theta, \epsilon>0, \delta>0, S_n, \exists n_0(\theta, \epsilon, \delta): \forall n \geq n_0,\;\\P_n\big[|{\hat \theta(S_{n}}) - \theta^*|\geq \epsilon \big] < \delta \tag{1}$$
We are examining the bounded in $[0,1]$ sequence of real numbers $$\big\{ P_n\big[|{\hat\theta(S_{n}}) - \theta^*|\geq \epsilon \big]\big\}$$
indexed by $n$. If this sequence has a limit as $n\rightarrow \infty$, call it simply $p$, we will have that
$$\forall \theta\in \Theta, \epsilon>0, \delta>0, S_n,\,\exists n_0(\theta, \epsilon, \delta): \forall n \geq n_0,\;\\\Big| P_n\big[|\hat{\theta(S_{n}}) - \theta^*|\geq \epsilon \big] -p\Big|< \delta \tag{2}$$
So if we assume (or require) $(1)$, we essentially assume (or require) that the limit as $n\rightarrow \infty$ exists and is equal to zero, $p=0$.
So $(1)$ reads "the limit of $P_n\big[|\hat{\theta(S_{n}}) - \theta^*|\geq \epsilon\big]$ as $n\rightarrow \infty$ is $0$". Which is exactly the current definition of consistency (and yes, it covers "all possible samples")
So it appears that the OP essentially proposed an alternative expression for the exact same property, and not a different property, of the estimator.
ADDENDUM (forgot the history part)
In his "Foundations of the Theory of Probability" (1933), Kolmogorov mentions in a footnote that (the concept of convergence in probability)
"...is due to Bernoulli;its completely general treatment was introduced by E.E.Slutsky".
(in 1925). The work of Slutsky is in German -there may be even an issue of how the German word was translated in English (or the term used by Bernoulli). But don't try to read too much into a word.
Best Answer
Ok. The model is, in matrix notation and conformable dimensions $$\mathbf y = \mathbf X\beta + \mathbf u $$
The $OLS$ estimator is
$$\hat \beta = (\mathbf X'\mathbf X)^{-1}\mathbf X' \mathbf y = (\mathbf X'\mathbf X)^{-1}\mathbf X' (\mathbf X\beta + \mathbf u) $$
$$= (\mathbf X'\mathbf X)^{-1}\mathbf X' \mathbf X\beta + (\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u = \beta + (\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u$$
For consistency we examine
$$\operatorname{plim}\hat \beta = \operatorname{plim}\beta + \operatorname{plim}\left[(\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u\right] = \beta + \operatorname{plim}\left[\left(\frac 1n\mathbf X'\mathbf X\right)^{-1}\left(\frac 1n\mathbf X'\mathbf u\right)\right] $$
And here is the crucial point that makes us need a weaker assumption for consistency compared to unbiasedness: for unbiasedness we would face $E\left[(\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u\right]$, and in order to "insert" the expected value into the expression we have to condition on $\mathbf X$, which leads us to the expression $E(\mathbf u\mid \mathbf X)$ and the need to assume it as being equal to zero, i.e. assume "mean-independence" between the error term and the regressors.
But $\operatorname{plim}$ is a more "flexible" operator than $E$: under $\operatorname{plim}$ expressions and products can be decomposed (something that under the expected value requires independence), and also $\operatorname{plim}$ can "go inside the expression" (while $E$ cannot except if it is an affine function), as long as the function is a continuous transformation (and it very rarely isn't) - so
$$\operatorname{plim}\left[\left(\frac 1n\mathbf X'\mathbf X\right)^{-1}\left(\frac 1n\mathbf X'\mathbf u\right)\right] = \operatorname{plim}\left(\frac 1n\mathbf X'\mathbf X\right)^{-1}\operatorname{plim}\left(\frac 1n\mathbf X'\mathbf u\right)$$
For consistency we need to assume that the first $\operatorname{plim}$ is finite -but this is an assumption on the properties of the regressor matrix, unrelated to the error term. So we are left with the second $\operatorname{plim}$ which, written for clarity using sums it is $$\operatorname{plim}\left(\frac 1n\mathbf X'\mathbf u\right) = \left[\begin{matrix} \operatorname{plim}\frac 1n\sum_{i=1}^nx_{1i}u_i \\ .\\ .\\ \operatorname{plim}\frac 1n\sum_{i=1}^nx_{ki}u_i \\ \end{matrix}\right] \rightarrow\left[\begin{matrix} \frac 1n\sum_{i=1}^nE(x_{1i}u_i) \\ .\\ .\\ \frac 1n\sum_{i=1}^nE(x_{ki}u_i) \\ \end{matrix}\right] $$ ...the last transformation due to the usual assumptions that permit the application of the law of large numbers.
Exactly because we have been able to "separate" $(\mathbf X'\mathbf X)^{-1}$ from $\mathbf X'\mathbf u$ (due to the fact that we are examining the $\operatorname{plim}$ and not $E$) we ended up looking only at the contemporaneous relation between each regressor and the error term. And so what we need to assume for consistency of the $OLS$ estimator is only that $E(x_{1i}u_i) =0 \; \forall k, \; \forall i$, (contemporaneous uncorrelatedness) which is much weaker than $E(\mathbf u\mid \mathbf X)$, the latter requiring mean-independence, and moreover, not only contemporaneous independence, but across time too (since we condition the whole error vector on the whole regressor matrix).