# Regression – Why E(????) is the Best Linear Predictor and E(??|??) is the Best Predictor

econometricsestimationleast squarespredictorregression

Why does $$E(\epsilon_ix_i)$$ have the best linear predictor and $$E(\epsilon_i|x_i)$$ have the best predictor?

Each $$x_i$$ is a $$1\times K$$ vector. All of information I could understand is $$E(\epsilon_i|x_i)$$ stronger than $$E(\epsilon_ix_i)$$. Under $$E(\epsilon_i|x_i)$$, we have consequences $$E(\epsilon_i)=0$$ and $$E(\epsilon_ix_i)=0$$, those implies $$Cov(\epsilon_i,x_i)=E(\epsilon_ix_i)-E(\epsilon_i)E(x_i)=0$$. From those, I don't get why does one call best linear predictor and the other call best predictor.

The Equation 2.1 are $$Y=X\beta+\epsilon\\y_i=x^T_{i}\beta+\epsilon_i\quad, i\in\mathbb{N}$$

Let us begin by defining what we mean by "best predictor" and "best linear predictor". A best predictor of $$Y$$ given $$X$$ is a function $$m$$ of $$X$$ such that the mean squared error $$E[(Y-m(X))^2]$$ is minimized among all possible choices of $$m$$. Formally, the best predictor is a function in $$\mathrm{argmin}_{m}\, E[(Y-m(X))^2]$$ On the other hand, a best linear predictor is a linear function $$m$$ that minimizes the mean square error only within the class of other linear estimators. In other words, for the best linear predictor, we do not need to compare ourselves to non-linear functions, we just need to do the best we can subject to our linearity constraint. Formally, the best linear predictor is defined to the quantity $$X'\beta^*$$ where $$\beta^* \in \mathrm{argmin}_\beta\, E[(Y-X'\beta)^2]$$

Based on these definitions, let us begin with an important first observation. If a best predictor is linear, then it is also a best linear predictor. Why? Because in the definition of best predictor, we had to optimize our predictions over a much richer set of candidate functions, and in particular, this set of candidate functions must at least include the linear ones.

Now, to understand your question, we begin by solving explicitly for the best predictor and the best linear predictor. Let us first show that the best predictor is always given by the conditional expectation function, $$m(X) = E[Y|X]$$. To see why, first note that $$E[Y-m(X)|X] = 0$$ for almost all $$X$$. Now, consider any other candidate function $$g(X)$$. Then we have \begin{aligned}E[(Y-g(X))^2] &= E[(\{Y - m(X)\} + \{m(X) - g(X)\})^2] \\&= E[(Y-m(X))^2] + 2 E[(Y-m(X))(m(X)-g(X))] + E[(m(X) - g(X))^2]\end{aligned} Notice that the first term on the RHS does not depend on the choice of $$g$$ and the last term is minimized when $$g(X) = m(X)$$ for almost all $$X$$ and in that case takes on a value of $$0$$. To show that choosing $$g = m$$ is indeed a minimizer, it thus suffices to show that no matter the choice of $$g$$, the middle term is 0. But this follows immediately from the law of iterated expectations: \begin{aligned}E[(Y-m(X))(m(X)-g(X))] &= E[E[(Y-m(X))(m(X)-g(X))|X]] \\&= E[(m(X)-g(X))\underbrace{E[Y-m(X)|X]}_{=0}] = 0\end{aligned}

Next, let us show that the best linear predictor is always the population OLS prediction. Showing this is a matter of simple algebra: $$E[(Y-\beta' X)^2] = E[Y^2] - 2 \beta' E[XY] + \beta' E[XX']\beta$$ This is a quadratic form in $$\beta$$ with a positive definite quadratic term, so $$\beta$$ is optimized at the FOC, i.e. when $$-2 E[XY] + E[XX']\beta = 0 \iff \beta = E[XX']^{-1}E[XY]$$ This is simply the OLS slope the above discussion shows that the OLS slope is always the best predictor. Moreover, looking back at the first order condition, we could have rewritten it $$-2 E[X(Y - X'\beta)] = 0 \iff E[X\underbrace{(Y-X'\beta)}_{\equiv \epsilon}] = 0$$ This recovers the claim in the text that when $$E[X\epsilon]=0$$, then we have that $$\beta$$ is the best linear predictor. In fact, the condition $$E[X\epsilon]=0$$ is exactly the first order condition defining the optimality of $$\beta$$.

Finally, consider what it means for $$E[\epsilon|X] = 0$$. Since $$E[Y|X] = E[X'\beta|X] + E[\epsilon|X] = X'\beta + E[\epsilon|X]$$, this condition implies that $$E[Y|X] = X'\beta$$, i.e. that $$X'\beta$$ is in fact the conditional expectation function (this is why it is considered a strong assumption. It implicitly imposes that the conditional expectation function is in fact linear) for $$Y$$ and hence the best overall predictor, per the discussion above.

To recap, the condition that $$E[X\epsilon] = 0$$ can always be made true by defining $$\beta$$ in the linear model to be the "best" choice of $$\beta$$, i.e. the slope of the best linear predictor. On the other hand, the condition that $$E[\epsilon|X]=0$$ requires the must stronger condition that in fact, the conditional expectation function is linear.