It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.
Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression:
\begin{align}
E(u|z) = x\alpha_1 + z\alpha_2 + \nu
\end{align}
The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated
with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.
Here is an example to prove the point:
\begin{align}
\xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\
z &=\xi\\
x &= z^2 + \zeta\\
u &= z+z^2-E(z+z^2)+\nu
\end{align}
Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.
Here is a very simple example in R
which demonstrates the point:
set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u
summary(lm(y~x+z))
# auxiliary regression
summary(lm(z+z^2~x+z))
Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.
So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4):
\begin{align}
y = x\beta + z\gamma + E(u|z) + v
\end{align}
It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.
Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS):
\begin{align}
y = x\beta + f(z) + v
\end{align}
That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.
Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression:
\begin{align}
y = x\beta + v
\end{align}
This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.
This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.
In the statistical sense of this (regression) model, there is no difference between treatment $D_i$ and covariate $X_i$. Aside from the type of variable (continuous/categorical) they are both predictors/independent variables (this would also apply when treatment $D_i$ was continuous, or covariate $X_i$ categorical). Moreover, 'statistically' speaking, everything which you can infer from $τ$ applies to $β$ as well.
Now comes the less statistical part, and a more methodological one: along the theory or hypothesis you are studying, these variables are not equal. One may be of particular interest. Especially when trying to make causal inferences, you want to obtain an 'as pure as possible' notion of its effect on the outcome of interest and (if a frequentist) its significance. That is why you correct/adjust for the effects of other variables (often called confounders; correcting for confounding bias). Now the model needs to be focused around correcting for other variables which can confound the association of interest. If done correctly, you might get a good estimate of the (approximately) unbiased 'true' association of treatment $D_i$ on the outcome. However, you've only selected confounders related to treatment $D_i$'s effect on outcome. You might have omitted some confounders for covariate $X_i$ from the model, because you did not expect them to influence the association between $D_i$ and outcome.
Because of this, causal inferences based covariate $X_i$'s $β$ are not completely corrected for (AKA still biased by confounding).
If in your example the training program is only confounded by age (because we have some theory about this), causal inference for $D_i$'s effect on wages becomes possible. For age however, treatment $D_i$ might not be the only confounder, ergo*, effect estimate $β$ might not be 'pure' and would not be an unbiased effect estimate for the effect of covariate $X_i$/age on wages.
*(always wanted to use that word)
Best Answer
I'll start with your second question as it will inform the answer to the first.
Note the distinction between regression coefficients and structural causal model coefficients. The former is what you get when you run a regression - always. Only under specific circumstances would the regression coefficients have a causal interpretation, or in other words, only under specific circumstances will the regression coefficients coincide with the coefficients in the structural causal model. What are these specific circumstances? A necessary condition is the zero conditional mean assumption (pertaining to the structural errors), discussed by Wooldridge and Greene. Or the ignorability assumption discussed by Gelman and Hill. The latter is once again a necessary assumption for the regression coefficients to have a causal interpretation, it is just described in a different context - that of potential outcomes. The zero conditional mean assumption and the ignorability assumption, also called selection on observables, and also called CIA [Conditional Independence Assumption] (in Mostly Harmless Econometrics) are two sides of the same coin. Chen & Pearl said with reference to Greene's book "In summary, while Greene provides the most detailed account of potential outcomes and counterfactuals of all the authors surveyed, his failure to acknowledge the oneness of the potential outcomes and structural equation frameworks is likely to cause more confusion than clarity, especially in view of the current debate between two antagonistic and narrowly focused schools of econometric research (See Pearl 2009, p. 379-380)."
So, to answer your question. If the zero conditional mean assumption (with regards to the structural errors) is violated then the regression coefficients will not coincide with those of the structural model; in other words, the regression coefficients will not have a causal interpretation.
Because they chose to describe the conditions necessary for the coefficients to have a causal interpretation in the context of potential outcomes. Just the other side of the same coin.
For more detail on the difference between regression and structural causal model, see Carlos Cinelli's answer here and here.