It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.
Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression:
\begin{align}
E(u|z) = x\alpha_1 + z\alpha_2 + \nu
\end{align}
The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated
with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.
Here is an example to prove the point:
\begin{align}
\xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\
z &=\xi\\
x &= z^2 + \zeta\\
u &= z+z^2-E(z+z^2)+\nu
\end{align}
Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.
Here is a very simple example in R
which demonstrates the point:
set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u
summary(lm(y~x+z))
# auxiliary regression
summary(lm(z+z^2~x+z))
Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.
So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4):
\begin{align}
y = x\beta + z\gamma + E(u|z) + v
\end{align}
It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.
Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS):
\begin{align}
y = x\beta + f(z) + v
\end{align}
That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.
Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression:
\begin{align}
y = x\beta + v
\end{align}
This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.
This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.
As you suspect, the second version is indeed a special case of the more general first result. We obtain it when $X_2=x_2$ and $X_1=(\iota\;\;x_1)$ with $\iota$ a vector of ones for the constant.
What (maybe) confuses you is that Wooldridge's statement only focuses on the coefficient on $x_1$ and does not bother to discuss $\tilde{b}_0$, the coefficient on the constant, as it is often of secondary interest.
When we have a constant, $x_1$ and $x_2$, we get a $(2\times1)$ vector in the short regression $\tilde{b}=(\tilde{b}_0,\tilde{b}_1)'$. Likewise, the regression of $x_2$ on an intercept and $x_1$ then yields a coefficient vector, call it $\Delta$, that contains $\Delta=(\delta_0,\delta)'$.
In Goldberger's general result, $\Delta$ corresponds to $(X_1'X_1)^{-1} X_1'X_2$, the OLSEs of a regression of $X_2$ on $X_1$. (When $X_2$ contains $k_2>1$ variables, we would actually obtain a $(k_1\times k_2)$ matrix of estimated coefficients here, with $k_1$ the number of variables in $X_1$.)
Finally, let $\hat{b}_{[0,1]}=(\hat{b}_0,\hat{b}_1)'$.
So all in all, we may write
$$
\tilde{b}=\hat{b}_{[0,1]}+\Delta\cdot\hat{b}_2,
$$
which is now, I hope, clearly a special case of Goldberger's formulation. Wooldridge just picks the second element of that vector.
Best Answer
In English, it means that conditional on observing the data, the expectation of the error term is zero.
How might this be violated?
Example: omitted variable correlated with $x$
Imagine the true model is: $$ y_i = \alpha + \beta x_i + \gamma z_i + u_i$$
But instead imagine we're running the regression: $$ y_i = \alpha + \beta x_i + \underbrace{\epsilon_i}_{\gamma z_i + u_i}$$
Then: $$\begin{align*} E[\epsilon_i \mid x_i ] &= E[\gamma z_i + u_i \mid x_i] \\ &=\gamma E[ z_i\mid x_i] \quad \text{ assuming $u_i$ is white noise} \end{align*}$$
If $E[z_i \mid x_i] \neq 0$ and $\gamma \neq 0$, then $E[\epsilon_i \mid x_i] \neq 0$ and strict exogeneity is violated.
For example, imagine $y$ is wages, $x$ is an indicator for a college degree, and $z$ is some measure of ability. If wages are a function of both education and ability (the true data generating process is the first equation), and college graduates are expected to have higher ability ($E[z_i \mid x_i] \neq 0]$) because college tends to attract and admit higher ability students, then if one were to run a simple regression of wages on education, the strict exogeneity assumption would be violated. We have a classic confounding variable. Ability causes education, and ability affects wages, hence our expectation of the error in equation (2) given education isn't zero.
What would happen if we did run the regression? You would pickup both the education effect and the ability effect in the education coefficient. In this simple linear example, the estimated coefficient $b$ would pick up the effect of $x$ on $y$ plus the association of $x$ and $z$ times the effect of $z$ on $y$.