It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.
Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression:
\begin{align}
E(u|z) = x\alpha_1 + z\alpha_2 + \nu
\end{align}
The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated
with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.
Here is an example to prove the point:
\begin{align}
\xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\
z &=\xi\\
x &= z^2 + \zeta\\
u &= z+z^2-E(z+z^2)+\nu
\end{align}
Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.
Here is a very simple example in R
which demonstrates the point:
set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u
summary(lm(y~x+z))
# auxiliary regression
summary(lm(z+z^2~x+z))
Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.
So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4):
\begin{align}
y = x\beta + z\gamma + E(u|z) + v
\end{align}
It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.
Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS):
\begin{align}
y = x\beta + f(z) + v
\end{align}
That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.
Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression:
\begin{align}
y = x\beta + v
\end{align}
This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.
This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.
That is not correct if there is no intercept. Without an intercept the OLS estimate would be:
$$ \beta = \frac{\operatorname{E}[XY]}{\operatorname{E}{[X^2]}}$$
In the case of a finite sample, your estimate would be:
$$ \hat{\beta} = \frac{\sum_{i=1}^n x_i y_i }{\sum_{i=1}^n x_i^2} $$
The general case (skip over this if you don't know matrix algebra yet)
If you know matrix algebra, all these are special cases. Minimizing the sum of squares can be written as:
$$ \text{minimize (over $\mathbf{b}$) } \quad\left( \mathbf{y} - X \mathbf{b}\right)'\left( \mathbf{y} - X \mathbf{b}\right) $$
Which has the solution:
$$ \hat{\mathbf{b}} = (X'X)^{-1} X'\mathbf{y}$$
The algebra behind that can be found (among numerous places) on my answer here.
Best Answer
The estimator for the variance commonly used in regression does not come from the least squares principle, which only produces an estimate for $\boldsymbol{\beta}$. It is just a bias-corrected version (by the factor $\frac{n}{n-K})$ of the empirical variance
$$\widehat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \left(y_i - \mathbf{x}_i^{T} \widehat{\boldsymbol{\beta}} \right)^2$$ which in turn is the maximum likelihood estimator for $\sigma^2$ under the assumption of a normal distribution. It's confusing that many people claim that that is the OLS estimator of the variance.