Both in case of unbiasedness and variance-connected issues (like efficiency or heteroskedasticity), first of all, I would recommend plugging your equation on $Y$ into formulas for $\beta_{1} / \beta_{2}$. Thus (for $Y_{i} = \beta X_{i} + u_{i}$) we get:
$$
\beta_{1}= \beta + \frac{\sum_{i=1}^{n}u_{i}}{\sum_{i=1}^{n}X_{i}} \\
\beta_{2}= \beta + \frac{\sum_{i=1}^{n}X_{i}u_{i}}{\sum_{i=1}^{n}X_{i}^{2}} \\
$$
Since $E(u_{i}|X_{i}) = 0$, you can see clearly (Tipp: $E(u_{i})=E(E(u_{i}|X_{i})$), why both estimators are unbiased and why there is a problem if we put $Y_{i} = \beta X_{i} + \gamma R_{i} + u_{i}$ inside $\beta_{1} / \beta_{2}$ (instead of $Y_{i} = \beta X_{i} + u_{i}$).
Moreover, calculating OLS estimator ($\beta_{OLS}=(X^{'}X)^{-1}X^{'}y$) for given model, we obtain $\beta_{2} = \beta_{OLS}$. Gauss-Markov Theorem gives you that $\beta_{2}$ has lower variance (both estimators are linear in $Y$ and $\beta_{2}$ is BLUE). Alternatively, you could calculate variance of both estimators directly from definition:
$$
Var(\beta_{2})=Var(\beta + \frac{\sum_{i=1}^{n}X_{i}u_{i}}{\sum_{i=1}^{n}X_{i}^{2}})=...
$$
remembering that $Var(u_{i}) = E(Var(u_{i}|X_{i})) + Var(E(u_{i}|X_{i}))$. This approach might be particularly useful in c) since you can no longer use Gauss-Markov Theorem (we have heteroskedasticity!). I hope you can manage to do this; to find out which variance is lower you can use Cauchy-Schwarz inequality (https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality).
Last but not least, heteroskedasticity has nothing with biasedness or unbiasedness of your estimators (the only thing you should provide is $E(u_{i}|X_{i})=0$). Back to b), your approach to use Breusch - Pagan test seems correct for me.
It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.
Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression:
\begin{align}
E(u|z) = x\alpha_1 + z\alpha_2 + \nu
\end{align}
The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated
with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.
Here is an example to prove the point:
\begin{align}
\xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\
z &=\xi\\
x &= z^2 + \zeta\\
u &= z+z^2-E(z+z^2)+\nu
\end{align}
Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.
Here is a very simple example in R
which demonstrates the point:
set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u
summary(lm(y~x+z))
# auxiliary regression
summary(lm(z+z^2~x+z))
Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.
So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4):
\begin{align}
y = x\beta + z\gamma + E(u|z) + v
\end{align}
It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.
Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS):
\begin{align}
y = x\beta + f(z) + v
\end{align}
That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.
Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression:
\begin{align}
y = x\beta + v
\end{align}
This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.
This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.
Best Answer
This is essentially a "groupwise heteroskedastic" model, and it doesn't matter whether the three $Y_i$'s and the three $X_i$'s may contain different random variables (and not just different realizations of the same random variables). All that matters is that
a) The coefficient vector is assumed identical,
b) The regressor matrices have the same column dimension, and
c) The $\mathbf U_1, \mathbf U_2, \mathbf U_3$ are assumed independent.
Then it is business as usual.
To solve it, we stack our data set per equation, i.e. all observations pertaining to $Y_1$ equation first, etc, obtaining a $T\times K$ regressor matrix, $T= T_1+T_2+T_3$. We have the model
$$\mathbf y = \mathbf X\beta + \mathbf u \tag{1}$$
Mind you, we don't really care if each equation has different regressors The model is heteroskedastic, with $T \times T$ conditional covariance matrix
$$E(\mathbf u \mathbf u' \mid \mathbf X)=\mathbf V = \text{diag}(\sigma^2_1,...,\sigma^2_2,...,\sigma^2_3,...) \tag{2}$$ with appropriate sub-dimensions. This matrix is symmetric and positive definite, so, as a general matrix-algebra result, there exists a decomposition
$$\mathbf V^{-1} = \mathbf C'\mathbf C = \text{diag}(1/\sigma^2_1,...,1/\sigma^2_2,...,1/\sigma^2_3,...) \tag{3}$$
which also gives
$$\mathbf V = \mathbf C^{-1} (\mathbf C')^{-1} \tag{4}$$
So in our case $$\mathbf C = \text{diag}(1/\sigma_1,...,1/\sigma_2,...,1/\sigma_3,...) \tag{5}$$
where $\mathbf C$ is non-singular $T \times T$ square matrix. Consider the linear transformation of the model by premultiplication by $\mathbf C$:
$$\mathbf {\tilde y} = \mathbf C \mathbf y, \;\; \mathbf {\tilde X}= \mathbf C \mathbf X,\;\; \mathbf {\tilde u}= \mathbf C \mathbf u$$
This transformation makes the error term conditionally homoskedastic, since
$$E[\mathbf {\tilde u}\mathbf {\tilde u}'\mid \mathbf X] = \mathbf C E(\mathbf u\mathbf u'\mid \mathbf X) \mathbf C' = \mathbf C \mathbf V \mathbf C' = \mathbf C \mathbf C^{-1} (\mathbf C')^{-1} \mathbf C' = \mathbf I_T$$
and one can verify that all other properties of the linear regression model are satisfied. So running OLS on the transformed model is optimal (in the usual sense). We will obtain
$$\hat \beta^{(tr)}_{OLS} = \left(\mathbf {\tilde X}'\mathbf {\tilde X}\right)^{-1}\mathbf {\tilde X}'\mathbf {\tilde y} = \left(\mathbf X'\mathbf C' \mathbf C\mathbf X\right)^{-1}\mathbf X' \mathbf C' \mathbf C\mathbf y = \left(\mathbf X'\mathbf V^{-1}\mathbf X\right)^{-1}\mathbf X'\mathbf V^{-1}\mathbf y $$
But this is equivalent to run GLS on the un-transformed model,
$$\beta_{GLS} = \left(\mathbf X'\mathbf V^{-1}\mathbf X\right)^{-1}\mathbf X'\mathbf V^{-1}\mathbf y $$
and its conditional variance is easy to derive as
$$\text{Var}\left[\hat \beta_{GLS}\right] = \left(\mathbf X'\mathbf V^{-1}\mathbf X\right)^{-1}$$