Let's start with the simpler case where there is no correlation structure for the residuals:
fit <- gnls(model=model,data=data,start=start)
logLik(fit)
The log likelihood can then be easily computed by hand with:
N <- fit$dims$N
p <- fit$dims$p
sigma <- fit$sigma * sqrt((N-p)/N)
sum(dnorm(y, mean=fitted(fit), sd=sigma, log=TRUE))
Since the residuals are independent, we can just use dnorm(..., log=TRUE)
to get the individual log likelihood terms (and then sum them up). Alternatively, we could use:
sum(dnorm(resid(fit), mean=0, sd=sigma, log=TRUE))
Note that fit$sigma
is not the "less biased estimate of $\sigma^2$" -- so we need to make the correction manually first.
Now for the more complicated case where the residuals are correlated:
fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit)
Here, we need to use the multivariate normal distribution. I am sure there is a function for this somewhere, but let's just do this by hand:
N <- fit$dims$N
p <- fit$dims$p
yhat <- cbind(fitted(fit))
R <- vcv(tree, cor=TRUE)
sigma <- fit$sigma * sqrt((N-p)/N)
S <- diag(sigma, nrow=nrow(R)) %*% R %*% diag(sigma, nrow=nrow(R))
-1/2 * log(det(S)) - 1/2 * t(y - yhat) %*% solve(S) %*% (y - yhat) - N/2 * log(2*pi)
It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.
Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression:
\begin{align}
E(u|z) = x\alpha_1 + z\alpha_2 + \nu
\end{align}
The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated
with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.
Here is an example to prove the point:
\begin{align}
\xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\
z &=\xi\\
x &= z^2 + \zeta\\
u &= z+z^2-E(z+z^2)+\nu
\end{align}
Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.
Here is a very simple example in R
which demonstrates the point:
set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u
summary(lm(y~x+z))
# auxiliary regression
summary(lm(z+z^2~x+z))
Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.
So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4):
\begin{align}
y = x\beta + z\gamma + E(u|z) + v
\end{align}
It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.
Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS):
\begin{align}
y = x\beta + f(z) + v
\end{align}
That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.
Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression:
\begin{align}
y = x\beta + v
\end{align}
This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.
This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.
Best Answer
After scouring the web for more information as to how a system of nonlinear equations can efficiently be solved in R, I stumbled upon the
nlsur
method available in Stata. I know not everyone is able to use Stata, but for those that have the opportunity, it's helpful to know that the implementation in Stata is very easy. I have reused all my datapreparation work and exported only the transformed variables as a .csvThis is the code which solves the following equation system:
This uses a 2stage FGLS to solve the following three equations(if an iterated FGLS is wanted the flag
ifgnls
can be set after the initials): $$ \log S_E =\frac{\sigma -1}{\sigma}\left(\log{\psi} + \alpha_E \tilde{t}\right) + \frac{1-\sigma}{\sigma}\log{\tilde{Y}}+ (\sigma + 1)\log{\tilde{E}} + \log{\gamma_E} $$ $$ \log{S_L} = \frac{\sigma -1}{\sigma}\left(\log{\psi} + \alpha_L \beta \tilde{t}+\beta \log{\tilde{L}} + (1-\beta)\log{\tilde{K}}\right) + \frac{1-\sigma}{\sigma}\log{{\tilde{Y}}} + \log{\gamma_V \beta} $$ $$ \log\tilde{Y} = \log\psi + \frac{\sigma}{\sigma - 1} \log ( \gamma_{V}\left(e^{\alpha_{L}\tilde{t}\beta}\left(\tilde{L}\right)^{\beta} \left(\tilde{K}\right)^{1-\beta}\right)^{\frac{\sigma-1}{\sigma}} +\gamma_{E}\left(e^{\alpha_{E}\tilde{t}} \tilde{E}\right)^{\frac{\sigma-1}{\sigma}}) $$The cross equation restrictions on my parameters are automatically recognized and the convergence behavior seems good. You have to use curly braces to explicitly declare the parameters. I personally am not a good enough programmer, but I hope that a package analogous to
nlsur
in Stata will become available in R.Edit: One big draw-back is that this method does NOT support Newey-West standard errors which is a big issue for my estimation. When writing Stata support they have answered that development for this is currently not planned. This is a shame because hac-robust estimation is available for the normal nonlinear estimation, guess I will have to find something more suitable after all.