Question 3)
In notation to be understood as matrix-vector, assume that the correct specification is
$$y = X\beta + \gamma y_{-1}+ e$$
(where $X$ contains the constant and the $X_1$ variable and $e$ is white noise, and $E(e\mid X) =0$), but you specify and estimate instead
$$y = X\beta + u$$
i.e. without including the LAD, and so in reality $u =\gamma y_{-1}+ e$.
Then OLS estimation will give
$$\hat \beta = (X'X)^{-1}X'y = (X'X)^{-1}X'(X\beta + \gamma y_{-1}+ e) $$
$$= \beta + (X'X)^{-1}X'y_{-1}\gamma +(X'X)^{-1}X'e$$
The expected value of the estimator is
$$E(\hat \beta) = \beta + E\Big[(X'X)^{-1}X'y_{-1}\gamma\Big] +E\Big[(X'X)^{-1}X'e\Big]$$
and using the law of iterated expectations
$$E(\hat \beta) = \beta + E\Big(E\Big[(X'X)^{-1}X'y_{-1}\gamma\Big]\mid X\Big) +E\Big(E\Big[(X'X)^{-1}X'e\Big]\mid X\Big)$$
$$= \beta + E\Big((X'X)^{-1}X'E\Big[y_{-1}\gamma\mid X\Big]\Big) +E\Big((X'X)^{-1}X'E\Big[e\mid X\Big]\Big)$$
$$=\beta + E\Big((X'X)^{-1}X'E\Big[y_{-1}\gamma\mid X\Big]\Big) + 0 $$
the last term being zero per our assumptions. But $E\Big[y_{-1}\gamma\mid X\Big] \neq 0$, because $X$ contains all the regressors (from all time periods), and so there is correlation with the LAD vector. Therefore $E(\hat \beta) \neq \beta$. In other words, ignoring the lag dependent variable will not make the estimator unbiased, as long as $\gamma \neq 0$, i.e. as long as the LAD does belong to the regression.
Question 1)
Assume now that you specify correctly, and denote $Z$ the matrix containing also the LAD.
Here (using the same steps as before)
$$\hat \beta = \beta + (Z'Z)^{-1}Z'e$$
and
$$E(\hat \beta) = \beta + E\Big((Z'Z)^{-1}Z'E\Big[e\mid Z\Big]\Big)$$
But is $e$ (the vector) independent of $Z$? No, because $Z$ contains the LAD from all time periods bar the most recent, while $e$ contains the errors from all time periods bar the first. So even if $e$ is not serially correlated, it is correlated with the vector $y_{-1}$.
So indeed, the last term is not zero and $$E(\hat \beta) \neq \beta$$ the OLS estimator is biased.
But the OLS estimator will be consistent if indeed the inclusion of the LAD eliminates serial correlation, because (using the properties of the plim operator)
$$\operatorname{plim}\hat \beta = \beta + \operatorname{plim}\left(\frac 1{n-1} Z'Z\right)^{-1}\cdot \operatorname{plim}\left(\frac 1{n-1}Z'e\right)$$
Part of the standard assumptions (and rather "easily" satisfied), is that the first plim of the product converges to something finite. The second plim written explicitly is (and using the stationarity assumption to invoke the LLN)
$$\operatorname{plim}\left(\frac 1{n-1}Z'\mathbf e\right) = \left[\begin{matrix}
\operatorname{plim}\frac 1{n-1}\sum_{i=2}^ne_i \\
\operatorname{plim}\frac 1{n-1}\sum_{i=2}^nx_{i}e_i \\
\operatorname{plim}\frac 1{n-1}\sum_{i=2}^ny_{i-1}e_i \\
\end{matrix}\right] \rightarrow\left[\begin{matrix}
E(e_i) \\
E(x_{i}e_i) \\
E(y_{i-1}e_i) \\
\end{matrix}\right]\; \forall i$$
$E(e\mid X) = 0 \Rightarrow E(e_i) = 0$, and also that $E(x_{i}e_i)=0$, for all $i$.
Finally, IF serial correlation has been removed, then $E(y_{i-1}e_i) =0$ also. So this plim goes to zero and therefore
$$\operatorname{plim}\hat \beta = \beta$$
i.e. the OLS estimator is indeed consistent in this case. So the "summary" is correct.
Question 2)
The full sentence from Wooldridge is
"It is also valid to use the SC-robust standard errors in models with lagged dependent variables assuming, of course, that there is good reason for allowing serial correlation in such models".
meaning, when we have good reasons to believe that the inclusion of lagged dependent variables does not fully remove autocorrelation. And it seems we got ourselves a Catch-22: if serial correlation (SC) has been removed, why use SC-robust std errors? And if serial correlation has not been removed, our OLS estimator will be inconsistent, so in such a case is it meaningful/useful/appropriate to use asymptotic inference? Well, it appears that if we do suspect that SC still exists, it is better to try to do something about it, regardless. But your comment has merit, and I would suggest to contact Wooldridge directly on the matter, in order to get an authoritative answer.
To prove this, start from the probability limit of the OLS estimator. Let $X$ denote the full matrix of regressors to be used, $[1,X_1,X_2]$, and let $e \equiv u + b_3 X_3$. Also, let $b$ be the parameters we are trying to estimate, i.e. $b = (b_0,b_1,b_2)$.
\begin{align*}
p\lim \hat{\beta} &= p\lim \left[ (X'X)^{-1}X'Y \right]
\\ &= p\lim \left[ (X'X)^{-1}X'Y \right]
\\ &= p\lim \left[ (X'X)^{-1}X'(Xb + e) \right]
\\ &= p\lim \left[ (X'X)^{-1}X'Xb \right] + p\lim \left[ (X'X)^{-1}X'e \right]
\\ &= p\lim \left[ (X'X)^{-1}X'X \right] b + p\lim \left[ (X'X)^{-1}X'(b_3 X_3 + u) \right]
\\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right] + p\lim \left[ (X'X)^{-1}X'u \right]
\\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right]
\\ &= b + b_3 \mathbb{E}(X'X)]^{-1} \mathbb{E}(X' X_3)
\end{align*}
Above, a key step is of course that $p\lim \left[ (X'X)^{-1}X'u \right] =0$, which happens because
$$ p\lim \left[ (X'X)^{-1}X'u \right] = (p\lim X'X)^{-1} p\lim (X'u) = [\mathbb{E}(X'X)]^{-1} \mathbb{E}(X'u) $$, since $\mathbb{E}(X'u)=0$, which holds because the original assumption is that each of the regressors are uncorrelated with $u$ (but not necessarily $e$).
Now we see that $p\lim \hat{\beta} \ne b$ whenever $\mathbb{E}(X'X_3) \ne 0$, that is whenever there is correlation between $X_1$ and $X_3$ or between $X_2$ and $X_3$.
Best Answer
I am answering under the supervision of CV's peers. Be critical.
Assume one has the following model specification
$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{u}$
where $\boldsymbol{y}$ is a $n \times 1$ vector, $\boldsymbol{X}$ an $n \times k$ matrix, $\boldsymbol{\beta}$ a $k \times 1$ hyperparameter and $\boldsymbol{u}$ a $n \times 1$ vector of homoscedastic but autocorrelated residuals. At this stage we still do not know how those are autocorrelated.
Assume that one omited to include another variable, say a $n \times 1$ vector $\boldsymbol{z}$, whose endogeneity consists of its autocorrelation such that
$\boldsymbol{z} = f(\boldsymbol{z})$
where $f$ is assumed to be a bijective/invertible vector function which specifies the correlation structure between the $n$ components $z_{i=1,...,n}$ of $\boldsymbol{z}$.
This means that $\boldsymbol{u}$ is hiddenly generated as follows (where $\gamma \neq 0$ is a scalar parameter and $\boldsymbol{v}$ is $n \times 1$ vector of errors assumed to be iid normal.)
$\boldsymbol{u} = \gamma\boldsymbol{z} + \boldsymbol{v} \iff \frac{1}{\gamma}(\boldsymbol{u}-\boldsymbol{v}) = \boldsymbol{z} \iff f(\frac{1}{\gamma}(\boldsymbol{u}-\boldsymbol{v})) = f(\boldsymbol{z})$
But since one has $\boldsymbol{z} = f(\boldsymbol{z})$, the above last equivalence can be turned into an equality. Which leads to
$\frac{\boldsymbol{u}-\boldsymbol{v}}{\gamma} = f(\frac{\boldsymbol{u}-\boldsymbol{v}}{\gamma}) \iff u = f(\frac{\boldsymbol{u}-\boldsymbol{v}}{\gamma})\gamma+\boldsymbol{v}$
Which shows that even if the correlation is not the same as the one there is between the components of $\boldsymbol{z}$, it does exist.
Thus yes serial correlation does have something to do with endogeneity when, e.g., this endogeneity consists of an omited autocorrelated variable whose autocorrelation structure is invertible.
But actually, it is very unlikely that $f$ be invertible. I mean that, if autocorrelation works through time, $f$ is the backshift operator, and it is not invertible.