Solved – Why do you put all the exogenous variables into the first and second stage of 2SLS

2slsleast squaresregression

For a general MLR, let's say we have k endogenous X's, r exogenous W's, and m instruments.

The first stage of the 2SLS model is regressing each of the endogenous X's on all the Z's and W's. For the second stage, we then take the predicted values of X and fit it to our original model which also include our W's.

So why do we regress on the exogenous variables W in both stages? Since they are exogenous, isn't regressing in one of them sufficient. Could someone give a rigorous explanation?

Best Answer

Technically, you are actually regressing $[X\;,\; W]$ on $[Z\;,\; W]$ so the resulting fitted values for the second stage regressors are $[\hat X\;,\; \hat W]=[\hat X \;,\;W]$.

$\hat W =W$ since the best prediction of $W$ available in the matrix $[Z\;,\; W]$ is obviously $W$ itself.

But the trivialities aside, $W$ is included in the first stage regressors because it is exogenous and so excluding $W$ would lead to a loss in efficiency or consistency (most likely both) of the 2SLS estimator. In other words, the purpose of the first stage is to sort of "devide out" the endogenous part of the $X$'s in that $\hat X$ is the part of $X$ which can be associated solely with exogenous movements (i.e. changes in $Z$ and $W$). If $X$ and $W$ are correlated at all, not including $W$ here would result in a large loss of information since the resulting fitted values would not reflect all the exogenous movement in $X$.

$W$ is included in the second stage to avoid omitted variable bias in the 2SLS coefficient estimates. At this point $\hat X$ is almost surely correlated with $W$ and so if $W$ has any effect on $Y$, leaving it out of the regression will result in bias coefficient estimates.

Related Question