Econometrics – Standard Errors of Two Stage Least Squares in Stata

econometricsinstrumental-variablesstandard errorstata

I use Stata. I am trying to replicate the ivreg output of a regression performing manually the first stage, predicting the instrument after the first stage and running the second stage regression with the instrument in place of the endogenous regressor in the structural model.
Naturally, the standard errors of my second stage regression do not take into account the fact that I am using an estimated regressor: they are different from those in the output of the ivreg command.
My question is: How could I obtain reliable inference without using the ivreg command? IS there an option I should add to the second stage regression to have reliable standard errors? If not, how could I obtain reliable standard errors starting from the second stage manual regression?

Best Answer

The relevant formula is $$\mathbb{Var}(\beta_{IV})=\sigma^2 \cdot (X'P_{Z}X)^{-1},$$ where $$\sigma^2 = (y-X\beta_{IV})'(y-X\beta_{IV})/(n-k_{SS}),$$

and $$P_Z = Z \, (Z'Z)^{-1} Z',$$ and $k_{SS}$ is the number of regressors in the second stage. Some people will just use $n$ or $n-k_{FS}$ since the choice does not matter asymptotically.

Kit Baum has code in this thread on Old Statalist. I've tweaked it slightly to use ivregress rather than ivreg2:

// how to fix 2SLS estimates done 'by hand'
sysuse auto, clear
ivregress 2sls price headroom (weight = turn foreign)
estat vce
di e(rmse)
mat v2sls = e(V)
  
// First stage reg
qui reg weight turn foreign headroom
predict double what, xb

// Second stage reg
qui reg price what headroom
scalar rmsebyhand = e(rmse)

// the 'wrong' VCE, calculated from the instruments
mat vbyhand = e(V)
scalar dfk = e(df_r)

// the correct resids: orig regressors * second stage coeffs 
gen double eps2 = (price - _b[what]*weight - _b[headroom]*headroom - _b[_cons])^2
qui su eps2

// corrected RMSE, based on the correct resids
scalar rmsecorr = sqrt(r(sum) / dfk)

// corrected VCE, using the right s^2
mat vcorr = (rmsecorr / rmsebyhand)^2 * vbyhand
mat li vcorr

// check to see that it equals the real 2SLS VCE
mat diff = v2sls - vcorr
mat li diff

Related Solutions

Solved – Basic 2SLS IV Questions in Stata

You need to include all your exogenous variables in both the first and the second stage as otherwise you might end up with biased estimates. For a discussion of why having some exogenous variables in the first but not in the second stage is problematic see here. Given your setup the correct syntax for Stata would be
ivregress 2sls Y exog1 exog2 exog3 exog4 (X = inst1 inst2)

As a side note: instead of ivregress you might want to use ivreg2 which is a user written command that provides many more diagnostic statistics for your 2SLS model.

For the interaction of the endogenous variable and exog3 you would also need to generate an interaction between the instruments and exog3. In a model like $$Y_i = \alpha + \beta_1 \text{exog1}_i + \beta_2 \text{exog2}_i + \beta_3 \text{exog3}_i + \beta_4 \text{exog4}_i + \gamma X_i + \epsilon_i$$ you said that you can instrument $X$ by running the first stage $$X_i = a + \rho_1 \text{exog1}_i + \rho_2 \text{exog2}_i + \rho_3 \text{exog3}_i + \rho_4 \text{exog4}_i + \phi_1 \text{inst1}_i + \phi_2 \text{inst2}_i + e_i $$ and then use the fitted values of this in the second stage. In the same spirit, if inst1 and inst2 are valid instruments for X, then inst1*exog3 and inst2*exog3 will be valid instruments for X*exog3, i.e. for a model $$Y_i = \alpha + \beta_1 \text{exog1}_i + \beta_2 \text{exog2}_i + \beta_3 \text{exog3}_i + \beta_4 \text{exog4}_i + \gamma \text{(X$_i$ $\cdot$ exog3$_i$)} + \eta_i$$ the first stage would be $ \begin{align} \text{(X$_i$ $\cdot$ exog3$_i$)} &= c + \delta_1 \text{exog1}_i + \delta_2 \text{exog2}_i + \delta_3 \text{exog3}_i + \delta_4 \text{exog4}_i + \psi_1 \text{(inst1 $\cdot$ exog3)}_i \newline &+ \psi_2 \text{(inst2 $\cdot$ exog3)}_i + u_i \end{align} $

In Stata the least complicated way would be to generate the interactions by hand

gen Xexog3 = X*exog3
gen inst1exog3 = inst1*exog3
gen inst2exog3 = inst2*exog3
ivregress 2sls Y exog1 exog2 exog3 exog4 (X Xexog3 = inst1 inst2 inst1exog3 inst2exog3)

This type of question has been asked before on the Statalist, so if you are interested in further discussion of the problem have a look here.

2SLS – How to Use 2SLS with Second Stage Probit?

Your case is less problematic than the other way round. The expectations and linear projections operators go through a linear first stage (e.g. OLS) but not not through non-linear ones like probit or logit. Therefore it's not a problem if you first regress your continous endogenous variable $X$ on your instrument(s) $Z$, $$X_i = a + Z'_i\pi + \eta_i$$ and then use the fitted values in a probit second stage to estimate $$\text{Pr}(Y_i=1|\widehat{X}_i) = \text{Pr}(\beta\widehat{X}_i + \epsilon_i > 0)$$

The standard errors won't be right because $\widehat{X}_i$ is not a random variable but an estimated quantity. You can correct this by bootstrapping both first and second stage together. In Stata this would be something like

// use a toy data set as example
webuse nlswork

// set up the program including 1st and 2nd stage
program my2sls
    reg grade age race tenure
    predict grade_hat, xb

    probit union grade_hat age race
    drop grade_hat
end

// obtain bootstrapped standard errors
bootstrap, reps(100): my2sls

In this example we want to estimate the effect of years of education on the probability of being in a labor union. Given that years of education are likely to be endogenous, we instrument it with years of tenure in the first stage. Of course, this doesn't make any sense from the point of interpretation but it illustrates the code.

Just make sure that you use the same exogenous control variables in both first and second stage. In the above example those are age, race whereas the (non-sensical) instrument tenure is only there in the first stage.

Best Answer

Related Solutions

Solved – Basic 2SLS IV Questions in Stata

2SLS – How to Use 2SLS with Second Stage Probit?

Related Question