2SLS Consistency – How Consistent is 2SLS with a Binary Endogenous Variable?

endogeneityinstrumental-variablesprobit

I have read that 2SLS estimator is still consistent even with binary endogenous variable (http://www.stata.com/statalist/archive/2004-07/msg00699.html). In the first stage, a probit treatment model will be run instead of a linear model.

Is there any formal proof to show that 2SLS is still consistent even when 1st stage is a probit or logit model?

Also what if the outcome is also binary? I understand if we have a binary outcome and binary endogenous variable (1st and 2nd stages are both binary probit/logit models), mimicking 2SLS method will produce a inconsistent estimate. Is there any formal proof for this? Wooldridge's econometric book has some discussion but I think there is no rigorous proof to show the inconsistency.

data sim;
     do i=1 to 500000;
        iv=rand("normal",0,1);
             x2=rand("normal",0,1);
        x3=rand("normal",0,1);
        lp=0.5+0.8*iv+0.5*x2-0.2*x3;
        T=rand("bernoulli",exp(lp)/(1+exp(lp)));
        Y=-0.8+1.2*T-1.3*x2-0.8*x3+rand("normal",0,1);
        output;
     end;
     run;

****1st stage: logit model ****;
****get predicted values   ****;         
proc logistic data=sim descending;
     model T=IV;
     output out=pred1 pred=p;
     run;

****2nd stage: ols model with predicted values****;
proc reg data=pred1;
     model y=p;
     run;

the coefficient of p = 1.19984. I only run one simulation but with a large sample size.

Best Answer

There has been a similar question regarding a probit first stage and an OLS second stage. In the answer I have provided a link to notes that contain a formal proof of the inconsistency of this regression which is formally known as "forbidden regression", as it was termed by Jerry Hausman. The main reason for the inconsistency of the probit first stage/OLS second stage approach is that neither the expectations operator nor the linear projections operator pass through a non-linear first stage. Therefore the fitted values from a first stage probit are only uncorrelated with the second stage error term under very restrictive assumptions that almost never hold in practice. Be aware though that the formal proof of the inconsistency of the forbidden regression is quite elaborate, if I remember correctly.

If you have a model $$Y_i = \alpha + \beta X_i + \epsilon_i$$ where $Y_i$ is a continuous outcomes and $X_i$ is a binary endogenous variable, you can run the first stage $$X_i = a + Z'_i\pi + \eta_i$$ via OLS and use the fitted values $\widehat{X}_i$ instead of $X_i$ in the second stage. This is the linear probability model you were referring to. Given that there is no problem for expectations or linear projections for this linear first stage, your 2SLS estimates will be consistent albeit less efficient than they could be if we were to take into account the non-linear nature of $X_i$.

Consistency of this approach stems from the fact that whilst a non-linear model may fit the conditional expectations function more closely for limited dependent variables this does not matter much if you are interested in the marginal effect. In the linear probability model the coefficients themselves are marginal effects evaluated at the mean, so if the marginal effect at the mean is what you are after (and usually people are) then this is what you want given the the linear model gives the best linear approximations to non-linear conditional expectation functions.
The same holds true if $Y_i$ is binary, too.

For a more detailed discussion of this have a look at Kit Baum's excellent lecture notes on this topic. From slide 7 he discusses the use of the linear probability model in the 2SLS context.

Finally, if you really want to use probit because you want more efficient estimates then there is another way which is also mentioned in Wooldridge (2010) "Econometric Analysis of Cross Section and Panel Data". The above linked answer includes it, I repeat it here for completeness. As an applied example see Adams et al. (2009) who use a three-step procedure that goes as follows:

  1. use probit to regress the endogenous variable on the instrument(s) and exogenous variables
  2. use the predicted values from the previous step in an OLS first stage together with the exogenous (but without the instrumental) variables
  3. do the second stage as usual

This procedure does not fall for the forbidden regression problem but potentially delivers more efficient estimates of your parameter of interest.