You need to include all your exogenous variables in both the first and the second stage as otherwise you might end up with biased estimates. For a discussion of why having some exogenous variables in the first but not in the second stage is problematic see here. Given your setup the correct syntax for Stata would be
ivregress 2sls Y exog1 exog2 exog3 exog4 (X = inst1 inst2)
As a side note: instead of ivregress
you might want to use ivreg2
which is a user written command that provides many more diagnostic statistics for your 2SLS model.
For the interaction of the endogenous variable and exog3
you would also need to generate an interaction between the instruments and exog3
. In a model like
$$Y_i = \alpha + \beta_1 \text{exog1}_i + \beta_2 \text{exog2}_i + \beta_3 \text{exog3}_i + \beta_4 \text{exog4}_i + \gamma X_i + \epsilon_i$$
you said that you can instrument $X$ by running the first stage
$$X_i = a + \rho_1 \text{exog1}_i + \rho_2 \text{exog2}_i + \rho_3 \text{exog3}_i + \rho_4 \text{exog4}_i + \phi_1 \text{inst1}_i + \phi_2 \text{inst2}_i + e_i $$
and then use the fitted values of this in the second stage. In the same spirit, if inst1
and inst2
are valid instruments for X
, then inst1*exog3
and inst2*exog3
will be valid instruments for X*exog3
, i.e. for a model
$$Y_i = \alpha + \beta_1 \text{exog1}_i + \beta_2 \text{exog2}_i + \beta_3 \text{exog3}_i + \beta_4 \text{exog4}_i + \gamma \text{(X$_i$ $\cdot$ exog3$_i$)} + \eta_i$$
the first stage would be
$
\begin{align}
\text{(X$_i$ $\cdot$ exog3$_i$)} &= c + \delta_1 \text{exog1}_i + \delta_2 \text{exog2}_i + \delta_3 \text{exog3}_i + \delta_4 \text{exog4}_i + \psi_1 \text{(inst1 $\cdot$ exog3)}_i \newline &+ \psi_2 \text{(inst2 $\cdot$ exog3)}_i + u_i
\end{align}
$
In Stata the least complicated way would be to generate the interactions by hand
gen Xexog3 = X*exog3
gen inst1exog3 = inst1*exog3
gen inst2exog3 = inst2*exog3
ivregress 2sls Y exog1 exog2 exog3 exog4 (X Xexog3 = inst1 inst2 inst1exog3 inst2exog3)
This type of question has been asked before on the Statalist, so if you are interested in further discussion of the problem have a look here.
There has been a similar question regarding a probit first stage and an OLS second stage. In the answer I have provided a link to notes that contain a formal proof of the inconsistency of this regression which is formally known as "forbidden regression", as it was termed by Jerry Hausman. The main reason for the inconsistency of the probit first stage/OLS second stage approach is that neither the expectations operator nor the linear projections operator pass through a non-linear first stage. Therefore the fitted values from a first stage probit are only uncorrelated with the second stage error term under very restrictive assumptions that almost never hold in practice.
Be aware though that the formal proof of the inconsistency of the forbidden regression is quite elaborate, if I remember correctly.
If you have a model
$$Y_i = \alpha + \beta X_i + \epsilon_i$$
where $Y_i$ is a continuous outcomes and $X_i$ is a binary endogenous variable, you can run the first stage
$$X_i = a + Z'_i\pi + \eta_i$$
via OLS and use the fitted values $\widehat{X}_i$ instead of $X_i$ in the second stage. This is the linear probability model you were referring to. Given that there is no problem for expectations or linear projections for this linear first stage, your 2SLS estimates will be consistent albeit less efficient than they could be if we were to take into account the non-linear nature of $X_i$.
Consistency of this approach stems from the fact that whilst a non-linear model may fit the conditional expectations function more closely for limited dependent variables this does not matter much if you are interested in the marginal effect. In the linear probability model the coefficients themselves are marginal effects evaluated at the mean, so if the marginal effect at the mean is what you are after (and usually people are) then this is what you want given the the linear model gives the best linear approximations to non-linear conditional expectation functions.
The same holds true if $Y_i$ is binary, too.
For a more detailed discussion of this have a look at Kit Baum's excellent lecture notes on this topic. From slide 7 he discusses the use of the linear probability model in the 2SLS context.
Finally, if you really want to use probit because you want more efficient estimates then there is another way which is also mentioned in Wooldridge (2010) "Econometric Analysis of Cross Section and Panel Data". The above linked answer includes it, I repeat it here for completeness. As an applied example see Adams et al. (2009) who use a three-step procedure that goes as follows:
- use probit to regress the endogenous variable on the instrument(s) and exogenous variables
- use the predicted values from the previous step in an OLS first stage together with the exogenous (but without the instrumental) variables
- do the second stage as usual
This procedure does not fall for the forbidden regression problem but potentially delivers more efficient estimates of your parameter of interest.
Best Answer
The reference for this should be Newey (1987) "Efficient estimation of limited dependent variable models with endogenous explanatory variables", Journal of Econometrics, Vol. 36(3), pp. 231–250 (link). This is the estimator that is implemented with the
probitiv
command in Stata, for instance, where you can have an OLS first stage and probit second stage.