Solved – 2SLS probit vs LPM

2slseconometricsinstrumental-variablesprobitregression

I am using 2SLS to estimate the effect of education on the probability that one works. In the first stage I regress education on my instrument and the other exogenous control variables. The same exogenous control variables are then included in the second stage.

The LPM version is obtained in Stata by the following command:

ivregress 2sls emp i.country i.cohort (education=instrument)

However, I cannot decide whether to use probit instead. From the literature I have mostly found support for probit. I hence wonder when LPM is consistent and/or preferred to probit?

Best Answer

To cite from Angrist and Pischke's (2009) Mostly Harmless Econometrics,

"...while a nonlinear model may fit the CEF (conditional expectation function) for LDVs (limited dependent variable models) more closely than a linear model, when it comes to marginal effects, this probably matters little. This optimistic conclusion is not a theorem, but as in the empirical example here, it seems to be fairly robustly true." (p. 107)

So if you are interested in the average causal effect (which from the set-up of your question it seems so) then using either LPM and IV probit should be fine. Both have their advantages and disadvantages though.

For instance, if you are interested in prediction then LPM will be no good as predicted probabilities are not restricted to lie between zero and one. If you have clusters in your standard errors (in your case people in the same regions are likely to be subjected to similar shocks to their employment status), the standard errors are more easily adjusted in LPM. IV probit on the other hand is much more expensive in terms of computation and you also need to calculate the marginal effects in order to get interpretable coefficients - in Stata you can do this with the margins command.

For further discussion of LPM and IV probit have a look at these notes from page 34 onwards. The argument that LPM is fine in this case is also made in Wooldridge (2010) Econometric Analysis of Cross Section and Panel Data.

Even though this is the current general opinion on LPM v.s. IV probit/logit there is some recent work that seeks to show that LPM is not that good after all. The main reference for this should be Lewbel et al (2012). However, their example against LPM is rather constructed as it applies only to fairly extreme data cases. Might still be worth to have a look at it because they also compare different methods.

Related Solutions

Solved – 2SLS but second stage Probit

Your case is less problematic than the other way round. The expectations and linear projections operators go through a linear first stage (e.g. OLS) but not not through non-linear ones like probit or logit. Therefore it's not a problem if you first regress your continous endogenous variable $X$ on your instrument(s) $Z$, $$X_i = a + Z'_i\pi + \eta_i$$ and then use the fitted values in a probit second stage to estimate $$\text{Pr}(Y_i=1|\widehat{X}_i) = \text{Pr}(\beta\widehat{X}_i + \epsilon_i > 0)$$

The standard errors won't be right because $\widehat{X}_i$ is not a random variable but an estimated quantity. You can correct this by bootstrapping both first and second stage together. In Stata this would be something like

// use a toy data set as example
webuse nlswork

// set up the program including 1st and 2nd stage
program my2sls
    reg grade age race tenure
    predict grade_hat, xb

    probit union grade_hat age race
    drop grade_hat
end

// obtain bootstrapped standard errors
bootstrap, reps(100): my2sls

In this example we want to estimate the effect of years of education on the probability of being in a labor union. Given that years of education are likely to be endogenous, we instrument it with years of tenure in the first stage. Of course, this doesn't make any sense from the point of interpretation but it illustrates the code.

Just make sure that you use the same exogenous control variables in both first and second stage. In the above example those are age, race whereas the (non-sensical) instrument tenure is only there in the first stage.

Solved – 2SLS with endogenous interaction term

High level, if you have a valid instrument for endog, then the interaction term between your instrument for endog and link will be a valid instrument for the interaction. So no problems there.

I lack the intuition for dealing with the second part of the problem. So why not simulate? In R:

# Data generating process
n <- 100000
# Create instruments
z1 <- rnorm(n); z2 <- rnorm(n)

# Create unobserved variables
naughty1 <- rnorm(n); naughty2 <- rnorm(n)

# Create endogenous regressors
x1 <- z1 + naughty1 + rnorm(n)
x2 <- z2 + naughty2 + rnorm(n)

# Create outcome variable 
y <- 1 + 0.5*x1 + 0.2*x2 + 0.7*naughty1 -0.5*naughty2 + 2*x1*x2  + rnorm(n)

# Verify that we get biased estimates
mod <- lm(y ~ x1 + x2 + x1:x2)
summary(mod)
# We get fairly strong bias for the coefficients on x1 and x2

# Now we use your method
x1hat <- predict(lm(x1 ~ z1))
x2hat <- predict(lm(x2 ~ z2))

mod2 <- lm(y ~ x1hat + x2hat + x1hat:x2hat)
summary(mod2)
# Hey presto - it works; no more bias.

Naturally you'd need to boostrap the confidence intervals to correct for the two-stages. But what you've proposed should work. Note that it's not particularly efficient. Even in this case, I had to dial the number of observations up pretty high before it started zeroing in on the true parameters.

Best Answer

Related Solutions

Solved – 2SLS but second stage Probit

Solved – 2SLS with endogenous interaction term

Related Question