Solved – Using predicted probabilities as regressors

instrumental-variablesinterpretationmarginal-effectprobit

I am working on a project where I investigate growth in wages due to migration. I correct for the endogeneity in the decision to migrate (only those that are most likely to gain from migration will migrate) by first using a probit model to predict the probabilities of migration based on various characteristics. I then use the predicted probabilities in a second step as a proxy for migration (this in effect is an instrumental variables regression).

My problem is that I get unreasonably high estimates – wages are predicted to increase up to 200%. My concern is that since my predicted probabilities are very low (on average 3%, 25% at the 99th percentile), which is reasonable as in the sample only about 5% migrate, the results that I get come from the marginal increase of probability to migrate from 0 to 1. As far as the predicted probabilities go in my sample, an increase from 0 to 1 is very extreme. Could this be causing the huge estimates? Am I interpreting this correctly? Or should I rather look at the strength of my instruments, etc.?

Best Answer

If you are interested in an approximation of the average partial effect you could just use a linear probability model in the first stage, i.e. do your instrumental variables estimation via 2SLS, for instance, in the usual way. However, due to the non-linearities involved this is not the efficient approach but it can give a good initial idea of the effect under study. For a more in-depth treatment of this argument see Wooldridge (2010) "Econometric Analysis of Cross-Section and Panel Data" in section 15.7.3 from page 594 onward. On page 265-268 he explains the forbidden regression and its problems.

Another procedure that you might be interested in was used by Adams et al. (2009). They use a three-step procedure where they have a probit "first stage" and an OLS second stage without falling for the forbidden regression problem. Their general approach is:

use probit to regress the endogenous variable on the instrument(s) and exogenous variables
use the predicted values from the previous step in an OLS first stage together with the exogenous (but without the instrumental) variables
do the second stage as usual

This procedure will yield unbiased estimates and generally is more efficient than doing 2SLS with a linear probability model in the first stage.

Related Solutions

Solved – Heckman sample selection

The answer is yes, you do not need to use the parameters of inverse Mills ratios. But you must include them in the regression nevertheless, or your other parameters will be biased.
According to the article yes. Although if different variables are statistically significant in different regression there is no problem. Just assume that coefficients for the non-significant regressors are zero.
Splitting is perfectly reasonable. Since you are fitting two models, one for decision whether to go to college or not and another for log-earnings, it is perfectly reasonable to assume that different variables will be important. I should investigate this further though, high multicolinearity when using the same variables in probit and ols regression is not a standard feature of Heckman model as far as I know.

Solved – Logit – comparison of predicted probabilities

Stack the data from the two time periods, as you have done, but don't run them separately for the time periods. Use a dummy for time, and interaction terms as appropriate. Try this:

svy: logit y x##c.age x##female x##period

This will tell you if period is significant, and if it moderates y's effect on x. You can then run your margins statement appropriately. You also have to be careful in interpreting interaction terms in logit models, because of their nonlinearity. See these references for a detailed explanation:

Norton, E. C., Wang, H., & Ai, C. (2004). Computing interaction effects and standard errors in logit and probit models. Stata Journal, 4, 154-167.

This is kind of a contentious area and a bit has been written since 2004, though, so you should do more digging. I do believe the current implementation of margins in Stata takes care of this for you, but it would be good to be aware of the issues.

One other comment, for nonlinear models it can be dangerous to compare coefficients across separate samples. Logit models are sensitive to differences in the dispersion of the underlying latent variable, so if the dispersion or variance is different across the datasets, you may not get valid comparisons of coefficients. This isn't typically a concern with linear regression, but it is in a logit model. See this paper if you have access to Sage journals - if not, reading the abstract may be sufficient to understand it's a problem: Karlson, K. B., Holm, A., & Breen, R. (2012). Comparing Regression Coefficients Between Same-sample Nested Models Using Logit and Probit A New Method. Sociological Methodology, 42(1), 286-313.

Best Answer

Related Solutions

Solved – Heckman sample selection

Solved – Logit – comparison of predicted probabilities

Related Question