Solved – Endogeneity test instrumental variables

endogeneityinstrumental-variablesintuitionlinear model

I'm reading a paper in which is used the following endogeneity test:

First of all, we have the initial linear model: $$y = \beta_0 + \beta_1x_1 +
\beta_2x_2 + \beta_3x_3 + e$$ $x_3$ is the endogenous regressor and $z$ is
the instrument.
We regress the endogenous regressor on the instrument and the
exogenous regressors: $$x_3 = b_0 + b_1x_1 +b_2x_2 + b_3z + e$$
We recover the residual $u$ of the linear regression of the previous
point. Then we estimate the following linear model: $$y = \beta_0 + \beta_1 x_1 +
\beta_2 x_2 + \beta_3 x_3 + \rho u + e$$

The paper says that this is an endogeneity test: if the estimated coefficients in step 3 are very similar to those in step 1 then regressor $x_3$ was not endogenous.

Could anyone explain me the intuition behind this test?

Best Answer

What you are looking at is formally known as the control function approach. When you run your first stage $$x_3 = b_0 + b_1x_1 +b_2x_2 + b_3z + u$$ you basically split the variation in $x_3$ into exogenous variation (that comes from the exogenous and instrumental variables), and you leave the "bad" variation that is correlated with $e$ in your first regression.

You know that when you regress $$y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3x_3 + e$$ some part of your endogenous variable is correlated with $e$, i.e. it is contained in the error term. This part is captured by $u$ in the first stage. So you can imagine that $e$ is a sort of composite error $e = \epsilon + u$ (formally this isn't the right way of making the point but it is intuitive). Therefore, if you regress $$y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3x_3 + \rho u + e$$ there is no endogeneity problem anymore because the part of $x_3$ which is correlated with $e$ is not in this error term anymore because it is included in the regression as $u$.

If you run 2SLS instead, you will notice that the $\beta_3$ will have the exact same value as the one from the control function approach (see this related question and its answer). In essence your authors are restating the Hausman test. You know that the control function approach or 2SLS will give you consistent estimates. Therefore, if such estimates are not significantly different from the OLS estimates the bias in OLS cannot be big (under the assumption that the instrument is valid and strong).

Related Solutions

Solved – Instrumental variables and mixed/multilevel models

The paper of Peter Ebbes et al. (2005) proposes a Latent IV estimation, where you do not need external IVs.

Ebbes, Peter; Wedel, Michel; Böckenholt, Ulf; Steerneman, Ton; (2005). "Solving and Testing for Regressor-Error (in)Dependence When no Instrumental Variables are Available: With New Evidence for the Effect of Education on Income." Quantitative Marketing and Economics 3(4): 365-392. http://hdl.handle.net/2027.42/47579

Also the paper by Kim and Frees 2007 proposes a GMM estimation that helps you address the endogeneity problems in MLM.

Jee-Seon Kim, & Edward W. Frees (2007). "Multilevel Modelling with Correlated Effects". Psychometrika, 72, 4, pp. 505-533.

However, I have not seen any R code for any of the two approaches :(.

Solved – Check for endogeneity

In general, endogeneity is a theoretical property and not something that can be tested from the data at hand. Then you need something as an instrument, like you say.

The second question sounds more like you are wondering what functional form will be best. There will certainly be a difference in the parameter values, but it may be that the predictions from the two are the same. You can run both, predict and inspect visually:

You could for example estimate model 1 first and compute $\widehat{\log y_1}$ as the predicted values from the first model and $\widehat{\log y_2}$ as the predicted values from the second. Then you can plot them against each other.

Stata code could be

reg logy x1 x2
predict yhat1 , xb
g logx1 = log(x1)
reg logy logx1 x2
predict yhat2 , xb 
twoway (scatter logy x1) (scatter yhat1 x1) (scatter yhat2 x1) , legend(order(1 "data" 2 "linear" 3 "logarithmic"))

Best Answer

Related Solutions

Solved – Instrumental variables and mixed/multilevel models

Solved – Check for endogeneity

Related Question