Solved – Check for endogeneity

endogeneitylog-linear

I run a log linear model, as $log(y) = b_0 + b_1x_1 + b_2x_2 + e$. I think $x_1$ may be endogenous and I would like to test it, so that I can consequently run a two-stage model. I would like to know if I there is a way to check it or if I must have at least one instrumental variable for $x_1$.

Another specification of my model is $log(y) = b_0 + b_1log(x_1) + b_2x_2 + e$. May I use the same procedure with a logarithm or is there a difference?

Best Answer

In general, endogeneity is a theoretical property and not something that can be tested from the data at hand. Then you need something as an instrument, like you say.

The second question sounds more like you are wondering what functional form will be best. There will certainly be a difference in the parameter values, but it may be that the predictions from the two are the same. You can run both, predict and inspect visually:

You could for example estimate model 1 first and compute $\widehat{\log y_1}$ as the predicted values from the first model and $\widehat{\log y_2}$ as the predicted values from the second. Then you can plot them against each other.

Stata code could be

reg logy x1 x2
predict yhat1 , xb
g logx1 = log(x1)
reg logy logx1 x2
predict yhat2 , xb 
twoway (scatter logy x1) (scatter yhat1 x1) (scatter yhat2 x1) , legend(order(1 "data" 2 "linear" 3 "logarithmic"))

Related Solutions

Regression – Estimating $b_1 x_1+b_2 x_2$ Instead of $b_1 x_1+b_2 x_2+b_3x_3$

The issue you need to worry about is called endogeneity. More specifically, it depends on whether $x_3$ is correlated in the population with $x_1$ or $x_2$. If it is, then the associated $b_j$s will be biased. That is because OLS regression methods force the residuals, $u_i$, to be uncorrelated with your covariates, $x_j$s. However, your residuals are composed of some irreducible randomness, $\varepsilon_i$, and the unobserved (but relevant) variable, $x_3$, which by stipulation is correlated with $x_1$ and / or $x_2$. On the other hand, if both $x_1$ and $x_2$ are uncorrelated with $x_3$ in the population, then their $b$s won't be biased by this (they may well be biased by something else, of course). One way econometricians try to deal with this issue is by using instrumental variables.

For the sake of greater clarity, I've written a quick simulation in R that demonstrates the sampling distribution of $b_2$ is unbiased / centered on the true value of $\beta_2$, when it is uncorrelated with $x_3$. In the second run, however, note that $x_3$ is uncorrelated with $x_1$, but not $x_2$. Not coincidentally, $b_1$ is unbiased, but $b_2$ is biased.

library(MASS)                          # you'll need this package below
N     = 100                            # this is how much data we'll use
beta0 = -71                            # these are the true values of the
beta1 = .84                            # parameters
beta2 = .64
beta3 = .34

############## uncorrelated version

b0VectU = vector(length=10000)         # these will store the parameter
b1VectU = vector(length=10000)         # estimates
b2VectU = vector(length=10000)
set.seed(7508)                         # this makes the simulation reproducible

for(i in 1:10000){                     # we'll do this 10k times
  x1 = rnorm(N)
  x2 = rnorm(N)                        # these variables are uncorrelated
  x3 = rnorm(N)
  y  = beta0 + beta1*x1 + beta2*x2 + beta3*x3 + rnorm(100)
  mod = lm(y~x1+x2)                    # note all 3 variables are relevant
                                       # but the model omits x3
  b0VectU[i] = coef(mod)[1]            # here I'm storing the estimates
  b1VectU[i] = coef(mod)[2]
  b2VectU[i] = coef(mod)[3]
}
mean(b0VectU)  # [1] -71.00005         # all 3 of these are centered on the
mean(b1VectU)  # [1] 0.8399306         # the true values / are unbiased
mean(b2VectU)  # [1] 0.6398391         # e.g., .64 = .64

############## correlated version

r23 = .7                               # this will be the correlation in the
b0VectC = vector(length=10000)         # population between x2 & x3
b1VectC = vector(length=10000)
b2VectC = vector(length=10000)
set.seed(2734)

for(i in 1:10000){
  x1 = rnorm(N)
  X  = mvrnorm(N, mu=c(0,0), Sigma=rbind(c(  1, r23),
                                         c(r23,   1)))
  x2 = X[,1]
  x3 = X[,2]                           # x3 is correated w/ x2, but not x1
  y  = beta0 + beta1*x1 + beta2*x2 + beta3*x3 + rnorm(100)
                                       # once again, all 3 variables are relevant
  mod = lm(y~x1+x2)                    # but the model omits x3
  b0VectC[i] = coef(mod)[1]
  b1VectC[i] = coef(mod)[2]            # we store the estimates again
  b2VectC[i] = coef(mod)[3]
}
mean(b0VectC)  # [1] -70.99916         # the 1st 2 are unbiased
mean(b1VectC)  # [1] 0.8409656         # but the sampling dist of x2 is biased
mean(b2VectC)  # [1] 0.8784184         # .88 not equal to .64

Solved – Proof that omitted variable bias may lead to endogeneity

To prove this, start from the probability limit of the OLS estimator. Let $X$ denote the full matrix of regressors to be used, $[1,X_1,X_2]$, and let $e \equiv u + b_3 X_3$. Also, let $b$ be the parameters we are trying to estimate, i.e. $b = (b_0,b_1,b_2)$.

\begin{align*} p\lim \hat{\beta} &= p\lim \left[ (X'X)^{-1}X'Y \right] \\ &= p\lim \left[ (X'X)^{-1}X'Y \right] \\ &= p\lim \left[ (X'X)^{-1}X'(Xb + e) \right] \\ &= p\lim \left[ (X'X)^{-1}X'Xb \right] + p\lim \left[ (X'X)^{-1}X'e \right] \\ &= p\lim \left[ (X'X)^{-1}X'X \right] b + p\lim \left[ (X'X)^{-1}X'(b_3 X_3 + u) \right] \\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right] + p\lim \left[ (X'X)^{-1}X'u \right] \\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right] \\ &= b + b_3 \mathbb{E}(X'X)]^{-1} \mathbb{E}(X' X_3) \end{align*}

Above, a key step is of course that $p\lim \left[ (X'X)^{-1}X'u \right] =0$, which happens because

$$ p\lim \left[ (X'X)^{-1}X'u \right] = (p\lim X'X)^{-1} p\lim (X'u) = [\mathbb{E}(X'X)]^{-1} \mathbb{E}(X'u) $$, since $\mathbb{E}(X'u)=0$, which holds because the original assumption is that each of the regressors are uncorrelated with $u$ (but not necessarily $e$).

Now we see that $p\lim \hat{\beta} \ne b$ whenever $\mathbb{E}(X'X_3) \ne 0$, that is whenever there is correlation between $X_1$ and $X_3$ or between $X_2$ and $X_3$.

Best Answer

Related Solutions

Regression – Estimating $b_1 x_1+b_2 x_2$ Instead of $b_1 x_1+b_2 x_2+b_3x_3$

Solved – Proof that omitted variable bias may lead to endogeneity

Related Question