Regression – Understanding OLS Estimation, Bias, and Causality in Statistical Models

biascausalityendogeneityleast squaresregression

I wish to ask about the bias of an OLS estimator. In what follows I assume that the regression that we are dealing with is an approximation to a linear conditional expectations function. That is we have that:

$ E[Y|X] = \beta_{0} + \beta_{1}X_{i} $

Hence,

$ Y_{i} = \beta_{0} + \beta_{1}X_{i}+\varepsilon_{i} $

In this case, as with all CEFs, $ \varepsilon_{i} $ is defined such that $E[\varepsilon_{i}|X_{i}]=0$. This is true, by definition and can be verified using the Law of Iterated Expectations, if necessary.

However, I will also note that in the background, we have a different model which is causal. It will be defined as follows:

$ Y_{i} = \delta_{0} + \delta_{1}X_{i}+u_{i} $

Notice that if $E[u_{i}|X_{i}]=0 $, then the causal model and the CEF coincide, and we can estimate the parameters of the causal model without any problems! However, let's suppose that $E[u_{i}|X_{i}] \neq 0 $. This means that we endogeneity in the causal model. This will imply that that the CEF is NOT the same as the causal model. More explicitly. $ \beta_{1} \neq \delta_{1} $.

My question is as follows. Suppose that I took a sample of N observations of $ (Y_{i}, X_{i}) $ and decided to run my OLS estimator. What will happen?

We know that formally, $ E[\hat{\beta}] = \beta + (X'X)^{-1}E[(X'\varepsilon)] $.

Notice however that by definition $ E[(X'\varepsilon)]=0 $. There is no such thing as endogeneity in the land of CEFs. Recall, that the OLS put nicely by Angrist (2008): "inherits its legitimacy from the CEF", which means that our estimator uses definition of the error terms found in the CEF. Hence, endogeneity cannot cause OLS to be biased?!

As we have put forward above endogeneity is only present in the causal model. My question is, how does endogeneity bias actually work? When does it affect the coefficient estimates produced by OLS?

Is it that the "bias" represents the difference between the coefficient found in the CEF and the causal model?!! That is, assuming that $E[u_{i}|X_{i}] \neq 0 $, then: $ E[\hat{\beta}] = \delta + (X'X)^{-1}E[(X'u)] $? In other words, it gives the causal model parameters plus some bias on the end? Or to put it less confusingly, $ E[\hat{\beta}] = \beta = \delta + (X'X)^{-1}E[(X'u)] $?

Lastly, when faced with this problem, do instrumental variables try to transform the regression model in such a way such that the CEF results and the causal model parameters coincide upon estimation?

Any clarifications would be much appreciated!

Best Answer

You are not far from to the correct interpretation. Many times presentations are ambiguous and/or contradictory about this point, but all problems come to solve if we keep clear separated CEF and causal model as you do.

In what follows I assume that the regression that we are dealing with is an approximation to a linear conditional expectations function. That is we have that:

$ E[Y|X] = \beta_{0} + \beta_{1}X_{i} $

Hence,

$ Y_{i} = \beta_{0} + \beta_{1}X_{i}+\varepsilon_{i} $

In this case, as with all CEFs, $ \varepsilon_{i} $ is defined such that $E[\varepsilon_{i}|X_{i}]=0$. This is true, by definition and can be verified using the Law of Iterated Expectations, if necessary.

More precisely the mean independence is precisely true if the exact CEF is linear, if you use the approximation argument even this condition is true in approximation only. For simplification let me said that the exact CEF is assumed linear.

However, I will also note that in the background, we have a different model which is causal. It will be defined as follows:

$ Y_{i} = \delta_{0} + \delta_{1}X_{i}+u_{i} $

Notice that if $E[u_{i}|X_{i}]=0 $, then the causal model and the CEF coincide, and we can estimate the parameters of the causal model without any problems!

This part is ok, but a precisation about the word "coincide" can help. Even if algebraically the two equation can coincide we must keep in mind that them are two strongly different concepts; them must be not mixed.

However, let's suppose that $E[u_{i}|X_{i}] \neq 0 $. This means that we endogeneity in the causal model.

Exactly.

This will imply that that the CEF is NOT the same as the causal model. More explicitly. $ \beta_{1} \neq \delta_{1} $.

Yes. In this situation the previous CEF do not permits to identify the causal parameters.

My question is as follows. Suppose that I took a sample of N observations of $ (Y_{i}, X_{i}) $ and decided to run my OLS estimator. What will happen?

We know that formally, $ E[\hat{\beta}] = \beta + (X'X)^{-1}E[(X'\varepsilon)] $.

Notice however that by definition $ E[(X'\varepsilon)]=0 $. There is no such thing as endogeneity in the land of CEFs. Recall, that the OLS put nicely by Angrist (2008): "inherits its legitimacy from the CEF", which means that our estimator uses definition of the error terms found in the CEF. Hence, endogeneity cannot cause OLS to be biased?! As we have put forward above endogeneity is only present in the causal model.

Indeed you have that $ E[\hat{\beta}] = \beta$ regardless endogeneity problems.

My question is, how does endogeneity bias actually work? When does it affect the coefficient estimates produced by OLS?

Endogeneity imply the biasedness of $ \hat{\beta} $ but it is biased for what? It is biased for $\delta$.

Indeed

Is it that the "bias" represents the difference between the coefficient found in the CEF and the causal model?!! That is, assuming that $E[u_{i}|X_{i}] \neq 0 $, then: $ E[\hat{\beta}] = \delta + (X'X)^{-1}E[(X'u)] $? In other words, it gives the causal model parameters plus some bias on the end?

Exactly.

Lastly, when faced with this problem, do instrumental variables try to transform the regression model in such a way such that the CEF results and the causal model parameters coincide upon estimation?

IV estimator can help for consistent estimation of $\delta$. However even OLS can be used yet, what you need is good controls.

Finally this topic is quite vast, this my explanation can help: Under which assumptions a regression can be interpreted causally?

Related Question