Multiple Regression – Omitted Variable Bias: Which Predictors to Include and Why

biascausalitymultiple regression

For a last couple of weeks I've been thinking about OVB (Omitted variable bias) in the context of regression and solution for that (how to avoid this problem). I am acquainted with Shalizi's lectures (2.2), but he is just describing this mathematically.

This week somebody said that it's quite easy – the solution for OVB is to include all those predictors that control the effect of confounding covariates, not all predictors for dependent variable Y.

I am not sure if this is true and yes, I do feel that I lack of deeper knowledge.

Best Answer

This is not necessarily wrong, but not always feasible and also not a free lunch.

An omitted variable may cause (see, e.g., the comments below for additional thoughts on the matter) bias if it is both (a) related to the outcome $Y$ and (b) correlated with the predictor $X$ whose effect on $Y$ you are primarily interested in.

Consider an example: You want to learn about the causal effect of additional schooling on later earnings. Another variable that is most certainly satisfies the conditions (a) and (b) is "motivation" - more motivated people will both be more successful in their jobs (whether they are highly schooled or not) and generally choose to receive more education, as they are likely to like learning, and not find it too painful to study for exams.

So, when comparing earnings of highly schooled and less schooled employees without controlling for motivation, you would likely at least partially not be comparing two groups that only differ in terms of their schooling (whose effect you are interested in) but also in terms of their motivation, so the observed difference in earnings should not only be ascribed to differences in schooling.

Now, it would indeed be a solution to control for motivation by including it into the regression. The likely problem is of course: are you going to have data on motivation? Even if you were to conduct a survey yourself (rather than use say administrative data, that will most likely not have entries on motivation), how would you even measure it?

As to why including everything is not a free lunch: if you have a small sample, including all available covariates may quickly lead to overfitting when prediction is your goal. See for example this very nice discussion.