Solved – Does increasing sample size have any effect on omitted variable bias

biasmultiple regressionregressionsample-size

Say I have a multiple linear regression model, where two of the variables are positively correlated, and I omit one of these variables from the model.

First question – if I increase the sample size, the estimated errors on the parameters would decrease wouldn't they?

Second question – would increasing the sample size have any effect on the bias of the coefficient? I am thinking that it would have no effect, but I am not sure? (Also, am I right in saying that the bias would have the same sign as the coefficient of the omitted variable?)

Best Answer

You are correct on both accounts, but omitting the variable is still a very bad idea. Increasing your sample size is not going to 'fix' omitted variable bias. Consider the semi-classic example of drowning deaths and temperature (because people go to swimming pools when it's warm but not when it's cold).

We estimate one model:

$$ drowning.deaths = \alpha + \beta_1 temperature + \epsilon $$

We estimate a second model:

$$ drowning.deaths = \alpha + \beta_1 temperature + \beta_2 pool.in.area + \epsilon $$

If we increase our sample size for the first model, yes, we will reduce our standard errors, but absolutely our model will still suffer from omitted variable bias. Theory first! Remember that models like this are for hypothesis testing, and the 'best fit' model isn't always the correct model. No matter how small your standard errors are, and no matter how big your sample size is, if you've modeled it wrong, your results are not going to give you the 'right story.'

In the case of swimming pools, say you have 50 communities with swimming pools, and 50 communities without swimming pools, our first model is going to underestimate the relationship between drowning deaths and temperature-given-swimming-pool and overestimate the relationship between drowning deaths and temperature-given-no-swimming-pool. Thus, missing a really important piece of the puzzle (interaction in this specific hypothetical a plus).

If you really wanted to demonstrate a strong relationship between $drowning.deaths$ and $temperature$ (depending on the number of communities that had a pool), you could drop $pool.in.area$, and $\beta_1$ might be larger. But, that would be very bad science if you also knew there was a relationship between $drowning.deaths$ and $pool.in.area$.

In general, if a variable is plausibly related to your outcome, include it in your model. With maybe a few exceptions: some instrumental variable considerations (which are beyond the scope of this question) or severe multicollinearity between the predictor and the outcome.

On the sign of the omitted variable bias - that depends on some other factors, including both the relationship between the omitted variable and the outcome, and relationship between the omitted variable and other covariates. It could go either way.

To conclude, in a summary of a summary: increasing the sample size will not solve this problem, and you should include the variable unless there are other very compelling reasons not to.