Solved – Overspecification bias/ including too many variables to a regression model

biasregression

This seems to be the general view in statistics community:

If the regression model is overspecified (outcome 4), then the
regression equation contains one or more redundant predictor
variables. That is, part of the model is correct, but we have gone
overboard by adding predictors that are redundant. Redundant
predictors lead to problems such as inflated standard errors for the
regression coefficients."

Regression models that are overspecified yield unbiased regression
coefficients, unbiased predictions of the response, and an unbiased
MSE. Such a regression model can be used, with caution, for prediction
of the response, but should not be used to ascribe the effect of a
predictor on the response. Also, as with including extraneous
variables, we've also made our model more complicated and hard to
understand than necessary.

I was wondering whether anyone has a proof for this property. I can certainly prove that this quote is not correct without making additional assumptions. Suppose a true population model:

$$y_i = \beta x_i + e_i$$

Now estimate the model:

$$y_i = \beta x_i + \beta_2x_{2i}+e_i$$

Suppose that $x_2$ is in fact caused by $x_1$, or just by fluke they happen to have the following relationship:

$$x_2= -x_1 +u_i$$

Where u_i is some centered random error. Say that the true beta is 1. The bias will be such that the beta will be on average 0.5 or even -0.5. Perhaps the quote only concerns variables that are not correlated with other independent variables? Given this result, isn't it just as bad to add variables that do not belong into the model, as it is leaving out variables that do (bias wise)?

Best Answer

Quick comments:

  1. I don't know where you're pulling .5 from?
  2. If the variation of $u_i$ is small, you basically have a multicollinearity problem: $x_1$ and $x_2$ are for practical purposes almost the same variable.
  3. With $x_1$ and $x_2$ almost the same, what tends to happen when you regress $y$ on $x_1$ and $x_2$ is that the sum of $\beta_1$ and $\beta_2$ will get closer to your true beta of $1$, but individually, the estimates may be crazy! You might have $\beta_1 = 2.25$ and $\beta_2 = -1.24$. The sum is close to the true value of 1, but individually they're way off the true $\beta_1 = 1, \beta_2 = 0$. Furthermore, they will be highly sensitive to small changes in your data. You can simulate this to see. Eg. a small simulation I did:

MATLAB code to generate data:

n = 10000; x1 = randn(n, 1); x2 = x1 + randn(n, 1) * .01; y = x1 + randn(n, 1);

Estimation results:

run 1: b1 = 1.34  b2 = -0.35
run 2: b1 = 2.14  b2 = -1.14
run 3: b1 = .04   b2 = .94

Observe that the sum is always about 1, but that the individual estimates are massively imprecise and vary massively between runs. If noise $u_i$ is sufficiently small, you can't distinguish the explanatory effect of $x_1$ vs. $x_2$.

Furthermore, you get results that aren't statistically significant! On the other hand, if you drop the $x_2$ variable, the t-stat shoots up to like 100.

Now let's increase $n$ to 10 million.

run 1: b1 = .98   b2 = .018
run 2: b1 = .97   b2 = .023
run 3: b1 = 1.02  b2 = -.018

Eventually you can get n large enough to distinguish $x_1$ from $x_2$ in this setup, but $n$ needs to be obscenely large.