This seems to be the general view in statistics community:
If the regression model is overspecified (outcome 4), then the
regression equation contains one or more redundant predictor
variables. That is, part of the model is correct, but we have gone
overboard by adding predictors that are redundant. Redundant
predictors lead to problems such as inflated standard errors for the
regression coefficients."Regression models that are overspecified yield unbiased regression
coefficients, unbiased predictions of the response, and an unbiased
MSE. Such a regression model can be used, with caution, for prediction
of the response, but should not be used to ascribe the effect of a
predictor on the response. Also, as with including extraneous
variables, we've also made our model more complicated and hard to
understand than necessary.
I was wondering whether anyone has a proof for this property. I can certainly prove that this quote is not correct without making additional assumptions. Suppose a true population model:
$$y_i = \beta x_i + e_i$$
Now estimate the model:
$$y_i = \beta x_i + \beta_2x_{2i}+e_i$$
Suppose that $x_2$ is in fact caused by $x_1$, or just by fluke they happen to have the following relationship:
$$x_2= -x_1 +u_i$$
Where u_i is some centered random error. Say that the true beta is 1. The bias will be such that the beta will be on average 0.5 or even -0.5. Perhaps the quote only concerns variables that are not correlated with other independent variables? Given this result, isn't it just as bad to add variables that do not belong into the model, as it is leaving out variables that do (bias wise)?
Best Answer
Quick comments:
MATLAB code to generate data:
Estimation results:
Observe that the sum is always about 1, but that the individual estimates are massively imprecise and vary massively between runs. If noise $u_i$ is sufficiently small, you can't distinguish the explanatory effect of $x_1$ vs. $x_2$.
Furthermore, you get results that aren't statistically significant! On the other hand, if you drop the $x_2$ variable, the t-stat shoots up to like 100.
Now let's increase $n$ to 10 million.
Eventually you can get n large enough to distinguish $x_1$ from $x_2$ in this setup, but $n$ needs to be obscenely large.