Solved – Sign flipping when adding one more variable in regression and with much greater magnitude

multicollinearityregression

Basic setup:

regression model: $y = \text{constant} +\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\alpha C+\epsilon$
where C is the vector of control variables.

I'm interested in $\beta$ and expect $\beta_1$ and $\beta_2$ to be negative. However, there is multicollinearity problem in the model, the coefficient of correlation is given by,
corr($x_1$,$x_2)=$ 0.9345, corr($x_1$,$x_3)=$ 0.1765, corr($x_2$,$x_3)=$ 0.3019.

So $x_1$ and $x_2$ are highly correlated, and they should virtually provide the same information. I run three regressions:

  1. exclude $x_1$ variable; 2. exclude $x_2$ variable; 3. original model with both $x_1$ and $x_2$.

Results:
For regression 1 and 2, it provide the expected sign for $\beta_2$ and $\beta_1$ respectively and with similar magnitude. And $\beta_2$ and $\beta_1$ are significant in 10% level in both model after I do the HAC correction in standard error. $\beta_3$ is positive but not significant in both model.

But for 3, $\beta_1$ has the expected sign, but the sign for $\beta_2$ is positive with the magnitude twice greater than $\beta_1$ in absolute value. And both $\beta_1$ and $\beta_2$ are insignificant. Moreover, the magnitude for $\beta_3$ reduces almost in half compared to regression 1 and 2.

My question is:

Why in 3, the sign of $\beta_2$ becomes positive and much greater than $\beta_1$ in absolute value? Is there any statistical reason that $\beta_2$ can flip sign and has large magnitude? Or is it because model 1 and 2 suffer omitted variable problem which inflated $\beta_3$ provided $x_2$ has positive effect on y? But then in regression model 1 and 2, both $\beta_2$ and $\beta_1$ should be positive instead of negative, since the total effect of $x_1$ and $x_2$ in regression model 3 is positive.

Best Answer

Think of this example:

Collect a dataset based on the coins in peoples pockets, the y variable/response is the total value of the coins, the variable x1 is the total number of coins and x2 is the number of coins that are not quarters (or whatever the largest value of the common coins are for the local).

It is easy to see that the regression with either x1 or x2 would give a positive slope, but when incuding both in the model the slope on x2 would go negative since increasing the number of smaller coins without increasing the total number of coins would mean replacing large coins with smaller ones and reducing the overall value (y).

The same thing can happen any time you have correlalted x variables, the signs can easily be opposite between when a term is by itself and in the presence of others.

Related Question