Regression – Predictor Flipping Sign in Regression with No Multicollinearity Explained

multicollinearitymultiple regressionregression

I'm running a multiple regression model with 4 predictors. The problem is that when I put predictor A together with the others its sign becomes negative (whereas in simple regression the sign is positive). I also found the other predictor (B) that causes the change in sign. An important addition is that A alone is positive but not significant, with B it becomes negative and significantly improves the R-squared of the model.

I checked the VIF and found no sign of multicollinearity (maximum VIF is 1.60). Also, the correlation between A and B is not incredibly high (only 0.6).

I have the following questions:

  1. Could you explain to me why there is this change in sign combining the two predictors even if they are not multicollinear?
  2. Is it OK to leave them both in the model, or should I choose between the two? Having both of them makes A significant and improves the R-squared and Adjusted R-Squared.
  3. How do I interpret this result in simple words?

I checked these other questions (1 and 2) and found no clear answer for my case.

Best Answer

As @jbowman notes, you don't need an "incredibly high" correlation to cause the sign to flip. How far a coefficient can move is a function of the correlation, and whether the sign 'flips' is only a matter of whether the coefficient moved towards 0, and how far away it was beforehand.

Multicollinearity is a pretty strict criterion for correlation. By conventional rule of thumb, the VIF should be $\ge 10$ to claim that there is multicollinearity. If we restrict ourselves to the pairwise correlation, and two models, one with a single variable, and the second with both, it's easier to see how this plays out. The VIF is 1/tolerance, and tolerance is $1-R^2$, so a VIF of $10$ corresponds to a pairwise correlation of $r \approx .95$. In your case, working forwards, $r=.6$, $r^2=.36$, tolerance $= .64$, $VIF = 1.6$. What the $1.6$ means is that the variance of the sampling distribution is $1.6\times$ wider than it would have been if the variables had been perfectly uncorrelated. Thus, the standard error is $1.25\times$ as wide as it would be. That is, there is very little effect on the power of the test of this coefficient due to the collinearity. As a result, people quite reasonably say you don't have to worry about collinearity in cases like yours.

But that isn't the same thing as saying that you can't have an effect of the omitted variable bias, $r = .6$ is still a reasonably strong correlation. To get a clearer understanding of how the sign could flip, it might help you to read my answer here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression? The issue just isn't really restricted to multicollinearity, it can occur with any amount of correlation, if the conditions are right.

To address your subsequent questions, it's fine to have both in the model. You would just say that A is correlated with B such that the marginal relationship between A and Y is positive, but the relationship is negative after controlling for B.