If two simple OLS coefficients are positive, can they flip signs during multiple OLS?

linear modelregressionself-study

This question already has answers here:

Regression coefficients that flip sign after including other predictors

(3 answers)

Closed 2 years ago.

I was asked this during an interview, and I'm curious if my thinking is correct.

Fit linear regression twice to two features, $x_1$ and $x_2$. You get two coefficients $\beta_1$ and $\beta_2$, both greater than $1$. Now fit linear regression to both features at the same time. Can either coefficient be negative?

My intuition is that yes, the coefficient sign can flip, if $x_1$ and $x_2$ are collinear. OLS parameter estimates are unstable here since the normal equation requires inverting the Gram matrix $\mathbf{X}^{\top} \mathbf{X}$, which has the same rank as $\mathbf{X}$. (1) Am I correct and (2) if so, is my analysis thorough? Not sure if there's anything else I should consider here or a better way to explain why the coefficients can flip signs.

Best Answer

Yes, they can flip sign if they are correlated. Arguing this mathematically is likely possible, but we can just demonstrate that this can happen with simulation.



set.seed(0)
# Generate correlated covars 
X = MASS::mvrnorm(100, c(0,0), matrix(c(1, 0.99, 0.99, 1), nrow = 2))
# Use them to generate observations.  Only the first column has effect on y
y = X %*% c(2, 0) + rnorm(100, 0, 0.4)

# Estimate 3 models: 2 with only one variable and 1 with both
m1 = lm(y~X[,1])
coef(m1)
>>> (Intercept)      X[, 1] 
 0.02606534  2.03186570 


m2 = lm(y~X[,2])
coef(m2)
    (Intercept)      X[, 2] 
 >>>0.04038971  1.96816682 

m = lm(y~X)
coeff(m)
>>> Coefficients:
(Intercept)           X1           X2  
    0.02581      2.07047     -0.03831  

```

Related Solutions

Solved – Sign flipping when adding one more variable in regression and with much greater magnitude

Think of this example:

Collect a dataset based on the coins in peoples pockets, the y variable/response is the total value of the coins, the variable x1 is the total number of coins and x2 is the number of coins that are not quarters (or whatever the largest value of the common coins are for the local).

It is easy to see that the regression with either x1 or x2 would give a positive slope, but when incuding both in the model the slope on x2 would go negative since increasing the number of smaller coins without increasing the total number of coins would mean replacing large coins with smaller ones and reducing the overall value (y).

The same thing can happen any time you have correlalted x variables, the signs can easily be opposite between when a term is by itself and in the presence of others.

Solved – p-values change after mean centering with interaction terms. How to test for significance

But I do not understand what it means by "correct test for significance". Can someone explain what he's referring to?

If I were you I would post a comment to that answer by @EdM, otherwise, unless they actually see this question and answers themself, we can only make an informed guess. Having said that, what I think is meant by that statement, is that the model must include both the main effect and the interaction in order to make correct inferences. There could be some rare cases where it is not necessary to include the main effect, but as a good general rule, you should.

Now, looking at the output from your two models, the first thing I notice is:

the condition number is large, 2.17e+03. his might indicate that there are strong multicollinearity or other numerical problems

and also note that this warning is absent from the centered model.

One consequence of muticollinearity is that it can inflate standard errors, which increases p values. Your model contains an interaction which is a product of two other variables. Depending on the scale it might be the case that there is a high correlation between the interaction and the variables themselves and this could cause inflated p values. Centering variables often reduces correlation between them when nonlinear terms (such as an interaction) are included. Without access to the data itself it is hard to say if this is what is actually happening, but it's my best informed guess. Your first point of call should be a correlation matrix between all the predictors and this will give you a big hint if this is actually the cause.

However, further inspection of the output reveals that the R squared for both models is 1. This indicates that there is a problem somewhere. Without access to the data it is very difficult to see where that might be.

As to the reason why the estimates an p values for the main effects change after centering, first, note that in a model without an interaction term, mean-centering the variables will change only the intercept term. The coefficients and their standard errors for the other variables will be unchanged. However, in the presence of an interaction, the main effects no longer have the same interpretation. They are interpreted as the change in the outcome variable for a 1 unit change of the variable in question, when the other main effect that it is interacted with is at zero (or in the case of a categorical variable, its reference level). This implies that, after centering the variables, the estimates and their standard errors for the main effects that are involved in an interaction will change (and hence the p values too), because zero now has a different meaning after centering, but the estimate and the standard error for the interaction itself will remain unchanged. In other words the tests are different. Looking at the output, this is exactly what has happened.

Edit: To provide better understanding:

To understand the last point more fully we can write out the equations for two simple models, one without centering, and one with centering, with two predictors, $x_1$ and $x_2$ along with their interaction.

Firstly, the original (uncentered) model is:

$$\mathbb{E}[Y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1x_2$$

Denote the centered variables by $z_1$ and $z_2$, such that

$$ \begin{align} z_1 &= x_1 - \mu_1 \text{ and} \\ z_2 &= x_2 - \mu_2 \end{align} $$ where $\mu_1$ and $\mu_2$ are the means of $x_1$ and $x_2$ respectively. We can now write the model with centering in terms of the centered variables and the means of the uncentered variables:

$$\mathbb{E}[Y] = \beta_0 + \beta_1 (z_1 + \mu_1) + \beta_2 (z_2 + \mu_2) + \beta_3 (z_1 + \mu_1) (z_2 + \mu_2)$$

Expanding:

$$\mathbb{E}[Y] = \beta_0 + \beta_1 z_1 + \beta_1 \mu_1 + \beta_2 z_2 + \beta_2\mu_2 + \beta_3 z_1 z_2 +\beta_3 z_1 \mu_2 +\beta_3 z_2 \mu_1 + \beta_3 \mu_1 \mu_2 $$

Now, note that $\beta_1 \mu_1$, $\beta_2\mu_2$ and $\beta_3 \mu_1 \mu_2$ are all constant so these can be subsumed into a new intercept, $\gamma_0$, giving:

$$\mathbb{E}[Y] = \gamma_0 + \beta_1 z_1 + \beta_2 z_2 + \beta_3 z_1 z_2 +\beta_3 z_1 \mu_2 +\beta_3 z_2 \mu_1 $$

Rearranging this by factorizing by $z_1$, $z_2$ and $z_1 z_2$ we arrive at:

$$\mathbb{E}[Y] = \gamma_0 + z_1 (\beta_1 + \beta_3 \mu_2 ) + z_2 (\beta_2 + \beta_3 \mu_1) + z_1 z_2 \beta_3 $$

So, this is the simplified form of the regression model using the centered variables. We can immediately note that:

the intercept will be different from the uncentered model, since it is now equal to $ \gamma_0 = \beta_0 + \beta_1 \mu_1 +\beta_2\mu_2 +\beta_3 \mu_1 \mu_2$
the test for $z_1$ is comparing $\beta_1 + \beta_3 \mu_2$ to zero, or equivalently the equality of $\beta_1$ and $-\beta_3 \mu_2$, which will only be the same as the test for $\beta_1$ in the uncentered model if $\mu_2$ is zero, which obviously it isn't otherwise you wouldn't be centering $x_2$ in the first place.
similarly, the test for $z_2$ is comparing $\beta_2 + \beta_3 \mu_1$ to zero, which will only be the same as the test for $\beta_2$ in the uncentered model if $\mu_1$ is zero.
The test for $z_1 z_2$ is comparing $\beta_3$ to zero, which is the same as in the uncentered model.

Again, inspecting the output of both models, this is exactly what is happening.

To sum up, although the two models are the same, ie the centered model is just a re-parameterization of the uncentered model, the p values for the tests of the estimated coefficient for the main effects of the centered variables that are involved in the interaction, and the intercept, will be different, because they are testing different things. The p values for the tests of the estimated coefficients of the main effect which is not involved in an interaction, along with that for the interaction, will be unchanged. These are general results. In addition to this, in your particular data there could also be issues due to multicollinearity, and the fact that R-squared is reported as 1, is also suspicious.

Best Answer

Related Solutions

Solved – Sign flipping when adding one more variable in regression and with much greater magnitude

Solved – p-values change after mean centering with interaction terms. How to test for significance

Related Question