Regression – Why Do Coefficients Change Signs in Regression Analysis?

multicollinearityregression

I have a dataset where there is a high degree of multicollinearity, with all variables correlating positively with each other and the dependent variable. However, on some of the models I run I get a couple of significant negative coefficients. Basically there are two coefficients that depending on what variables I include in the model, I can manipulate their signs.

My understanding is that if the variance-covariance matrix only contain positive values, then all coefficients should also be positive. Is this correct?

Best Answer

Because the question appears to ask about data whereas the comments talk about random variables, a data-based answer seems worth presenting.

Let's generate a small dataset. (Later, you can change this to a huge dataset if you wish, just to confirm that the phenomena shown below do not depend on the size of the dataset.) To get going, let one independent variable $x_1$ be a simple sequence $1,2,\ldots,n$. To obtain another independent variable $x_2$ with strong positive correlation, just perturb the values of $x_1$ up and down a little. Here, I alternately subtract and add $1$. It helps to rescale $x_2$, so let's just halve it. Finally, let's see what happens when we create a dependent variable $y$ that is a perfect linear combination of $x_1$ and $x_2$ (without error) but with one positive and one negative sign.

The following commands in R make examples like this using n data:

n <- 6                  # (Later, try (say) n=10000 to see what happens.)
x1 <- 1:n               # E.g., 1   2 3   4 5   6
x2 <- (x1 + c(-1,1))/2  # E.g., 0 3/2 1 5/2 2 7/2
y <- x1 - x2            # E.g,  1 1/2 2 3/2 3 5/2
data <- cbind(x1,x2,y)

Here's a picture: scatterplot matrix with fits

First notice the strong, consistent positive correlations among the variables: in each panel, the points trend from lower left to upper right.

Correlations, however, are not regression coefficients. A good way to understand the multiple regression of $y$ on $x_1$ and $x_2$ is first to regress both $y$ and $x_2$ (separately) on $x_1$ (to remove the effects of $x_1$ from both $y$ and $x_2$) and then to regress the $y$ residuals on the $x_2$ residuals: the slope in that univariate regression will be the $x_2$ coefficient in the multivariate regression of $y$ on $x_1$ and $x_2$.

The lower triangle of this scatterplot matrix has been decorated with linear fits (the diagonal lines) and their residuals (the vertical line segments). Take a close look at the left column of plots, depicting the residuals of regressions against $x_1$. Scanning from left to right, notice how each time the upper panel ($x_2$ vs $x_1$) shows a negative residual, the lower panel ($y$ vs $x_1$) shows a positive residual: these residuals are negatively correlated.

That's the key insight: multiple regression peels away relationships that may otherwise be hidden by mutual associations among the independent variables.

For the doubtful, we can confirm the graphical analysis with calculations. First, the covariance matrix (scaled to simplify the presentation):

> cov(data) * 40
    x1 x2  y
x1 140 82 58
x2  82 59 23
y   58 23 35

The positive entries confirm the impression of positive correlation in the scatterplot matrix. Now, the multivariate regression:

> summary(lm(y ~ x1+x2))
...
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -7.252e-16  2.571e-16 -2.821e+00   0.0667 .  
x1           1.000e+00  1.476e-16  6.776e+15   <2e-16 ***
x2          -1.000e+00  2.273e-16 -4.399e+15   <2e-16 ***

One slope is +1 and the other is -1. Both are significant.

(Of course the slopes are significant: $y$ is a linear function of $x_1$ and $x_2$ with no error. For a more realistic example, just add a little bit of random error to $y$. Provided the error is small, it can change neither the signs of the covariances nor the signs of the regression coefficients, nor can it make them "insignificant.")

Related Question