Perhaps a very basic question but one that has me confused. Say, in a correlation table the relationship between A
and the DV (B
) is .351
, but -.150
in the OLS model (where you have added C
, D
and E
variables), what does this then mean? In other words: if the C
to E
variables not only change the coefficient of A
but even make it go from negative to positive, does that indicate an undesirable interaction effect between the variables used in the OLS? I have been checking the VIF scores for this but based on low VIF I have no reason to fear multicollinearity. What (if anything) is wrong?
I'm trying to wrap my head around this constructing a simple example for myself to understand. Say A
is a person's height and B
is the distance this person jumps. There probably is a positive correlation (higher means longer legs, means longer distance jumping). What variables C
to E
would offset this person's height, even to the extent that this person's height is working against him when jumping (making the coefficient between A
and DV B
in the OLS negative)?
Best Answer
No, this doesn't imply 'the model is wrong' in the least. It's telling you that you should be wary of interpreting raw correlations when other important variables exist.
Here's a set of data I just generated (in R). The sample correlation between y and x1 is negative:
Yet the coefficient in the regression is positive:
... [snip]
Is the 'model' wrong? No, I fitted the same model I used to create the data, one that satisfies all the regression assumptions,
$y = 9 + 0.2 x_1 - 5 x_2 + e $, where $e_i \sim N(0,4^2)$,
or in R:
y= 9 + 0.2*x1 -5*x2 + rnorm(length(x2),0,4)
So how does this happen?
Look at two things. First, look at the plot of $y$ vs $x_1$:
And we see a (very slight in this case) negative correlation.
Now look at the same plot, but with the values at a particular value of $x_2$ ($x_2=4$) marked in red:
... at a given value of $x_2$, the relationship with $x_1$ is increasing, not decreasing. The same happens at the other values of $x_2$. For each value of $x_2$, the relationship between $y$ and $x_1$ is positive. So why is the correlation negative? Because $x_1$ and $x_2$ are related.
If we want to look at correlation and have it correspond to the regression, the partial correlation rather than the raw correlation is the relevant quantity; here's the table of partial correlations (using package
ppcor
):We see the partial correlation between $y$ and $x_1$ controlling for $x_2$ is positive.
It wasn't the regression results that one had to beware of, it was the misleading impression from looking at the raw correlation.
Incidentally, it's also quite possible to make it so both the correlation and regression coefficient are significantly different from zero and of opposite sign ... and there's still nothing wrong with the model.