Correlation – Can Two Random Variables Be Negatively Correlated but Both Be Positively Correlated with a Third?

correlation

Would it be possible for two variables to be negatively correlated with one another, yet be positively correlated with a third variable? Are there any concrete examples?

Best Answer

Certainly. Consider multivariate normally distributed data with a covariance matrix of the form

$$\begin{pmatrix} 1 & - & + \\ - & 1 & + \\ + & + & 1 \end{pmatrix}. $$

As an example, we can generate 1000 such observations with covariance matrix

$$\begin{pmatrix} 1 & -0.5 & 0.5 \\ -0.5 & 1 & 0.5 \\ 0.5 & 0.5 & 1 \end{pmatrix} $$

in R as follows:

library(mixtools)
set.seed(1)
xx <- rmvnorm(1e3,mu=rep(0,3),
    sigma=rbind(c(1,-.5,.5),c(-.5,1,.5),c(.5,.5,1)))
cor(xx[,c(1,2)])
cor(xx[,c(1,3)])
cor(xx[,c(2,3)])

The first two columns are negatively correlated ($\rho=-0.5$), the first and the third and the second and the third are positively correlated ($\rho=0.5$).

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

1) Residuals do correlate positively with observed values in many, many cases. Think of it this way - a very large positive error ("error" is the "true residual", to misuse the language) means that the corresponding observation is, all other things equal, likely to be very large in a positive direction. A very large negative error means that the corresponding observation is likely to be very large in a negative direction. If the $R^2$ of the regression is not large, then the variability of the errors will be the dominating effect on the variability of the target variable, and you will see this effect in your plots and correlations.

For example, consider the model $y_i = a + x_i + e_i$, which we'll model as $y_i = a + bx_i + e_i$, (which is correct for $b = 1$.) Here's the result of a regression with 100 observations:

e <- rnorm(100)
x <- rnorm(100)
y <- 1 + x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

> summary(foo)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3292 -0.8280 -0.0448  0.8213  2.9450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8498     0.1288   6.600 2.12e-09 ***
x             0.8929     0.1316   6.787 8.81e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.286 on 98 degrees of freedom
Multiple R-squared: 0.3197, Adjusted R-squared: 0.3128 
F-statistic: 46.06 on 1 and 98 DF,  p-value: 8.813e-10

enter image description here

Note that we achieved a fairly respectable (in some fields) $R^2$ of 0.32.

We can obscure this effect with a different model:

y <- 1 + 5*x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

which has an $R^2$ of 0.93 and the following residual plot:

enter image description here

Here the correlation between $y$ and the residuals is about 0.25, but it's a lot less obvious on the plot.

2) Residuals have correlation zero with fitted values in a linear regression, by construction. Is your statement "... weakly correlated with fitted Y negatively" based solely upon looking at the plot, or did you actually calculate the correlation? If the former, appearances can be deceiving... if the latter, something is wrong; possibly you aren't looking at what you think you're looking at.

Solved – Two highly correlated variables where both correlate with a third: Correlation and Causation

The comment made by @user32164 still stands as I write: "highly correlated with a poor $R^2$" is contradictory. Regardless of what you consider as highly correlated, a high correlation means a high $R^2$.

I am assuming that you measured color somehow so that it may fairly be used as a quantitative predictor in a regression model. Whether that's so is an issue that people in your field might debate, but I'll take it as read.

We know what you mean, but language such as "very significant p-value" is a little loose. A low P-value indicates that an effect, difference, relationship, whatever is significant, but the P-value itself is an indicator of significance, not something that is itself significant.

Those small points aside, we need to distinguish different kinds of question here.

Statistical and causal inference Focusing on your example, whether fish color causes depth at which fish are seen, or vice versa, or both, is a biological question on which statistical people have little to say. They might help you design an experiment to test the underlying hypotheses, but from the example as given the extent to regression can be used to infer causation (existence and/or direction of causal relationships) is very limited. There is an enormous literature on this, but I think there is consensus that predictive ability as shown by regression is not sufficient to infer causation.
Significance and strength of relationship You appear to be confusing significance of relationship and strength of relationship at a basic level. With moderate and especially large sample size, it is perfectly possible to get significant results (at conventional levels) that are only weakly predictive. Usually, a significant result underlines that some quantity of interest is not zero, but that itself doesn't make it major or substantial scientifically or practically.
Separate effects of predictors You can't infer that predictors have separate effects just from the evidence you cite. If you have some $x_1$ as predictor and then add $x_2$ as another, whether the coefficient of $x_1$ changes is one thing to look at. You should benefit from testing the interaction. You always benefit from thinking about what the underlying science indicates about possible relations between $x_1$ and $x_2$.

Best Answer

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

Solved – Two highly correlated variables where both correlate with a third: Correlation and Causation

Related Question