Solved – Explaining the difference between Pearson correlation and distance correlation

correlationdistancedistance-covariancepearson-r

This question and its answer might highlight my naivete regarding Brownian/distance correlation.

I'm using the difference between a matrix of distance correlations, as calculated by energy::dcor(), and absolute value Pearson correlations, as calculated by cor(), to highlight potential nonlinear dependencies introduced with a certain estimation technique.

In my resulting difference matrix, I have a handful of negative values indicating that the Pearson correlation is larger in magnitude than the distance correlation (range from -.07 to -.01).

First, is my approach adequate? If so, how do I explain why the Pearson correlations might be larger in magnitude than distance correlation?

Best Answer

Distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal dimension.

The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors.

This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

Related Solutions

Solved – Can we compare correlations between groups by comparing regression slopes

Everything that you have written is correct. You can always test out things like that with a toy example. Here is an example with R:

library(MASS)

rho <- .5  ### the true correlation in both groups

S1 <- matrix(c( 1,   rho,   rho, 1), nrow=2)
S2 <- matrix(c(16, 4*rho, 4*rho, 1), nrow=2)

cov2cor(S1)
cov2cor(S2)

xy1 <- mvrnorm(1000, mu=c(0,0), Sigma=S1)
xy2 <- mvrnorm(1000, mu=c(0,0), Sigma=S2)

x <- c(xy1[,1], xy2[,1])
y <- c(xy1[,2], xy2[,2])
group <- c(rep(0, 1000), rep(1, 1000))

summary(lm(y ~ x + group + x:group))

What you will find that the interaction is highly significant, even though the true correlation is the same in both groups. Why does that happen? Because the raw regression coefficients in the two groups reflect not only the strength of the correlation, but also the scaling of X (and Y) in the two groups. Since those scalings differ, the interaction is significant. This is an important point, since it is often believed that to test the difference in the correlation, you just need to test the interaction in the model above. Let's continue:

summary(lm(xy2[,2] ~ xy2[,1]))$coef[2] - summary(lm(xy1[,2] ~ xy1[,1]))$coef[2]

This will show you that the difference in the regression coefficients for the model fitted separately in the two groups will give you exactly the same value as the interaction term.

What we are really interested in though is the difference in the correlations:

cor(xy1)[1,2]
cor(xy2)[1,2]
cor(xy2)[1,2] - cor(xy1)[1,2]

You will find that this difference is essentially zero. Let's standardize X and Y within the two groups and refit the full model:

x <- c(scale(xy1[,1]), scale(xy2[,1]))
y <- c(scale(xy1[,2]), scale(xy2[,2]))
summary(lm(y ~ x + x:group - 1))

Note that I am not including the intercept or the group main effect here, because they are zero by definition. You will find that the coefficient for x is equal to the correlation for group 1 and the coefficient for the interaction is equal to the difference in the correlations for the two groups.

Now, for your question whether it would be better to use this approach versus using the test that makes use of Fisher's r-to-z transformation.

EDIT

The standard errors of the regression coefficients that are calculated when you standardize the X and Y values within the groups do not take this standardization into consideration. Therefore, they are not correct. Accordingly, the t-test for the interaction does not control the Type I error rate adequately. I conducted a simulation study to examine this. When $\rho_1 = \rho_2 = 0$, then the Type I error is controlled. However, when $\rho_1 = \rho_2 \ne 0$, then the Type I error of the t-test tends to be overly conservative (i.e., it does not reject often enough for a given $\alpha$ value). On the other hand, the test that makes use of Fisher's r-to-z transformation does perform adequately, regardless of the size of the true correlations in both groups (except when the group sizes get very small and the true correlations in the two groups get very close to $\pm1$.

Conclusion: If you want to test for a difference in correlations, use Fisher's r-to-z transformation and test the difference between those values.

Solved – Difference between Euclidean, Pearson, Geodesic and Mahalanobis distance metrics

Euclidean:

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space.

Pearson:

Pearson Correlation measures the similarity in shape between two profiles.

Geodesic:

In the mathematical field of graph theory, the distance between two vertices in a graph is the number of edges in a shortest path (also called a graph geodesic) connecting them. This is also known as the geodesic distance.

Wikipedia for Geodesic distance

Mahalonobis:

The Mahalanobis distance is a measure of the distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D, and grows as P moves away from the mean along each principal component axis.

Wikipedia for Mahalonobis