Correlation – Understanding Correlation Between Random Variables

correlation

I recently received the following claim from a work colleague with regard to a loose definition of the correlation between two random variables:

“A 60% correlation between two variables approximately implies that in 60% of cases the movement away from the variables' mean position is aligned.”

I have not seen correlations defined in such a way, but I was unable to provide an answer as to a) whether this is correct, b) the alternative official definition.

Best Answer

There's some merit to the idea, but it's not quantitatively correct.

The standard (Pearson) definition uses the expected product of the standardized values of $X$ and $Y$ to measure the correlation of $(X,Y)$. Specifically, recenter both $X$ and $Y$ (so that we're always discussing how much they differ from their respective means) and let their standard deviations be $\sigma$ and $\tau$, respectively. Then when $F$ is the bivariate distribution of $(X,Y)$,

$$\rho_F = \mathbb{E}_F\left(\frac{X}{\sigma}\frac{Y}{\tau}\right) = \mathbb{E}_F\left(r\left(\frac{X}{\sigma},\frac{Y}{\tau}\right)\right)$$

for the function $r(u,v)=uv.$ This function $r$ is positive when $u$ and $v$ have the same sign (are "aligned") and is negative otherwise. It grows in direct proportion to the values of $u$ and $v$. Thus its average value (the expectation) reflects a value-weighted balance of aligned and unaligned variations from the central point.

We could make $r$ less sensitive to the values of $u$ and $v$. A fairly extreme way to do this would be to give it a unit value when $u$ and $v$ are aligned and negate it when they are not; that is,

$$r^\prime(u,v) = \text{sgn}(u)\text{sgn}(v).$$

One could use this in place of $r$ in the correlation definition to create a robust measure of correlation of empirical distributions. In so doing, we would have the interpretation suggested in the question: the expectation of $r^\prime(X/\sigma,Y/\tau)$ is the chance that $X$ and $Y$ are "aligned" minus the chance they are not aligned. (In practice, the variables would be recentered using robust estimates such as their medians rather than their means and their standard deviations would be replaced by robust estimates of variation such as interquartile ranges or MADs.)

Yes, we could even choose $r^{\prime\prime}$ to be the indicator that $X$ and $Y$ are aligned, so that $\rho_F^{\prime\prime}$ really would be the chance of alignment. In so doing, however, we could only achieve values between $0$ and $1$: that can scarcely be interpreted like the Pearson coefficient, which ranges from $1$ down to $-1$. Since

$$r^{\prime\prime}(u,v) = \frac{1}{2} + \frac{r^\prime(u,v)}{2},$$

there is a clear, simple relationship between the characterization in the question (which appears to refer to $r^{\prime\prime}$) and the more familiar-looking correlation values attained by $r^\prime$. I will therefore continue to discuss $r^\prime$.

To appreciate the distinction between $\rho$ and $\rho^\prime$, consider the archetypical application of correlation: the bivariate normal distribution. When this particular distribution has Pearson correlation $\rho_F$, the alternative measure of correlation is

$$\rho^\prime_F = \frac{1}{2} + \frac{1}{\pi} \arctan\left(\frac{\rho_F}{\sqrt{1-\rho_F^2}}\right).$$

For values of $\rho$ within the range $(-0.6,0.6)$ or so, this function is approximately linear:

$$\rho^\prime_F \approx \frac{1}{2} + \frac{\rho_F}{\pi}.$$

For instance, with $\rho=0.6,$ $\rho^\prime = 0.705.$ This clearly differs (a lot) from the formula quoted in the question, which asserts $\rho^\prime_F \approx \rho_F$. This will only be the case for values of $\rho_F$ close to $\frac{\pi}{2(\pi-1)}\approx 0.733.$

The solid line graphs $\rho^\prime$ in terms of $\rho$ for bivariate Normal distributions. The dashed line is the linear approximation to $\rho^\prime$ around $\rho=0$. It has slope $\frac{1}{\pi}\approx 0.318$.

This one-to-one correspondence between $\rho$ and $\rho^\prime$ for the bivariate Normal distribution shows we could use either definition equally well for describing correlations where a Normal model applies--but I suspect the sampling distribution of $\rho^\prime$ might be a little more difficult to derive. There is not necessarily such a one-to-one correspondence between $\rho$ and $\rho^\prime$ among all bivariate distributions, though: for a given value of $\rho^\prime$ it is clear we could vary $\rho$ quite a bit by pulling a tiny bit of probability mass around in the $(X,Y)$ plane at large distances from the origin without changing $\rho^\prime$ at all.

Related Solutions

Correlation – Clustering Variables Based on Correlation Matrix

Here's a simple example in R using the bfi dataset: bfi is a dataset of 25 personality test items organised around 5 factors.

library(psych)
data(bfi)
x <- bfi

A hiearchical cluster analysis using the euclidan distance between variables based on the absolute correlation between variables can be obtained like so:

plot(hclust(dist(abs(cor(na.omit(x))))))

alt text The dendrogram shows how items generally cluster with other items according to theorised groupings (e.g., N (Neuroticism) items group together). It also shows how some items within clusters are more similar (e.g., C5 and C1 might be more similar than C5 with C3). It also suggests that the N cluster is less similar to other clusters.

Alternatively you could do a standard factor analysis like so:

factanal(na.omit(x), 5, rotation = "Promax")


Uniquenesses:
   A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3    E4    E5    N1 
0.848 0.630 0.642 0.829 0.442 0.566 0.635 0.572 0.504 0.603 0.541 0.457 0.541 0.420 0.549 0.272 
   N2    N3    N4    N5    O1    O2    O3    O4    O5 
0.321 0.526 0.514 0.675 0.625 0.804 0.544 0.630 0.814 

Loadings:
   Factor1 Factor2 Factor3 Factor4 Factor5
A1  0.242  -0.154          -0.253  -0.164 
A2                          0.570         
A3         -0.100           0.522   0.114 
A4                  0.137   0.351  -0.158 
A5         -0.145           0.691         
C1                  0.630           0.184 
C2  0.131   0.120   0.603                 
C3  0.154           0.638                 
C4  0.167          -0.656                 
C5  0.149          -0.571           0.125 
E1          0.618   0.125  -0.210  -0.120 
E2          0.665          -0.204         
E3         -0.404           0.332   0.289 
E4         -0.506           0.555  -0.155 
E5  0.175  -0.525   0.234           0.228 
N1  0.879  -0.150                         
N2  0.875  -0.152                         
N3  0.658                                 
N4  0.406   0.342  -0.148           0.196 
N5  0.471   0.253           0.140  -0.101 
O1         -0.108                   0.595 
O2 -0.145   0.421   0.125   0.199         
O3         -0.204                   0.605 
O4          0.244                   0.548 
O5  0.139                   0.177  -0.441 

               Factor1 Factor2 Factor3 Factor4 Factor5
SS loadings      2.610   2.138   2.075   1.899   1.570
Proportion Var   0.104   0.086   0.083   0.076   0.063
Cumulative Var   0.104   0.190   0.273   0.349   0.412

Test of the hypothesis that 5 factors are sufficient.
The chi square statistic is 767.57 on 185 degrees of freedom.
The p-value is 5.93e-72

Correlation – Sanity Check for Correlation Between Log-Normal Random Variables

The moments of the lognormal distribution are usually derived from the moment generating function of the normal distribution. If $X\sim N(\mu,\sigma^2)$, then it has the mgf $$ M_X[t] = \mathbb{E} \exp(tX) = \exp( \mu t + \sigma^2 t^2/2 ) $$ Then if $Y = \exp(X)$ is the lognormal variable of interest, we can find, for instance, $$ \mathbb{E}[Y] = \mathbb{E} [\exp(X)] = M_X(1) = \exp( \mu + \sigma^2/2 ) $$ and $$ \mathbb{E}[Y^2] = \mathbb{E} [\exp(2X)] = M_X(2) = \exp( 2\mu + 2\sigma^2 ) $$ from which $$ \mathbb{V}[Y] = \mathbb{E}[Y^2] - \mathbb{E}^2[Y] = \exp( 2\mu + 2\sigma^2 ) - \exp( 2\mu + \sigma^2 ) $$ $$ = \exp( 2\mu + \sigma^2 ) [ \exp(\sigma^2) - 1 ] = [ \exp(\sigma^2) - 1 ]\mathbb{E}^2[Y] $$ You may have to do something like that with a four-variate normal distribution and its multivariate mgf.

In my opinion experience working with lognormal distributions, it is only practical when the log-variance $\sigma$ is less than 1. Beyond that, the sensitivity of pretty much any reasonable summary of the distribution hinges critically on whether you got the tail behavior correctly. In most situations (as is true with the original normal distribution, as well, but is not exacerbated by exponentiation), the shape of the right will be different, and you can easily see a twofold difference due to 0.1 change in $\sigma$... and you don't want that to happen.

Best Answer

Related Solutions

Correlation – Clustering Variables Based on Correlation Matrix

Correlation – Sanity Check for Correlation Between Log-Normal Random Variables

Related Question