Correlation – Understanding Correlation Between Random Variables

correlation

I recently received the following claim from a work colleague with regard to a loose definition of the correlation between two random variables:

“A 60% correlation between two variables approximately implies that in 60% of cases the movement away from the variables' mean position is aligned.”

I have not seen correlations defined in such a way, but I was unable to provide an answer as to a) whether this is correct, b) the alternative official definition.

Best Answer

There's some merit to the idea, but it's not quantitatively correct.

The standard (Pearson) definition uses the expected product of the standardized values of $X$ and $Y$ to measure the correlation of $(X,Y)$. Specifically, recenter both $X$ and $Y$ (so that we're always discussing how much they differ from their respective means) and let their standard deviations be $\sigma$ and $\tau$, respectively. Then when $F$ is the bivariate distribution of $(X,Y)$,

$$\rho_F = \mathbb{E}_F\left(\frac{X}{\sigma}\frac{Y}{\tau}\right) = \mathbb{E}_F\left(r\left(\frac{X}{\sigma},\frac{Y}{\tau}\right)\right)$$

for the function $r(u,v)=uv.$ This function $r$ is positive when $u$ and $v$ have the same sign (are "aligned") and is negative otherwise. It grows in direct proportion to the values of $u$ and $v$. Thus its average value (the expectation) reflects a value-weighted balance of aligned and unaligned variations from the central point.

We could make $r$ less sensitive to the values of $u$ and $v$. A fairly extreme way to do this would be to give it a unit value when $u$ and $v$ are aligned and negate it when they are not; that is,

$$r^\prime(u,v) = \text{sgn}(u)\text{sgn}(v).$$

One could use this in place of $r$ in the correlation definition to create a robust measure of correlation of empirical distributions. In so doing, we would have the interpretation suggested in the question: the expectation of $r^\prime(X/\sigma,Y/\tau)$ is the chance that $X$ and $Y$ are "aligned" minus the chance they are not aligned. (In practice, the variables would be recentered using robust estimates such as their medians rather than their means and their standard deviations would be replaced by robust estimates of variation such as interquartile ranges or MADs.)

Yes, we could even choose $r^{\prime\prime}$ to be the indicator that $X$ and $Y$ are aligned, so that $\rho_F^{\prime\prime}$ really would be the chance of alignment. In so doing, however, we could only achieve values between $0$ and $1$: that can scarcely be interpreted like the Pearson coefficient, which ranges from $1$ down to $-1$. Since

$$r^{\prime\prime}(u,v) = \frac{1}{2} + \frac{r^\prime(u,v)}{2},$$

there is a clear, simple relationship between the characterization in the question (which appears to refer to $r^{\prime\prime}$) and the more familiar-looking correlation values attained by $r^\prime$. I will therefore continue to discuss $r^\prime$.

To appreciate the distinction between $\rho$ and $\rho^\prime$, consider the archetypical application of correlation: the bivariate normal distribution. When this particular distribution has Pearson correlation $\rho_F$, the alternative measure of correlation is

$$\rho^\prime_F = \frac{1}{2} + \frac{1}{\pi} \arctan\left(\frac{\rho_F}{\sqrt{1-\rho_F^2}}\right).$$

For values of $\rho$ within the range $(-0.6,0.6)$ or so, this function is approximately linear:

$$\rho^\prime_F \approx \frac{1}{2} + \frac{\rho_F}{\pi}.$$

For instance, with $\rho=0.6,$ $\rho^\prime = 0.705.$ This clearly differs (a lot) from the formula quoted in the question, which asserts $\rho^\prime_F \approx \rho_F$. This will only be the case for values of $\rho_F$ close to $\frac{\pi}{2(\pi-1)}\approx 0.733.$

Figure

The solid line graphs $\rho^\prime$ in terms of $\rho$ for bivariate Normal distributions. The dashed line is the linear approximation to $\rho^\prime$ around $\rho=0$. It has slope $\frac{1}{\pi}\approx 0.318$.

This one-to-one correspondence between $\rho$ and $\rho^\prime$ for the bivariate Normal distribution shows we could use either definition equally well for describing correlations where a Normal model applies--but I suspect the sampling distribution of $\rho^\prime$ might be a little more difficult to derive. There is not necessarily such a one-to-one correspondence between $\rho$ and $\rho^\prime$ among all bivariate distributions, though: for a given value of $\rho^\prime$ it is clear we could vary $\rho$ quite a bit by pulling a tiny bit of probability mass around in the $(X,Y)$ plane at large distances from the origin without changing $\rho^\prime$ at all.