Solved – Numerically stable correlation coefficient calculation

correlationcovariancenumerics

I have been trying to calculate the correlation coefficient $(\rho)$ of two variables, and noticed that in cases where either $var(X)$ or $var(Y)$ are very small, the correlation coefficient calculation gives incorrect results, due to the fact that the covariance is coming out too small.

In fact, in the limit where say $x_i = \mu_x$ for all $i$ ($\mu_x$ is the true mean), both the numerator and denominator in the correlation coefficient become zero.

As an example, take $X,Y$ points scattered along a straight line with thickness $\epsilon$. Clearly, as $\epsilon\to0$, $|\rho|\to1$, meaning that one would expect:

$$\left|\text{Cov}(X,Y)\right|\approx \sqrt{\text{var}(X)\text{var}(Y)}$$
De-facto, I'm getting that the covariance is much smaller. As an example, here is an image of pixels for which I want to calculate $\rho$, with the shade of gray giving the weight.

$\quad$enter image description here

My calculations give:

$$\sigma_x = 0.50, \quad \sigma_y = 52.25$$
$$\text{cov(X,Y)} = 4.67 \ll \sqrt{\text{var}(X)\text{var}(Y)} = 26.11$$
Meaning that I get a $\rho$ of $0.18$, while I'd expect $\rho\approx 1$. This problem doesn't occur whenever the line is not oriented to one of the axes, e.g. when the variances are comparable.

Are there any methods for calculating the covariance that overcome this problem?

Best Answer

The correlation coefficient, $\rho$ is $$\rho = \dfrac{Cov(X,Y)}{\sqrt{Var(X) Var(Y)}} $$

For numerical stability, you can instead find $\exp(\log(\rho))$.

$$\rho = \exp(\log(\rho)) = \exp\left(\log(Cov(X,Y)) - \dfrac{1}{2}\log(Var(X)) - \dfrac{1}{2} \log(Var(Y) \right). $$

Thus, when the values are really small, $\log$ of the values negative, and thus making them easier to work with.

When the variance is exactly 0 for a random variable, the correlation between the two variables is known to be 0. See here. In your example, the variance along the $Y$ variable is large, but the variance along the $X$ variable is very small. When $\epsilon \to 0$, the correlation does not go to 1!, because when $\epsilon \to 0$, $Var(X) \to 0$.

Intuitively, you can understand it in this way. Correlation tells you how much the change in $X$ will impact the linear change in $Y$, When there is no variability in $X$, then there is no change in $X$, so it cannot impact any linear change in $Y$!

When your pixel cloud is oriented in the non standard axis, then there is positive variation in both $X$ and $Y$, and thus you expect the correlation to be 1 then.

Related Question