Solved – Instability of one-pass algorithm for correlation coefficient

correlationpearson-r

What should I know about the instability of the Pearson product-moment correlation coefficient? When might I experience problems using this calculation?

I will quote the following Wikipedia article for some background information:
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Mathematical_properties

The Pearson correlation can be expressed in terms of uncentered moments. Since $μX = E(X)$, $$σX2 = E[(X − E(X))2] = E(X2) − E2(X)$$ and likewise for Y, and since

$$E[(X-E(X))(Y-E(Y))]=E(XY)-E(X)E(Y)$$
the correlation can also be written as

$$\rho_{X,Y}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-(E(X))^2}~\sqrt{E(Y^2)- (E(Y))^2}}$$
Alternative formulae for the sample Pearson correlation coefficient are also available:

$$r_{xy}=\frac{\sum x_iy_i-n \bar{x} \bar{y}}{(n-1) s_x s_y}=\frac{n\sum x_iy_i-\sum x_i\sum y_i}
{\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}$$
The second formula above needs to be corrected for a sample:

$$r_{xy}=\frac{\sum x_iy_i-n \bar{x} \bar{y}}{(n-1) s_x s_y}=\frac{n\sum x_iy_i-\sum x_i\sum y_i}
{\sqrt{(n-1)\sum x_i^2-(\sum x_i)^2}~\sqrt{(n-1)\sum y_i^2-(\sum y_i)^2}}$$
The above formula suggests a convenient single-pass algorithm for calculating sample correlations, but, depending on the numbers involved, it can sometimes be numerically unstable.

Best Answer

You can experience problems whenever a term like $\sum y_i^2$ or $\sum x_iy_i$ is very large, and yet close to the second term, potentially leading to a large loss in digits of accuracy when almost all of the significant digits cancel. In the case of the variance, it happens when the standard deviation is small compared to the mean.

It's possible to construct one-pass forms for all the terms under the $\sqrt{}$ signs that don't suffer this sort of problem.

There's an example calculation for a variance given here. Similar calculations for covariance can be done.

Related Question