Correlation – Validity of Pearson Correlation of Ranks and the Normality Assumption

correlationnormality-assumptionranksspearman-rho

I am currently reading up on assumptions for Pearson correlations. An important assumption for the ensuing t-test seems to be that both variables come from normal distributions; if they don't, then the use of alternative measures such as the Spearman rho is advocated. The Spearman correlation is computed like the Pearson correlation, only using the ranks of X and Y instead of X and Y themselves, correct?

My question is: If the input variables into a Pearson correlation need to be normally distributed, why is the calculation of a Spearman correlation valid even though the input variables are ranks? My ranks certainly don't come from normal distributions…

The only explanation I have come up with so far is that rho's significance might be tested differently from that of the Pearson correlation t-test (in a way that does not require normality), but so far I have found no formula. However, when I ran a few examples, the p-values for rho and for the t-test of the Pearson correlation of ranks always matched, save for the last few digits. To me this does not look like a groundbreakingly different procedure.

Any explanations and ideas you might have would be appreciated!

Best Answer

Normality is not required to calculate a Pearson correlation; it's just that some forms of inference about the corresponding population quantity are based on the normal assumptions (CIs and hypothesis tests).

If you don't have normality, the implied properties of that particular form of inference won't hold.

In the case of the Spearman correlation, you don't have normality, but that's fine because the inference calculations for the Spearman correlation (such as the hypothesis test) are not based on a normality assumption.

They're derived based on being a set of paired ranks from a continuous bivariate distribution; in this case the hypothesis test uses the permutation distribution of the test statistic based on the ranks.

When the usual assumptions for inference with the Pearson correlation hold (bivariate normality) the Spearman correlation is usually very close (though on average a little closer to 0).

(So when you could use the Pearson, the Spearman often does quite well. If you had nearly bivariate normal data apart from some contamination with some other process (that caused outliers), the Spearman would be a more robust way to estimate the correlation in the uncontaminated distribution.)