Solved – How should I interpret a Spearman’s rank correlation significance of zero

p-valuespearman-rho

This is a follow up to a Stack Overflow question

I'm calculating the Spearman rank correlation coefficient between two vectors using corr in Matlab.

The two vectors represent the frequency of terms in a different types of document. For example, one type might be a webpage and the other might be a newspaper article. So each vector is 1 by n where n is the number of of terms in my vocabulary. In this case, n=1500. The reason that I am calculating rank correlation is I want to be able to say whether the vocabulary distribution differs between the different types of document. I normalize each vector, so that the rank corresponds to the percentage of documents each vocabulary term appears in.

From this calculation, I get rho = .8879 and p = 0

As I understand it, when p is small, the correlation is significant, but this is so extremely small that I am slightly concerned.

This is what my data distribution looks like on a loglog plot.

Data plot

From the Matlab documentation for corr, the p-value for Spearman is computed using permutation distributions.

Here is my understanding of how this calculation works, building on the Wikipedia article about permutation testing. Initially the correlation coefficient is calculated as the
"observed value of the test statistic, T(obs)".
Then both input sets are mixed together and all possible resampling of the mixed datapoints are tested for the correlation coefficient. The one-sided p-value of the test is calculated as the proportion of sampled permutations where correlation is greater than or equal to T(obs). The two-sided p-value of the test is the proportion where it was less than or equal to T(obs).

Therefore, to get a p-value of zero, I would need to get all of the correlation coefficients for the sampled permutations to either be greater than or all be less than T(obs). That seems extremely unlikely since my datapoints don't lie exactly on a line.

Am I correctly understanding the calculation? Is there an easier way to understand what the p-value is describing here?

Here is a link to the data on Dropbox, if you want to see if you get the same results.

Best Answer

The P value isn't zero. It is just very small and the program you use rounds it off to zero, rather than (correctly) saying it is less than some threshold.