Spearman’s Rank Correlation – Understanding the p-value

p-valuespearman-rho

I have some pairs of datasets (n=200 or thereabouts), of samples which are non-negative and not normally distributed. I think these pairs of variables are related, probably linearly.

Calculating Spearman's rank correlation on these datasets gives some strange results. The correlation coefficients show that the pairs of variables are weakly, positively correlated (e.g. rho of around 0.4), but the p-values are very low (e.g. 4.1e-10).

My vague understanding of this is that the variables are weakly, positively correlated but the probability of unrelated variables producing the same correlation is very low. Does this mean we can be reasonably certain that a positive correlation exists or have I misunderstood?

Best Answer

Your understanding of the p-value is correct (well technically it is the probability of seeing the observed correlation or stronger) if no correlation exists.

What is a strong or weak correlation is depends on the context, it is often good to plot your data, or generate random data with a given correlation and plot that to get a feel for the strength of the correlation.

The p-value is determined by the observed correlation and the sample size, so with a large enough sample size a very weak correlation can be significant, meaning that what you saw is likely real and not due to chance, it just may not be very interesting. On the other hand with small sample sizes you can get a very strong correlation that is not statistically significant meaning that chance and no relationship is a plausible explanation (think about 2 points, the correlation will almost always be 1 or -1 even when there is no relationship, so that size can easily be attributed to chance).

Matlab code

n = 36;
p = 1000;

X = randn(n,p);
C = corr(X);
offDiagElements = C(logical(triu(C,1)));

figure
step = 0.01;
x = -1:step:1;
h = histc(offDiagElements, x);
stairs(x,h/sum(h)/step)
hold on

r = -1:0.01:1;
plot(r, 1/beta(1/2,(n-2)/2)*(1-r.^2).^((n-4)/2), 'r')

sigma2 = var(offDiagElements);
plot(r, 1/sqrt(sigma2*2*pi)*exp(-r.^2/(2*sigma2)), 'k--')

Spearman's correlation coefficient

I am not aware of theoretical results about the distribution of sample Spearman's correlations. But in the simulation above it is very easy to replace the Pearson's correlations with Spearman's ones:

C = corr(X, 'type', 'Spearman');

and this does not seem to change the distribution at all.

Update: @Glen_b pointed out in chat that "the distribution can't be the same because the distribution for the Spearman is discrete while that for the Pearson is continuous". This is true and can be clearly seen with my code for smaller values of $n$. Curiously, if one uses a large enough histogram bin so that the discreteness disappears, the histogram starts overlapping perfectly with the Pearson's one. I am not sure how to formulate this relationship mathematically precisely.

Solved – Equivalent to Spearman correlation for non-monotonic data

Is there a way to test if my data is monotonic prior to Spearman's rho / Kendall's tau correlation calculations?

You could plot the data and look for a non-monotone shape.

Also, you could fit a generalized additive model (GAM) which estimates nonparametric functions of the predictor variables. This can be done in the mgcv package in R. For example:

require(mgcv)
set.seed(123)
n <- 100

x <- runif(n,-5,5)

y <- x^2 + rnorm(n,0,4) 
plot(x,y, col="red")

which produces:

Note that

> cor.test(x, y, method = "kendall")

sample estimates:
        tau 
-0.01454545 

> cor.test(x, y, method = "spearman")

sample estimates:
         rho 
-0.005664566

So, both Spearman's rho and Kendall's tau are not helpful.

Now, if we run a GAM, we get

> summary(m0 <- gam(y~s(x)))

.
.
.
Approximate significance of smooth terms:
       edf Ref.df     F p-value    
s(x) 8.277  8.861 46.72  <2e-16 ***
.
.
.

With edf>1 there is evidence of non-linearity in the data, which doesn't prove that the association is non-monotonic, but nevertheless suggests that it might be.

Is it possible to decompose my dataset into monotonic sections, to analyse them separately?

Yes ! Sticking with the same dataset, we can do:

x1 <- x[x<0]
y1 <- y[x<0]

x2 <- x[x>=0]
y2 <- y[x>=0]

cor.test(x1, y1, method = "kendall")
cor.test(x1, y1, method = "spearman")

which gives:

sample estimates:
       tau 
-0.5878084 

sample estimates:
       rho 
-0.7905983

and this handles the first segment of the data, then:

cor.test(x2, y2, method = "kendall")
cor.test(x2, y2, method = "spearman")

which gives:

sample estimates:
      tau 
0.7446809 

sample estimates:
      rho 
0.9155874

So here we can see a strong negative association in the first segment and a strong positive association in the second.

Is there any equivalent to Spearman's rho test (or Kendall's tau) that accounts for multiple monotonic components?

Not that I am aware of.

Best Answer

Related Solutions

r – How to Understand the Distribution of Sample Correlation Coefficients Between Two Uncorrelated Normal Variables

Matlab code

Spearman's correlation coefficient

Solved – Equivalent to Spearman correlation for non-monotonic data

Related Question