Solved – Permutation Test for Spearman Correlation Coefficient

correlationpermutation-testrspearman-rho

I have this bivariate data:

x=(7.1,7.1,7.2,8.3,9.4,10.5,11.4)

y=(2.8,2.9,2.8,2.6,3.5,4.6,5.0)

I want to examine the relationship between x and y by using Spearmans correlation coefficient (R computes: r=0.7). And I want to test the value of the correlation coefficient for significance. The null hypothesis is: rho = 0 (two sided test), which means no relationship. The following R-Code computes an approximate p-value of 0.07992:

cor.test(x,y,method="spearman")

But R gives me the following warning: No exact p-value because of ties.

Now I want to compute a permutation test to get the exact p-value (for alpha=0.05, two sided). I looked it up on the Internet and I think it should be possible with the package "coin". But I have no idea how to do this.

I found already the following solution:

library(coin)

spearman_test(y~x,distribution=approximate(B=9999))

R computes a p-value of 0.08641. But I am not sure, if this is correct. I want to find an exact p-value and not an approximate one.

I would be really grateful, if anybody could help me.

Best Answer

Seven observations is small enough to list out the entire permutation distribution ($7! = 5040$ is smaller than the sample you took), and that's easy enough to do in stats packages that let you write code.

[You can do this fairly easily with various packages in R (coin can probably do it with the right options set). Here's an example using a function that generates all the permutations:

library(combinat)  # for function "permn"
spcor <- sapply(permn(x), y=y, method="spearman", cor)
mean(abs(spcor)>=0.7)
[1] 0.08849206

note that permn returns a list of permutations and sapply applies the cor function to them]

However, in larger samples it's not going to be feasible to list out the entire distribution, but sampling the permutation distribution (a randomization test, as you did in your question) is fine -- you can even give a standard error (or if you prefer, a confidence interval) for the true p-value, since the sampled p-value is just scaled binomial.

So for your results you had a sampled $p$ ($\hat{p}$, say) of 0.08641, so $\text{se}(\hat{p})= \sqrt{0.08641\times (1-0.08641)/9999}=0.00281$. This ability to give a standard error can be useful in terms of figuring out how many resamples to take to get some desired margin of error. (Note your resampled estimate of $p$ was less than a standard error away from the exact p)

e.g. if you want to give about 2 significant figure accuracy, you'd probably want the standard error to be a good bit less than 0.0005, so you'd want something a fair bit over 320000 resamples (at a minimum) for that modest level of accuracy when p is about 0.086. On the other hand, it's not clear why an very accurate p-value would be necessary if it's not very close to your significance level. (Does it matter if it's 0.08 or 0.09 or something in between?)

[Note that 9! is also about 320 thousand (a bit over). Your original number of observations would need to be at least 10 before the total number of permutations is substantially larger than the number of resamples required for roughly 2dp accuracy; so by n=10 I'd definitely suggest you consider the randomization test unless you're using specialized algorithms for enumerating the tail of the permutation distribution.]

Related Solutions

R – Identifying Issues with Spearman Correlation in Presence of Many Ties

Use a permutation test. You only need to permute one of the variables independently of the other; here, the response is permuted. Because the relationship in the example is strong, only a small number of permutations are needed (1000 in the example below).

As always, the actual statistic is compared to the distribution of permuted statistics. The p-value is the estimate of the tail probability of the permutation distribution relative to the actual statistic. In some cases the test statistic has a discrete distribution, so it's wise to check the frequencies with which (a) the permutation statistics strictly exceed the actual statistic and (b) the permutation statistics equal or exceed the actual statistic. The code illustrates this by splitting the difference.

test <- function(y) suppressWarnings(cor.test(x, y, method="spearman")$estimate)
rho <- test(y)                                     # Test statistic
p <- replicate(10^3, test(sample(y, length(y))))   # Simulated permutation distribution

p.out <- sum(abs(p) > rho)    # Count of strict (absolute) exceedances
p.at <- sum(abs(p) == rho)    # Count of equalities, if any
(p.out + p.at /2) / length(p) # Proportion of exceedances: the p-value.

suppressWarnings quiets any complaints from cor.test that it cannot compute a p-value due to ties.

Solved – Interpretation of Spearman’s rank correlation coefficient – beyond its significance

The Spearman's rank c. c. is the Pearson' c.c. of the ranked variables; in its turn the Pearson's c.c. is defined as the mean of the product of the paired standardized scores $z(X_i)$, $z(Y_i)$.

\begin{equation} r(X,Y) = \Sigma_i[z(X_i) z(Y_i)]/(n-1) \end{equation}

in which $n$ is the sample size and the standard scores

\begin{equation} z(X_i) = [X_i - \bar{X}]/std(X) \end{equation}

\begin{equation} z(Y_i) = [Y_i - \bar{Y}]/std(Y) \end{equation}

are relative to the ranked variables ($X_i$, $Y_i$). Squaring $r(X_i, Y_i)$ we obtain the coefficient of determination $r²$, which we can equate to the fraction of explained variance. So if my Spearman's rank c.c. is of 0.6, I can deduce that the variance of the ranked variables is shared at 36%.

From the first equation and attempting at a simpler way of explaining $r(X,Y)$, I would say is the average value of concordance of z-score variations. For instance, let us say I repeat an experiment by increasing the sample size $n$ and calculate $r(X,Y)$ for both the small sample and the larger one. Let us say that associated to an increase in n of $~3$ I get a decrease in $r(X,Y)$ of roughly 50%; this corresponds to a decrease in standard scores concordance of 50%. My interpretation should be then that the latest dataset provides weaker evidence for the presence of a correlation in the data.

Best Answer

Related Solutions

R – Identifying Issues with Spearman Correlation in Presence of Many Ties

Solved – Interpretation of Spearman’s rank correlation coefficient – beyond its significance

Related Question