R – Identifying Issues with Spearman Correlation in Presence of Many Ties

correlationmonte carlorspearman-rho

I'm testing the hypothesis that there's a monotonic relationship between two variables. I think I should use a Spearman rank correlation test, since my data don't necessarily meet normality assumptions & have many outliers. However, there are many ties in the independent variable. How can I tell whether the ties are causing me a problem?

The data look something like this (R code):

set.seed(0)
x <- rep(1:10, 10)
y <- x + rnorm(length(x), sd=rep(x, 10))

enter image description here

One approach I can think of is to add a small random number to each x value many time, and look at the mean/median p-value, like so:

nReps <- 100
pVec <- rep(NA, 100)
for(i in 1:nReps) {
  xDodge <- x + rnorm(n=nReps, mean=0, sd=0.0001)
  pVec[i] <- cor.test(xDodge, y, method="spearman")$p.value
}
mean(pVec)
sd(pVec)

Does that method seem reasonable? Is there a previously-described method to assess the effect of ties on Spearman's rho, or a similar correlation method that does better with large numbers of ties?

Best Answer

Use a permutation test. You only need to permute one of the variables independently of the other; here, the response is permuted. Because the relationship in the example is strong, only a small number of permutations are needed (1000 in the example below).

As always, the actual statistic is compared to the distribution of permuted statistics. The p-value is the estimate of the tail probability of the permutation distribution relative to the actual statistic. In some cases the test statistic has a discrete distribution, so it's wise to check the frequencies with which (a) the permutation statistics strictly exceed the actual statistic and (b) the permutation statistics equal or exceed the actual statistic. The code illustrates this by splitting the difference.

test <- function(y) suppressWarnings(cor.test(x, y, method="spearman")$estimate)
rho <- test(y)                                     # Test statistic
p <- replicate(10^3, test(sample(y, length(y))))   # Simulated permutation distribution

p.out <- sum(abs(p) > rho)    # Count of strict (absolute) exceedances
p.at <- sum(abs(p) == rho)    # Count of equalities, if any
(p.out + p.at /2) / length(p) # Proportion of exceedances: the p-value.

suppressWarnings quiets any complaints from cor.test that it cannot compute a p-value due to ties.

Related Solutions

Spearman Correlation – Dealing with Problems in Spearman Correlation Due to Many Ties

While ranking the data for use in Spearman correlation is possible with Excel formulas (like almost everything), it is not that easy.

I would suggest a little easier solution, that at the moment will work only in 32-bit Excel: use RExcel:

First you'd need to download and install the R 2.15.2 for Windows.
Then Open the R prompt and copy & paste the following code (which will install, among others, components that will allow quite seamless communication with Excel. All software has uninstallers in case you decide to remove them later):

makeSureInstalled<-function(package)
{
   if(length(grep(paste("^",package,"$",sep=""),noquote(installed.packages())[,1]))==0)
      install.packages(package)
library(package=package,character.only=TRUE)
}
makeSureInstalled("rcom")
installstatconnDCOM()
comRegisterServer()
comRegisterRegistry()
makeSureInstalled("RExcelInstaller")
installRExcel(ForegroundServer=TRUE)

Then open your Excel, paste your data (I'll assume it comes in two collumns)
On new ribbon you should click "Start R":
Put this formulas:
On cell H8 you will have the p-value.
If you want to have the Spearman correlation coefficient $\rho$, type in this formula:

=REval("cor.test(var1,var2,method='spearman')$estimate")

Correlation Coefficients – Explanation for Pearson’s Larger Than Spearman’s Rank Correlation Coefficient

This is a simple dataset, where the points come alternating from two linear functions:

The pearson correlation detects, there is a general upwards motion in the combined data (red an black together) and is r=.453 The spearman correlation just sees the ranks, which are distributed like this:

There is a high and a low rank alternating, so no clear trend for spearman. Spearman r = .079 This pearson is 5.7 times as high and you can easily increase that value by extending the row. You can even easily get a negative Spearman for a positive Pearson by just leaving out the last value. So there is nothing in the way of a compbination of a large Pearson and a small Spearman r and the above picture is even a bit similar to your's.

You can easily see how I constructed the data by looking at them:

1, -.01, 2, -.02, 3, -.03, 4, -.04, 5, -.05, 6, -.06, 7, -.07, 8, -.08, 9, -.09, 10

Hope that helps, Bernhard

Best Answer

Related Solutions

Spearman Correlation – Dealing with Problems in Spearman Correlation Due to Many Ties

Correlation Coefficients – Explanation for Pearson’s Larger Than Spearman’s Rank Correlation Coefficient

Related Question