Correct ranking for Spearman Correlation

correlationstatistics

In order to calculate Spearman Correlation Coefficient, the data should be ranked. However, many people do this in different way. Some sort them like an increasing sequence (i.e the smallest number has rank 1 and the greatest has rank $n$), others do this in an opposite way, they give the highest rank to the smallest number and rank 1 to the greatest. Can you suggest what is the most appropriate way to do that?

Best Answer

Fake data simulated in R for purposes of demonstration.

set.seed(2020)
x = rnorm(15, 100, 15)
round(x,2)
 [1] 105.65 104.52  83.53  83.04  58.05 110.81 114.09  96.56
 [9] 126.39 101.76  87.20 113.64 117.95  94.43  98.15
y = .001*x^4 + rnorm(15, 0, 4)
round(y)
 [1] 124617 119365  48669  47550  11357 150771 169415  86933
 [9] 255158 107233  57828 166771 193524  79500  92807

A scatterplot shows positive, but not entirely linear, association.

plot(x,y, pch=20)

enter image description here

Notice that Pearson and Spearman correlation differ. Roughly speaking, Pearson correlation measures the linear component of the association. The Pearson correlation $r = 0.948$ shows substantial, but not perfect, linear association.

By contrast, each increase in $x$ is accompanied by an increase in $y.$ This leads to a Spearman correlation $r_S = 1.$

cor(x,y, method="pearson")
[1] 0.9481193
cor(x,y, method="spearman")
[1] 1

As you say, the Spearman correlation is based on ranks. Notice that $x$'s and $y$'s have ranks that match exactly. This is another way of saying that each increase in $x$ is accompanied by an increase in $y.$

Notice that rank 1 for the $x$'s corresponds to the minimum $x$-value 58.05, and rank 1 for the $y$'s corresponds to the minimum $y$-value 11,357. Similarly, rank 15 corresponds to the maximum of each variable.

rank(x)
 [1] 10  9  3  2  1 11 13  6 15  8  4 12 14  5  7
rank(y)
 [1] 10  9  3  2  1 11 13  6 15  8  4 12 14  5  7

The Spearman correlation can be found by taking the Pearson correlation of the ranks.

cor(rank(x), rank(y), method = "pearson")
[1] 1

The Wikipedia article of Spearman correlation has some nice examples.

Related Question