Solved – the explanation for having a Pearson’s correlation coefficient significantly larger than the Spearman’s rank correlation coefficient

correlationintuitionpearson-rspearman-rho

What is the explanation for obtaining a Pearson's correlation coefficient value that is significantly larger (a factor of ~2) than the Spearman's rank correlation coefficient value (on the same data)?

Doesn't this goes against the idea that Spearman's rank correlation coefficient, being the Pearson's correlation coefficient of the ranked data, can be seen as a generalization of Pearson's evaluation for monotonic dependences instead of linear ones? How can the correlation coefficient value for the monotonic dependence be smaller than that for a linear dependence only?

I was surprised to see that this was possible in a dataset with $N$~100 elements. I should add that the p-value associated to the Pearson's correlation coefficient is of 0.0 while that of Spearman's rank is of ~0.10.

Possible explanation:

This behaviour might be driven by the extreme values of the dataset. I compare the values of Pearson's c.c. ($\rho$) and Spearman's rank c.c. ($\rho_r$) after removal of these. I present the 2-sided p-values.

  • Full dataset: $\rho$ = 0.381 (p-value: 0.000), $\rho_r$ = 0.151 (p-value: 0.131)

  • One outlier removed: $\rho$ = 0.336 (p-value: 0.001), $\rho_r$ = 0.125 (p-value: 0.213)

  • Three outliers removed: $\rho$ = 0.167 (p-value: 0.100), $\rho_r$ = 0.076 (p-value: 0.459)

The remaining distribution (plotted) does not seem affected by the presence of outliers and yet is still exhibits the same behaviour. The full data is available here; note the outliers correspond to the first three rows.

Distribution after the three outliers are removed (the first three rows in the attached file)

Best Answer

This is a simple dataset, where the points come alternating from two linear functions: the raw data

The pearson correlation detects, there is a general upwards motion in the combined data (red an black together) and is r=.453 The spearman correlation just sees the ranks, which are distributed like this: the ranks of the above data

There is a high and a low rank alternating, so no clear trend for spearman. Spearman r = .079 This pearson is 5.7 times as high and you can easily increase that value by extending the row. You can even easily get a negative Spearman for a positive Pearson by just leaving out the last value. So there is nothing in the way of a compbination of a large Pearson and a small Spearman r and the above picture is even a bit similar to your's.

You can easily see how I constructed the data by looking at them:

1, -.01, 2, -.02, 3, -.03, 4, -.04, 5, -.05, 6, -.06, 7, -.07, 8, -.08, 9, -.09, 10

Hope that helps, Bernhard