Solved – Is Spearman’s correlation coefficient usable to compare distributions

distributionspaired-dataspearman-rho

I have distributions from two different data sets and I would like to
measure how similar their distributions (in terms of their bin
frequencies) are. In other words, I am not interested in the correlation of
data point sequences but rather in the their distributional properties with respect to similarity. Currently I can only observe a similarity in eye-balling which is not enough. I don't want to assume causality and I don't want to predict at this point. So, I assume that correlation is the way to go.

Spearman's Correlation Coefficient is used to compare non-normal data and since I don't know anything about the real underlying distribution in my data, I think it would be a save bet. I wonder if this measure can also be used to
compare distributional data rather than the data poitns that are
summarized in a distribution. Here the example code in R that exemplifies
what I would like to check:

aNorm <- rnorm(1000000)
bNorm <- rnorm(1000000)
cUni <- runif(1000000)
ha <- hist(aNorm)
hb <- hist(bNorm)
hc <- hist(cUni)
print(ha$counts)
print(hb$counts)
print(hc$counts)
# relatively similar
n <- min(c(NROW(ha$counts),NROW(hb$counts)))
cor.test(ha$counts[1:n], hb$counts[1:n], method="spearman")
# quite different
n <- min(c(NROW(ha$counts),NROW(hc$counts)))
cor.test(ha$counts[1:n], hc$counts[1:n], method="spearman")

Does this make sense or am I violating some assumptions of the coefficient?

Thanks,
R.

Best Answer

For measuring the bin frequencies of two distributions, a pretty good test is the Chi Square test. It is exactly what it is designed for. And, it is even nonparametric. The distribution don't even have to be normal or symmetric. It is much better than the Kolmogorov-Smirnov test that is known to be weak in fitting the tails of the distribution where the fitting or diagnosing is often the most important.

Spearman's correlation won't be so precise in terms of capturing the similarities of your actual bin frequencies. It will just tell you that your overall ranking of observations for the two distributions are similar. Instead, when calculating the Chi Square test (long hand so to speak) you will be able to observe readily which bin frequencies differentials are the most responsible for driving down the overall p value of the Chi Square test.

Another pretty good test is the Anderson-Darling test. It is one of the best tests to diagnose the fit between two distributions. However, in terms of giving information about the specific bin frequencies I suspect that the Chi Square test gives you more information.

Related Question