Solved – Comparing huge data sets that are non-normal and different lengths in R

nonparametricrstatistical significance

I'm comparing two data sets. Each set is extremely large, about 25,000 quantitative pieces of data in length.
I want to find if these sets are significantly different, but the problem is that they are not normally distributed and are different lengths.
I've tried a variety of tests in R: Mood's median test, Wilcoxan test, Kruskal-Wallis, but these seem to require either datasets of the same length or a normal distribution. I have not been able to find a test that works for the data I want to compare.
Do you know of any statistical tests that can be used to compare my non-normally distributed, different-length data sets?

Best Answer

You may be looking for the two-sample Kolmogorov-Smirnov test, which assesses a measure of distance between the two samples' cumulative distribution functions. As such, it can be used for samples of different size. In R, look at ?ks.test.

However, of course with datasets this large, even small deviations in the CDF will be detected as statistically significant. Whether these are clinically significant cannot be assessed by statistical tests - look at quantiles, density plots, histograms and so forth for this.

Plus, if you have a specific question you are most interested in, like whether the means differ, or the variances (assuming equal means or not), of course more specialized tests are likely available, or you might be able to perform a nonparametric test, e.g., a permutation test.