Probability – How to Statistically Compare Similarity Between Two Sets

computational-statisticscorrelationdistributionsprobability

This is a very open ended question. Suppose I have two sets of data samples of the same form, say [item, rating]. Rating is a value on the interval [0,100] and item is a unique identifier given to a particular item. I would like to compare these two sets of data samples and determine whether the null hypothesis holds.

One caveat though. I can't look at the rating distribution. This because I have literally thousands of groups that I would like to compare and it would be too time consuming to determine the rating distribution (normal, bimodal, etc) of each group. Therefore groups that I may be comparing may have different distributions.

The naive approach would be to assume that each distribution is normal and to use something like students t test to compare groups. This is what I have been doing but I would like something more robust. Therefore how might one determine how similar/different two groups are when the two groups may have different non-normal distributions (the number of elements in the two groups may be different as well)?

edit:
The item really doesn't matter. What matters is the ratings for each group.

Best Answer

null hypothesis in this case is that [...] they are different

-- That's not how null hypotheses work. You need something you can calculate the distribution of a test statistic under; generally that's no effect/no difference (whence, "null").

similarity ... whether or not the two groups were sampled from the same population

Your definition of 'similarity' ("from the same population") is a suitable null, fortunately.

So if the null is the population distributions are identical and the alternative is that they differ in some way, you're after a general test for distributional differences -- something that would pick up a difference in location, or spread, or shape.

This would be something like a two-sample Kolmogorov-Smirnov test. There are other possibilities, but that's the most commonly used one. If there are particular kinds of alternatives you especially want power against, there may be a more suitable choice.