Determining whether two samples are from the same distribution

data analysishypothesis testingprobabilitystatistics

The context for my question is that we have two sets of data which are of the same family of distributions, both normal, both exponential, etc.
If they were both normally distributed, a difference of means test lets us determine, with a high degree of certainty, the relationship between their means, and gives some indication of whether the two samples are actually from the exact same distribution or not.

I am interested in to what extent we can do this for other (continuous) random variables. For instance, if both samples of data appear exponentially distributed, what tests or analysis can be done to determine whether the samples are drawn from the same underlying distribution?

The best thing I've come across is to leverage MLE. That is, if we can find the MLE for our family of distributions, we compute it, given our sample data. And if the estimated parameters we get for our two samples are $P_1$ and $P_2$, we have some indication that our two samples are indeed from different populations if $P_1 \neq P_2$. This seems somewhat deficient, of course, because for the normal case we would compute the mean and standard deviation, but can't generally eyeball, based on the two values not being equal, that the two samples do indeed (probably) have different underlying populations. There are more subtle tests done, like difference of means. This approach seems insufficient, then.

As an addendum, I'll mention the problem I'm trying to solve is that we have data for males' and females' performance on a test, about 300 of each. Their scores are definitely not normally distributed, they are exponential or some other skewed distribution. I'm interested in determining whether males and females' performance appear to be genuinely different, whether male scores and female scores appear to be samples from different distributions, populations. Phrased another way, if I were a betting man and wanted to guess how well a person did on the test, whether or not your telling me their gender would improve my odds. Right now I can say the mean female score is more than mean male score, but that is not enough. I'd like something more rigorous, like the tests from basic statistics, to say e.g. with 95% certainty, they have different means.

Best Answer

One possibility is to conduct a Kolmogorov-Smirnov test. The advantage of this test is it is fully nonparametric, i.e. it doesn't require you to specify some parametric model for your data. The test statistic is essentially the supremal gap between the empirical CDFs of your two univariate datasets.

Related Question