Solved – ks test for discrete distributions

kolmogorov-smirnov testp-valuepython

I have a two samples of data of very different length, think train and test sets for a machine learning model. The data is binary and I'd like to know if the two have a comparable distribution of values. The data is also very imbalanced (~5% of 1s).

As per comment a summary of the data:

  • sample 1: 950 0s and 50 1s
  • sample 2: 90 0s and 10 1s

So what the resampling / bootstraping is trying to answer, if i resample from sample 1, how probably is a distribution as observed in sample 2.

As the KS test can only be applied to continuous distributions, I figured I could bootstrap the data and take their mean (share of 1s) and then compare the distribution of these means with a KS test.

Is this a reasonable approach, any literature on how to tackle this? How can I decide how many samples to use and how often to resample?

Background is I'd like to implement this in python and automate this test. My current approach:

def bootstrap_ks(x1,x2,col):
    xv1 = pd.DataFrame(np.random.choice(x1, size=[100,1000] )).mean()
    xv2 = pd.DataFrame(np.random.choice(x2, size=[100,1000] )).mean()
    xv1.plot.hist(alpha=0.2, bins=50)
    xv2.plot.hist(alpha=0.2, bins=50)
    plt.title(col)
    plt.show()
    return stats.ks_2samp(xv1,xv2)

Best Answer

Statistical distributional tests of two samples of 0s and 1s should not be done via ks-test but can easily done via Fisher's exact test or Chi-Squared-Test-of-Independence. I am not familiar enough with Python, but basically if you sample 80:20 from one group and 90:10 from the other that can be displayed in a contingency table:

         `0` `1`
Group 1  80  20
Group 2  90  10

And these are usually tested via on of tests mentioned above.

Note that these tests will answer whether there is significance which is "enough proof of difference". What is considered "enough evidence" depends heavily on the sample sizes you take.

You have not explained why you want to investigate this but you might want to consider plotting the distribution of odds ratios or something else more meaningfull then $p$-values.

How can I decide how many samples to use and how often to resample?

The number $n$ of resamples is high enough, if many resamplings of size $n$ deliver return values that are close enough. Choose a sample size $n$, e. g. $n = 500$, resample 10 times with $n$ resamples and judge depending on the range of return values, if that was precise enough for you. If not of if in doubt, increase $n$ substantially and start over.