Solved – use Kolmogorov–Smirnov test on the Data

continuous datadistributionskolmogorov-smirnov test

I am not good in statistics so I desperately need your help.

So I have this dataset of distributions, and I want to know if I can use the KS-test on it. the Idea is saying that the feature's distribution 1 in case 1 is different than its distribution 2 in case 2.
For example, let s take the feature "size" in two cases: case1, case2

The distribution 1 (in case 1) looks like this:

[0,0,0,0,0,0,132,33,1200,0,0,98,208,56,0,0,0,....]

The distribution 2 (in case 2) looks like this:

[52215,2132,933,11200,0,0,13245,4208,309,0,34000,0,....]

and so on,

each number represent the total size in one second, and the null hypothesis, is that distribution 1 and distribution 2 follow identical distribution so the point is rejecting it by having a less than 1% as a p-value (that s what I understood please correct me if I am wrong)

I read that KS-test is applied on continuous distributions, is the one I have continuous?? how to know if your distribution is continuous?

If I can't apply the KS, what else can I apply? mentioning that I work with Python..

Best Answer

If "size", or any other feature, is not constrained to a countable set of values (i.e., you can enumerate them like a list...although the list might be infinitely long), then you have a continuous rv.

You can certainly apply the K-S test here for generalized differences. Its easy for two samples, but if you are trying to test for general difference in 3+ distributions, you should use a computational approach:

  1. Your null hypothesis is $H_0: F_1=F_2....=F_N$ for distributions $F_i$ pertaining to your particular feature for each of $i=\{1...N\}$cases, each of size $n$
  2. Under the null, we can assume the the data from each case came from the same population or distribution, therefore, you will be combining your data into a single "bucket" for purposes of testing your hypothesis. Thus you will have one "null population" of size $N\times n$
  3. Here's the computational part:
  4. (a) Randomly sample with replacement from the "null population" to create N groups of size $n$. Let's call this new set of data a replication, designated $R_1$.
  5. (b) Find the largest vertical difference between the groups of empirical CDFs (similar to what you do with the 2 sample KS test, but you will now have to determine, for each value of your feature, the maximum value of the group of ECDFs and the minimum value of the ECDFs, then find the maximum value of the difference of these two values across all values of "size"), lets call this value $K_1$
  6. (c) Repeat this 1000 or so times using a computer (more if possible), to get 1000 $K$ values.
  7. (d) Now, calculate the $K$ value of your actual results (i.e, the actual N groups you observed), call this $K_a$
  8. Determine the number of $K$ values such that $K\geq K_a$ and divide by the total number of K values. This is the $p$-value of your test. Treat it as any other $p$-value.

As you can see, this is a relatively straightforward test (formally called a resampling or Bootstrap hypothesis test), that substitutes intensive calculation for complex (often intractable) mathematics. It is also flexible, since you can easily adapt to varying numbers of samples, varying sample sizes, etc.