I am not good in statistics so I desperately need your help.
So I have this dataset of distributions, and I want to know if I can use the KS-test on it. the Idea is saying that the feature's distribution 1 in case 1 is different than its distribution 2 in case 2.
For example, let s take the feature "size" in two cases: case1, case2
The distribution 1 (in case 1) looks like this:
[0,0,0,0,0,0,132,33,1200,0,0,98,208,56,0,0,0,....]
The distribution 2 (in case 2) looks like this:
[52215,2132,933,11200,0,0,13245,4208,309,0,34000,0,....]
and so on,
each number represent the total size in one second, and the null hypothesis, is that distribution 1 and distribution 2 follow identical distribution so the point is rejecting it by having a less than 1% as a p-value (that s what I understood please correct me if I am wrong)
I read that KS-test is applied on continuous distributions, is the one I have continuous?? how to know if your distribution is continuous?
If I can't apply the KS, what else can I apply? mentioning that I work with Python..
Best Answer
If "size", or any other feature, is not constrained to a countable set of values (i.e., you can enumerate them like a list...although the list might be infinitely long), then you have a continuous rv.
You can certainly apply the K-S test here for generalized differences. Its easy for two samples, but if you are trying to test for general difference in 3+ distributions, you should use a computational approach:
As you can see, this is a relatively straightforward test (formally called a resampling or Bootstrap hypothesis test), that substitutes intensive calculation for complex (often intractable) mathematics. It is also flexible, since you can easily adapt to varying numbers of samples, varying sample sizes, etc.