Solved – Similarity between two sets of random values

distributionskolmogorov-smirnov testp-value

I have two sets of random floats between [0,1], each one with a different number of elements:

$$A = \{0.3637852, 0.2330702, 0.1683102, 0.2127219, 0.0152532, …, N_A\}$$
$$B = \{0.4541056, 0.7521812, 0.0266602, 0.5099002, 0.3468181, …, N_B\}$$

where $N_B > N_A$.

I need to asses the similarity between these two sets disregarding the difference in number of elements (ie: the fact that $N_B > N_A$ should not play a role in the similarity assesment) since I'm only interested in how the values are scattered between [0,1], not how many of them there are in each set.

So far I've applied the 1D kde.test available in Duong's R package 'ks' (see page 27) which returns a p-value, and I've also applied the python function scipy.stats.ks_2samp which is a 1D Kolmogorov-Smirnov statistic which returns a "KS statistic" and a "two-tailed p-value".

My questions are:

1- Is one of these statistics (KDE's p-value, KS statistic or the two-tailed p-value) recommended for my needs? If so, why?

2- What is the difference between the "KS statistic" and a "two-tailed p-value"?

3- Will the difference in the number of elements in each set affect the outcome of these statistics? If so, how can I avoid that? Is it even possible?

Best Answer

1- Is one of these statistics (KDE's p-value, KS statistic or the two-tailed p-value) recommended for my needs? If so, why?

Your needs as expressed do not seem to be sufficiently clearly defined as to differentiate between them. They both test for a difference in distribution.

2- What is the difference between the "KS statistic" and a "two-tailed p-value"?

The two sample Kolmogorov-Smirnov statistic is the largest difference in ECDFs for the two samples:

enter image description here

(The data here is the same data I generated for your other question. Here the A sample is red and the B sample is blue.)

The height difference in ECDFs at x=35 is 1/6 or about 0.1667 (indeed anywhere in $[34.50717,35.32252)$ ), the same as the value produced by calculating the statistic:

> ks.test(A,B)

        Two-sample Kolmogorov-Smirnov test

data:  A and B 
D = 0.1667, p-value = 0.6228
alternative hypothesis: two-sided 

The meaning of the p-value is as for any hypothesis test - the probability of obtaining a statistic at least as unusual (in this case, at least as large) if the null hypothesis were true.

3- Will the difference in the number of elements in each set affect the outcome of these statistics?

No, the KS test, and (to my understanding) the KDE-based one both handle different sample sizes.

Here's approximate data values, in case anyone needs them

> print(A,d=3)
 [1] 41.34 25.92 55.30 50.06 75.67  3.03 61.81 34.51 34.33  9.62 94.95 24.73
[13] 30.41 11.77 25.13 90.75 12.62 36.14 56.91 29.76 15.34 62.58 33.03 36.44
[25] 47.90 66.01 42.49 18.21 31.58 58.30 17.63 70.81 73.86 46.63 10.24 12.02
[37] 47.14 15.56 80.27 12.76 33.61 52.08 41.64 13.19 32.96 64.21 81.15 32.37
[49] 33.79 40.43
> print(B,d=3)
 [1] 39.43 57.93 72.91 12.81  3.76 39.02 56.02 40.28 30.25 75.31  2.46 81.44
[13] 11.74  9.32 60.85 75.39 44.58 62.05 53.33 63.63 29.90 31.41 59.82 50.37
[25] 41.17 49.49 20.34 35.32 33.82 35.47
>

All that said, I recommend you consider whuber's words most carefully. There's a lot of good advice packed into very few words.

Related Question