I'm having trouble to understand how to compare 2 sets of data by their distribution .
For Example,
how can I understand that column X100 has the same distribution as column Y1?
Also, is there a way to express the distribution comparison of all columns to all columns?
I'm a machine learning developer using python, and this is a part of a classification problem I'm working on.
Would appreciate any help.. tnx 🙂
Best Answer
You can compare distribution of the two columns using two-sample Kolmogorov-Smirnov test, it is included in the
scipy.stats
: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.htmlFrom the stackoverflow topic:
Under the null hypothesis the two distributions are identical. If the K-S statistic is small or the p-value is high (greater than the significance level, say 5%), then we cannot reject the hypothesis that the distributions of the two samples are the same. Conversely, we can reject the null hypothesis if the p-value is low.