I'm having trouble to understand how to **compare** 2 sets of data by their **distribution** .

For Example,

how can I understand that column X100 has the same distribution as column Y1?

Also, is there a way to express the **distribution comparison** of all columns to all columns?

I'm a machine learning developer using **python**, and this is a part of a **classification problem** I'm working on.

Would appreciate any help.. tnx ðŸ™‚

## Best Answer

You can compare distribution of the two columns using two-sample Kolmogorov-Smirnov test, it is included in the

`scipy.stats`

: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.htmlFrom the stackoverflow topic:

Under the null hypothesis the two distributions are identical. If the K-S statistic is small or the p-value is high (greater than the significance level, say 5%), then we cannot reject the hypothesis that the distributions of the two samples are the same. Conversely, we can reject the null hypothesis if the p-value is low.