Solved – How to Compare the Data Distribution of 2 datasets

datasetdistributionsmachine learningpythonscipy

I'm having trouble to understand how to compare 2 sets of data by their distribution .

For Example,
how can I understand that column X100 has the same distribution as column Y1?

enter image description here

enter image description here

Also, is there a way to express the distribution comparison of all columns to all columns?

I'm a machine learning developer using python, and this is a part of a classification problem I'm working on.

Would appreciate any help.. tnx 🙂

Best Answer

You can compare distribution of the two columns using two-sample Kolmogorov-Smirnov test, it is included in the scipy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

From the stackoverflow topic:

from scipy.stats import ks_2samp
import numpy as np

np.random.seed(123456)
x = np.random.normal(0, 1, 1000)
y = np.random.normal(0, 1, 1000)
z = np.random.normal(1.1, 0.9, 1000)

>>> ks_2samp(x, y)
Ks_2sampResult(statistic=0.022999999999999909, pvalue=0.95189016804849647)
>>> ks_2samp(x, z)
Ks_2sampResult(statistic=0.41800000000000004, pvalue=3.7081494119242173e-77)

Under the null hypothesis the two distributions are identical. If the K-S statistic is small or the p-value is high (greater than the significance level, say 5%), then we cannot reject the hypothesis that the distributions of the two samples are the same. Conversely, we can reject the null hypothesis if the p-value is low.

Related Question