Solved – How to compare two data sets to know if they are not similar

artificial intelligencedata visualizationdistributionsmachine learningsimilarities

I want to write a program which when given two data sets, should be able to tell whether the data sets differ significantly or are roughly similar.

Details on the two data sets:

  • Both data sets will have same number of values.
  • A value in a data set may not be related to other values in the same data set.

Aim is to get a final score which tells about 'similarity' of the two data sets.

I am completely new to statistics. I am trying to learn AI and feel that this will be a good basic start towards Data Science.
Also, I want to visualize this.

Example:

Data Set 1: A = 2 B = 120,000 C = 70

Data Set 2: A = 5 B = 240,000 C = 80

Best Answer

I dont think this is possible in the general case. "No free lunch" and all that. I think you need to first form a null hypothesis, and then find evidence for/against the null hypothesis. And what you are trying to compare actually is probably not the data sets themselves, but the populations from which the datasets were sampled. I think it's safe to say that the datasets themselves are different? Since, if they were exactly identical, you wouldnt be asking the question :-)

But, for example, the distribution from which the first dataset was sampled might have some mean $\mu_1$, and the distribution from which the second dataset was sampled might have some mean $\mu_2$. Can you find evidence, using the sampled datasets, that the means are/arent the same?

So, an example null hypothesis could be $\mu_1 = \mu_2$, which you can then bring mathematical/statistical methods to be bear on to find evidence for or against.

In the general case, I dont reckon there is any definitive method to say that two datasets are/arent drawn from the same distribution. Since I think a valid distribution is to imagine a black box that gives the first dataset at time $t_1$, and the second dataset at time $t_2$. Or alternatively two black boxes, one that always gives the values of the first dataset, and the other that always gives the values in the second dataset, where the two black boxes thus represent two different underlying distributions. And there's no way to differentiate between these two cases, by looking at the two sampled datasets. (I think this is a bit hand-waving, and there's probably a formal way of expressing this, but I reckon the underlying concepts are probably approximately correct).