Solved – How to compare two data sets to know if they are not similar

artificial intelligencedata visualizationdistributionsmachine learningsimilarities

I want to write a program which when given two data sets, should be able to tell whether the data sets differ significantly or are roughly similar.

Details on the two data sets:

Both data sets will have same number of values.
A value in a data set may not be related to other values in the same data set.

Aim is to get a final score which tells about 'similarity' of the two data sets.

I am completely new to statistics. I am trying to learn AI and feel that this will be a good basic start towards Data Science.
Also, I want to visualize this.

Example:

Data Set 1: A = 2 B = 120,000 C = 70

Data Set 2: A = 5 B = 240,000 C = 80

Best Answer

I dont think this is possible in the general case. "No free lunch" and all that. I think you need to first form a null hypothesis, and then find evidence for/against the null hypothesis. And what you are trying to compare actually is probably not the data sets themselves, but the populations from which the datasets were sampled. I think it's safe to say that the datasets themselves are different? Since, if they were exactly identical, you wouldnt be asking the question :-)

But, for example, the distribution from which the first dataset was sampled might have some mean $\mu_1$, and the distribution from which the second dataset was sampled might have some mean $\mu_2$. Can you find evidence, using the sampled datasets, that the means are/arent the same?

So, an example null hypothesis could be $\mu_1 = \mu_2$, which you can then bring mathematical/statistical methods to be bear on to find evidence for or against.

In the general case, I dont reckon there is any definitive method to say that two datasets are/arent drawn from the same distribution. Since I think a valid distribution is to imagine a black box that gives the first dataset at time $t_1$, and the second dataset at time $t_2$. Or alternatively two black boxes, one that always gives the values of the first dataset, and the other that always gives the values in the second dataset, where the two black boxes thus represent two different underlying distributions. And there's no way to differentiate between these two cases, by looking at the two sampled datasets. (I think this is a bit hand-waving, and there's probably a formal way of expressing this, but I reckon the underlying concepts are probably approximately correct).

Related Solutions

Solved – How similar are the 2 data sets

We need either an example or more details on the datasets:

is there more than one variable?
how many individuals per dataset?
is the Gaussian hypothesis sound for your problem?

The t-test will answer the question: is the mean the same between the two classes?

To test if the two data sets come from the same distribution, you could for example apply a Kolmogorov Smirnov test (ks.test in R). And there are alternative multivariate Kolmogorov Smirnov tests if you have two or more variables [Lopes et al., 2007].

With the example dataset:

 x <- unlist(read.table(text="1 1 2 3 1 2 1 3 4 1",sep=" "))
 y <- unlist(read.table(text="2 2 1 2 2 1 1 2 2 3 2 2",sep=" "))
 maxi <- max(c(x,y)) 
 xfac <- factor(x,levels=1:maxi)
 yfac <- factor(y,levels=1:maxi)
 # Plot
 layout(1:2)
 barplot(table(xfac))
 barplot(table(yfac))

Bar plots of the 2 samples

# Two sample test on the median
wilcox.test(x, y) # Similar medians
# Two sample Kolmogorov-Smirnov Test
ks.test(x, y) # Do not trust the p-value because the data is discrete
# Alternative?

Given the plot and the results of the tests, you might want to augment the number of individuals!

Solved – Quantifying similarity between two data sets

Area between 2 curves may give you the difference. Hence sum(nr-nf) (sum of all differences) will be an approximation of the area between 2 curves. If you want to make it relative, sum(nr-nf)/sum(nf) can be used. These will give you a single value indicating similarity between 2 curves for each graph.

Edit: Above method of sum of differences will be useful even if these are separate points or observations and not connected lines or curves, but in that case, mean of differences can also be an indicator and may be better since it would take into account the number of observations.

Best Answer

Related Solutions

Solved – How similar are the 2 data sets

Solved – Quantifying similarity between two data sets

Related Question