Solved – How similar are the 2 data sets

p-valuesimilaritiesstatistical significancet-testunbalanced-classes

I am kind of stuck with an easy question:

I have two data sets with experimental data. The data sets do not have the same size. I would like to show that these data sets are possibly coming from the same experiment.

I tried a two-sample $t$-test; it shows that the data are significantly different. Is there a way to generate something like a $p$-value for similarity instead of difference?

Update:
Here an example:
Date set 1 (Vector): 1 1 2 3 1 2 1 3 4 1 Mean: 1.9
Data set 2 (Vector): 2 2 1 2 2 1 1 2 2 3 2 2 Mean: 1.83

How would you now show that this data sets are from one experiment?

Best Answer

We need either an example or more details on the datasets:

  • is there more than one variable?
  • how many individuals per dataset?
  • is the Gaussian hypothesis sound for your problem?

The t-test will answer the question: is the mean the same between the two classes?

To test if the two data sets come from the same distribution, you could for example apply a Kolmogorov Smirnov test (ks.test in R). And there are alternative multivariate Kolmogorov Smirnov tests if you have two or more variables [Lopes et al., 2007].

With the example dataset:

 x <- unlist(read.table(text="1 1 2 3 1 2 1 3 4 1",sep=" "))
 y <- unlist(read.table(text="2 2 1 2 2 1 1 2 2 3 2 2",sep=" "))
 maxi <- max(c(x,y)) 
 xfac <- factor(x,levels=1:maxi)
 yfac <- factor(y,levels=1:maxi)
 # Plot
 layout(1:2)
 barplot(table(xfac))
 barplot(table(yfac))

Bar plots of the 2 samples

# Two sample test on the median
wilcox.test(x, y) # Similar medians
# Two sample Kolmogorov-Smirnov Test
ks.test(x, y) # Do not trust the p-value because the data is discrete
# Alternative?

Given the plot and the results of the tests, you might want to augment the number of individuals!