I am using scipy
in Python
and the following return a nan
value for whatever reason:
>>>stats.ttest_ind([1, 1], [1, 1])
Ttest_indResult(statistic=nan, pvalue=nan)
>>>stats.ttest_ind([1, 1], [1, 1, 1])
Ttest_indResult(statistic=nan, pvalue=nan).
But whenever I use samples that have different summary statistics, I actually get a reasonable value:
stats.ttest_ind([1, 1], [1, 1, 1, 2])
Ttest_indResult(statistic=-0.66666666666666663, pvalue=0.54146973927558495).
Is it reasonable to interpret a p-value of nan
as 0
instead? Is there any reason from statistics that it doesn't make sense to run a 2-sample t-test on samples with the same summary statistics?
Best Answer
The problem with trying to compare two constant samples with a t-test is that the calculation of t involves an estimate of within-group SD in its denominator. From Wikipedia:
$$t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$
When both samples are constant, $s_{X_1X_2} = 0$, leading to a division by 0.