The documentation for SciPy's stats module describes the following method for finding the p-value from a two-sided (one-sample) t-test:
We can use the t-test to test whether the mean of our sample differs in a statistically significant way from the theoretical expectation.
>>> print 't-statistic = %6.3f pvalue = %6.4f' % stats.ttest_1samp(x, m) t-statistic = 0.391 pvalue = 0.6955`
The pvalue is 0.7, this means that with an alpha error of, for example, 10%, we cannot reject the hypothesis that the sample mean is equal to zero, the expectation of the standard t-distribution.
As an exercise, we can calculate our ttest also directly without using the provided function, which should give us the same answer, and so it does:
>>> tt = (sm-m)/np.sqrt(sv/float(n)) # t-statistic for mean >>> pval = stats.t.sf(np.abs(tt), n-1)*2 # two-sided pvalue = Prob(abs(t)>tt) >>> print 't-statistic = %6.3f pvalue = %6.4f' % (tt, pval) t-statistic = 0.391 pvalue = 0.6955
If look carefully, you'll notice the authors use stats.t.sf()
, the survival function (1-CDF), to calculate the p-value.
Why not just use the regular CDF? Is there a particular reason the survival function was used? This seems obtuse.
Best Answer
I am assuming you're not asking "why calculate this quantity" (which is a question about the definition of a p-value) but rather "why calculate the quantity as $2S(|t|)$ when you could calculate it from $2(1-F|t|)$?".
The answer is mostly a matter of numerical accuracy. For large $x$, $S(x)$ can be computed to high accuracy and then doubled, while the corresponding $F$ is very close to $1$, leading to catastrophic cancellation when computing $1-F$.