Solved – Why use survival function to find p-value for t-test

computational-statisticsscipyt-test

The documentation for SciPy's stats module describes the following method for finding the p-value from a two-sided (one-sample) t-test:

We can use the t-test to test whether the mean of our sample differs in a statistically significant way from the theoretical expectation.

   >>> print 't-statistic = %6.3f pvalue = %6.4f' %  stats.ttest_1samp(x, m)
   t-statistic =  0.391 pvalue = 0.6955`

The pvalue is 0.7, this means that with an alpha error of, for example, 10%, we cannot reject the hypothesis that the sample mean is equal to zero, the expectation of the standard t-distribution.

As an exercise, we can calculate our ttest also directly without using the provided function, which should give us the same answer, and so it does:

    >>> tt = (sm-m)/np.sqrt(sv/float(n))  # t-statistic for mean
    >>> pval = stats.t.sf(np.abs(tt), n-1)*2  # two-sided pvalue = Prob(abs(t)>tt)
    >>> print 't-statistic = %6.3f pvalue = %6.4f' % (tt, pval)  
    t-statistic =  0.391 pvalue = 0.6955

If look carefully, you'll notice the authors use stats.t.sf(), the survival function (1-CDF), to calculate the p-value.

Why not just use the regular CDF? Is there a particular reason the survival function was used? This seems obtuse.

Best Answer

I am assuming you're not asking "why calculate this quantity" (which is a question about the definition of a p-value) but rather "why calculate the quantity as $2S(|t|)$ when you could calculate it from $2(1-F|t|)$?".

The answer is mostly a matter of numerical accuracy. For large $x$, $S(x)$ can be computed to high accuracy and then doubled, while the corresponding $F$ is very close to $1$, leading to catastrophic cancellation when computing $1-F$.

Related Question