Solved – Tracking down the assumptions made by SciPy’s ttest_ind() function

pythonstatistical significancet-test

I'm trying to write my own Python code to compute t-statistics and p-values for one and two tailed independent t tests. I can use the normal approximation, but for the moment I am trying to just use the t-distribution. I've been unsuccessful in matching the results of SciPy's stats library on my test data. I could use a fresh pair of eyes to see if I'm just making a dumb mistake somewhere.

Note, this isn't so much of a coding question as it is a "why isn't this computation yielding the right t-stat?" I give the code for completeness, but don't expect any software advice. Just help understanding why this isn't right.

My code:

import numpy as np
import scipy.stats as st

def compute_t_stat(pop1,pop2):

    num1 = pop1.shape[0]; num2 = pop2.shape[0];

    # The formula for t-stat when population variances differ.
    t_stat = (np.mean(pop1) - np.mean(pop2))/np.sqrt( np.var(pop1)/num1 + np.var(pop2)/num2 )

    # ADDED: The Welch-Satterthwaite degrees of freedom.
    df = ((np.var(pop1)/num1 + np.var(pop2)/num2)**(2.0))/(   (np.var(pop1)/num1)**(2.0)/(num1-1) +  (np.var(pop2)/num2)**(2.0)/(num2-1) ) 

    # Am I computing this wrong?
    # It should just come from the CDF like this, right?
    # The extra parameter is the degrees of freedom.

    one_tailed_p_value = 1.0 - st.t.cdf(t_stat,df)
    two_tailed_p_value = 1.0 - ( st.t.cdf(np.abs(t_stat),df) - st.t.cdf(-np.abs(t_stat),df) )    


    # Computing with SciPy's built-ins
    # My results don't match theirs.
    t_ind, p_ind = st.ttest_ind(pop1, pop2)

    return t_stat, one_tailed_p_value, two_tailed_p_value, t_ind, p_ind

Update:

After reading a bit more on the Welch's t-test, I saw that I should be using the Welch-Satterthwaite formula to calculate degrees of freedom. I updated the code above to reflect this.

With the new degrees of freedom, I get a closer result. My two-sided p-value is off by about 0.008 from the SciPy version's… but this is still much too big an error so I must still be doing something incorrect (or SciPy distribution functions are very bad, but it's hard to believe they are only accurate to 2 decimal places).

Second update:

While continuing to try things, I thought maybe SciPy's version automatically computes the Normal approximation to the t-distribution when the degrees of freedom are high enough (roughly > 30). So I re-ran my code using the Normal distribution instead, and the computed results are actually further away from SciPy's than when I use the t-distribution.

Best Answer

By using the SciPy built-in function source(), I could see a printout of the source code for the function ttest_ind(). Based on the source code, the SciPy built-in is performing the t-test assuming that the variances of the two samples are equal. It is not using the Welch-Satterthwaite degrees of freedom.

I just want to point out that, crucially, this is why you should not just trust library functions. In my case, I actually do need the t-test for populations of unequal variances, and the degrees of freedom adjustment might matter for some of the smaller data sets I will run this on. SciPy assumes equal variances but does not state this assumption.

As I mentioned in some comments, the discrepancy between my code and SciPy's is about 0.008 for sample sizes between 30 and 400, and then slowly goes to zero for larger sample sizes. This is an effect of the extra (1/n1 + 1/n2) term in the equal-variances t-statistic denominator. Accuracy-wise, this is pretty important, especially for small sample sizes. It definitely confirms to me that I need to write my own function. (Possibly there are other, better Python libraries, but this at least should be known. Frankly, it's surprising this isn't anywhere up front and center in the SciPy documentation for ttest_ind()).

Related Solutions

T-test vs Non-Parametric Tests – How to Choose for Small Samples

I am going to change the order of questions about.

I've found textbooks and lecture notes frequently disagree, and would like a system to work through the choice that can safely be recommended as best practice, and especially a textbook or paper this can be cited to.

Unfortunately, some discussions of this issue in books and so on rely on received wisdom. Sometimes that received wisdom is reasonable, sometimes it is less so (at the least in the sense that it tends to focus on a smaller issue when a larger problem is ignored); we should examine the justifications offered for the advice (if any justification is offered at all) with care.

Most guides to choosing a t-test or non-parametric test focus on the normality issue.

That’s true, but it’s somewhat misguided for several reasons that I address in this answer.

If performing an "unrelated samples" or "unpaired" t-test, whether to use a Welch correction?

This (to use it unless you have reason to think variances should be equal) is the advice of numerous references. I point to some in this answer.

Some people use a hypothesis test for equality of variances, but here it would have low power. Generally I just eyeball whether the sample SDs are "reasonably" close or not (which is somewhat subjective, so there must be a more principled way of doing it) but again, with low n it may well be that the population SDs are rather further apart than the sample ones.

Is it safer simply to always use the Welch correction for small samples, unless there is some good reason to believe population variances are equal? That’s what the advice is. The properties of the tests are affected by the choice based on the assumption test.

Some references on this can be seen here and here, though there are more that say similar things.

The equal-variances issue has many similar characteristics to the normality issue – people want to test it, advice suggests conditioning choice of tests on the results of tests can adversely affect the results of both kinds of subsequent test – it’s better simply not to assume what you can’t adequately justify (by reasoning about the data, using information from other studies relating to the same variables and so on).

However, there are differences. One is that – at least in terms of the distribution of the test statistic under the null hypothesis (and hence, its level-robustness) - non-normality is less important in large samples (at least in respect of significance level, though power might still be an issue if you need to find small effects), while the effect of unequal variances under the equal variance assumption doesn’t really go away with large sample size.

What principled method can be recommended for choosing which is the most appropriate test when the sample size is "small"?

With hypothesis tests, what matters (under some set of conditions) is primarily two things:

What is the actual type I error rate?
What is the power behaviour like?

We also need to keep in mind that if we're comparing two procedures, changing the first will change the second (that is, if they’re not conducted at the same actual significance level, you would expect that higher $\alpha$ is associated with higher power).

(Of course we're usually not so confident we know what distributions we're dealing with, so the sensitivity of those behaviors to changes in circumstances also matter.)

With these small-sample issues in mind, is there a good - hopefully citable - checklist to work through when deciding between t and non-parametric tests?

I will consider a number of situations in which I’ll make some recommendations, considering both the possibility of non-normality and unequal variances. In every case, take mention of the t-test to imply the Welch-test:

n medium-large

Non-normal (or unknown), likely to have near-equal variance:

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney, though if it’s only slightly heavy, the t-test should do okay. With light-tails the t-test may (often) be preferred. Permutation tests are a good option (you can even do a permutation test using a t-statistic if you're so inclined). Bootstrap tests are also suitable.

Non-normal (or unknown), unequal variance (or variance relationship unknown):

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney

if inequality of variance is only related to inequality of mean - i.e. if H0 is true the difference in spread should also be absent. GLMs are often a good option, especially if there’s skewness and spread is related to the mean. A permutation test is another option, with a similar caveat as for the rank-based tests. Bootstrap tests are a good possibility here.

Zimmerman and Zumbo (1993)$^{[1]}$ suggest a Welch-t-test on the ranks which they say performs better that the Wilcoxon-Mann-Whitney in cases where the variances are unequal.

n moderately small

rank tests are reasonable defaults here if you expect non-normality (again with the above caveat). If you have external information about shape or variance, you might consider GLMs . If you expect things not to be too far from normal, t-tests may be fine.

n very small

Because of the problem with getting suitable significance levels, neither permutation tests nor rank tests may be suitable, and at the smallest sizes, a t-test may be the best option (there’s some possibility of slightly robustifying it). However, there’s a good argument for using higher type I error rates with small samples (otherwise you’re letting type II error rates inflate while holding type I error rates constant). Also see de Winter (2013)$^{[2]}$.

The advice must be modified somewhat when the distributions are both strongly skewed and very discrete, such as Likert scale items where most of the observations are in one of the end categories. Then the Wilcoxon-Mann-Whitney isn’t necessarily a better choice than the t-test.

Simulation can help guide choices further when you have some information about likely circumstances.

I appreciate this is something of a perennial topic, but most questions concern the questioner's particular data set, sometimes a more general discussion of power, and occasionally what to do if two tests disagree, but I would like a procedure to pick the correct test in the first place!

The main problem is how hard it is to check the normality assumption in a small data set:

It is difficult to check normality in a small data set, and to some extent that's an important issue, but I think there's another issue of importance that we need to consider. A basic problem is that trying to assess normality as the basis of choosing between tests adversely impacts the properties of the tests you're choosing between.

Any formal test for normality would have low power so violations may well not be detected. (Personally I wouldn't test for this purpose, and I'm clearly not alone, but I've found this little use when clients demand a normality test be performed because that's what their textbook or old lecture notes or some website they found once declare should be done. This is one point where a weightier looking citation would be welcome.)

Here’s an example of a reference (there are others) which is unequivocal (Fay and Proschan, 2010$^{[3]}$):

The choice between t- and WMW DRs should not be based on a test of normality.

They are similarly unequivocal about not testing for equality of variance.

To make matters worse, it is unsafe to use the Central Limit Theorem as a safety net: for small n we can't rely on the convenient asymptotic normality of the test statistic and t distribution.

Nor even in large samples -- asymptotic normality of the numerator doesn’t imply that the t-statistic will have a t-distribution. However, that may not matter so much, since you should still have asymptotic normality (e.g. CLT for the numerator, and Slutsky’s theorem suggest that eventually the t-statistic should begin to look normal, if the conditions for both hold.)

One principled response to this is "safety first": as there's no way to reliably verify the normality assumption on a small sample, run an equivalent non-parametric test instead.

That’s actually the advice that the references I mention (or link to mentions of) give.

Another approach I've seen but feel less comfortable with, is to perform a visual check and proceed with a t-test if nothing untowards is observed ("no reason to reject normality", ignoring the low power of this check). My personal inclination is to consider whether there are any grounds for assuming normality, theoretical (e.g. variable is sum of several random components and CLT applies) or empirical (e.g. previous studies with larger n suggest variable is normal).

Both those are good arguments, especially when backed up with the fact that the t-test is reasonably robust against moderate deviations from normality. (One should keep in mind, however, that "moderate deviations" is a tricky phrase; certain kinds of deviations from normality may impact the power performace of the t-test quite a bit even though those deviations are visually very small - the t-test is less robust to some deviations than others. We should keep this in mind whenever we're discussing small deviations from normality.)

Beware, however, the phrasing "suggest the variable is normal". Being reasonably consistent with normality is not the same thing as normality. We can often reject actual normality with no need even to see the data – for example, if the data cannot be negative, the distribution cannot be normal. Fortunately, what matters is closer to what we might actually have from previous studies or reasoning about how the data are composed, which is that the deviations from normality should be small.

If so, I would use a t-test if data passed visual inspection, and otherwise stick to non-parametrics. But any theoretical or empirical grounds usually only justify assuming approximate normality, and on low degrees of freedom it's hard to judge how near normal it needs to be to avoid invalidating a t-test.

Well, that’s something we can assess the impact of fairly readily (such as via simulations, as I mentioned earlier). From what I've seen, skewness seems to matter more than heavy tails (but on the other hand I have seen some claims of the opposite - though I don't know what that's based on).

For people who see the choice of methods as a trade-off between power and robustness, claims about the asymptotic efficiency of the non-parametric methods are unhelpful. For instance, the rule of thumb that "Wilcoxon tests have about 95% of the power of a t-test if the data really are normal, and are often far more powerful if the data is not, so just use a Wilcoxon" is sometimes heard, but if the 95% only applies to large n, this is flawed reasoning for smaller samples.

But we can check small-sample power quite easily! It’s easy enough to simulate to obtain power curves as here.
(Again, also see de Winter (2013)$^{[2]}$).

Having done such simulations under a variety of circumstances, both for the two-sample and one-sample/paired-difference cases, the small sample efficiency at the normal in both cases seems to be a little lower than the asymptotic efficiency, but the efficiency of the signed rank and Wilcoxon-Mann-Whitney tests is still very high even at very small sample sizes.

At least that's if the tests are done at the same actual significance level; you can't do a 5% test with very small samples (and least not without randomized tests for example), but if you're prepared to perhaps do (say) a 5.5% or a 3.2% test instead, then the rank tests hold up very well indeed compared with a t-test at that significance level.

Small samples may make it very difficult, or impossible, to assess whether a transformation is appropriate for the data since it's hard to tell whether the transformed data belong to a (sufficiently) normal distribution. So if a QQ plot reveals very positively skewed data, which look more reasonable after taking logs, is it safe to use a t-test on the logged data? On larger samples this would be very tempting, but with small n I'd probably hold off unless there had been grounds to expect a log-normal distribution in the first place.

There’s another alternative: make a different parametric assumption. For example, if there’s skewed data, one might, for example, in some situations reasonably consider a gamma distribution, or some other skewed family as a better approximation - in moderately large samples, we might just use a GLM, but in very small samples it may be necessary to look to a small-sample test - in many cases simulation can be useful.

Alternative 2: robustify the t-test (but taking care about the choice of robust procedure so as not to heavily discretize the resulting distribution of the test statistic) - this has some advantages over a very-small-sample nonparametric procedure such as the ability to consider tests with low type I error rate.

Here I'm thinking along the lines of using say M-estimators of location (and related estimators of scale) in the t-statistic to smoothly robustify against deviations from normality. Something akin to the Welch, like:

$$\frac{\stackrel{\sim}{x}-\stackrel{\sim}{y}}{\stackrel{\sim}{S}_p}$$

where $\stackrel{\sim}{S}_p^2=\frac{\stackrel{\sim}{s}_x^2}{n_x}+\frac{\stackrel{\sim}{s}_y^2}{n_y}$ and $\stackrel{\sim}{x}$, $\stackrel{\sim}{s}_x$ etc being robust estimates of location and scale respectively.

I'd aim to reduce any tendency of the statistic to discreteness - so I'd avoid things like trimming and Winsorizing, since if the original data were discrete, trimming etc will exacerbate this; by using M-estimation type approaches with a smooth $\psi$-function you achieve similar effects without contributing to the discreteness. Keep in mind we're trying to deal with the situation where $n$ is very small indeed (around 3-5, in each sample, say), so even M-estimation potentially has its issues.

You could, for example, use simulation at the normal to get p-values (if sample sizes are very small, I'd suggest that over bootstrapping - if sample sizes aren't so small, a carefully-implemented bootstrap may do quite well, but then we might as well go back to Wilcoxon-Mann-Whitney). There's be a scaling factor as well as a d.f. adjustment to get to what I'd imagine would then be a reasonable t-approximation. This means we should get the kind of properties we seek very close to the normal, and should have reasonable robustness in the broad vicinity of the normal. There are a number of issues that come up that would be outside the scope of the present question, but I think in very small samples the benefits should outweigh the costs and the extra effort required.

[I haven't read the literature on this stuff for a very long time, so I don't have suitable references to offer on that score.]

Of course if you didn't expect the distribution to be somewhat normal-like, but rather similar to some other distribution, you could undertake a suitable robustification of a different parametric test.

What if you want to check assumptions for the non-parametrics? Some sources recommend verifying a symmetric distribution before applying a Wilcoxon test, which brings up similar problems to checking normality.

Indeed. I assume you mean the signed rank test*. In the case of using it on paired data, if you are prepared to assume that the two distributions are the same shape apart from location shift you are safe, since the differences should then be symmetric. Actually, we don't even need that much; for the test to work you need symmetry under the null; it's not required under the alternative (e.g. consider a paired situation with identically-shaped right skewed continuous distributions on the positive half-line, where the scales differ under the alternative but not under the null; the signed rank test should work essentially as expected in that case). The interpretation of the test is easier if the alternative is a location shift though.

*(Wilcoxon’s name is associated with both the one and two sample rank tests – signed rank and rank sum; with their U test, Mann and Whitney generalized the situation studied by Wilcoxon, and introduced important new ideas for evaluating the null distribution, but the priority between the two sets of authors on the Wilcoxon-Mann-Whitney is clearly Wilcoxon’s -- so at least if we only consider Wilcoxon vs Mann&Whitney, Wilcoxon goes first in my book. However, it seems Stigler's Law beats me yet again, and Wilcoxon should perhaps share some of that priority with a number of earlier contributors, and (besides Mann and Whitney) should share credit with several discoverers of an equivalent test.[4][5] )

References

[1]: Zimmerman DW and Zumbo BN, (1993),
Rank transformations and the power of the Student t-test and Welch t′-test for non-normal populations,
Canadian Journal Experimental Psychology, 47: 523–39.

[2]: J.C.F. de Winter (2013),
"Using the Student’s t-test with extremely small sample sizes,"
Practical Assessment, Research and Evaluation, 18:10, August, ISSN 1531-7714
http://pareonline.net/getvn.asp?v=18&n=10

[3]: Michael P. Fay and Michael A. Proschan (2010),
"Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules,"
Stat Surv; 4: 1–39.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/

[4]: Berry, K.J., Mielke, P.W. and Johnston, J.E. (2012),
"The Two-sample Rank-sum Test: Early Development,"
Electronic Journal for History of Probability and Statistics, Vol.8, December
pdf

[5]: Kruskal, W. H. (1957),
"Historical notes on the Wilcoxon unpaired two-sample test,"
Journal of the American Statistical Association, 52, 356–360.

Solved – Implementing chi-square in python and testing on scipy’s poisson and norm variates

With a large sample size like 10000, the law of large numbers holds already pretty well and the distribution of test statistics should be a good approximation. So the rejection rate in repeated simulations should be close to alpha.

If it doesn't work out that way, then it is most likely a bug in the program. The numpy/scipy random numbers have been verified years ago by a very similar chisquare test and should not have any bugs. Bugs with just tiny incorrect numbers or bugs in corner or extreme cases are never ruled out. But there are no bugs that would be visible with this kind of chisquare test.

Best Answer

Related Solutions

T-test vs Non-Parametric Tests – How to Choose for Small Samples

Solved – Implementing chi-square in python and testing on scipy’s poisson and norm variates

Related Question