Solved – Why would I use ANOVA instead of a Rank-Sum test

anovanonparametrict-test

A colleague of mine with little statistics experience is trying to perform an experimental evaluation of a computer program. He created a between subjects design and solicited test subjects.

11 people were given his "new and improved" computer program to use. 10 others got the "old and boring" computer program to use, for the same task.

He asked me, and several other people around the lab how to analyze his data.

I told him he should examine the data for normality. If it was normally distributed, he should use a t-test. If it was not, he should use a Wilcoxon rank sum test.

One of my colleagues told him he should use ANOVA, even though he only has 2 groups. Apparently using ANOVA on non-normal data in R produces a new degree of freedom measure which can be put into a t-test.

I've never heard of such a thing. Is this true? Is it statistically valid? Why would anyone use it instead of just doing a rank-sum test?

Best Answer

Using ANOVA in R does not produce anything different from using ANOVA in another program, and with two groups the results will be equivalent to an equal variance t-test. The t-test is known to be robust to deviations from normality, though with unequal variances a Welch's t-test is probably preferrable.

In the special case of a score based on the number of correct answers on a multiple choice test, the distribution of the score is probably an overdispersed binomial. In that case the "correct" analysis might be a GLM model with a quasi-binomial distribution. Of course, the results might be quite similar to the unequal variance t-test.

Here is a simple simulation based example with 20 questions, and unequal variances. Welch's t-test gives a result much closer to the overdispersed binomial regression.

set.seed(3413)
#generate first sample
p1 <- 1/(1+exp(-1+rnorm(10,sd=1)))
x1 <- rbinom(10, size=20, p=p1)
#generate second sample
p2 <- 1/(1+exp(-3+rnorm(10, sd=1)))
x2 <- rbinom(10, size=20, p=p2)
#combine two sets
x <- c(x1,x2)
g <- gl(2,10)

#summaries:
tapply(x, g, mean)
   1    2 
12.6 19.2 
tapply(x, g, sd)
       1        2 
3.921451 1.032796 

#t-test:
t.test(x ~ g, var.equal=TRUE)

        Two Sample t-test

data:  x by g 
t = -5.1468, df = 18, p-value = 6.765e-05
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -9.294136 -3.905864 
sample estimates:
mean in group 1 mean in group 2 
           12.6            19.2 


#without equal variances:
t.test(x ~ g, var.equal=FALSE)

        Welch Two Sample t-test

data:  x by g 
t = -5.1468, df = 10.243, p-value = 0.0004016
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -9.448128 -3.751872 
sample estimates:
mean in group 1 mean in group 2 
           12.6            19.2 


#overdispersed binomial regression:
summary(glm(cbind(x, 20-x) ~ g, family="quasibinomial") )

Call:
glm(formula = cbind(x, 20 - x) ~ g, family = "quasibinomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5340  -0.8386  -0.2199   1.2778   2.7581  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5322     0.2242   2.374 0.028946 *  
g2            2.6458     0.5962   4.438 0.000318 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasibinomial family taken to be 2.343713)

    Null deviance: 120.242  on 19  degrees of freedom
Residual deviance:  45.197  on 18  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Related Solutions

Solved – Calculating the confidence interval for nonparametric count data? (Is it possible in GraphPad Prism?)

Instead of worrying about how non-Gaussian your data is, you can go with the fact that it has a nice clear binomial distribution. You need to fit a mixed effects logistic regression model. You can do this with the lme4 package in R (which is free). Use the lmer() function with family=binomial. A search term to use to get more info is "mixed effects generalized linear model".

It's not clear what confidence interval you want (ie estimate of what parameter? average number of malformations? or impact of some other variable on malformation rate?), but whatever your analysis is, the above approach will work.

T-test vs Non-Parametric Tests – How to Choose for Small Samples

I am going to change the order of questions about.

I've found textbooks and lecture notes frequently disagree, and would like a system to work through the choice that can safely be recommended as best practice, and especially a textbook or paper this can be cited to.

Unfortunately, some discussions of this issue in books and so on rely on received wisdom. Sometimes that received wisdom is reasonable, sometimes it is less so (at the least in the sense that it tends to focus on a smaller issue when a larger problem is ignored); we should examine the justifications offered for the advice (if any justification is offered at all) with care.

Most guides to choosing a t-test or non-parametric test focus on the normality issue.

That’s true, but it’s somewhat misguided for several reasons that I address in this answer.

If performing an "unrelated samples" or "unpaired" t-test, whether to use a Welch correction?

This (to use it unless you have reason to think variances should be equal) is the advice of numerous references. I point to some in this answer.

Some people use a hypothesis test for equality of variances, but here it would have low power. Generally I just eyeball whether the sample SDs are "reasonably" close or not (which is somewhat subjective, so there must be a more principled way of doing it) but again, with low n it may well be that the population SDs are rather further apart than the sample ones.

Is it safer simply to always use the Welch correction for small samples, unless there is some good reason to believe population variances are equal? That’s what the advice is. The properties of the tests are affected by the choice based on the assumption test.

Some references on this can be seen here and here, though there are more that say similar things.

The equal-variances issue has many similar characteristics to the normality issue – people want to test it, advice suggests conditioning choice of tests on the results of tests can adversely affect the results of both kinds of subsequent test – it’s better simply not to assume what you can’t adequately justify (by reasoning about the data, using information from other studies relating to the same variables and so on).

However, there are differences. One is that – at least in terms of the distribution of the test statistic under the null hypothesis (and hence, its level-robustness) - non-normality is less important in large samples (at least in respect of significance level, though power might still be an issue if you need to find small effects), while the effect of unequal variances under the equal variance assumption doesn’t really go away with large sample size.

What principled method can be recommended for choosing which is the most appropriate test when the sample size is "small"?

With hypothesis tests, what matters (under some set of conditions) is primarily two things:

What is the actual type I error rate?
What is the power behaviour like?

We also need to keep in mind that if we're comparing two procedures, changing the first will change the second (that is, if they’re not conducted at the same actual significance level, you would expect that higher $\alpha$ is associated with higher power).

(Of course we're usually not so confident we know what distributions we're dealing with, so the sensitivity of those behaviors to changes in circumstances also matter.)

With these small-sample issues in mind, is there a good - hopefully citable - checklist to work through when deciding between t and non-parametric tests?

I will consider a number of situations in which I’ll make some recommendations, considering both the possibility of non-normality and unequal variances. In every case, take mention of the t-test to imply the Welch-test:

n medium-large

Non-normal (or unknown), likely to have near-equal variance:

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney, though if it’s only slightly heavy, the t-test should do okay. With light-tails the t-test may (often) be preferred. Permutation tests are a good option (you can even do a permutation test using a t-statistic if you're so inclined). Bootstrap tests are also suitable.

Non-normal (or unknown), unequal variance (or variance relationship unknown):

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney

if inequality of variance is only related to inequality of mean - i.e. if H0 is true the difference in spread should also be absent. GLMs are often a good option, especially if there’s skewness and spread is related to the mean. A permutation test is another option, with a similar caveat as for the rank-based tests. Bootstrap tests are a good possibility here.

Zimmerman and Zumbo (1993)$^{[1]}$ suggest a Welch-t-test on the ranks which they say performs better that the Wilcoxon-Mann-Whitney in cases where the variances are unequal.

n moderately small

rank tests are reasonable defaults here if you expect non-normality (again with the above caveat). If you have external information about shape or variance, you might consider GLMs . If you expect things not to be too far from normal, t-tests may be fine.

n very small

Because of the problem with getting suitable significance levels, neither permutation tests nor rank tests may be suitable, and at the smallest sizes, a t-test may be the best option (there’s some possibility of slightly robustifying it). However, there’s a good argument for using higher type I error rates with small samples (otherwise you’re letting type II error rates inflate while holding type I error rates constant). Also see de Winter (2013)$^{[2]}$.

The advice must be modified somewhat when the distributions are both strongly skewed and very discrete, such as Likert scale items where most of the observations are in one of the end categories. Then the Wilcoxon-Mann-Whitney isn’t necessarily a better choice than the t-test.

Simulation can help guide choices further when you have some information about likely circumstances.

I appreciate this is something of a perennial topic, but most questions concern the questioner's particular data set, sometimes a more general discussion of power, and occasionally what to do if two tests disagree, but I would like a procedure to pick the correct test in the first place!

The main problem is how hard it is to check the normality assumption in a small data set:

It is difficult to check normality in a small data set, and to some extent that's an important issue, but I think there's another issue of importance that we need to consider. A basic problem is that trying to assess normality as the basis of choosing between tests adversely impacts the properties of the tests you're choosing between.

Any formal test for normality would have low power so violations may well not be detected. (Personally I wouldn't test for this purpose, and I'm clearly not alone, but I've found this little use when clients demand a normality test be performed because that's what their textbook or old lecture notes or some website they found once declare should be done. This is one point where a weightier looking citation would be welcome.)

Here’s an example of a reference (there are others) which is unequivocal (Fay and Proschan, 2010$^{[3]}$):

The choice between t- and WMW DRs should not be based on a test of normality.

They are similarly unequivocal about not testing for equality of variance.

To make matters worse, it is unsafe to use the Central Limit Theorem as a safety net: for small n we can't rely on the convenient asymptotic normality of the test statistic and t distribution.

Nor even in large samples -- asymptotic normality of the numerator doesn’t imply that the t-statistic will have a t-distribution. However, that may not matter so much, since you should still have asymptotic normality (e.g. CLT for the numerator, and Slutsky’s theorem suggest that eventually the t-statistic should begin to look normal, if the conditions for both hold.)

One principled response to this is "safety first": as there's no way to reliably verify the normality assumption on a small sample, run an equivalent non-parametric test instead.

That’s actually the advice that the references I mention (or link to mentions of) give.

Another approach I've seen but feel less comfortable with, is to perform a visual check and proceed with a t-test if nothing untowards is observed ("no reason to reject normality", ignoring the low power of this check). My personal inclination is to consider whether there are any grounds for assuming normality, theoretical (e.g. variable is sum of several random components and CLT applies) or empirical (e.g. previous studies with larger n suggest variable is normal).

Both those are good arguments, especially when backed up with the fact that the t-test is reasonably robust against moderate deviations from normality. (One should keep in mind, however, that "moderate deviations" is a tricky phrase; certain kinds of deviations from normality may impact the power performace of the t-test quite a bit even though those deviations are visually very small - the t-test is less robust to some deviations than others. We should keep this in mind whenever we're discussing small deviations from normality.)

Beware, however, the phrasing "suggest the variable is normal". Being reasonably consistent with normality is not the same thing as normality. We can often reject actual normality with no need even to see the data – for example, if the data cannot be negative, the distribution cannot be normal. Fortunately, what matters is closer to what we might actually have from previous studies or reasoning about how the data are composed, which is that the deviations from normality should be small.

If so, I would use a t-test if data passed visual inspection, and otherwise stick to non-parametrics. But any theoretical or empirical grounds usually only justify assuming approximate normality, and on low degrees of freedom it's hard to judge how near normal it needs to be to avoid invalidating a t-test.

Well, that’s something we can assess the impact of fairly readily (such as via simulations, as I mentioned earlier). From what I've seen, skewness seems to matter more than heavy tails (but on the other hand I have seen some claims of the opposite - though I don't know what that's based on).

For people who see the choice of methods as a trade-off between power and robustness, claims about the asymptotic efficiency of the non-parametric methods are unhelpful. For instance, the rule of thumb that "Wilcoxon tests have about 95% of the power of a t-test if the data really are normal, and are often far more powerful if the data is not, so just use a Wilcoxon" is sometimes heard, but if the 95% only applies to large n, this is flawed reasoning for smaller samples.

But we can check small-sample power quite easily! It’s easy enough to simulate to obtain power curves as here.
(Again, also see de Winter (2013)$^{[2]}$).

Having done such simulations under a variety of circumstances, both for the two-sample and one-sample/paired-difference cases, the small sample efficiency at the normal in both cases seems to be a little lower than the asymptotic efficiency, but the efficiency of the signed rank and Wilcoxon-Mann-Whitney tests is still very high even at very small sample sizes.

At least that's if the tests are done at the same actual significance level; you can't do a 5% test with very small samples (and least not without randomized tests for example), but if you're prepared to perhaps do (say) a 5.5% or a 3.2% test instead, then the rank tests hold up very well indeed compared with a t-test at that significance level.

Small samples may make it very difficult, or impossible, to assess whether a transformation is appropriate for the data since it's hard to tell whether the transformed data belong to a (sufficiently) normal distribution. So if a QQ plot reveals very positively skewed data, which look more reasonable after taking logs, is it safe to use a t-test on the logged data? On larger samples this would be very tempting, but with small n I'd probably hold off unless there had been grounds to expect a log-normal distribution in the first place.

There’s another alternative: make a different parametric assumption. For example, if there’s skewed data, one might, for example, in some situations reasonably consider a gamma distribution, or some other skewed family as a better approximation - in moderately large samples, we might just use a GLM, but in very small samples it may be necessary to look to a small-sample test - in many cases simulation can be useful.

Alternative 2: robustify the t-test (but taking care about the choice of robust procedure so as not to heavily discretize the resulting distribution of the test statistic) - this has some advantages over a very-small-sample nonparametric procedure such as the ability to consider tests with low type I error rate.

Here I'm thinking along the lines of using say M-estimators of location (and related estimators of scale) in the t-statistic to smoothly robustify against deviations from normality. Something akin to the Welch, like:

$$\frac{\stackrel{\sim}{x}-\stackrel{\sim}{y}}{\stackrel{\sim}{S}_p}$$

where $\stackrel{\sim}{S}_p^2=\frac{\stackrel{\sim}{s}_x^2}{n_x}+\frac{\stackrel{\sim}{s}_y^2}{n_y}$ and $\stackrel{\sim}{x}$, $\stackrel{\sim}{s}_x$ etc being robust estimates of location and scale respectively.

I'd aim to reduce any tendency of the statistic to discreteness - so I'd avoid things like trimming and Winsorizing, since if the original data were discrete, trimming etc will exacerbate this; by using M-estimation type approaches with a smooth $\psi$-function you achieve similar effects without contributing to the discreteness. Keep in mind we're trying to deal with the situation where $n$ is very small indeed (around 3-5, in each sample, say), so even M-estimation potentially has its issues.

You could, for example, use simulation at the normal to get p-values (if sample sizes are very small, I'd suggest that over bootstrapping - if sample sizes aren't so small, a carefully-implemented bootstrap may do quite well, but then we might as well go back to Wilcoxon-Mann-Whitney). There's be a scaling factor as well as a d.f. adjustment to get to what I'd imagine would then be a reasonable t-approximation. This means we should get the kind of properties we seek very close to the normal, and should have reasonable robustness in the broad vicinity of the normal. There are a number of issues that come up that would be outside the scope of the present question, but I think in very small samples the benefits should outweigh the costs and the extra effort required.

[I haven't read the literature on this stuff for a very long time, so I don't have suitable references to offer on that score.]

Of course if you didn't expect the distribution to be somewhat normal-like, but rather similar to some other distribution, you could undertake a suitable robustification of a different parametric test.

What if you want to check assumptions for the non-parametrics? Some sources recommend verifying a symmetric distribution before applying a Wilcoxon test, which brings up similar problems to checking normality.

Indeed. I assume you mean the signed rank test*. In the case of using it on paired data, if you are prepared to assume that the two distributions are the same shape apart from location shift you are safe, since the differences should then be symmetric. Actually, we don't even need that much; for the test to work you need symmetry under the null; it's not required under the alternative (e.g. consider a paired situation with identically-shaped right skewed continuous distributions on the positive half-line, where the scales differ under the alternative but not under the null; the signed rank test should work essentially as expected in that case). The interpretation of the test is easier if the alternative is a location shift though.

*(Wilcoxon’s name is associated with both the one and two sample rank tests – signed rank and rank sum; with their U test, Mann and Whitney generalized the situation studied by Wilcoxon, and introduced important new ideas for evaluating the null distribution, but the priority between the two sets of authors on the Wilcoxon-Mann-Whitney is clearly Wilcoxon’s -- so at least if we only consider Wilcoxon vs Mann&Whitney, Wilcoxon goes first in my book. However, it seems Stigler's Law beats me yet again, and Wilcoxon should perhaps share some of that priority with a number of earlier contributors, and (besides Mann and Whitney) should share credit with several discoverers of an equivalent test.[4][5] )

References

[1]: Zimmerman DW and Zumbo BN, (1993),
Rank transformations and the power of the Student t-test and Welch t′-test for non-normal populations,
Canadian Journal Experimental Psychology, 47: 523–39.

[2]: J.C.F. de Winter (2013),
"Using the Student’s t-test with extremely small sample sizes,"
Practical Assessment, Research and Evaluation, 18:10, August, ISSN 1531-7714
http://pareonline.net/getvn.asp?v=18&n=10

[3]: Michael P. Fay and Michael A. Proschan (2010),
"Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules,"
Stat Surv; 4: 1–39.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/

[4]: Berry, K.J., Mielke, P.W. and Johnston, J.E. (2012),
"The Two-sample Rank-sum Test: Early Development,"
Electronic Journal for History of Probability and Statistics, Vol.8, December
pdf

[5]: Kruskal, W. H. (1957),
"Historical notes on the Wilcoxon unpaired two-sample test,"
Journal of the American Statistical Association, 52, 356–360.

Best Answer

Related Solutions

Solved – Calculating the confidence interval for nonparametric count data? (Is it possible in GraphPad Prism?)

T-test vs Non-Parametric Tests – How to Choose for Small Samples

Related Question