Solved – Understanding the results of Bartlett’s test of homoscedasticity in ANOVA

anovahypothesis testingrvariance

I want to conduct one-way ANOVA for this data:

# three factor levels
I <- c(19, 22, 20, 18, 25, 21, 24, 17)
II <- c(20, 21, 33, 27, 29, 30, 22, 23)
III <- c(16, 15, 18, 26, 17, 23, 20, 19)

# making a dataframe from data
response <- c(I, II, III)
factor <- c(rep("I", length(I)), rep("II", length(II)), rep("III", length(III)))
(data1 <- data.frame(response, factor))

So firstly, I check the boxplot for every factor level:

# making a side-by-side boxplots
plot(response ~ factor, data1)

and see that variance for level II is much higher than for I and II, so I suspect that Bartlett's test will reject the null hypothesis about the equality of variances.

I also check the exact value of these variances and see that the second one is significantly different from the others (22,83):

tapply(data1$response, data1$factor, var)
#      I        II       III 
#  7.928571 22.839286 13.642857

Then I check the normality of response, it's ok:

# testing for normality
qqnorm(data1$response)
qqline(data1$response)

if(shapiro.test(kalkulator$reakcja)$p.value >= 0.01){
   cat("No reason to reject null hypothesis")   
}else {
   cat("This distribution isn't normal")
}
# No reason to reject null hypothesis

So I finally go to Bartlett's test:

# testing for homoscedasticity
bartlett.test(response ~ factor, data1)

# Bartlett test of homogeneity of variances

# data:  response by factor
# Bartlett's K-squared = 1.7932, df = 2, p-value = 0.408

And see that there's no reason to reject null hypothesis. I know of course, that this statement isn't equal to "null hypothesis is true", but I have here significant difference in variances and still this test is passed. Why? And should I assume that there is homogeneity of variances and go on with ANOVA?
Thanks for taking your time 🙂

Best Answer

Statistical tests allow to say if there are significant differences. The fact that you think the variances are significantly different just by computing their values goes against the direction of statistical testing.

In your case, you do a variance calculated from eight points in each group. Thus, the degree of uncertainty is high on the actual value of the variances in each group which leads the Bartlett test to not reject the null hypothesis.

If you had 800 points in each group, the result would be probably different for the same variance values you computed in each group.

Related Solutions

Solved – How to choose between t-test or non-parametric test e.g. Wilcoxon in small samples

I am going to change the order of questions about.

I've found textbooks and lecture notes frequently disagree, and would like a system to work through the choice that can safely be recommended as best practice, and especially a textbook or paper this can be cited to.

Unfortunately, some discussions of this issue in books and so on rely on received wisdom. Sometimes that received wisdom is reasonable, sometimes it is less so (at the least in the sense that it tends to focus on a smaller issue when a larger problem is ignored); we should examine the justifications offered for the advice (if any justification is offered at all) with care.

Most guides to choosing a t-test or non-parametric test focus on the normality issue.

That’s true, but it’s somewhat misguided for several reasons that I address in this answer.

If performing an "unrelated samples" or "unpaired" t-test, whether to use a Welch correction?

This (to use it unless you have reason to think variances should be equal) is the advice of numerous references. I point to some in this answer.

Some people use a hypothesis test for equality of variances, but here it would have low power. Generally I just eyeball whether the sample SDs are "reasonably" close or not (which is somewhat subjective, so there must be a more principled way of doing it) but again, with low n it may well be that the population SDs are rather further apart than the sample ones.

Is it safer simply to always use the Welch correction for small samples, unless there is some good reason to believe population variances are equal? That’s what the advice is. The properties of the tests are affected by the choice based on the assumption test.

Some references on this can be seen here and here, though there are more that say similar things.

The equal-variances issue has many similar characteristics to the normality issue – people want to test it, advice suggests conditioning choice of tests on the results of tests can adversely affect the results of both kinds of subsequent test – it’s better simply not to assume what you can’t adequately justify (by reasoning about the data, using information from other studies relating to the same variables and so on).

However, there are differences. One is that – at least in terms of the distribution of the test statistic under the null hypothesis (and hence, its level-robustness) - non-normality is less important in large samples (at least in respect of significance level, though power might still be an issue if you need to find small effects), while the effect of unequal variances under the equal variance assumption doesn’t really go away with large sample size.

What principled method can be recommended for choosing which is the most appropriate test when the sample size is "small"?

With hypothesis tests, what matters (under some set of conditions) is primarily two things:

What is the actual type I error rate?
What is the power behaviour like?

We also need to keep in mind that if we're comparing two procedures, changing the first will change the second (that is, if they’re not conducted at the same actual significance level, you would expect that higher $\alpha$ is associated with higher power).

(Of course we're usually not so confident we know what distributions we're dealing with, so the sensitivity of those behaviors to changes in circumstances also matter.)

With these small-sample issues in mind, is there a good - hopefully citable - checklist to work through when deciding between t and non-parametric tests?

I will consider a number of situations in which I’ll make some recommendations, considering both the possibility of non-normality and unequal variances. In every case, take mention of the t-test to imply the Welch-test:

n medium-large

Non-normal (or unknown), likely to have near-equal variance:

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney, though if it’s only slightly heavy, the t-test should do okay. With light-tails the t-test may (often) be preferred. Permutation tests are a good option (you can even do a permutation test using a t-statistic if you're so inclined). Bootstrap tests are also suitable.

Non-normal (or unknown), unequal variance (or variance relationship unknown):

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney

if inequality of variance is only related to inequality of mean - i.e. if H0 is true the difference in spread should also be absent. GLMs are often a good option, especially if there’s skewness and spread is related to the mean. A permutation test is another option, with a similar caveat as for the rank-based tests. Bootstrap tests are a good possibility here.

Zimmerman and Zumbo (1993)$^{[1]}$ suggest a Welch-t-test on the ranks which they say performs better that the Wilcoxon-Mann-Whitney in cases where the variances are unequal.

n moderately small

rank tests are reasonable defaults here if you expect non-normality (again with the above caveat). If you have external information about shape or variance, you might consider GLMs . If you expect things not to be too far from normal, t-tests may be fine.

n very small

Because of the problem with getting suitable significance levels, neither permutation tests nor rank tests may be suitable, and at the smallest sizes, a t-test may be the best option (there’s some possibility of slightly robustifying it). However, there’s a good argument for using higher type I error rates with small samples (otherwise you’re letting type II error rates inflate while holding type I error rates constant). Also see de Winter (2013)$^{[2]}$.

The advice must be modified somewhat when the distributions are both strongly skewed and very discrete, such as Likert scale items where most of the observations are in one of the end categories. Then the Wilcoxon-Mann-Whitney isn’t necessarily a better choice than the t-test.

Simulation can help guide choices further when you have some information about likely circumstances.

I appreciate this is something of a perennial topic, but most questions concern the questioner's particular data set, sometimes a more general discussion of power, and occasionally what to do if two tests disagree, but I would like a procedure to pick the correct test in the first place!

The main problem is how hard it is to check the normality assumption in a small data set:

It is difficult to check normality in a small data set, and to some extent that's an important issue, but I think there's another issue of importance that we need to consider. A basic problem is that trying to assess normality as the basis of choosing between tests adversely impacts the properties of the tests you're choosing between.

Any formal test for normality would have low power so violations may well not be detected. (Personally I wouldn't test for this purpose, and I'm clearly not alone, but I've found this little use when clients demand a normality test be performed because that's what their textbook or old lecture notes or some website they found once declare should be done. This is one point where a weightier looking citation would be welcome.)

Here’s an example of a reference (there are others) which is unequivocal (Fay and Proschan, 2010$^{[3]}$):

The choice between t- and WMW DRs should not be based on a test of normality.

They are similarly unequivocal about not testing for equality of variance.

To make matters worse, it is unsafe to use the Central Limit Theorem as a safety net: for small n we can't rely on the convenient asymptotic normality of the test statistic and t distribution.

Nor even in large samples -- asymptotic normality of the numerator doesn’t imply that the t-statistic will have a t-distribution. However, that may not matter so much, since you should still have asymptotic normality (e.g. CLT for the numerator, and Slutsky’s theorem suggest that eventually the t-statistic should begin to look normal, if the conditions for both hold.)

One principled response to this is "safety first": as there's no way to reliably verify the normality assumption on a small sample, run an equivalent non-parametric test instead.

That’s actually the advice that the references I mention (or link to mentions of) give.

Another approach I've seen but feel less comfortable with, is to perform a visual check and proceed with a t-test if nothing untowards is observed ("no reason to reject normality", ignoring the low power of this check). My personal inclination is to consider whether there are any grounds for assuming normality, theoretical (e.g. variable is sum of several random components and CLT applies) or empirical (e.g. previous studies with larger n suggest variable is normal).

Both those are good arguments, especially when backed up with the fact that the t-test is reasonably robust against moderate deviations from normality. (One should keep in mind, however, that "moderate deviations" is a tricky phrase; certain kinds of deviations from normality may impact the power performace of the t-test quite a bit even though those deviations are visually very small - the t-test is less robust to some deviations than others. We should keep this in mind whenever we're discussing small deviations from normality.)

Beware, however, the phrasing "suggest the variable is normal". Being reasonably consistent with normality is not the same thing as normality. We can often reject actual normality with no need even to see the data – for example, if the data cannot be negative, the distribution cannot be normal. Fortunately, what matters is closer to what we might actually have from previous studies or reasoning about how the data are composed, which is that the deviations from normality should be small.

If so, I would use a t-test if data passed visual inspection, and otherwise stick to non-parametrics. But any theoretical or empirical grounds usually only justify assuming approximate normality, and on low degrees of freedom it's hard to judge how near normal it needs to be to avoid invalidating a t-test.

Well, that’s something we can assess the impact of fairly readily (such as via simulations, as I mentioned earlier). From what I've seen, skewness seems to matter more than heavy tails (but on the other hand I have seen some claims of the opposite - though I don't know what that's based on).

For people who see the choice of methods as a trade-off between power and robustness, claims about the asymptotic efficiency of the non-parametric methods are unhelpful. For instance, the rule of thumb that "Wilcoxon tests have about 95% of the power of a t-test if the data really are normal, and are often far more powerful if the data is not, so just use a Wilcoxon" is sometimes heard, but if the 95% only applies to large n, this is flawed reasoning for smaller samples.

But we can check small-sample power quite easily! It’s easy enough to simulate to obtain power curves as here.
(Again, also see de Winter (2013)$^{[2]}$).

Having done such simulations under a variety of circumstances, both for the two-sample and one-sample/paired-difference cases, the small sample efficiency at the normal in both cases seems to be a little lower than the asymptotic efficiency, but the efficiency of the signed rank and Wilcoxon-Mann-Whitney tests is still very high even at very small sample sizes.

At least that's if the tests are done at the same actual significance level; you can't do a 5% test with very small samples (and least not without randomized tests for example), but if you're prepared to perhaps do (say) a 5.5% or a 3.2% test instead, then the rank tests hold up very well indeed compared with a t-test at that significance level.

Small samples may make it very difficult, or impossible, to assess whether a transformation is appropriate for the data since it's hard to tell whether the transformed data belong to a (sufficiently) normal distribution. So if a QQ plot reveals very positively skewed data, which look more reasonable after taking logs, is it safe to use a t-test on the logged data? On larger samples this would be very tempting, but with small n I'd probably hold off unless there had been grounds to expect a log-normal distribution in the first place.

There’s another alternative: make a different parametric assumption. For example, if there’s skewed data, one might, for example, in some situations reasonably consider a gamma distribution, or some other skewed family as a better approximation - in moderately large samples, we might just use a GLM, but in very small samples it may be necessary to look to a small-sample test - in many cases simulation can be useful.

Alternative 2: robustify the t-test (but taking care about the choice of robust procedure so as not to heavily discretize the resulting distribution of the test statistic) - this has some advantages over a very-small-sample nonparametric procedure such as the ability to consider tests with low type I error rate.

Here I'm thinking along the lines of using say M-estimators of location (and related estimators of scale) in the t-statistic to smoothly robustify against deviations from normality. Something akin to the Welch, like:

$$\frac{\stackrel{\sim}{x}-\stackrel{\sim}{y}}{\stackrel{\sim}{S}_p}$$

where $\stackrel{\sim}{S}_p^2=\frac{\stackrel{\sim}{s}_x^2}{n_x}+\frac{\stackrel{\sim}{s}_y^2}{n_y}$ and $\stackrel{\sim}{x}$, $\stackrel{\sim}{s}_x$ etc being robust estimates of location and scale respectively.

I'd aim to reduce any tendency of the statistic to discreteness - so I'd avoid things like trimming and Winsorizing, since if the original data were discrete, trimming etc will exacerbate this; by using M-estimation type approaches with a smooth $\psi$-function you achieve similar effects without contributing to the discreteness. Keep in mind we're trying to deal with the situation where $n$ is very small indeed (around 3-5, in each sample, say), so even M-estimation potentially has its issues.

You could, for example, use simulation at the normal to get p-values (if sample sizes are very small, I'd suggest that over bootstrapping - if sample sizes aren't so small, a carefully-implemented bootstrap may do quite well, but then we might as well go back to Wilcoxon-Mann-Whitney). There's be a scaling factor as well as a d.f. adjustment to get to what I'd imagine would then be a reasonable t-approximation. This means we should get the kind of properties we seek very close to the normal, and should have reasonable robustness in the broad vicinity of the normal. There are a number of issues that come up that would be outside the scope of the present question, but I think in very small samples the benefits should outweigh the costs and the extra effort required.

[I haven't read the literature on this stuff for a very long time, so I don't have suitable references to offer on that score.]

Of course if you didn't expect the distribution to be somewhat normal-like, but rather similar to some other distribution, you could undertake a suitable robustification of a different parametric test.

What if you want to check assumptions for the non-parametrics? Some sources recommend verifying a symmetric distribution before applying a Wilcoxon test, which brings up similar problems to checking normality.

Indeed. I assume you mean the signed rank test*. In the case of using it on paired data, if you are prepared to assume that the two distributions are the same shape apart from location shift you are safe, since the differences should then be symmetric. Actually, we don't even need that much; for the test to work you need symmetry under the null; it's not required under the alternative (e.g. consider a paired situation with identically-shaped right skewed continuous distributions on the positive half-line, where the scales differ under the alternative but not under the null; the signed rank test should work essentially as expected in that case). The interpretation of the test is easier if the alternative is a location shift though.

*(Wilcoxon’s name is associated with both the one and two sample rank tests – signed rank and rank sum; with their U test, Mann and Whitney generalized the situation studied by Wilcoxon, and introduced important new ideas for evaluating the null distribution, but the priority between the two sets of authors on the Wilcoxon-Mann-Whitney is clearly Wilcoxon’s -- so at least if we only consider Wilcoxon vs Mann&Whitney, Wilcoxon goes first in my book. However, it seems Stigler's Law beats me yet again, and Wilcoxon should perhaps share some of that priority with a number of earlier contributors, and (besides Mann and Whitney) should share credit with several discoverers of an equivalent test.[4][5] )

References

[1]: Zimmerman DW and Zumbo BN, (1993),
Rank transformations and the power of the Student t-test and Welch t′-test for non-normal populations,
Canadian Journal Experimental Psychology, 47: 523–39.

[2]: J.C.F. de Winter (2013),
"Using the Student’s t-test with extremely small sample sizes,"
Practical Assessment, Research and Evaluation, 18:10, August, ISSN 1531-7714
http://pareonline.net/getvn.asp?v=18&n=10

[3]: Michael P. Fay and Michael A. Proschan (2010),
"Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules,"
Stat Surv; 4: 1–39.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/

[4]: Berry, K.J., Mielke, P.W. and Johnston, J.E. (2012),
"The Two-sample Rank-sum Test: Early Development,"
Electronic Journal for History of Probability and Statistics, Vol.8, December
pdf

[5]: Kruskal, W. H. (1957),
"Historical notes on the Wilcoxon unpaired two-sample test,"
Journal of the American Statistical Association, 52, 356–360.

Solved – What to do with non-normality and heterogeneous variances in two-way ANOVA when transformations do not work

Thanks for posting the data. Posting shows that the box plots concealed, although not intentionally, the sample sizes and important detail too. Whenever I see skewness on a positive response, my first instinct is to reach for logarithms, as they so often work well. Here, however, logarithms drastically over-transform, and plotting everything shows up a small surprise, namely that the two lowest values need care and attention.

The graph here is a quantile-box plot in which the original data points are plotted in order on scales consistent with the box idea (i.e. about half the points are inside the box and about half outside, the "about" being a side-effect of sample sizes like 11).

A more cautious square root transformation seems about right.

Personally I regard preliminary tests for normality and so forth as over-rated stuff left over from the 1960s. I feel far too queasy about forking paths of the form: pass the test OK, fail the test do something quite different, particularly with small sample sizes. Once you have a scale on which you have approximate symmetry and approximate equality of variances, linear models will work well.

Similarly, skewness and kurtosis from small samples can hardly be trusted. (Actually, skewness and kurtosis from large samples can hardly be trusted.) For some of the reasons see e.g. this paper

Indeed, some fits with generalised linear models with cohort and gender as indicator predictor variables show that results seem consistent over identity, root and log links, even despite the evidence of the first graph. If this were my problem I would push forward with a square root link function. In other words, although transformations are informative about the best scale to work on, you let the link function of a generalised linear model do the work.

Campaign slogan: Conventional box plots with a few groups leave out detail that could easily be interesting or useful and don't make full use of the space available. Use graphs that show more!

EDIT:

Here is token output: predicted values using generalised linear model, root link, normal family, interaction between cohort and females:

  +--------------------------------------+
  | cohort   females   predicted   Freq. |
  |--------------------------------------|
  |      1     males       2.056      12 |
  |      1   females       5.024      12 |
  |      2     males      12.712      11 |
  |      2   females      15.348      11 |
  +--------------------------------------+

Best Answer

Related Solutions

Solved – How to choose between t-test or non-parametric test e.g. Wilcoxon in small samples

Solved – What to do with non-normality and heterogeneous variances in two-way ANOVA when transformations do not work

Related Question