I'm trying to figure out how actual working analysts handle data that doesn't quite meet the assumptions.
It depends on my needs, which assumptions are violated, in what way, how badly, how much that affects the inference, and sometimes on the sample size.
I'm running analysis on grouped data from trees in four groups. I've got data for about 35 attributes for each tree and I'm going through each attribute to determine if the groups differ significantly on that attribute. However, in a couple of cases, the ANOVA assumptions are slightly violated because the variances aren't equal (according to a Levene's test, using alpha=.05).
1) If sample sizes are equal, you don't have much of a problem. ANOVA is quite (level-)robust to different variances if the n's are equal.
2) testing equality of variance before deciding whether to assume it is recommended against by a number of studies. If you're in any real doubt that they'll be close to equal, it's better to simply assume they're unequal.
Some references:
Zimmerman, D.W. (2004),
"A note on preliminary tests of equality of variances."
Br. J. Math. Stat. Psychol., May; 57(Pt 1): 173-81.
http://www.ncbi.nlm.nih.gov/pubmed/15171807
Henrik gives three references here
3) It's the effect size that matters, rather than whether your sample is large enough to tell you they're significantly different. So in large samples, a small difference in variance will show as highly significant by Levene's test, but will be of essentially no consequence in its impact. If the samples are large and the effect size - the ratio of variances or the differences in variances - are quite close to what they should be, then the p-value is of no consequence. (On the other hand, in small samples, a nice big p-value is of little comfort. Either way the test doesn't answer the right question.)
Note that there's a Welch-Satterthwaite type adjustment to the estimate of residual standard error and d.f. in ANOVA, just as there is in two-sample t-tests.
- Use a non-parametric test like a Wilcoxon (if so, which one?).
If you're interested in location-shift alternatives, you're still assuming constant spread. If you're interested in much more general alternatives then you might perhaps consider it; the k-sample equivalent to a Wilcoxon test is a Kruskal-Wallis test.
Do some kind of correction to the ANOVA result
See my above suggestion of considering Welch-Satterthwaite, that's a 'kind of correction'.
(Alternatively you might cast your ANOVA as a set of pairwise Welch-type t-tests, in which case you likely would want to look at a Bonferroni or something similar)
I've also read some things that suggest that heteroscedasticity isn't really that big of a problem for ANOVA unless the means and variances are correlated (i.e. they both increase together)
You'd have to cite something like that. Having looked at a number of situations with t-tests, I don't think it's clearly true, so I'd like to see why they think so; perhaps the situation is restricted in some way. It would be nice if it were the case though because pretty often generalized linear models can help with that situation.
Finally, I should add that I'm doing this analysis for publication in a peer-reviewed journal, so whatever approach I settle on has to pass muster with reviewers.
It's very hard to predict what might satisfy your reviewers. Most of us don't work with trees.
When you have heteroskedasticity, it doesn't make sense to try to check normality of the entire set of residuals, though you could still check groups individually (with corresponding loss of power of course).
On the other hand, it doesn't really make sense to formally test either normality or heterosckedasticity when checking assumptions, since the hypothesis tests answer the wrong question.
This is because your data aren't actually normal (and it's also very unlikely that your populations have identical variance) - so you already know the answer to the question the hypothesis test checks for. With a nice large sample like you have, the chance that a nice powerful test like the Shapiro-Wilk doesn't pick it up is small - so you'll reject as non-normal data from distributions that will have little impact on the signficance level or the power. That is, you'll tend to reject normality - even at quite small significance levels - when it really doesn't matter. The test is likely to reject when it matters least (i.e. when you have a big sample).
What you actually want to know is the answer to a different question than the test answers - something like "How much does this affect my inference?". The hypothesis test doesn't address that question - it doesn't consider the impact of the non-normality, only your ability to detect it.
Further, when you have a sequence "do this equal-variance normal theory analysis if I don't reject all these tests, otherwise do this analysis if I reject that one, this other analysis if I reject the other one and something else if I reject both" we must consider the properties of the whole scheme. Such programs of testing usually do worse (in terms of test bias, accuracy of significance level and very often, of power) than just assuming you'd rejected both tests.
So you recommend to just stick with the ANOVA and consider no assumptions violated?
Not quite. In fact, if anything my last sentence above suggests that you assume heteroskedasticity and normality are violated from the start. Either one alone being violated is relatively easy to deal with, both together is a little trickier (but still possible). However, in your case I think you're probably okay, since I think you needn't worry about one of the two:
Normality may not be such a problem - the considerations would be what kind of non-normality might you have, and how strongly non-normal, and with how large a sample size?
Your sample size seems reasonably large and the distribution pretty mildly left skew and light-tailed, though that assessment may be confounded with the heteroskedasticity. However, if you had good understanding of the properties of what was being measured - which you may well have - or information from similar studies, you might be able to make an a priori assessment on that basis and so better able to choose an appropriate procedure (though I'd still suggest diagnostic checking).
Since your data are probability estimates, they'll be bounded. In fact the left skewness may simply be caused by some probability assessments getting relatively close to 100. If that's the case, you must also tend to doubt your assumption of linearity, and that will be a likely cause of heteroskedasticity as well. If my guess about getting close to the upper bound is right you'll tend to see lower spread among the groups with higher mean.
You might consider an analysis suited to continuous proportions, perhaps a beta-regression - at least if you have no data exactly on either boundary. (An alternative might be a transformation, but models that deal with the data you have tend to be both easier to defend and more interpretable.)
With your decent-sized sample, you are probably safe enough on non-normality, but heteroskedasticity might be more of an issue - in particular, heteroskedasticity issues don't decrease with sample size.
On the other hand, if your sample sizes are equal (or at least very nearly equal), heteroskedasticity is of little consequence. Your tests will be little impacted in the case of equal sample sizes.
If equal-sample-sizes are not the case, I suggest you:
i) don't assume heteroskedasticity will simply be okay
ii) don't formally test it, for the same reasons outlined above (testing answers the wrong question)
Instead I suggest you start with the assumption that the variances differ - whether it's to use something like the Welch approach (I can't say I know how that works for 2x2 off the top of my head, but it should be quite possible to make it work in that case, since it only affects the calculation of residual variance and its df), or to implement your ANOVA in a regression and move to something like heteroskedasticity-consistent standard errors (which is used more widely in areas like econometrics).
Best Answer
Regardless of the good points made in the comments above about whether you condition your testing procedure on the results of preliminary investigation (e.g. choosing Welch vs. standard t-tests based on the outcome of Levene's test) I suspect that the reason for this difference between ANOVA/t-test (i.e., linear models where all of the predictors are categorical) and other linear models such as regression, ANCOVA, etc. (i.e., linear models with at least one continuous predictor) is that questions of heteroscedasticity etc. apply to the conditional distribution of the data, i.e. the distribution of the $\epsilon$ in $y=\beta_0 + \beta_1 x + \ldots + \epsilon$.