When you have heteroskedasticity, it doesn't make sense to try to check normality of the entire set of residuals, though you could still check groups individually (with corresponding loss of power of course).
On the other hand, it doesn't really make sense to formally test either normality or heterosckedasticity when checking assumptions, since the hypothesis tests answer the wrong question.
This is because your data aren't actually normal (and it's also very unlikely that your populations have identical variance) - so you already know the answer to the question the hypothesis test checks for. With a nice large sample like you have, the chance that a nice powerful test like the Shapiro-Wilk doesn't pick it up is small - so you'll reject as non-normal data from distributions that will have little impact on the signficance level or the power. That is, you'll tend to reject normality - even at quite small significance levels - when it really doesn't matter. The test is likely to reject when it matters least (i.e. when you have a big sample).
What you actually want to know is the answer to a different question than the test answers - something like "How much does this affect my inference?". The hypothesis test doesn't address that question - it doesn't consider the impact of the non-normality, only your ability to detect it.
Further, when you have a sequence "do this equal-variance normal theory analysis if I don't reject all these tests, otherwise do this analysis if I reject that one, this other analysis if I reject the other one and something else if I reject both" we must consider the properties of the whole scheme. Such programs of testing usually do worse (in terms of test bias, accuracy of significance level and very often, of power) than just assuming you'd rejected both tests.
So you recommend to just stick with the ANOVA and consider no assumptions violated?
Not quite. In fact, if anything my last sentence above suggests that you assume heteroskedasticity and normality are violated from the start. Either one alone being violated is relatively easy to deal with, both together is a little trickier (but still possible). However, in your case I think you're probably okay, since I think you needn't worry about one of the two:
Normality may not be such a problem - the considerations would be what kind of non-normality might you have, and how strongly non-normal, and with how large a sample size?
Your sample size seems reasonably large and the distribution pretty mildly left skew and light-tailed, though that assessment may be confounded with the heteroskedasticity. However, if you had good understanding of the properties of what was being measured - which you may well have - or information from similar studies, you might be able to make an a priori assessment on that basis and so better able to choose an appropriate procedure (though I'd still suggest diagnostic checking).
Since your data are probability estimates, they'll be bounded. In fact the left skewness may simply be caused by some probability assessments getting relatively close to 100. If that's the case, you must also tend to doubt your assumption of linearity, and that will be a likely cause of heteroskedasticity as well. If my guess about getting close to the upper bound is right you'll tend to see lower spread among the groups with higher mean.
You might consider an analysis suited to continuous proportions, perhaps a beta-regression - at least if you have no data exactly on either boundary. (An alternative might be a transformation, but models that deal with the data you have tend to be both easier to defend and more interpretable.)
With your decent-sized sample, you are probably safe enough on non-normality, but heteroskedasticity might be more of an issue - in particular, heteroskedasticity issues don't decrease with sample size.
On the other hand, if your sample sizes are equal (or at least very nearly equal), heteroskedasticity is of little consequence. Your tests will be little impacted in the case of equal sample sizes.
If equal-sample-sizes are not the case, I suggest you:
i) don't assume heteroskedasticity will simply be okay
ii) don't formally test it, for the same reasons outlined above (testing answers the wrong question)
Instead I suggest you start with the assumption that the variances differ - whether it's to use something like the Welch approach (I can't say I know how that works for 2x2 off the top of my head, but it should be quite possible to make it work in that case, since it only affects the calculation of residual variance and its df), or to implement your ANOVA in a regression and move to something like heteroskedasticity-consistent standard errors (which is used more widely in areas like econometrics).
It is important to check for residuals rather than normality of
the collection of all responses.
Mixture of normal observations need not be normal. I will give an illustration with $g = 4$ groups and $n = 10$ replications
in each group. Data are simulated as normal with several different means
and equal variances, but a Shapiro-Wilk test rejects normality for the
$gn = 40$ observations taken together.
set.seed(1234)
g = 4; n = 10
x1 = rnorm(10, 20, 5); x2 = rnorm(10, 25, 5)
x3 = rnorm(10, 35, 5); x4 = rnorm(10, 50, 5)
x = c(x1, x2, x3, x4)
shapiro.test(x)
Shapiro-Wilk normality test
data: x
W = 0.93777, p-value = 0.0291
Taken together the 40 observations have a normal mixture distribution, which
need not be normal. Perhaps Wikipedia on mixture distributions, especially the figure
near the top of the page.
Looking at residuals. However, for this simple model, the residuals are found by subtracting
the mean for each group from each observation in the group. The 40 residuals
pass the Shapiro-Wilk test.
r1 = x1 - mean(x1); r2 = x2 - mean(x2)
r3 = x3 - mean(x3); r4 = x4 - mean(x4)
r = c(r1, r2, r3, r4)
shapiro.test(r)
Shapiro-Wilk normality test
data: r
W = 0.98231, p-value = 0.7743
ANOVA Significant. Because the group population means are quite different,
a one-way ANOVA on my fake data shows a highly significant effect.
gp = as.factor(rep(1:4, each=10))
lm.out = lm(x ~ gp); anova(lm.out)
Analysis of Variance Table
Response: x
Df Sum Sq Mean Sq F value Pr(>F)
gp 3 5655.9 1885.31 62.167 2.596e-14 ***
Residuals 36 1091.8 30.33
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It is precisely in the cases where there is a significant effect that
the aggregated data from all groups are likely to
fail the Shapiro-Wilk normality test.
Best Answer
The assumptions matter insofar as they affect the properties of the hypothesis tests (and intervals) you might use whose distributional properties under the null are calculated relying on those assumptions.
In particular, for hypothesis tests, the things we might care about are how far the true significance level might be from what we want it to be, and whether power against alternatives of interest is good.
In relation to the assumptions you ask about:
1. Equality of variance
This can certainly impact the significance level, at least when sample sizes are unequal.
(Edit:) An ANOVA F-statistic is the ratio of two estimates of variance (the partitioning and comparison of variances is why it's called analysis of variance). The denominator is an estimate of the supposedly-common-to-all-cells error variance (calculated from residuals), while the numerator, based on variation in the group means, will have two components, one from variation in the population means and one due to the error variance. If the null is true, the two variances that are being estimated will be the same (two estimates of the common error variance); this common but unknown value cancels out (because we took a ratio), leaving an F-statistic that only depends on the distributions of the errors (which under the assumptions we can show has an F distribution. (Similar comments apply to the t-test I used for illustration.)
[There's a little bit more detail on some of that information in my answer here]
However, here the two population variances differ across the two differently-sized samples. Consider the denominator (of the F-statistic in ANOVA and of the t-statistic in a t-test) - it is composed of two different variance estimates, not one, so it will not have the "right" distribution (a scaled chi-square for the F and its square root in the case of a t - both the shape and the scale are issues).
As a result, the F-statistic or the t-statistic will no longer have the F- or t-distribution, but the manner in which it is affected is different depending on whether the large or the smaller sample was drawn from the population with the larger variance. This in turn affects the distribution of p-values.
Under the null (i.e. when the population means are equal), the distribution of p-values should be uniformly distributed. However, if the variances and the sample sizes are unequal but the means are equal (so we don't want to reject the null), the p-values are not uniformly distributed. I did a small simulation to show you what happens. In this case, I used only 2 groups so ANOVA is equivalent to a two-sample t-test with the equal variance assumption. So I simulated samples from two normal distributions one with standard deviation ten times as large as the other, but equal means.
For the left side plot, the larger (population) standard deviation was for n=5 and the smaller standard deviation was for n=30. For the right side plot the larger standard deviation went with n=30 and the smaller with n=5. I simulated each one 10000 times and found the p-value each time. In each case you want the histogram to be completely flat (rectangular), since this means all tests conducted at some significance level $\alpha$ with actually get that type I error rate. In particular it's most important that the leftmost parts of the histogram to stay close to the grey line:
As we see, the left side plot (larger variance in the smaller sample) the p-values tend to be very small -- we would reject the null hypothesis very often (nearly half the time in this example) even though the null is true. That is, our significance levels are much larger than we asked for. In the right hand side plot we see the p-values are mostly large (and so our significance level is much smaller than we asked for) -- in fact not once in ten thousand simulations did we reject at the 5% level (the smallest p-value here was 0.055). [This may not sound like such a bad thing, until we remember that we will also have very low power to go with our very low significance level.]
That's quite a consequence. This is why it's a good idea to use a Welch-Satterthwaite type t-test or ANOVA when we don't have a good reason to assume that the variances will be close to equal -- by comparison it's barely affected in these situations (I simulated this case as well; the two distributions of simulated p-values - which I have not shown here - came out quite close to flat).
2. Conditional distribution of the response (DV)
This is somewhat less directly critical - for moderate deviations from normality, the significance level is so not much affected in larger samples (though the power can be!).
Here's one example, where the values are exponentially distributed (with identical distributions and sample sizes), where we can see this significance level issue being substantial at small $n$ but reducing with large $n$.
We see that at n=5 there are substantially too few small p-values (the significance level for a 5% test would be about half what it should be), but at n=50 the problem is reduced -- for a 5% test in this case the true significance level is about 4.5%.
So we might be tempted to say "well, that's fine, if n is big enough to get the significance level to be pretty close", but we may also be throwing a way a good deal of power. In particular, it's known that the asymptotic relative efficiency of the t-test relative to widely used alternatives can go to 0. This means that better test choices can get the same power with a vanishingly small fraction of the sample size required to get it with the t-test. You don't need anything out of the ordinary to be going on to need more than say twice as much data to have the same power with the t as you would need with an alternative test - moderately heavier-than normal tails in the population distribution and moderately large samples can be enough to do it.
(Other choices of distribution may make the significance level higher than it should be, or substantially lower than we saw here.)