The null hypothesis of ANOVA is that the within-group variances are equal and all group means are equal. Under this hypothesis, your data provide enough information for testing, provided you can justify assuming the sampling distributions of the group means are close to Normal.
Here's a brief analysis. To establish notation, let there be $d\ge 2$ groups of sizes $n_1, n_2, \ldots, n_d$ comprising $N=n_1+\cdots + n_d$ independent observations. Let the common variance be $\sigma^2$ and the common mean be $\mu.$ If $\mathcal{L}$ is the likelihood, write $\Lambda = -2\log(\mathcal L),$ which is minimized when the likelihood is maximized.
These assumptions imply the group means $x_i$ have Normal$(\mu, \sigma^2/n_i)$ distributions. Therefore
$$\Lambda = \sum_{i=1}^d \log\left(\frac{\sigma^2}{n_i}\right) + \frac{(x_i-\mu)^2}{\sigma^2/n_i}.$$
Critical values of the gradient of $\Lambda$ produce the familiar estimates
$$\hat \mu = \frac{1}{N}\sum_{i=1}^d n_i x_i$$
and
$$\hat\sigma^2 = \frac{1}{d}\sum_{i=1}^d n_i(x_i-\hat\mu)^2.$$
At this point it is natural to re-express the data in terms of Z scores as
$$z_i = \frac{x_i - \hat\mu}{\sqrt{\hat\sigma^2 / n_i}}$$
because each of these is approximately standard Normal and they are approximately independent. Consequently, you could inspect this set of Z scores for deviations from a standard Normal distribution. You might use a Normal-theory outlier test, for instance; or you could construct a statistic such as the Kolmogorov-Smirnov statistic or Anderson-Darling statistic. For formal testing, bootstrapping the distribution of this statistic from the estimates will work (and helps us avoid extensive further analysis!).
As an example, the estimates for the data $X=(6.58, 7.4, 3.2)$ with group sizes $n=(20, 2, 15)$ are
$$(\hat \mu, \hat\sigma^2) = (5.25, 35.89)$$
with corresponding Z scores
$$z = (0.99, 0.51, -1.33).$$
These aren't unusual, so we haven't found significant evidence of a difference in group means. Indeed, it's scarcely possible to do so with just three groups--one of the means would have to be far from the other two.
Suppose, though, that you have more data from more studies and therefore you have means of more than three groups. Then this approach could yield significant results. As an example, consider ten groups with means $x=(1,2,\ldots, 9, 30).$ That last value of $30$ is extreme. The estimated parameters are now $\hat\mu=7.5$ and $\hat\sigma^2=124.5$ with Z scores
$$z = (-0.82, -0.70, \ldots, 0.06, 0.19, 2.85).$$
That last one ($2.85$) is a little unusual and indeed, the parametric boostrap of the KS statistic gives a p-value of $0.011,$ indicating a significant difference.
The following R
code gives an implementation. The p-value for the data in the question is $0.605.$
#
# Data.
#
x <- c(6.58, 7.4, 3.2)
n <- c(20, 2, 15)
# x <- c(1:9, 30)
# n <- rep(2, length(x))
#
# Parameter estimates.
#
theta.hat <- function(x, n) {
N <- sum(n)
m <- sum(n*x) / N
s2 <- mean(n*(x-m)^2)
c(x.hat=m, s2.hat=s2)
}
(theta <- theta.hat(x, n))
#
# Bootstrap statistic.
#
KS.stat <- function(x, theta, n) {
max(abs((1:length(x))/(length(x)+1) - sort(pnorm(x, theta["x.hat"], sqrt(theta["s2.hat"]/n)))))
}
stat <- KS.stat(x, theta, n)
#
# Display Z scores.
#
print(signif(x - theta["x.hat"]) / sqrt(theta["s2.hat"] / n), 2)
#
# Bootstrap the statistic.
#
sim <- replicate(1e4, {
y <- rnorm(length(n), theta["x.hat"], sqrt(theta["s2.hat"]/n))
p <- theta.hat(y, n)
KS.stat(y, p, n)
})
print(c(`p-value`=mean(c(stat, sim) >= stat)))
Best Answer
My take on this is different. Posing this as a question about means, or about medians, or about any other summary, is an example of what Whitehead called misplaced concreteness. The question of interest to me is more diffuse, at least in the first instance: How do the distributions compare?
Incidentally, the OP's edit shows that we have samples of 500 hexameters. So, populations are clearly definable, the complete works being sampled. Whether there is a wider population behind and beyond, e.g. as defining the style of particular authors or groups of authors, I happily leave open.
Like the OP, I want it both ways.
The variables are discrete and necessarily positive. So, neither variable can be taken completely seriously as candidates for being normal or Gaussian. But neither are many examples often presented as good fits to normal distributions, such as people's heights, necessarily positive, continuous in principle, but conventionally measured in cm or inches.
At least for descriptive or exploratory graphical purposes, a normal quantile plot (normal probability plot, normal scores plot, probit plot, fractile diagram) is a good conventional starting point. The normal distribution is a reference case in the same way that circular shapes or sea level are reference cases for shapes that we don't expect all or even ever to be exactly circular, or landsurface altitudes that we don't expect all or even ever to be at sea level.
The large number of ties is clear but showing them graphically is not especially helpful.
I used what in my reading are most often called ridits, which are for each distinct value, number of words $4(1)12$,
$$ (\text{fraction} \lt \text{value})\ + \ (1/2)\ (\text{fraction} = \text{value}),$$
OR
$$ (\text{fraction} \le \text{value})\ - \ (1/2)\ (\text{fraction} = \text{value}),$$
but I show them on a normal quantile scale. (Apologies to those who would prefer that I introduce notation or use more formal language.)
Ridits have many other names, such as average cumulative proportions; the mid-distribution function; the grade function; cumulative proportions to midpoint. (The last term appeared in a much used textbook by Dixon and Massey, some years before Bross introduced and named ridits, and the idea is likely to be even older. Bross admitted late in life that he was just choosing a punchy term like logit or probit or rankit, but honouring his wife Rida.)
The merit of this plot could only be that it helps anyone to see coarse and fine structure comparing the distributions. Use of the normal as a reference is less important than that it gives us a way to compare the distributions with each other. Although your view may be different, I think it is sufficiently clearer than a histogram to be worth looking at. I see a systematic shift, typically use of more words in the Iliad, and hints of extra structure that should be interpreted with extreme caution.
A formal test of normality such as Shapiro-Wilk seems to me fairly pointless.
I used Stata, but coding is, or should be, trivial in your favourite software if different. It may help that I started with this version of the data.
So the cumulative probabilities to midpoint for the Iliad run from 6.5/500 to 499/500 and the corresponding standard normal deviates to 7 d.p. run from $-$2.2262118 to 2.8781617.