Compare means for discrete data

analysis-of-meansstatistical significance

I have two sets of observations which correspond to the number of words in dactylic hexameters in two different Greek authors. The histogram has a nice bell shape, but the data is not normal: there can only be discrete number of words in a verse. It is not Poisson, either: variance is smaller than the mean (empirically, a verse has no less 4 words and no more than 12).

enter image description here

What are the ways to compare two means to say if they are significantly different?

I was thinking of chi-square for comparison of 2 distributions, but this way, it seems that we are treating 5-word lines and, say, 6-word lines as categorical data (which is not totally devoid of sense, but awkward).

Another option would be to compute means for every n lines (say, 5), and then apply a z-test or a t-test (by Central Limit Theorem).

Both solutions, however, seem far-fetched. Is there a more elegant solution?

upd. Following the suggestions from comments below, I add the results of the Shapiro tests and a qq-plot.

a) Shapiro-Wilk normality test for the Agronautica sample returns
W = 0.94682, p-value = 2.056e-12; for the Iliad sample — W = 0.95116, p-value = 8.582e-12

b) QQ-plots look similar for both samples:

enter image description here

upd2: adding the tables of counts for each sample (first line is the number of words in a verse)

iliad_sample

4 5 6 7 8 9 10 11 12

13 33 115 128 124 69 13 3 2

argo_sample

4 5 6 7 8 9 10 11

22 67 136 149 85 32 7 2

Best Answer

My take on this is different. Posing this as a question about means, or about medians, or about any other summary, is an example of what Whitehead called misplaced concreteness. The question of interest to me is more diffuse, at least in the first instance: How do the distributions compare?

Incidentally, the OP's edit shows that we have samples of 500 hexameters. So, populations are clearly definable, the complete works being sampled. Whether there is a wider population behind and beyond, e.g. as defining the style of particular authors or groups of authors, I happily leave open.

Like the OP, I want it both ways.

  1. The variables are discrete and necessarily positive. So, neither variable can be taken completely seriously as candidates for being normal or Gaussian. But neither are many examples often presented as good fits to normal distributions, such as people's heights, necessarily positive, continuous in principle, but conventionally measured in cm or inches.

  2. At least for descriptive or exploratory graphical purposes, a normal quantile plot (normal probability plot, normal scores plot, probit plot, fractile diagram) is a good conventional starting point. The normal distribution is a reference case in the same way that circular shapes or sea level are reference cases for shapes that we don't expect all or even ever to be exactly circular, or landsurface altitudes that we don't expect all or even ever to be at sea level.

The large number of ties is clear but showing them graphically is not especially helpful.

I used what in my reading are most often called ridits, which are for each distinct value, number of words $4(1)12$,

$$ (\text{fraction} \lt \text{value})\ + \ (1/2)\ (\text{fraction} = \text{value}),$$

OR

$$ (\text{fraction} \le \text{value})\ - \ (1/2)\ (\text{fraction} = \text{value}),$$

but I show them on a normal quantile scale. (Apologies to those who would prefer that I introduce notation or use more formal language.)

Ridits have many other names, such as average cumulative proportions; the mid-distribution function; the grade function; cumulative proportions to midpoint. (The last term appeared in a much used textbook by Dixon and Massey, some years before Bross introduced and named ridits, and the idea is likely to be even older. Bross admitted late in life that he was just choosing a punchy term like logit or probit or rankit, but honouring his wife Rida.)

The merit of this plot could only be that it helps anyone to see coarse and fine structure comparing the distributions. Use of the normal as a reference is less important than that it gives us a way to compare the distributions with each other. Although your view may be different, I think it is sufficiently clearer than a histogram to be worth looking at. I see a systematic shift, typically use of more words in the Iliad, and hints of extra structure that should be interpreted with extreme caution.

A formal test of normality such as Shapiro-Wilk seems to me fairly pointless.

enter image description here

I used Stata, but coding is, or should be, trivial in your favourite software if different. It may help that I started with this version of the data.

clear
input float(n_words iliad argo)
 4  13  22
 5  33  67
 6 115 136
 7 128 149
 8 124  85
 9  69  32
10  13   7
11   3   2
12   2   0
end

So the cumulative probabilities to midpoint for the Iliad run from 6.5/500 to 499/500 and the corresponding standard normal deviates to 7 d.p. run from $-$2.2262118 to 2.8781617.

Related Question