Compare means for discrete data

analysis-of-meansstatistical significance

I have two sets of observations which correspond to the number of words in dactylic hexameters in two different Greek authors. The histogram has a nice bell shape, but the data is not normal: there can only be discrete number of words in a verse. It is not Poisson, either: variance is smaller than the mean (empirically, a verse has no less 4 words and no more than 12).

What are the ways to compare two means to say if they are significantly different?

I was thinking of chi-square for comparison of 2 distributions, but this way, it seems that we are treating 5-word lines and, say, 6-word lines as categorical data (which is not totally devoid of sense, but awkward).

Another option would be to compute means for every n lines (say, 5), and then apply a z-test or a t-test (by Central Limit Theorem).

Both solutions, however, seem far-fetched. Is there a more elegant solution?

upd. Following the suggestions from comments below, I add the results of the Shapiro tests and a qq-plot.

a) Shapiro-Wilk normality test for the Agronautica sample returns
W = 0.94682, p-value = 2.056e-12; for the Iliad sample — W = 0.95116, p-value = 8.582e-12

b) QQ-plots look similar for both samples:

upd2: adding the tables of counts for each sample (first line is the number of words in a verse)

iliad_sample

4 5 6 7 8 9 10 11 12

13 33 115 128 124 69 13 3 2

argo_sample

4 5 6 7 8 9 10 11

22 67 136 149 85 32 7 2

Best Answer

My take on this is different. Posing this as a question about means, or about medians, or about any other summary, is an example of what Whitehead called misplaced concreteness. The question of interest to me is more diffuse, at least in the first instance: How do the distributions compare?

Incidentally, the OP's edit shows that we have samples of 500 hexameters. So, populations are clearly definable, the complete works being sampled. Whether there is a wider population behind and beyond, e.g. as defining the style of particular authors or groups of authors, I happily leave open.

Like the OP, I want it both ways.

The variables are discrete and necessarily positive. So, neither variable can be taken completely seriously as candidates for being normal or Gaussian. But neither are many examples often presented as good fits to normal distributions, such as people's heights, necessarily positive, continuous in principle, but conventionally measured in cm or inches.
At least for descriptive or exploratory graphical purposes, a normal quantile plot (normal probability plot, normal scores plot, probit plot, fractile diagram) is a good conventional starting point. The normal distribution is a reference case in the same way that circular shapes or sea level are reference cases for shapes that we don't expect all or even ever to be exactly circular, or landsurface altitudes that we don't expect all or even ever to be at sea level.

The large number of ties is clear but showing them graphically is not especially helpful.

I used what in my reading are most often called ridits, which are for each distinct value, number of words $4(1)12$,

$$ (\text{fraction} \lt \text{value})\ + \ (1/2)\ (\text{fraction} = \text{value}),$$

$$ (\text{fraction} \le \text{value})\ - \ (1/2)\ (\text{fraction} = \text{value}),$$

but I show them on a normal quantile scale. (Apologies to those who would prefer that I introduce notation or use more formal language.)

Ridits have many other names, such as average cumulative proportions; the mid-distribution function; the grade function; cumulative proportions to midpoint. (The last term appeared in a much used textbook by Dixon and Massey, some years before Bross introduced and named ridits, and the idea is likely to be even older. Bross admitted late in life that he was just choosing a punchy term like logit or probit or rankit, but honouring his wife Rida.)

The merit of this plot could only be that it helps anyone to see coarse and fine structure comparing the distributions. Use of the normal as a reference is less important than that it gives us a way to compare the distributions with each other. Although your view may be different, I think it is sufficiently clearer than a histogram to be worth looking at. I see a systematic shift, typically use of more words in the Iliad, and hints of extra structure that should be interpreted with extreme caution.

A formal test of normality such as Shapiro-Wilk seems to me fairly pointless.

I used Stata, but coding is, or should be, trivial in your favourite software if different. It may help that I started with this version of the data.

clear
input float(n_words iliad argo)
 4  13  22
 5  33  67
 6 115 136
 7 128 149
 8 124  85
 9  69  32
10  13   7
11   3   2
12   2   0
end

So the cumulative probabilities to midpoint for the Iliad run from 6.5/500 to 499/500 and the corresponding standard normal deviates to 7 d.p. run from $-$2.2262118 to 2.8781617.

Related Solutions

Solved – Compare means of two datasets of binary data

You can express your data in the form of a contingency table. For a small N you can use Fisher's exact test to test whether your measurements a and b are dependent on each other.

For a larger N you can use the chi-squared test

Solved – How to compare means among groups with a single value

The null hypothesis of ANOVA is that the within-group variances are equal and all group means are equal. Under this hypothesis, your data provide enough information for testing, provided you can justify assuming the sampling distributions of the group means are close to Normal.

Here's a brief analysis. To establish notation, let there be $d\ge 2$ groups of sizes $n_1, n_2, \ldots, n_d$ comprising $N=n_1+\cdots + n_d$ independent observations. Let the common variance be $\sigma^2$ and the common mean be $\mu.$ If $\mathcal{L}$ is the likelihood, write $\Lambda = -2\log(\mathcal L),$ which is minimized when the likelihood is maximized.

These assumptions imply the group means $x_i$ have Normal$(\mu, \sigma^2/n_i)$ distributions. Therefore

$$\Lambda = \sum_{i=1}^d \log\left(\frac{\sigma^2}{n_i}\right) + \frac{(x_i-\mu)^2}{\sigma^2/n_i}.$$

Critical values of the gradient of $\Lambda$ produce the familiar estimates

$$\hat \mu = \frac{1}{N}\sum_{i=1}^d n_i x_i$$

and

$$\hat\sigma^2 = \frac{1}{d}\sum_{i=1}^d n_i(x_i-\hat\mu)^2.$$

At this point it is natural to re-express the data in terms of Z scores as

$$z_i = \frac{x_i - \hat\mu}{\sqrt{\hat\sigma^2 / n_i}}$$

because each of these is approximately standard Normal and they are approximately independent. Consequently, you could inspect this set of Z scores for deviations from a standard Normal distribution. You might use a Normal-theory outlier test, for instance; or you could construct a statistic such as the Kolmogorov-Smirnov statistic or Anderson-Darling statistic. For formal testing, bootstrapping the distribution of this statistic from the estimates will work (and helps us avoid extensive further analysis!).

As an example, the estimates for the data $X=(6.58, 7.4, 3.2)$ with group sizes $n=(20, 2, 15)$ are

$$(\hat \mu, \hat\sigma^2) = (5.25, 35.89)$$

with corresponding Z scores

$$z = (0.99, 0.51, -1.33).$$

These aren't unusual, so we haven't found significant evidence of a difference in group means. Indeed, it's scarcely possible to do so with just three groups--one of the means would have to be far from the other two.

Suppose, though, that you have more data from more studies and therefore you have means of more than three groups. Then this approach could yield significant results. As an example, consider ten groups with means $x=(1,2,\ldots, 9, 30).$ That last value of $30$ is extreme. The estimated parameters are now $\hat\mu=7.5$ and $\hat\sigma^2=124.5$ with Z scores

$$z = (-0.82, -0.70, \ldots, 0.06, 0.19, 2.85).$$

That last one ($2.85$) is a little unusual and indeed, the parametric boostrap of the KS statistic gives a p-value of $0.011,$ indicating a significant difference.

The following R code gives an implementation. The p-value for the data in the question is $0.605.$

#
# Data.
#
x <- c(6.58, 7.4, 3.2)
n <- c(20, 2, 15)

# x <- c(1:9, 30)
# n <- rep(2, length(x))
#
# Parameter estimates.
#
theta.hat <- function(x, n) {
  N <- sum(n)
  m <- sum(n*x) / N
  s2 <- mean(n*(x-m)^2)
  c(x.hat=m, s2.hat=s2)
}
(theta <- theta.hat(x, n))
#
# Bootstrap statistic.
#
KS.stat <- function(x, theta, n) {
  max(abs((1:length(x))/(length(x)+1) - sort(pnorm(x, theta["x.hat"], sqrt(theta["s2.hat"]/n)))))
}
stat <- KS.stat(x, theta, n)
#
# Display Z scores.
#
print(signif(x - theta["x.hat"]) / sqrt(theta["s2.hat"] / n), 2)
#
# Bootstrap the statistic.
#
sim <- replicate(1e4, {
  y <- rnorm(length(n), theta["x.hat"], sqrt(theta["s2.hat"]/n))
  p <- theta.hat(y, n)
  KS.stat(y, p, n)
})
print(c(`p-value`=mean(c(stat, sim) >= stat)))

Best Answer

Related Solutions

Solved – Compare means of two datasets of binary data

Solved – How to compare means among groups with a single value

Related Question