[Math] Two-sample test for ordinal data.

correlationhypothesis testingstatistics

I have a question in a survey X that can be rated between 1 and 10 (ordinal). The answers can be split in group A and group B.

I want to know if the mean of group A's answers significantly differ from groups B rating. Which test is the best one to do so and how can I do this with SPSS?

Thank you very much for your help!

Best Answer

This seems to be a two-sample test with Groups 1 (of size $n_1$) and 2 (of size $n_2$). Your data are scores from 1 to 10 on the question.

Welch t test. If $n_1$ and $n_2$ are large enough (perhaps both above 20), you might be able to get a reliable answer using a Welch 2-sample t-test.

Wilcoxon test. You are almost sure to have lots of ties (repeated scores) even if both sample sizes are relatively small. Thus you will get error messages about ties when trying to do a Wilcoxon rank-sum test, along with an approximated P-value or a statement that a P-value is not available (depending of the software you use).

Permutation test. Perhaps it is best to do a permutation test. Under the null hypothesis that the two groups tend to give the same responses to the question, the argument is that the scores could be permuted between Groups A and B without effect. So if we choose some measure of difference such as the difference $D = \bar X_1 - \bar X_2$ between the two sample means, we can use either combinatorics or simulation to get the null permutation distribution of $D$, and judge whether your observed value of $D$ is consistent with the null distribution.

Example. I will illustrate each kind of test using fake data with 25 subjects in each group (although none of the tests require sample sizes to be equal).

Here are listings and summaries of some fake data to use for testing.

x1; summary(x1)
##  9  6  4 10  5  5  8  8  8  8  8  9  8  6  4  7  8  9  8  6  8  8  5  8  9
##  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  4.00    6.00    8.00    7.28    8.00   10.00 

x2; summary(x2)
## 10  9 10  7  7  8  8 10  8  5  7  7  7  5  8 10 10 10  9  7  9 10 10 10  9
##  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.0     7.0     9.0     8.4    10.0    10.0 

A quick look shows means to be greater in Group 2 than in Group 1. Is this difference statistically significant?

t test: A Welch 2-sample t test in R statistical software finds a significant difference. (P-value $\approx$ 2%.) The only doubt is whether data are sufficiently nearly normally distributed for the t test to give accurate results. (Data for both groups spectacularly fail a Shapiro-Wilk test with P-values < .01. But sample sizes may be large enough for the t test to be useful anyhow.)

t.test(x1, x2)

##        Welch Two Sample t-test

## data:  x1 and x2 
## t = -2.434, df = 47.853, p-value = 0.01872
## alternative hypothesis: true difference in means is not equal to 0 
## sample estimates:
## mean of x mean of y 
##      7.28      8.40 

Wilcoxon test: The Wilcoxon test, for a difference in medians gives a (tentative) P-value of about 2%, but warns that it may not be accurate. However, there are only seven uniquely different values among the 50 subjects. So the number of ties is 'massive' and the Wilcoxon test is based on a comparison of ranks, which can be problematic when there are many ties. I would not want to trust the result of the Wilcoxon test.

wilcox.test(x1, x2)

##     Wilcoxon rank sum test with continuity correction

## data:  x1 and x2 
## W = 200.5, p-value = 0.02702
## alternative hypothesis: true location shift is not equal to 0 

## Warning message:
## In wilcox.test.default(x1, x2) : cannot compute exact p-value with ties

Permutation test. It would be tedious to derive the exact permutation distribution of $D$ for this example. The usual cure is to simulate a large number of permutations and to approximate the P-value from simulation results. Here is a brief program in R to find the approximate P-value (2.1%) of the permutation test. (You may get a slightly different P-value at each run of the program, but not enough different to matter in the interpretation. For this program, subsequent runs all gave values rounding to 2%)

m = 10^4;  d.perm = numeric(m)
all = c(x1, x2);  d.obs = mean(x1) - mean(x2)
n1 = n2 = 25
for (i in 1:m) {
  perm = sample(all, n1+n2)
  d.perm[i] = mean(perm[1:n1]) - mean(perm[(n1+1):(n1+n2)])
  }
mean(abs(d.perm) >= abs(d.obs))
## 0.0215

Here is a histogram of the approximate permutation distribution. The solid red line at the left is the observed value of $D$ for the data above. The dotted red line at the right is just as extreme (far from 0) as the observed value of $D.$ The P-value of this 2-sided permutation test is the percentage of values in the permutation distribution outside these red lines, in this case, 2.1%.

enter image description here

Conclusion: The two groups differ significantly. The t test is probably OK, because, for samples this large, the Central Limit Theorem tends to make the sample means very nearly normal even if the data are not normal. For groups as small as ten, I would certainly insist on seeing permutation test results before drawing a conclusion.

You can read more about permutation tests in this paper by Eudey. The two-sample test above is discussed, with additional examples, in Section 4.

Almost certainly, your data will look different than my fake data. Please let me know if you have trouble relating my answer to your specific data.

Note: The fake data above were generated from populations with respective means about 3/5 and 5/6 using the R code below. (So it is appropriate that the tests found a significant difference.) By using the same seed I used, you should get exactly the same data.

set.seed(1234)
x1 = ceiling(10*rbeta(25, 3, 2))
x2 = ceiling(10*rbeta(25, 5, 1))

Addendum (Your Data from Comment). Your result in the Comment seems OK. Significant at 9.3% < 10% level; sometimes optimistically called "suggestive" of significance.

If you honestly expected (before seeing data) Gp2 scores to be higher, then maybe this should be a left-sided test of $H_0: \mu_1 \ge \mu_2$ vs. $H_a: \mu_1 < \mu_2.$ if so, P-value would be 3.8% < 5% for significance at the 5% level.

x1 <- c(0,7,10,0,9,5,10,6,8,7,8,2,2,8,10,7,10) 
x2 <- c(7,4,10,10,9,10,10,9,10,7,5,10,10,10,10,5,10,2)  
all = c(x1, x2);  gp = rep(0:1; times = c(17,18))
stripchart(all~gp, meth="stack", pch=19, col=c("blue", "green3"))

enter image description here

Welch t-test gives P-value 0.09024. Repeat of permutation test with m = 10^6 iterations to reduce possibility of simulation error.

x1 <- c(0,7,10,0,9,5,10,6,8,7,8,2,2,8,10,7,10) 
x2 <- c(7,4,10,10,9,10,10,9,10,7,5,10,10,10,10,5,10,2) 
m = 10^6;  d.perm = numeric(m) 
all = c(x1, x2);  d.obs = mean(x1) - mean(x2) 
n1 = length(x1);  n2 = length(x2) 
for (i in 1:m) { perm = sample(all, n1+n2) 
  d.perm[i] = mean(perm[1:n1]) - mean(perm[(n1+1):(n1+n2)]) } 
mean(abs(d.perm) >= abs(d.obs)) 
## 0.093149
## 0.093183  # 2nd run with m=10^6

mean(d.perm < d.obs)
## 0.038349  # P-value of LEFT SIDED test


length(unique(d.perm))
## 75        # uniquely different sim. values of D (enough)

hist(d.perm, prob=T, col="skyblue2", main="Simulated Permutation Distribution")
abline(v=d.obs, col="red", lwd=2)
abline(v=-d.obs, col="red", lwd=2, lty="dashed")

enter image description here

Note: If this is for a reviewed paper, you might get criticism (as noted by @Nameless) that the permutation test involves taking sample means of ordinal data. Possible nonparametric, ordinal-oriented alternatives:

(a) Use median instead of mean in the permutation test when finding d.obs and (within the loop) when finding d.perm, but not at the end when finding the P-value. (In R, the mean of a logical vector is the proportion of its TRUEs.) Trouble is I got only only about 20 uniquely different values of d.perm that way; not quite enough for my taste. One-sided P-value 0.047.

(b) Do a Welch t test on rank-transformed data. (Ranks are appropriate for ordinal data, their means are likely not far from normal with sample sizes above 15.) From t.test(rank(all) ~ gp, alte="less"), I get (Welch, one-sided) P-value 0.03457.

Related Question