Solved – Wilcoxon-Mann-Whitney test giving surprising results

hypothesis testingrwilcoxon-mann-whitney-test

I'm comparing two datasets and want to test the null hypothesis on them. Both datasets contain a lot of meaningful zeros (around 50% of the values are 0). One of them, "test", has more zeros than the other, but the non-zero values are higher. Here are two simplified datasets to illustrate the data that I'm using.

control <- (0, 0, 15, 0, 25, 0, 17, 0, 0, 12, 0)

test <- (0, 0, 0, 0, 30, 0, 19, 0, 0, 20, 0)

I've had trouble figuring out the most appropriate test to use to compare the two datasets, but I've momentarily settled on Mann-Whitney due to the distribution and the high number of zeros (I found some articles saying this test was appropriate for this kind of experiment).

The strange thing that I'm seeing is that occasionally I'll get a significant result, but it appears to be telling me that the group with the lower mean value (zeroes are included when calculating the mean) is actually the group with the smaller sumrank. I even tried checking the rank score for the two datasets manually and found that the one with the higher average mean was lower, as expected. The numbers roughly represent a kind of "revenue per user" number, so I'm definitely looking as average overall revenue as my outcome of interest.

Can anyone help me make sense of the results?? I've also included the R code below that I've ran to discover this (note that the above dataset won't work with the code…that was just a sample)

# This test gives me a p value of .05 when the avg value of test 
# is higher and sumrank is lower

wilcox.test(test, control, paired = FALSE)

# This test gives me a p value of ~.95 with the same dataset
wilcox.test(test, control, paired = FALSE, alternative = 'greater')

I confirmed that the 'greater' alternative works as I'd expect it by running the following code.

# This test gives me a p value of ~0
wilcox.test(test*100, control, paired=FALSE, alternative = 'greater)

Would love any advice that people are willing to give!

Best Answer

This seems rather confused. It isn't clear what you are trying to do, or what the trouble is. Part of the problem is that your example is messed up and doesn't yield the results you report:

control <- c(0, 0, 15, 0, 25, 0, 17, 0, 0, 12, 0)
test    <- c(0, 0,  0, 0, 30, 0, 19, 0, 0, 20, 0)
summary(control)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  0.000   0.000   0.000   6.273  13.500  25.000 
summary(test)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  0.000   0.000   0.000   6.273   9.500  30.000 

aggregate(rank(values)~ind, stack(list(test=test, control=control)), sum)
#       ind rank(values)
# 1 control          128
# 2    test          125

From the above, we can see that your example data have the same mean and median in both groups, and that the rank sum is lower for the test group. If I run Mann-Whitney U-tests on these data with your code, I don't get your results:

wilcox.test(test, control, paired=FALSE)
#         Wilcoxon rank sum test with continuity correction
# 
# data:  test and control
# W = 59, p-value = 0.9367
# alternative hypothesis: true location shift is not equal to 0
wilcox.test(test, control, paired=FALSE, alternative='greater')
#         Wilcoxon rank sum test with continuity correction
# 
# data:  test and control
# W = 59, p-value = 0.5631
# alternative hypothesis: true location shift is greater than 0

Moving away from your example, we can ask: What does the Mann-Whitney U-test test?

The Mann-Whitney tests if one sample is stochastically greater than the other. Imagine your numbers were written on individual raffle tickets and placed in two fishbowls. You will draw one ticket at random from each group. You are testing if it is more likely that one ticket will have a higher number than the other ticket. Note that there are 121 possible pairs of tickets you could draw. Here are the numbers of possible pairings that are greater than, less than, or equal:
```
sum(outer(test, control, FUN=">"))   # [1]  31
sum(outer(test, control, FUN="<"))   # [1]  34
sum(outer(test, control, FUN="=="))  # [1]  56
length(test)*length(control)         # [1] 121
```
These are pretty balanced (as you would expect from the output at the top) so we shouldn't expect a significant result, but there are fewer possibilities where test is greater than control.
Should you use a Mann-Whitney U-test since your data include lots of $0$s?

The naive, stats 101, scheme is that you should use a t-test to compare two groups unless the data are not normal. If you had enough data, you could probably still use a t-test, despite the non-normality, but what constitutes 'enough' is generally ambiguous. (In your case, it's clear that your data are too non-normal / you have nowhere near 'enough' data to ignore that.) People believe that the Mann-Whitney U-test is a test for differences of means (or maybe medians) with non-normal data, but it's not true.

The t-test and the Mann-Whitney U-test are testing different null hypotheses. The t-test tests if the samples are drawn from populations with the same mean. The Mann-Whitney U-test checks if one is stochastically greater (as just discussed). You need to use a test that assesses what you are trying to figure out.
What should your null hypothesis be?

That's really up to you. That is a theoretical decision that precedes data analysis (and hopefully study design).

If your data represent profits from customers, one thing to consider is how many customers you are likely to have. If you will only get one, you might want to determine if one method will likely yield a more profitable customer. On the other hand, it can easily be that the group that is stochastically greater has a lower mean, and if so, with enough customers you would eventually yield greater profits by serving the group with the higher mean (provided you can stay in business long enough):

How would you test if the means differ?

You can empirically estimate the sampling distribution by bootstrapping when you cannot assume normality:

set.seed(997)  # this makes the example exactly reproducible
B      = 10000
t.stat = vector(length=B)
for(i in 1:B){
  bt = sample(test,    size=length(test),    replace=TRUE)
  bc = sample(control, size=length(control), replace=TRUE)
  t.stat[i] = t.test(bt, bc)$statistic
}
mean(t.stat>0)    # [2] 0.490047
2*mean(t.stat>0)  # [2] 0.980094

Your p-value is $0.98$, which isn't surprising since your means are identical.

Related Solutions

R – Two-Sample One-Sided Kolmogorov-Smirnov Test vs. One-Sided Wilcoxon-Mann-Whitney Test

Both are testing for displacement of the x variable with respect to the y variable, but the 2 tests have opposite meanings for the term "greater" (and therefor also or "less").

In the ks.test "greater" means that the CDF of 'x' is higher than the CDF of 'y' which means that things like the mean and the median will be smaller values in 'x' than in 'y' if the CDF of 'x' is "greater" than the CDF of 'y'. In 'wicox.test' and 't.test' the mean, median, etc. will be greater in 'x' than in 'y' if you believe that the alternative of "greater" is true.

An example from R:

> x <- rnorm(25)
> y <- rnorm(25, 1)
> 
> ks.test(x,y, alt='greater')

        Two-sample Kolmogorov-Smirnov test

data:  x and y 
D = 0.6, p-value = 0.0001625
alternative hypothesis: two-sided 

> wilcox.test( x, y, alt='greater' )

        Wilcoxon rank sum test

data:  x and y 
W = 127, p-value = 0.9999
alternative hypothesis: true location shift is greater than 0 

> wilcox.test( x, y, alt='less' )

        Wilcoxon rank sum test

data:  x and y 
W = 127, p-value = 0.000101
alternative hypothesis: true location shift is less than 0

Here I generated 2 samples from a normal distribution, both with sample size 25 and standard deviation of 1. The x variable comes from a distribution of mean 0 and the y variable from a distribution of mean 1. You can see the results of ks.test give a very significant result testing in the "greater" direction even though x has the smaller mean, this is because the CDF of x is above that of y. The wilcox.test function shows lack of significance in the "greater" direction, but similar level of significance in the "less" direction.

Both tests are different approaches to testing the same idea, but what "greater" and "less" mean to the 2 tests are different (and conceptually opposite).

Solved – How to interpret the Mann-Whitney U when using R’s formula interface

Technically, the reference category and the direction of the test depend on the way the factor variable is encoded. With your toy data:

> wilcox.test(x ~ y, data=data, alternative="greater")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 52, p-value = 1
alternative hypothesis: true location shift is greater than 0 

> wilcox.test(x ~ y, data=data, alternative="less")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 52, p-value < 2.2e-16
alternative hypothesis: true location shift is less than 0

Notice that the W statistic is the same in both cases but the test uses opposite tails of its sampling distribution. Now let's look at the factor variable:

> levels(data$y)
[1] "A" "B"

We can recode it to make "B" the first level:

> data$y <- factor(data$y, levels=c("B", "A"))

Now we have:

> levels(data$y)
[1] "B" "A"

Note that we did not change the data themselves, just the way the categorical variable is encoded “under the hood”:

> head(data)
          x y
1 0.4395244 A
2 0.7698225 A
3 2.5587083 A
4 1.0705084 A
5 1.1292877 A
6 2.7150650 A

> aggregate(data$x, by=list(data$y), mean)
  Group.1        x
1       B 5.292817
2       A 1.034404

But the directions of the test are now inverted:

> wilcox.test(x ~ y, data=data, alternative="greater")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 2448, p-value < 2.2e-16
alternative hypothesis: true location shift is greater than 0

The W statistic is different but the p-value is the same than for the alternative="less" test with the categories in the original order. With the original data, it could be interpreted as “the location shift from B to A is less than 0” and with the recoded data it becomes “the location shift from A to B is greater than 0” but this is really the same hypothesis (but see Glen_b's comments to the question for the correct interpretation).

In your case, it therefore seems that the test you want is alternative="less" (or, equivalently, alternative="greater" with the recoded data). Does that help?

Best Answer

Related Solutions

R – Two-Sample One-Sided Kolmogorov-Smirnov Test vs. One-Sided Wilcoxon-Mann-Whitney Test

Solved – How to interpret the Mann-Whitney U when using R’s formula interface

Related Question