Solved – Wilcoxon-Mann-Whitney test giving surprising results

hypothesis testingrwilcoxon-mann-whitney-test

I'm comparing two datasets and want to test the null hypothesis on them. Both datasets contain a lot of meaningful zeros (around 50% of the values are 0). One of them, "test", has more zeros than the other, but the non-zero values are higher. Here are two simplified datasets to illustrate the data that I'm using.

control <- (0, 0, 15, 0, 25, 0, 17, 0, 0, 12, 0)

test <- (0, 0, 0, 0, 30, 0, 19, 0, 0, 20, 0)

I've had trouble figuring out the most appropriate test to use to compare the two datasets, but I've momentarily settled on Mann-Whitney due to the distribution and the high number of zeros (I found some articles saying this test was appropriate for this kind of experiment).

The strange thing that I'm seeing is that occasionally I'll get a significant result, but it appears to be telling me that the group with the lower mean value (zeroes are included when calculating the mean) is actually the group with the smaller sumrank. I even tried checking the rank score for the two datasets manually and found that the one with the higher average mean was lower, as expected. The numbers roughly represent a kind of "revenue per user" number, so I'm definitely looking as average overall revenue as my outcome of interest.

Can anyone help me make sense of the results?? I've also included the R code below that I've ran to discover this (note that the above dataset won't work with the code…that was just a sample)

# This test gives me a p value of .05 when the avg value of test 
# is higher and sumrank is lower

wilcox.test(test, control, paired = FALSE)

# This test gives me a p value of ~.95 with the same dataset
wilcox.test(test, control, paired = FALSE, alternative = 'greater')

I confirmed that the 'greater' alternative works as I'd expect it by running the following code.

# This test gives me a p value of ~0
wilcox.test(test*100, control, paired=FALSE, alternative = 'greater)

Would love any advice that people are willing to give!

Best Answer

This seems rather confused. It isn't clear what you are trying to do, or what the trouble is. Part of the problem is that your example is messed up and doesn't yield the results you report:

control <- c(0, 0, 15, 0, 25, 0, 17, 0, 0, 12, 0)
test    <- c(0, 0,  0, 0, 30, 0, 19, 0, 0, 20, 0)
summary(control)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  0.000   0.000   0.000   6.273  13.500  25.000 
summary(test)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  0.000   0.000   0.000   6.273   9.500  30.000 

aggregate(rank(values)~ind, stack(list(test=test, control=control)), sum)
#       ind rank(values)
# 1 control          128
# 2    test          125

From the above, we can see that your example data have the same mean and median in both groups, and that the rank sum is lower for the test group. If I run Mann-Whitney U-tests on these data with your code, I don't get your results:

wilcox.test(test, control, paired=FALSE)
#         Wilcoxon rank sum test with continuity correction
# 
# data:  test and control
# W = 59, p-value = 0.9367
# alternative hypothesis: true location shift is not equal to 0
wilcox.test(test, control, paired=FALSE, alternative='greater')
#         Wilcoxon rank sum test with continuity correction
# 
# data:  test and control
# W = 59, p-value = 0.5631
# alternative hypothesis: true location shift is greater than 0

  1. Moving away from your example, we can ask: What does the Mann-Whitney U-test test?

    The Mann-Whitney tests if one sample is stochastically greater than the other. Imagine your numbers were written on individual raffle tickets and placed in two fishbowls. You will draw one ticket at random from each group. You are testing if it is more likely that one ticket will have a higher number than the other ticket. Note that there are 121 possible pairs of tickets you could draw. Here are the numbers of possible pairings that are greater than, less than, or equal:

    sum(outer(test, control, FUN=">"))   # [1]  31
    sum(outer(test, control, FUN="<"))   # [1]  34
    sum(outer(test, control, FUN="=="))  # [1]  56
    length(test)*length(control)         # [1] 121
    

    These are pretty balanced (as you would expect from the output at the top) so we shouldn't expect a significant result, but there are fewer possibilities where test is greater than control.

  2. Should you use a Mann-Whitney U-test since your data include lots of $0$s?

    The naive, stats 101, scheme is that you should use a t-test to compare two groups unless the data are not normal. If you had enough data, you could probably still use a t-test, despite the non-normality, but what constitutes 'enough' is generally ambiguous. (In your case, it's clear that your data are too non-normal / you have nowhere near 'enough' data to ignore that.) People believe that the Mann-Whitney U-test is a test for differences of means (or maybe medians) with non-normal data, but it's not true.

    The t-test and the Mann-Whitney U-test are testing different null hypotheses. The t-test tests if the samples are drawn from populations with the same mean. The Mann-Whitney U-test checks if one is stochastically greater (as just discussed). You need to use a test that assesses what you are trying to figure out.

  3. What should your null hypothesis be?

    That's really up to you. That is a theoretical decision that precedes data analysis (and hopefully study design).

    If your data represent profits from customers, one thing to consider is how many customers you are likely to have. If you will only get one, you might want to determine if one method will likely yield a more profitable customer. On the other hand, it can easily be that the group that is stochastically greater has a lower mean, and if so, with enough customers you would eventually yield greater profits by serving the group with the higher mean (provided you can stay in business long enough):

    enter image description here

  4. How would you test if the means differ?

    You can empirically estimate the sampling distribution by bootstrapping when you cannot assume normality:

    set.seed(997)  # this makes the example exactly reproducible
    B      = 10000
    t.stat = vector(length=B)
    for(i in 1:B){
      bt = sample(test,    size=length(test),    replace=TRUE)
      bc = sample(control, size=length(control), replace=TRUE)
      t.stat[i] = t.test(bt, bc)$statistic
    }
    mean(t.stat>0)    # [2] 0.490047
    2*mean(t.stat>0)  # [2] 0.980094
    

Your p-value is $0.98$, which isn't surprising since your means are identical.