Solved – the approrpriate use of Continuity Correction when performing Mann-Whitney U analysis with small (n= 4) & discrete samples

rwilcoxon-mann-whitney-test

I am using Mann-Whitney U analysis (through R (whilst attempting to justify [STATISTICA] outputs).

The data do not conform to a normal distribution, are discrete and are small in sample size (n = 4). Example (there are data for T1,2, 5, 6, 7, & 8 in complete dataset):

T3  T4
75  0
66  2
25  3
21  33

Data are UVC Fish counts. Each 4 samples result in the sampling of a single 'transect' and the analysis is to test for significance between each transect (n= 8) within the total data. The project base do not have access to R (rural area) and run the statistical package STATISTICA. I, however, do have access to R and am attempting to understand the output of STATISTICA to advise on the correct 'p' value to report (support for STATISTICA is minimal). Example:

enter image description here

For efficiency, and save me running 64 independent MWU tests to create a matrix, i run the following code:

   d<-my.data
groups <- unique( d$group )
MWU<-outer( 
  groups, groups, 
  Vectorize( function(g1,g2) {
    wilcox.test( 
      d$value[ d$group == g1 ], 
      d$value[ d$group == g2 ],
      paired=FALSE,
      exact=FALSE,
      correct=TRUE
    )$p.value
  } )
)
MWU

my.data =

group   value
group 1 10
group 1 9
group 1 11
group 1 10
group 2 16
group 2 75
group 2 0
group 2 0
group 3 75
group 3 66
group 3 25
group 3 21
group 4 0
group 4 2
group 4 3
group 4 33
group 5 16
group 5 19
group 5 31
group 5 20
group 6 15
group 6 7
group 6 26
group 6 33
group 7 20
group 7 30
group 7 29
group 7 133
group 8 14
group 8 8
group 8 20
group 8 15

Output:

          [,1]      [,2]       [,3]      [,4]       [,5]      [,6]       [,7]       [,8]
[1,] 1.00000000 1.0000000 0.02940105 0.3094241 0.02940105 0.3094241 0.02940105 0.30942406
[2,] 1.00000000 1.0000000 0.24252556 1.0000000 0.38074574 0.6631172 0.19126699 0.88454944
[3,] 0.02940105 0.2425256 1.00000000 0.1123512 0.11235120 0.3123214 1.00000000 0.03038282
[4,] 0.30942406 1.0000000 0.11235120 1.0000000 0.31232142 0.2453835 0.19393085 0.31232142
[5,] 0.02940105 0.3807457 0.11235120 0.3123214 1.00000000 0.8852339 0.24538346 0.14648918
[6,] 0.30942406 0.6631172 0.31232142 0.2453835 0.88523391 1.0000000 0.31232142 0.56136321
[7,] 0.02940105 0.1912670 1.00000000 0.1939309 0.24538346 0.3123214 1.00000000 0.04206641
[8,] 0.30942406 0.8845494 0.03038282 0.3123214 0.14648918 0.5613632 0.04206641 1.00000000

These data are different to those reported by STATISTICA (but similair to exact values in last column). When the continuity correction is nulled correct=FALSE, the output of the above code is the same as the adjusted column p-values given my STATISTICA.

 > d<-my.data
> groups <- unique( d$group )
> MWU<-outer( 
+   groups, groups, 
+   Vectorize( function(g1,g2) {
+     wilcox.test( 
+       d$value[ d$group == g1 ], 
+       d$value[ d$group == g2 ],
+       paired=FALSE,
+       exact=FALSE,
+       correct=FALSE
+     )$p.value
+   } )
+ )
> 
> MWU
           [,1]      [,2]       [,3]       [,4]       [,5]      [,6]       [,7]
[1,] 1.00000000 1.0000000 0.02016457 0.24538346 0.02016457 0.2453835 0.02016457
[2,] 1.00000000 1.0000000 0.18858231 1.00000000 0.30649217 0.5613632 0.14648918
[3,] 0.02016457 0.1885823 1.00000000 0.08326452 0.08326452 0.2482131 1.00000000
[4,] 0.24538346 1.0000000 0.08326452 1.00000000 0.24821308 0.1912670 0.14891467
[5,] 0.02016457 0.3064922 0.08326452 0.24821308 1.00000000 0.7728300 0.19126699
[6,] 0.24538346 0.5613632 0.24821308 0.19126699 0.77282999 1.0000000 0.24821308
[7,] 0.02016457 0.1464892 1.00000000 0.14891467 0.19126699 0.2482131 1.00000000
[8,] 0.24538346 0.7715034 0.02092134 0.24821308 0.11021018 0.4678251 0.02940105
           [,8]
[1,] 0.24538346
[2,] 0.77150341
[3,] 0.02092134
[4,] 0.24821308
[5,] 0.11021018
[6,] 0.46782508
[7,] 0.02940105
[8,] 1.00000000

However, as i understand, due to the small sample size, ties are inevitable with discrete data, and are presented in the original data. Therefore, to account for these ties, correct=TRUEis correct. When correct=TRUE the output, as above, are similiar to the 2*1 sided, exact p column reported by STATISTICA, but not identical.

Guidance by STATISTICA is given here: http://documentation.statsoft.com/STATISTICAHelp.aspx?path=Nonparametrics/NonparametricAnalysis/Dialogs/StartupPanel/NonparametricsStatisticsStartupPanelMannWhitneyUtest

It seems, though, that whilst a small sample size would result in reporting 2*1 sided, exact p values, it does not assume ties. So i do not feel the data comply to this.

Based on the above example, am i correct to assume that continuity correction is the correct approach due to ties, and that the 2*1 sided, exact p values in statistica are those to be reported?

Please also comment on R code – is it appropriate for correct=TRUE when two samples that are compared to do not contain ties?

Best Answer

Minimum sample sizes for a traditional Mann-Whitney-Wilcoxon rank sum test (with no tied data) to show significance at the 5% level are $n_1 = n_2 = 4.$ The minimal P-value $2/{8 \choose 4} = 0.0286$ can be achieved only when there is complete separation between groups (all observations in one group are smaller than any observation in the other). So in this entire experiment you are working right at the edge of being able to see significant results, and not.

Between T3 & T8 you have complete separation, so the exact P-value from the R function wilcox.test is 0.0286. As follows:

t3 = c(75, 66, 25, 21);  t8 = c(14, 8, 20, 15)
wilcox.test(t3, t8)

        Wilcoxon rank sum test

data:  t3 and t8
W = 16, p-value = 0.02857
alternative hypothesis: true location shift is not equal to 0

Between T1 & T3, T1 & T5, and T1 & T7 you have complete separation, but the tied observations in T1 lead to an approximated P-value 0.0294 (roughly, but not exactly, what you have in your tables).

 t1 = c(10, 9, 11, 10)
 wilcox.test(t1, t3) 

        Wilcoxon rank sum test with continuity correction

data:  t1 and t3
W = 0, p-value = 0.0294
alternative hypothesis: true location shift is not equal to 0

Warning message:
In wilcox.test.default(t1, t3) : cannot compute exact p-value with ties

I suppose STATISTICA uses one of several statistical programs that purport to give an 'exact' P-value where the traditional MWW test does not (owing to ties); those values, which you show in red. are slightly different depending on the nature of the ties. With or without 'exact' corrections, the pairs with complete separation are the only ones with P-values below 5%. Various software packages have somewhat different ways of dealing with small sample sizes and with ties; I would not endorse one over the other. Perhaps others on this site have strong preferences, which they are no doubt willing to express. (My favorite method of handling data that do not satisfy the conditions of a traditional MWW signed rank test is to use a permutation test instead.)

Now let's look at a permutation test for one of the 'borderline' results in your table, specifically, T3 & T4, for which your approximated (red) P-value is about 0.08.

We use the pooled two-sample t statistic as a 'metric' for the permutation test. We are not assuming this statistic has the distribution $\mathsf{T}(df = 6)$ because we are not willing to assume the data are normal. But we are assuming this test statistic is a valid way to measure the 'distance' between T3 and T4.

Then we simulate the permutation distribution by looking at 100,000 instances in which values in T3 and T4 are randomly swapped between two groups. Then we judge whether the observed value 2.338 of the t statistic is extreme when compared to the simulated permutation distribution.

t3 = c(75,66,25,21);  t4 = c(0, 2, 3, 33)
all = c(t3, t4);  gp = c(1,1,1,1, 2,2,2,2)
t.obs = t.test(all~gp, var.eq=T)$stat
set.seed(813);  m = 10^5
t.prm = replicate( m, t.test(all ~ sample(gp))$stat )
mean(abs(t.prm)> abs(t.obs))
[1] 0.05803   # P-value of simulated permutation test

The observed value of the pooled t statistic is $T_{obs} = 2.338;$ the (unrelaible) P-value of a pooled t test is 0.058. The P-value of the permutation test is 0.058 (surprisingly not much different from the actual t test). The permutation test has a smaller P-value than the approximated P-value of the MWW rank sum test.

The figure below compares the permutation distribution with $\mathsf{T}(df=6).$ The distributions are clearly not the same, but their tails happen to have similar probabilities--for the particular data in T3 and T4. The vertical red bars are at $\pm T_{obs}.$

enter image description here

Notes: (a) There were 66 uniquely different values of the t statistic encountered in the permutation test shown here. That should be enough to get an accurate P-value. (b) In the permutation test, we could have used the Welch (separate variances) t test; the t statistics would be exactly the same (because sample sizes are equal), but we could not have shown a hypothetical t distribution because the degrees of freedom would be different for each permutation of the data. (c) The difference in sample means (numerator of the t statistic) could have been used as the metric with a very similar P-value, but then a direct comparison with the t test would not be possible. (d) For a basic treatment of permutation tests using R, see Eudey et al. (2010). Sect. 3 discusses two-sample tests.