Solved – the approrpriate use of Continuity Correction when performing Mann-Whitney U analysis with small (n= 4) & discrete samples

rwilcoxon-mann-whitney-test

I am using Mann-Whitney U analysis (through R (whilst attempting to justify [STATISTICA] outputs).

The data do not conform to a normal distribution, are discrete and are small in sample size (n = 4). Example (there are data for T1,2, 5, 6, 7, & 8 in complete dataset):

Data are UVC Fish counts. Each 4 samples result in the sampling of a single 'transect' and the analysis is to test for significance between each transect (n= 8) within the total data. The project base do not have access to R (rural area) and run the statistical package STATISTICA. I, however, do have access to R and am attempting to understand the output of STATISTICA to advise on the correct 'p' value to report (support for STATISTICA is minimal). Example:

For efficiency, and save me running 64 independent MWU tests to create a matrix, i run the following code:

   d<-my.data
groups <- unique( d$group )
MWU<-outer( 
  groups, groups, 
  Vectorize( function(g1,g2) {
    wilcox.test( 
      d$value[ d$group == g1 ], 
      d$value[ d$group == g2 ],
      paired=FALSE,
      exact=FALSE,
      correct=TRUE
    )$p.value
  } )
)
MWU

my.data =

group   value
group 1 10
group 1 9
group 1 11
group 1 10
group 2 16
group 2 75
group 2 0
group 2 0
group 3 75
group 3 66
group 3 25
group 3 21
group 4 0
group 4 2
group 4 3
group 4 33
group 5 16
group 5 19
group 5 31
group 5 20
group 6 15
group 6 7
group 6 26
group 6 33
group 7 20
group 7 30
group 7 29
group 7 133
group 8 14
group 8 8
group 8 20
group 8 15

Output:

          [,1]      [,2]       [,3]      [,4]       [,5]      [,6]       [,7]       [,8]
[1,] 1.00000000 1.0000000 0.02940105 0.3094241 0.02940105 0.3094241 0.02940105 0.30942406
[2,] 1.00000000 1.0000000 0.24252556 1.0000000 0.38074574 0.6631172 0.19126699 0.88454944
[3,] 0.02940105 0.2425256 1.00000000 0.1123512 0.11235120 0.3123214 1.00000000 0.03038282
[4,] 0.30942406 1.0000000 0.11235120 1.0000000 0.31232142 0.2453835 0.19393085 0.31232142
[5,] 0.02940105 0.3807457 0.11235120 0.3123214 1.00000000 0.8852339 0.24538346 0.14648918
[6,] 0.30942406 0.6631172 0.31232142 0.2453835 0.88523391 1.0000000 0.31232142 0.56136321
[7,] 0.02940105 0.1912670 1.00000000 0.1939309 0.24538346 0.3123214 1.00000000 0.04206641
[8,] 0.30942406 0.8845494 0.03038282 0.3123214 0.14648918 0.5613632 0.04206641 1.00000000

These data are different to those reported by STATISTICA (but similair to exact values in last column). When the continuity correction is nulled correct=FALSE, the output of the above code is the same as the adjusted column p-values given my STATISTICA.

 > d<-my.data
> groups <- unique( d$group )
> MWU<-outer( 
+   groups, groups, 
+   Vectorize( function(g1,g2) {
+     wilcox.test( 
+       d$value[ d$group == g1 ], 
+       d$value[ d$group == g2 ],
+       paired=FALSE,
+       exact=FALSE,
+       correct=FALSE
+     )$p.value
+   } )
+ )
> 
> MWU
           [,1]      [,2]       [,3]       [,4]       [,5]      [,6]       [,7]
[1,] 1.00000000 1.0000000 0.02016457 0.24538346 0.02016457 0.2453835 0.02016457
[2,] 1.00000000 1.0000000 0.18858231 1.00000000 0.30649217 0.5613632 0.14648918
[3,] 0.02016457 0.1885823 1.00000000 0.08326452 0.08326452 0.2482131 1.00000000
[4,] 0.24538346 1.0000000 0.08326452 1.00000000 0.24821308 0.1912670 0.14891467
[5,] 0.02016457 0.3064922 0.08326452 0.24821308 1.00000000 0.7728300 0.19126699
[6,] 0.24538346 0.5613632 0.24821308 0.19126699 0.77282999 1.0000000 0.24821308
[7,] 0.02016457 0.1464892 1.00000000 0.14891467 0.19126699 0.2482131 1.00000000
[8,] 0.24538346 0.7715034 0.02092134 0.24821308 0.11021018 0.4678251 0.02940105
           [,8]
[1,] 0.24538346
[2,] 0.77150341
[3,] 0.02092134
[4,] 0.24821308
[5,] 0.11021018
[6,] 0.46782508
[7,] 0.02940105
[8,] 1.00000000

However, as i understand, due to the small sample size, ties are inevitable with discrete data, and are presented in the original data. Therefore, to account for these ties, correct=TRUEis correct. When correct=TRUE the output, as above, are similiar to the 2*1 sided, exact p column reported by STATISTICA, but not identical.

Guidance by STATISTICA is given here: http://documentation.statsoft.com/STATISTICAHelp.aspx?path=Nonparametrics/NonparametricAnalysis/Dialogs/StartupPanel/NonparametricsStatisticsStartupPanelMannWhitneyUtest

It seems, though, that whilst a small sample size would result in reporting 2*1 sided, exact p values, it does not assume ties. So i do not feel the data comply to this.

Based on the above example, am i correct to assume that continuity correction is the correct approach due to ties, and that the 2*1 sided, exact p values in statistica are those to be reported?

Please also comment on R code – is it appropriate for correct=TRUE when two samples that are compared to do not contain ties?

Best Answer

Minimum sample sizes for a traditional Mann-Whitney-Wilcoxon rank sum test (with no tied data) to show significance at the 5% level are $n_1 = n_2 = 4.$ The minimal P-value $2/{8 \choose 4} = 0.0286$ can be achieved only when there is complete separation between groups (all observations in one group are smaller than any observation in the other). So in this entire experiment you are working right at the edge of being able to see significant results, and not.

Between T3 & T8 you have complete separation, so the exact P-value from the R function wilcox.test is 0.0286. As follows:

t3 = c(75, 66, 25, 21);  t8 = c(14, 8, 20, 15)
wilcox.test(t3, t8)

        Wilcoxon rank sum test

data:  t3 and t8
W = 16, p-value = 0.02857
alternative hypothesis: true location shift is not equal to 0

Between T1 & T3, T1 & T5, and T1 & T7 you have complete separation, but the tied observations in T1 lead to an approximated P-value 0.0294 (roughly, but not exactly, what you have in your tables).

 t1 = c(10, 9, 11, 10)
 wilcox.test(t1, t3) 

        Wilcoxon rank sum test with continuity correction

data:  t1 and t3
W = 0, p-value = 0.0294
alternative hypothesis: true location shift is not equal to 0

Warning message:
In wilcox.test.default(t1, t3) : cannot compute exact p-value with ties

I suppose STATISTICA uses one of several statistical programs that purport to give an 'exact' P-value where the traditional MWW test does not (owing to ties); those values, which you show in red. are slightly different depending on the nature of the ties. With or without 'exact' corrections, the pairs with complete separation are the only ones with P-values below 5%. Various software packages have somewhat different ways of dealing with small sample sizes and with ties; I would not endorse one over the other. Perhaps others on this site have strong preferences, which they are no doubt willing to express. (My favorite method of handling data that do not satisfy the conditions of a traditional MWW signed rank test is to use a permutation test instead.)

Now let's look at a permutation test for one of the 'borderline' results in your table, specifically, T3 & T4, for which your approximated (red) P-value is about 0.08.

We use the pooled two-sample t statistic as a 'metric' for the permutation test. We are not assuming this statistic has the distribution $\mathsf{T}(df = 6)$ because we are not willing to assume the data are normal. But we are assuming this test statistic is a valid way to measure the 'distance' between T3 and T4.

Then we simulate the permutation distribution by looking at 100,000 instances in which values in T3 and T4 are randomly swapped between two groups. Then we judge whether the observed value 2.338 of the t statistic is extreme when compared to the simulated permutation distribution.

t3 = c(75,66,25,21);  t4 = c(0, 2, 3, 33)
all = c(t3, t4);  gp = c(1,1,1,1, 2,2,2,2)
t.obs = t.test(all~gp, var.eq=T)$stat
set.seed(813);  m = 10^5
t.prm = replicate( m, t.test(all ~ sample(gp))$stat )
mean(abs(t.prm)> abs(t.obs))
[1] 0.05803   # P-value of simulated permutation test

The observed value of the pooled t statistic is $T_{obs} = 2.338;$ the (unrelaible) P-value of a pooled t test is 0.058. The P-value of the permutation test is 0.058 (surprisingly not much different from the actual t test). The permutation test has a smaller P-value than the approximated P-value of the MWW rank sum test.

The figure below compares the permutation distribution with $\mathsf{T}(df=6).$ The distributions are clearly not the same, but their tails happen to have similar probabilities--for the particular data in T3 and T4. The vertical red bars are at $\pm T_{obs}.$

Notes: (a) There were 66 uniquely different values of the t statistic encountered in the permutation test shown here. That should be enough to get an accurate P-value. (b) In the permutation test, we could have used the Welch (separate variances) t test; the t statistics would be exactly the same (because sample sizes are equal), but we could not have shown a hypothetical t distribution because the degrees of freedom would be different for each permutation of the data. (c) The difference in sample means (numerator of the t statistic) could have been used as the metric with a very similar P-value, but then a direct comparison with the t test would not be possible. (d) For a basic treatment of permutation tests using R, see Eudey et al. (2010). Sect. 3 discusses two-sample tests.

Related Solutions

Solved – Assumptions of Mann-Whitney U-test

Consider the Gaussian (normal) cumulative distribution function $\Phi(y)$ and the empirical cumulative distribution function $F_{n}(y)$. For optimality (control of type I error and low type II error) the two-sample $t$-test assumes that $\Phi^{-1}(F_{n}(y))$ when stratified by group yields two straight parallel lines. For optimal power the Wilcoxon test (and the proportional odds ordinal logistic model) assume that $\textrm{logit}(F_{n}(y))$ yields two parallel curves (they needn't be straight lines). The Wilcoxon assumptions are less stringent than the parametric method's assumptions.

So for optimality the Wilcoxon test assumes that the two distributions, after logit transformation, have the same shape and steepness. But the Wilcoxon test can be successfully used even when this assumption doesn't hold.

Much confusion exists about what the test assumes when you are calculating $P$-values. If using the normal distribution shortcut for $P$-values, more assumptions are made including equal dispersion in the two groups. If using general $U$-statistic theory (e.g. R Hmisc package rcorr.cens function) to get the standard error of the concordance probability), or if using the likelihood ratio or score $\chi^2$ tests from the proportional odds model, this assumption is not needed.

Solved – ks.test and ks.boot – exact p-values and ties

There are two points that are confused. The first one is about words "exact" and "approximate" in a statistical context. The word "exact" means that while calculations are carried out, no simplifications are used. The "approximate" p-value does not mean that the value is rounded to some precision. It means that while calculating it, some simplifications have been used. However, both "exact" and "approximate" calculations give precise numerical values. It is only our confidence that may differ. And now the second point: it is just the way of formatting output that gives you non-precise values. Actually, you are invoking the same output in different ways.

ks.test (black, red, alternative="l")$p.value
ks.test (black, red, alternative="g")$p.value
ks.test (black, red)$p.value

all give you precise (not rounded) values because you are calling the value of variables. In the last case the p-value is so small that it is lower than machine precision, and thus is listed as 0. But, when you are just calling a function, the function gives you human-readable output. During preparing this output, the p-values are passing through the format.pval() function. First of all, check the consistency of ks.test (black, red) and ks.test (black, red, alternative="g") - the p-values are the same in the non-precise format. And now compare

ks.test (black, red, alternative="g")$p.value and

format.pval(ks.test (black, red, alternative="g")$p.value)

Now is it clear how that p-value < 2.2e-16 is produced?

And finally about ks.boot(). It uses bootstrapping. While ks.test() obtains the probability of test statistics from the Kolmogorov distribution (this distribution describes how test statistics are distributed when two samples really are drawn from the same distribution), ks.boot() obtains the probability of test statistics from an empirical distribution, derived under the null hypothesis. That is, the studied two samples are combined together and from this united set two new samples are drawn at random with replacement. These new samples for sure are drawn from the same distribution, and their test statistics is noted. Repeating such procedure many times, we obtain the empirical distribution of test statistics under the null hypothesis. The number of repeats you are doing is in nboots variable in ks.boot() output. You have used default value of 1000. In this way, you have simulated 1000 test statistics values under the null hypothesis. You actual test statistics is greater than all these 1000. That means that p-value at least is equal or lesser than 0.001 - that is ks.boot.p.value. Call ks.boot(red,black,nboots=10000) and you'll obtain ks.boot.p.value=0.0001. To obtain a reasonable p-value with ks.boot() your nboots should have larger (absolute) order than expected p-value do have (i.e. more than $10^{23}$). I recommend you not do this, since it'll hang up your computer or will throw memory exception. Actually, the precise p-values of such small order have no any practical usage. Indeed, they are very sensitive to small changes in data, and thus repeated experiments would result in largely different p-values, so it can be said that the less p-value is - the less confidence to it precise value should be given.

Best Answer

Related Solutions

Solved – Assumptions of Mann-Whitney U-test

Solved – ks.test and ks.boot – exact p-values and ties

Related Question