Confidence Interval – Calculating Effect Size Confidence Interval for Mann-Whitney U-Test

confidence intervaleffect-sizespearman-rhowilcoxon-mann-whitney-test

According to Fritz, Morris, and Richler (2011; see below), $r$ can be calculated as an effect size for the Mann-Whitney U-test using the formula
$$
r = \frac{z}{\sqrt N}
$$
This is convenient to me, as I report $r$ also on other occasions. I'd like to report the confidence interval for $r$ in addition to the effect size measure.

Here are my questions:

  • Can I calculate the confidence intervals for r as for Pearson's r, although it is used as an effect size measure for a nonparametric test?
  • What confidence intervals have to be reported for one-tailed vs. two-tailed testing?

Edit concerning the second question: "What confidence intervals have to be reported for one-tailed vs. two-tailed testing?"

I found some more information that IMHO may answer this question. "Whereas two-sided confidence limits form a confidence interval, their one-sided counterparts are referred to as lower or upper confidence bounds." (http://en.wikipedia.org/wiki/Confidence_interval). From this information I conclude that it is not the main issue whether the significance testing (e.g., $t$-test) was one- or two-tailed, but what information one is interested in with respect to the CI for the effect size. My conclusion (please correct me if you disagree):

  • two-sided CI $\rightarrow$ interested in upper and lower bounds (as a consequence, it is possible that a two-sided CI entails 0 although the one-tailed test of significance was p < .05, especially in case the value was close to .05.)
  • one-sided "CI" $\rightarrow$ only interested in upper or lower bound (due to theoretical reasoning); however, this is not necessarily the main question of interest after testing a directed hypothesis. A two-sided CI is perfectly appropriate if the focus is on the possible range of an effect size. Right?

See below for the text passage from Fritz, Morris, & Richler (2011) on effect sizes estimate for the Mann-Whitney test from the article I refer to above.

"Most of the effect size estimates we have described here assume that the data have a normal distribution. However, some data do not meet the requirements of parametric tests, for example, data on an ordinal but not interval scale. For such data, researchers usually turn to nonparametric statistical tests, such as the Mann–Whitney and the Wilcoxon tests. The significance of these tests is usually evaluated through the approximation of the distributions of the test statistics to the $z$ distribution when sample sizes are not too small, and statistical packages, such as SPSS, that run these tests report the appropriate $z$ value in addition to the values for $U$ or $T$; $z$ can also be calculated by hand (e.g., Siegel & Castellan, 1988). The $z$ value can be used to calculate an effect size, such as the $r$ proposed by Cohen (1988); Cohen’s guidelines for r are that a large effect is .5, a medium effect is .3, and a small effect is .1 (Coolican, 2009, p. 395). It is easy to calculate $r$, $r^2$, or $\eta^2$ from these $z$ values because
$$
r = \frac{z}{\sqrt N}
$$
and
$$
r^2\quad{\rm or}\quad \eta^2 = \frac{z^2}{N}
$$
These effect size estimates remain independent of sample size despite the presence of N in the formulas. This is because z is sensitive to sample size; dividing by a function of N removes the effect of sample size from the resultant effect size estimate." (p. 12)

Best Answer

One choice of effect size for the Mann-Whitney U test is the common language effect size. For the Mann-Whitney U, this is the proportion of sample pairs that supports a stated hypothesis.

A second choice is the rank correlation; because the rank correlation ranges from -1 to +1, it has properties that are similar to the Pearson r. In addition, by the simple difference formula, the rank correlation is the difference between the common language effect size and its complement, a fact that promotes interpretation. For example, if there are 100 sample pairs, and if 70 sample pairs support the hypothesis, then the common language effect size is 70%, and the rank correlation is r = .70 = .30 = .40. A clear discussion of the common language effect size and of four formulas to compute the rank correlation is given by Kerby in the journal Innovative Teaching: Kerby (2014) Innovative Teaching

By the way, though the paper does not mention it, I am fairly certain that Somers d and the rank correlation for the Mann-Whitney are equivalent.