Solved – Difference of ‘centers’ of 2 non-normal samples with Mann-Whitney test

nonparametricstructural-equation-modelingwilcoxon-mann-whitney-test

I have 2 non-normally distributed samples of different sizes (N1~=N2).

To evaluate whether there is a significant difference between these samples, I used the Mann Whitney U test (ranksum in MATLAB).

Now I want to evaluate by how much the populations differ. With normally distributed data, I would just use the difference of the means with a SEM confidence interval.

However, for nonparamteric data, my Wikipedia research suggested using the Hodges–Lehmann estimator. My questions are:

  1. Is the Hodges–Lehmann estimator indeed correct?
  2. How do I calculate confidence intervals for the HL estimator (like the SEM)
  3. How can I implement this in MATLAB?

Thanks

Best Answer

Yes, if your alternative is a shift alternative, the corresponding estimate is the Hodges-Lehmann.

The estimate is easy enough.

The Hodges Lehmann estimate is the sample median of the cross-sample pairwise differences.

For really large samples, there are efficient methods (for example, ones that sort the two samples and then use the information in that to speed up the search for the median).

I'm going to assume your sample is not so large that its worth the effort of doing that, in which case, you simply want to find all pairwise differences and find the median of the result.

For example, if your x-sample is:

16.388  6.775 18.270 17.034 18.825 16.197 16.709 18.188  9.999

and your y-sample is:

 9.305  9.799 11.103

then the matrix of differences, x-y is

   7.083  6.589  5.285
  -2.530 -3.024 -4.328
   8.965  8.471  7.167
   7.729  7.235  5.931
   9.520  9.026  7.722
   6.892  6.398  5.094
   7.404  6.910  5.606
   8.883  8.389  7.085
   0.694  0.200 -1.104

(here columns correspond to particular y's and rows to particular y's). You then take the median of the whole set of differences.

(In R this is just a matter of:

> median(outer(x,y,"-"))
[1] 6.91

It's doubtless just about as easy in Matlab; would something like [X,Y] = meshgrid(x,y); Z = X - Y; work to generate the matrix of differences?)

If you have samples so large that this isn't feasible, you should indicate this.


I don't know why it didn't click before, but the basic concept of finding an interval for something like the Hodges-Lehmann shift parameter in the Wilcoxon-Mann-Whitney (WMW) test is pretty obvious; the same idea is widely used for other permutation tests.

The underlying idea is this: you shift the second sample left and right until you hit the $\alpha/2$ and $1-\alpha/2$ quantiles of the WMW statistic (being discrete, these will cut off less than $\alpha/2$ in the tails, but you can legitimately have one a bit larger than $\alpha/2$ if the sum of tail areas is less than $\alpha$; however, most people will prefer symmetry even at the cost of a more-conservative-than-necessary interval).

This approach can be easily implemented in software via use of a canned root-finding algorithm that doesn't require derivatives. Bounds are obvious enough (if you haven't hit it by the time you get complete separation of samples, you never will, so check those 'maximum shift' values first and then use them as bounds; similarly, the HL estimate itself is the bound for the other end of each search).

There are other approaches to doing this that will probably be more efficient, but this should work well enough with a good root-finder (perhaps involving only a dozen or so evaluations of statistics at different shift values).


In response to the comments about the issue being a change of scale, not location:

One possibility - you could consider looking at estimating the location shift in the logs, which should correspond to the log of the scale change (ratio of scales) in the untransformed values. That is if $X∼F(x)$ and $Y∼F(y/a)$ then if $X^∗=\ln X∼G(x), Y^∗=\ln Y∼G(y−\ln a)$, where $G(x)= F(\exp(x))$.

Intervals, at least should carry over nicely.

Here's an example of that in use:

The data are as follows. The x sample is 60 observations from a distribution with scale parameter $\mu=18$, while the y sample is 50 observations with scale parameter $\mu=25$. They have the same distribution, aside from a shift in scale.

The increase in scale is a factor of $25/18 \approx 1.3889$.

Here's a boxplot of the data on the original scale and on the log-scale:

boxplot of x&y samples and boxplot of logs

Here we do a Wilcoxon-Mann-Whitney on the log-data:

> lx=log(x)
> ly=log(y)
> wilcox.test(ly,lx,conf.int=TRUE)

    Wilcoxon rank sum test with continuity correction

data:  ly and lx
W = 1743, p-value = 0.1455
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -0.09044406  0.68813695
sample estimates:
difference in location 
             0.2917924 

If we exponentiate the interval, it should contain the true ratio of scales 95% of the time. It does contain the population ratio on this occasion:

> exp(wilcox.test(ly,lx,conf.int=TRUE)$conf.int)
[1] 0.9135254 1.9900046
attr(,"conf.level")
[1] 0.95

If we exponentiate the estimated shift in the logs, it will be a (somewhat) downward-biased estimate of the scale-ratio, but it's fairly good:

> exp(wilcox.test(ly,lx,conf.int=TRUE)$estimate)
difference in location 
              1.338825 

[The data used in the example were generated as follows:

> set.seed(94394381)
> x=rexp(60,1/18)
> y=rexp(50,1/25)

... just in case anyone wants to try it]


There are other ways to estimate the scale change; one could, for example, divide the y-sample by a constant, $c$, and try different values of $c$ until the WMW statistic was equal to its expected value under the null hypothesis. One could also try different values until the statistic was equal to the $\alpha/2$ and $1-\alpha/2$ quantiles of the null distribution of the WMW. As above, this will be the point estimate of and interval for the ratio of scales.

(That would have the advantage of not introducing a small bias into the central estimate.)