How to Build Confidence Interval in Wilcoxon Test in R

confidence intervalnonparametricwilcoxon-signed-rank

I want to calculate the confidence interval around the median obtained from this data set:

dat <- c(2.10, 2.35, 2.35, 3.10, 3.10, 3.15, 3.90, 3.90,  4.00,  4.80, 5.00,  5.00,  5.15,  5.35,  5.50,  6.00,  6.00,  6.25,  6.45)

The descriptive statistics:

summary(dat)
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2.100   3.125   4.800   4.392   5.425   6.450

I cannot find how the confidence interval that is presented together with the wilcoxon test results is calculated:

wilcox.test(dat, conf.int = T, correct = T, exact = F, conf.level = .99)
    Wilcoxon signed rank test with continuity correction

data:  dat
V = 190, p-value = 0.0001419
alternative hypothesis: true location is not equal to 0
99 percent confidence interval:
 3.450018 5.499933
sample estimates:
(pseudo)median 
      4.400028

I just want to estimate the median of the population with a confidence interval using a non-parametric method. How the confidence interval shown above is related to the Wilcoxon signed rank test?

Best Answer

I just want to estimate the median of the population with a confidence interval using a non-parametric method.

Note that the interval generated for the signed rank test is for the population version of the one-sample Hodges-Lehmann statistic (the pseudomedian), not the median.

Under the assumption of symmetry (which is necessary under the null for the signed rank test, but not necessarily required under the alternative, which is what you're calculating a confidence interval under), the two population quantities will coincide. You may be happy to make that somewhat stronger assumption, but keep in mind that it's quite possible for the sample median to fall outside the CI this generates.

How is the confidence interval shown above related to the Wilcoxon signed rank test?

It's the set of values for the pseudomedian that would not be rejected by a signed rank statistic. You can actually find the limits that way; this is a pretty general way to arrive at confidence intervals for statistics you don't have a simpler way to do it for.

There's a specific way to find the limits for the signed rank test that doesn't need you to do that, but you can use search methods to get there quite quickly with this general approach.

The more specific approach for the signed rank test is based on a symmetric pair of order statistics of the Walsh averages (averages of each $(i,j)$ pair $\frac{1}{2}(X_i+X_j)$, for $i \leq j$ ... i.e. including each point averaged with itself). The signed rank statistic is the number of positive $W$s.

Then if we label those averages $W_k, k=1, 2, ..., m$ where $m=n(n+1)/2$, the corresponding interval will be the symmetric pair of order $(W_{(k)},W_{(m+1-k)})$ with $k$ chosen as small as possible but still leads to endpoints in the non-rejection region of the test.

(This pdf outlines that in some detail.)

Related Solutions

Solved – Deriving confidence interval from standard error of the mean when the data are non-normal

This is somewhat tricky. There are several approaches:

Assume the distribution isn't 'too far' from the normal (in a particular sense), and that the t-interval will give close to the desired coverage. The t is at least reasonably robust to mild deviations from the assumptions, so if the population distribution isn't particularly skewed or especially heavy tailed, that should at least work reasonably well.
assume the distribution is symmetric* and construct an interval for the pseudomedian (Hodges-Lehmann estimate, median of pairwise averages) via a Wilcoxon signed-rank-type procedure. If the t-distribution would have been right, on average you lose very little by doing this. This can be done in many packages.

[With a symmetric distribution whose mean exists, the mean, pseudomedian, the ordinary median (and many other location-measures) coincide. An interval that contains one with a particular probability will also contain the others]

*(or at least 'sufficiently' close to it)

Here's an example of this done in R:
```
y <- rlogis(8,50,1)  
wilcox.test(y,conf.int=TRUE)  

Wilcoxon signed rank test`   

data:  y
V = 36, p-value = 0.007813  
alternative hypothesis: true location is not equal to 0  
95 percent confidence interval:  
 47.49677 52.22811  
sample estimates:  
(pseudo)median   
      49.55069   
```
So the interval given there is (47.50, 52.23):

The purple vertical line segment is the sample mean and the centre blue one is the sample pseudomedian. The outer blue segments mark the ends of the confidence interval. You see that in this example the interval includes the true population mean of 50.
assume symmetry and construct a CI from the values for the mean that would not be rejected by a permutation test (this can be done from a single permutation test distribution and 8 observations is few enough to get the whole permutation distribution rather than sample it).
use bootstrapping to construct a CI for the mean. The bootstrap is justified by an asymptotic argument (so it may not work very well for small samples), but you can make various distributional assumptions and check its coverage properties for plausible distributions via simulation. This paper (pdf is downloadable at that link) suggests that the bootstrap-t intervals often get better coverage properties than the usual t-intervals -- but may have poor coverage when samples are small and the distributions are skew.
If you have some additional information that would help guide a choice of distribution, you can get somewhere with other distributional assumptions. For example, if you know that the distribution is skew and continuous, you might try using a Gamma or lognormal model (say) to construct a CI for the mean. Or if you have count data you might use a Poisson, binomial or negative binomial model to try to construct an interval.

Confidence Interval – Calculating CI for Monte Carlo-Simulated Non-Normal Data

It is not really clear what do you want to achieve and as already noted by Björn there is not reason why you should expect anything else then narrow confidence intervals with such large sample.

Since you are doing simulation, you can explicitly simulate the distribution of means (or medians) and verify that actually the bootstrap confidence interval quite closely covers the 95% quantile interval of the distribution of empirical means (see code example below).

simfun <- function() {
  S1 <- rtrunc(10000, spec = "lnorm",a=0, b=1600, meanlog=4.4166,sdlog=1.1334)
  S2 <- rtrunc(10000, spec = "logis",a=0, location = 97.056, scale = 50.86)
  S3 <- rtrunc(10000, spec = "norm", a=0, mean=11.3,sd=4.45)

  SampleS <- matrix(c(S1,S2,S3),nrow = 10000,ncol = 3)
  finalSmeans <- rowMeans(SampleS)
  mean(finalSmeans)
}

CI.bca(bootmean)
##          2.5%    97.5%
## mean 92.92275 95.62323
quantile(replicate(500, simfun()), c(0.025, 0.975)) 
##     2.5%    97.5% 
## 91.39106 94.13525

Best Answer

Related Solutions

Solved – Deriving confidence interval from standard error of the mean when the data are non-normal

Confidence Interval – Calculating CI for Monte Carlo-Simulated Non-Normal Data

Related Question