Solved – Box plot notches vs. Tukey-Kramer interval

boxplotdata visualizationmediantukey-hsd-test

The "notch" help document (or original text) from boxplot in 'R' gives the following:

If the notches of two plots do not overlap this is ‘strong evidence’
that the two medians differ (Chambers et al, 1983, p. 62). See
boxplot.stats for the calculations used.

and the 'boxplot.stats' gives the following:

The notches (if requested) extend to +/-1.58 IQR/sqrt(n). This seems
to be based on the same calculations as the formula with 1.57 in
Chambers et al (1983, p. 62), given in McGill et al (1978, p. 16).
They are based on asymptotic normality of the median and roughly equal
sample sizes for the two medians being compared, and are said to be
rather insensitive to the underlying distributions of the samples. The
idea appears to be to give roughly a 95% confidence interval for the
difference in two medians.

Now I am more familiar with using the JMP version of the Tukey-Kramer test to compare the means of columns. Documentation for JMP gives this:

Shows a test that is sized for all differences among the means. This
is the Tukey or Tukey-Kramer HSD (honestly significant difference)
test. (Tukey 1953, Kramer 1956). This test is an exact alpha-level
test if the sample sizes are the same, and conservative if the sample
sizes are different (Hayter 1984).

Question:
What is the nature of the connection between the two approaches? Is there a way to transform one into the other?

It looks like one is looking for an approximate 95% CI for the median, and determining whether there is overlap; and the other is an "exact alpha test" (my samples are the same size) for determining whether the medians of two sets of samples are within a reasonable range of each other.

I reference packages, but I'm interested in the math behind the logic.

Best Answer

As far as the notched boxplot goes, the McGill et al [1] reference mentioned in your question contains pretty complete details (not everything I say here is explicitly mentioned there, but nevertheless it's sufficiently detailed to figure it out).

The interval is a robustified but Gaussian-based one

The paper quotes the following interval for notches (where $M$ is the sample median and $R$ is the sample interquartile range):

$$M\pm 1.7 \times 1.25R/(1.35\sqrt{N})$$

where:

$1.35$ is an asymptotic conversion factor to turn IQRs into estimates of $\sigma$ -- specifically, it's approximately the difference between the 0.75 quantile and the 0.25 quantile of a standard normal; the population quartiles are about 1.35 $\sigma$ apart, so a value of around $R/1.35$ should be a consistent (asymptotically unbiased) estimate of $\sigma$ (more accurately, about 1.349).
$1.25$ comes in because we're dealing with the asymptotic standard error of the median rather than the mean. Specifically, the asymptotic variance of the sample median is $\frac{1}{4nf_0^2}$ where $f_0$ is the density-height at the median. For a normal distribution, $f_0$ is $\frac{1}{\sqrt{2\pi}\sigma}\approx \frac{0.3989}{\sigma}$, so the asymptotic standard error of the sample median is $\frac{1}{2\sqrt{N}f_0}= \sqrt{\pi/2}\sigma/\sqrt{N}\approx 1.253\sigma/\sqrt{N}$.

As StasK mentions here, the smaller $N$ is, the the more dubious this would be (replacing his third reason with one about the reasonableness of using the normal distribution in the first place.

Combining the above two, we obtain an asymptotic estimate of the standard error of the median of about $1.25R/(1.35\sqrt{N})$. McGill et al credit this to Kendall and Stuart (I don't recall whether the particular formula occurs there or not, but the components will be).
So all that's left to discuss is the factor of 1.7.

Note that if we were comparing one sample to a fixed value (say a hypothesized median) we'd use 1.96 for a 5% test; consequently, if we had two very different standard errors (one relatively large, one very small), that would be about the factor to use (since if the null were true, the difference would be almost entirely due to variation in the one with larger standard error, and the small one could - approximately - be treated as effectively fixed).

On the other hand, if the two standard errors were the same, 1.96 would be much too large a factor, since both sets of notches come into it -- for the two sets of notches to fail to overlap we are adding one of each. This would make the right factor $1.96/\sqrt{2}\approx 1.386$ asymptotically.

Somewhere in between , we have 1.7 as a rough compromise factor. McGill et al describe it as "empirically selected". It does come quite close to assuming a particular ratio of variances, so my guess (and it's nothing more than that) is that the empirical selection (presumably based on some simulation) was between a set of round-value ratios for the variances (like 1:1, 2:1,3:1,... ), of which the "best compromise" $r$ from the $r:1$ ratio was then plugged into $1.96/\sqrt{1+1/r}$ rounded to two figures. At least it's a plausible way to end up very close to 1.7.

Putting them all (1.35,1.25 and 1.7) together gives about 1.57. Some sources get 1.58 by computing the 1.35 or the 1.25 (or both) more accurately but as a compromise between 1.386 and 1.96, that 1.7 is not even accurate to two significant figures (it's just a ballpark compromise value), so the additional precision is pointless (they might as well have just rounded the whole thing to 1.6 and be done with it).

Note that there's no adjustment for multiple comparisons anywhere here.

There's some distinct analogies in the confidence limits for a difference in the Tukey-Kramer HSD:

$$\bar{y}_{i\bullet}-\bar{y}_{j\bullet} \pm \frac{q_{\alpha;k;N-k}}{\sqrt{2}}\widehat{\sigma}_\varepsilon \sqrt{\frac{1}{n_i} + \frac{1}{n_j}}$$

But note that

this is a combined interval, not two separate contributions to a difference (so we have a term in $c.\sqrt{\frac{1}{n_i} + \frac{1}{n_j}}$ rather than the two contributing separately $k.\sqrt{\frac{1}{n_{i}}}$ and $k.\sqrt{\frac{1}{n_j}}$ and we assume constant variance (so we're not dealing with the compromise with the $1.96$ - when we might have very different variances - rather than the asymptotic $1.96/\sqrt{2}$ case)
it's based on means, not medians (so no 1.35)
it's based on $q$, which is based in turn on the largest difference in means (so there's not even any 1.96 part in this one, even one divided by $\sqrt{2}$). By contrast in comparing multiple box plots, there's no consideration of basing the notches on the largest difference in medians, it's all purely pairwise.

So while several of the ideas behind the form of components are somewhat analogous, they're actually quite different in what they're doing.

[1] McGill, R., Tukey, J. W. and Larsen, W. A. (1978) Variations of box plots. The American Statistician 32, 12–16.

Related Solutions

Solved – Tukey-Kramer test

It is called the LSD Threshold Matrix, according to the JMP Statistics and Graphics Guide (it is headed as Abs(Dif)-LSD), and those numbers reflect absolute difference between group means minus Fisher's least significant difference (LSD). They are displayed as off diagonal entries with the greatest difference in the upper-right or lower-left corner. The diagonal entries represent the comparison of each group mean with itself and are just the opposite of the LSD value. (Note that the matrix is symmetric so you just need to look at the upper or lower diagonal entries.)

Hence, as stated in the documentation, a high positive value would indicate a large departure from the LSD, whereas negative values would mean that the observed difference of means is less than it. See also Ramirez and Ramirez, Analyzing and Interpreting Continuous Data Using JMP: A Step-by-Step Guide, SAS Publishing 2009 (pp. 314-316), for more information on how JMP handles the computation of LSD and HSD. Gerard E. Dallal also offers a good overview of Multiple Comparison Procedures.

Solved – Test for median difference

You could consider a permutation test.

median.test <- function(x,y, NREPS=1e4) {
  z <- c(x,y)
  i <- rep.int(0:1, c(length(x), length(y)))
  v <- diff(tapply(z,i,median))
  v.rep <- replicate(NREPS, {
    diff(tapply(z,sample(i),median))
  })
  v.rep <- c(v, v.rep)
  pmin(mean(v < v.rep), mean(v>v.rep))*2
}

set.seed(123)
n1 <- 100
n2 <- 200
## the two samples
x <- rnorm(n1, mean=1)
y <- rexp(n2, rate=1)
median.test(x,y)

Gives a 2 sided p-value of 0.1112 which is a testament to how inefficient a median test can be when we don't appeal to any distributional tendency.

If we used MLE, the 95% CI for the median for the normal can just be taken from the mean since the mean is the median in a normal distribution, so that's 1.00 to 1.18. The 95% CI for the median for the exponential can be framed as $\log(2)/\bar{X}$, which by the delta method is 0.63 to 0.80. Therefore the Wald test is statistically significant at the 0.05 level but the median test is not.

Best Answer

Related Solutions

Solved – Tukey-Kramer test

Solved – Test for median difference

Related Question