The answer to this question is actually straightforward. The power advantage of creating composite samples is gained by reducing variability of sample estimates and improving the small sample approximations to normal curves. For $X_1, X_2, \ldots X_n$ having $\sqrt{n} \left( \bar{X} - \mu_x \right) \rightarrow_d \mathcal{N} \left( 0, \sigma^2 \right)$, the vector of composites of size $d$ is defined by $U_i = \sum_{(d-1)i+i}^{di} X_i / d$ and likewise for the second sample, $\vec{Y}$ and composites $\vec{V}$. We then have that the composites are independent samples with $\sqrt{n} \left( \bar{U} - \mu_x \right) \rightarrow_d \mathcal{N} \left( 0, \sigma^2/d \right)$ and likewise for $\vec{V}$.
So you can just calculated the pooled variance, the reduced effective degrees of freedom based on the fewer observed samples, and voilà that's all there is to it.
As far as the notched boxplot goes, the McGill et al [1] reference mentioned in your question contains pretty complete details (not everything I say here is explicitly mentioned there, but nevertheless it's sufficiently detailed to figure it out).
The interval is a robustified but Gaussian-based one
The paper quotes the following interval for notches (where $M$ is the sample median and $R$ is the sample interquartile range):
$$M\pm 1.7 \times 1.25R/(1.35\sqrt{N})$$
where:
$1.35$ is an asymptotic conversion factor to turn IQRs into estimates of $\sigma$ -- specifically, it's approximately the difference between the 0.75 quantile and the 0.25 quantile of a standard normal; the population quartiles are about 1.35 $\sigma$ apart, so a value of around $R/1.35$ should be a consistent (asymptotically unbiased) estimate of $\sigma$ (more accurately, about 1.349).
$1.25$ comes in because we're dealing with the asymptotic standard error of the median rather than the mean. Specifically, the asymptotic variance of the sample median is $\frac{1}{4nf_0^2}$ where $f_0$ is the density-height at the median. For a normal distribution, $f_0$ is $\frac{1}{\sqrt{2\pi}\sigma}\approx \frac{0.3989}{\sigma}$, so the asymptotic standard error of the sample median is $\frac{1}{2\sqrt{N}f_0}= \sqrt{\pi/2}\sigma/\sqrt{N}\approx 1.253\sigma/\sqrt{N}$.
As StasK mentions here, the smaller $N$ is, the the more dubious this would be (replacing his third reason with one about the reasonableness of using the normal distribution in the first place.
Combining the above two, we obtain an asymptotic estimate of the standard error of the median of about $1.25R/(1.35\sqrt{N})$. McGill et al credit this to Kendall and Stuart (I don't recall whether the particular formula occurs there or not, but the components will be).
So all that's left to discuss is the factor of 1.7.
Note that if we were comparing one sample to a fixed value (say a hypothesized median) we'd use 1.96 for a 5% test; consequently, if we had two very different standard errors (one relatively large, one very small), that would be about the factor to use (since if the null were true, the difference would be almost entirely due to variation in the one with larger standard error, and the small one could - approximately - be treated as effectively fixed).
On the other hand, if the two standard errors were the same, 1.96 would be much too large a factor, since both sets of notches come into it -- for the two sets of notches to fail to overlap we are adding one of each. This would make the right factor $1.96/\sqrt{2}\approx 1.386$ asymptotically.
Somewhere in between , we have 1.7 as a rough compromise factor. McGill et al describe it as "empirically selected". It does come quite close to assuming a particular ratio of variances, so my guess (and it's nothing more than that) is that the empirical selection (presumably based on some simulation) was between a set of round-value ratios for the variances (like 1:1, 2:1,3:1,... ), of which the "best compromise" $r$ from the $r:1$ ratio was then plugged into $1.96/\sqrt{1+1/r}$ rounded to two figures. At least it's a plausible way to end up very close to 1.7.
Putting them all (1.35,1.25 and 1.7) together gives about 1.57. Some sources get 1.58 by computing the 1.35 or the 1.25 (or both) more accurately but as a compromise between 1.386 and 1.96, that 1.7 is not even accurate to two significant figures (it's just a ballpark compromise value), so the additional precision is pointless (they might as well have just rounded the whole thing to 1.6 and be done with it).
Note that there's no adjustment for multiple comparisons anywhere here.
There's some distinct analogies in the confidence limits for a difference in the Tukey-Kramer HSD:
$$\bar{y}_{i\bullet}-\bar{y}_{j\bullet} \pm \frac{q_{\alpha;k;N-k}}{\sqrt{2}}\widehat{\sigma}_\varepsilon \sqrt{\frac{1}{n_i} + \frac{1}{n_j}}$$
But note that
this is a combined interval, not two separate contributions to a difference (so we have a term in $c.\sqrt{\frac{1}{n_i} + \frac{1}{n_j}}$ rather than the two contributing separately $k.\sqrt{\frac{1}{n_{i}}}$ and $k.\sqrt{\frac{1}{n_j}}$ and we assume constant variance (so we're not dealing with the compromise with the $1.96$ - when we might have very different variances - rather than the asymptotic $1.96/\sqrt{2}$ case)
it's based on means, not medians (so no 1.35)
it's based on $q$, which is based in turn on the largest difference in means (so there's not even any 1.96 part in this one, even one divided by $\sqrt{2}$). By contrast in comparing multiple box plots, there's no consideration of basing the notches on the largest difference in medians, it's all purely pairwise.
So while several of the ideas behind the form of components are somewhat analogous, they're actually quite different in what they're doing.
[1] McGill, R., Tukey, J. W. and Larsen, W. A. (1978) Variations of box plots. The American Statistician 32, 12–16.
Best Answer
Since you have the sample means and your hypothesis relates to population means, I've assumed you'll definitely want to use the sample means in what follows.
With some distributional assumptions, you can certainly get somewhere.
e.g. if you assume normality, the population interquartile range is about 1.35$\sigma$, so if the sample is large enough that the population IQR is estimated with little error, you can estimate $\sigma$ and have an effective test at the normal.
In this case, if you don't assume equal variances, then you get $\tilde{\sigma_i}=\text{IQR}_i/1.35$, then calculate $\tilde{\sigma}_D^2 = \tilde{\sigma}_1^2/n_1+\tilde{\sigma}_2^2/n_2$ and then take $z^* = \frac{\bar{x}_1-\bar{x}_2}{\tilde{\sigma}_D}$ and look up z-tables.
[By way of a check, I just did a simulation where I generated normal samples of size 30 (with equal variance, though I didn't assume it in the calculation), and the test is anticonservative (i.e. the type I error rate is higher than nominal), so when you attempt to do a 5% test it looks like you're actually getting somewhere in the region of 6.8% (the approximation will likely be a bit worse if the variances differ). If you can tolerate that, then that's probably fine. Of course you could lower the significance level to compensate for the anticonservatism but I'd be inclined to bite the bullet and try option 2. Once sample sizes hit 200 or so, though, this works pretty well.]
If either sample size is not large, you can still do something, but the distribution of the statistic will depend on the exact method by which the quartiles were computed as well as the particular sample sizes.
In particular, you could either
a. assume equal variances and use a test statistic akin to an equal-variance t-statistic but with an estimate of $\sigma^2$ based on a weighted average of the squares of the two IQRs; or
b. not make an assumption of equal variance and use a test statistic more akin to a Welch-Satterthwaite type statistic.
In the first case the distribution of the test statistic could be obtained fairly simply by simulation from the assumed distribution. (In the second case things are a bit more complicated because the distribution will depend on the way the spreads differ -- but something could still be done.)
If you're not prepared to make some distributional assumption, you can still bound the sample standard deviation and so get upper and lower bounds on the t-statistic; however, the bounds may not be very narrow.
If you hadn't had the sample means, you could use the medians in an analog of the t-test. If you're assuming normality (or even just symmetry and existence of means) then the medians will estimate the respective means; however, since we only need to deal with the difference in means, substantially weaker assumptions will suffice for this to work as a test.
In this case you can get critical values (or indeed, p-values) via simulation quite easily, but the null distribution under a normal assumption is pretty close to t-distributed; a quite decent approximation to the p-value can be obtained from t-tables, but suitable degrees of freedom are substantially lower than you'd have from a t-test (close to half!) -- and the test statistic should be scaled as well, since the variances don't exactly correspond.
This won't have especially good power at the normal, but it will have good robustness to deviations from normality.
As an example, for a statistic of this form:
$t^* = \frac{\tilde{x}_1-\tilde{x}_2}{\sqrt{q_1^2/n+q_2^2/n}}$
where $\tilde{x_i}$ is the median of sample $i$ and $q_i$ is the interquartile range of sample $i$ (which is analogous to a particular form of two-sample t-test for equal variance and equal $n$). I simulated 40,000 samples of size 30 and 30.
A Q-Q plot of absolute values of $t^*$ vs absolute values of quantiles of $c\cdot t_{40}$ (for $c=1.064$) is plotted below (grey), and the 45 degree line is drawn in green. The second plot shows detail in the region of typical significance levels (including, but not limited to values between 1% and 10%). The approximation is accurate to about 3 figures over most of that range.
[Similar plots are obtained for a variety of other degrees of freedom in the vicinity (with suitably chosen $c$) for each. Simulations at a variety of sample sizes suggest that t-distribution approximations work well across a wide range of $n$ for the equal-variance equal-sample-size case. I expect approximation via t-distributions will be adequate for the equal-variance unequal-sample-size case, but the simulations and analysis required would take a more substantial amount of time.]