Solved – Find Q1 and Q3 from median and IQR

descriptive statisticsinterquartilemedianquantilesrange

A study gives the following:

$n = 67$

mean = 73

sd = 68

median = 55

IQR = 66

Is it possible from this information to get the actual Q1 and Q3 values? I used the $n$, mean & sd to get 95% CI. Should that be roughly similar?

Best Answer

As @Dave noted in a comment, you would have to make some assumptions about the distribution. Given the mean and the median being so different, it's likely that there is substantial skew - and you confirm this in a comment.

Various assumptions might be reasonable.

With median = 55 and IQR = 66 (and no other info or assumptions), then, with a symmetric distribution, you would have 22 and 88 for the quartiles. But you could have anything from -10 and 56 to 54 and 120. But you have additional info: The mean and sd - these will limit the possibilities. And you probably also can figure out some things from the nature of the variable (e.g. is it always positive?) and try various distributions.

Related Solutions

Solved – Estimate median from mean, std dev, and/or range

It's certainly possible to place some bounds on the median, but without further assumptions they might potentially be pretty weak bounds. The problem is that the only gauge you have on how skew it might be (particularly, in the sense of the second Pearson skewness) is the relative positions of the extrema to the mean, and they're typically a very weak indicator of that. Adding in the fact that the variable is nonnegative gives a second very weak indicator of skewness (the relative size of the standard deviation and mean).

But the second Pearson skewness does give us a bound: for a distribution, the median cannot be more than one standard deviation from the mean. (For a sample, because of the effect of the usual Bessel correction on standard deviation, it must lie somewhat inside those limits.)

If the standard deviation is small, that may be adequate for some purposes.

If we denote the median as $\stackrel{\sim}{x}$, the mean as $\bar{x}$, the usual sample standard deviation as $s_{n-1}$ (and let $s_n=\sqrt{\frac{n-1}{n}}s_{n-1}$ be the uncorrected s.d.), the minimum as $x_{(1)}$ and the maximum as $x_{(n)}$ then naively, we can immediately say that

$$\max(x_{(1)},\bar{x}-s_n)\leq\,\,\stackrel{\sim}{x}\,\,\leq \min(x_{(n)},\bar{x}+s_n)\,.$$

By more careful consideration of all the information, knowing the minimum and maximum might bound the result still further, but my guess is not necessarily by very much (it may help more in some cases than others). Knowing the sample size, $n$, may also add some important information, particularly if $n$ is small.

The fact that the variable is non-negative might help. Markov's inequality suggests that the median cannot be more than twice the mean, perhaps that may sometimes improve the bound from the mean plus a standard deviation (though, if the s.d. were greater than the mean, you'd usually expect the median to be lower than the mean; again it may be possible to get better bounds still).

Anyway, adding that bound to our previous naive bounds, we have:

$$\max(x_{(1)},\bar{x}-s_n)\leq\,\,\stackrel{\sim}{x}\,\,\leq \min(x_{(n)},\bar{x}+s_n,2\bar{x})\,.$$

(In that situation we also know that the median is above $0$, but given we know $x_{(1)}$, that knowledge doesn't ever improve the lower bound.)

Edit: I simulated a few data sets from different distributions (partly to see how the bounds behaved and partly as a double check that I hadn't made any egregious errors). One of the examples did have the property that $2\bar x$ was a bit less than $\bar x +s_n$ (thus reducing the upper bound on the median, so adding that third component does sometimes help), but as I expected might often be the case, the actual median was less than the mean (so it didn't make the upper bound very close).

Still, the intervals did actually enclose the median for every example I did.

If you assumed some distributional form (like, say, normality), then of course you can get much better estimates (/intervals).

Statistics – Why 95% CI for Median is ±1.57*IQR/?N?

That's easy. If we check the original paper where notched box-and-whisker plots were introduced (Robert McGill, John W. Tukey and Wayne A. Larsen. Variations of Box Plots, The American Statistician, Vol. 32, No. 1 (Feb., 1978), pp. 12-16; fortunately, it's on JSTOR), we found section 7 where this formula is justified in the following way:

Should one desire a notch indicating a 95 percent confidence interval about each median, C=1.96 would be used. [Here C is different constant which is related to ours, but the exact relation is of no importance as will be clear later — I.S.] However, since a form of "gap gauge" which would indicate significant differences at the 95 percent level was desired, this was not done. It can be shown that C = 1.96 would only be appropriate if the standard deviations of the two groups were vastly different. If they were nearly equal, C = 1.386 would be the appropriate value, with 1.96 resulting in far too stringent a test (far beyond 99 percent). A value between these limits, C = 1.7, was empirically selected as preferable. Thus the notches used were computed as $M \pm 1.7(1.25R/1.35 \sqrt{N})$.

Emphasis is mine. Note that $1.7\times 1.25/1.35=1.57$, which is your magic number.

So, the short answer is: it is not a general formula for median CI but a particular tool for visualization and the constant was empirically selected to achieve a particular goal.

There's no magic.

Sorry.

Best Answer

Related Solutions

Solved – Estimate median from mean, std dev, and/or range

Statistics – Why 95% CI for Median is ±1.57*IQR/?N?

Related Question