Solved – Confidence intervals for median

confidence intervalmedianstandard error

I have a distribution of samples with a small number of values in each one (less than $10$). I have calculated the median for each sample, which I want to compare with a model and obtain the difference between the model and the median of each sample. To have a consistent result, I need an error on this difference.

It results that finding the standard deviation in such a case can be quite hard, at least for a non-pro like me (see for example here).

I have found this website which says how to calculate confidence intervals for the median, even if there is no official reference quoted.

It seems reasonable to me, but I can't really judge, so I would like to know:

  1. are those formulas correct?
  2. There is a reference for that?
  3. What about if I want to find CI different from $95\%$?

Thanks in advance

EDIT: I have also found this example of bootstrapping for non-Gaussian data. Now, I do not know much about bootstrapping, but it would be good to have an address on its validity.

Best Answer

Summary

When you can assume little or nothing about the true probability law, and can infer little about it--which is the case for small samples of $n$ observations--then a suitably chosen pair of order statistics will constitute a confidence interval for the median. Which order statistics to choose can easily be found with a quick analysis of the Binomial$(n, 1/2)$ distribution. There are some choices to be made in practice: these are discussed and illustrated at the end of this post.

Incidentally, the same analysis can be used to construct confidence intervals for any quantile $q$ (of which the median, corresponding to $q=50\%$, is one example). The Binomial$(n, q)$ distribution governs the solution in this case.

Introduction

Recall what a confidence interval (CI) means. The setting is an independent random sample $X = (X_1, X_2, \ldots, X_n)$ with each $X_i$ governed by the same distribution $F$. It is assumed only that $F$ is one element of a set $\Omega$ of possible distributions. Each of them has a median $F_{1/2}$. For any fixed $\alpha$ between $0$ and $1$, a CI of level $\alpha$ is a pair of functions (aka "statistics"), $L$ and $U$, such that

$${\Pr}_F(L(X)\le F_{1/2} \le U(X)) \ge 1 - \alpha.$$

The right hand side is the coverage of the CI for the distribution $F$.

Aside: for this to be useful, we also prefer that (1) the infimum of the coverages over $F\in\Omega$ be as small as possible and (2) the expected length of the interval, $\mathbb{E}_F(U(X)-L(X))$, should tend to be short for all or "most" $F\in\Omega$.

Analysis

Suppose we assume nothing about $\Omega$. In this situation we can still exploit the order statistics. These are the specific values in the sorted sample. To simplify the notation, let's sort the sample once and for all so that

$$X_1 \le X_2 \le \cdots \le X_n.$$

The value $X_i$ is the $i^\text{th}$ order statistic of the sample. Since we're assuming nothing about $\Omega$, we know nothing about $F$ at first, so we can't infer much about the likely intervals between each $X_i$ and its neighbor $X_{i+1}$. However, we can still reason quantitatively about the individual values: what is the chance that $X_i$ does not exceed the median of $F$? To figure this out, let $Y$ be a random variable governed by $F$, and let

$$\pi_F = {\Pr}_F(Y \le F_{1/2})$$

be the chance that $Y$ does not exceed the median of $F$. Then when $X_i \le F_{1/2}$ we know (since $X_1\le \cdots \le X_i \le F_{1/2}$) that our original unordered sample of $n$ values must have contained at least $i$ values not exceeding $F_{1/2}$.

This is a Binomial problem. Formally, if we define the random variable $Z$ to equal $1$ when $Y \le F_{1/2}$ and $0$ otherwise, the foregoing shows that $Z$ has a Bernoulli distribution with parameter $\pi_F$. A "success" consists in observing a value at or below the median. Therefore $\Pr(X_i \gt F_{1/2})$ is given by the Binomial probability associated with fewer than $i$ successes:

$$\Pr(X_i \gt F_{1/2}) = \sum_{j=0}^{i-1} \binom{n}{j} \pi_F^j(1-\pi_F)^{n-j}.$$

You probably noticed that $\pi_F \ge 1/2$. In fact, for many distributions the two values are equal: they differ only when $F$ assigns positive probability to the median $F_{1/2}$. To analyze the difference, write $\pi_F = 1/2 + \varepsilon$ for $\varepsilon \ge 0$. For $2(j-1) \le n$ this implies

$$\eqalign{ \pi_F^j(1-\pi_F)^{n-j} &= (1/2+\varepsilon)^j(1/2-\varepsilon)^{n-j} = (1/2+\varepsilon)^j[(1/2-\varepsilon)^j(1/2-\varepsilon)^{n-2j}]\\ &=(1/4-\varepsilon^2)^j(1/2-\varepsilon)^{n-2j} \le (1/4)^j(1/2)^{n-2j}=2^{-n}. }$$

Consequently, when $2(i-1) \le n$, we may get rid of the dependence of the sum on $F$, at the cost of replacing the equality by an inequality:

$$\Pr(X_i \gt F_{1/2}) \le 2^{-n}\sum_{j=0}^{i-1} \binom{n}{j}.$$

Exactly the same argument (applied by reversing the order statistics) shows that when $2(i+1) \ge n$,

$$\Pr(X_i \lt F_{1/2}) \le 2^{-n}\sum_{j=i+1}^n \binom{n}{j}.$$

The right hand sides reduce to zero whenever $i \le 0$ (in the first case) or $i \ge n$ (in the second). Therefore, it is always possible to find indexes $l \le u$ for which

$$\eqalign{ \Pr(X_l \gt F_{1/2} \text{ or } X_u \lt F_{1/2}) &= \Pr(X_l \gt F_{1/2}) + \Pr( X_u \lt F_{1/2}) \\ &\le 2^{-n}\left(\sum_{j=0}^{l-1} \binom{n}{j} + \sum_{j=u+1}^n \binom{n}{j}\right). }$$

Solution

This is the complement of the defining condition for a confidence interval, and therefore equivalent to it:

$$\Pr(X_l \le F_{1/2}\le X_u ) \ge 2^{-n}\sum_{j=l}^u \binom{n}{j}.$$

By selecting $l \le u$ to make the right hand side at least $1-\alpha$, we will have found a confidence interval procedure whose level is at least $1-\alpha$.

In other words, upon choosing such indexes $l$ and $u$, by setting $L(X) = X_l$ and $U(X) = X_u$, the interval $[L(X), U(X)]$ will be a CI for the median $F_{1/2}$ having coverage at least $1-\alpha$. You can compute its actual coverage in terms of Binomial probabilities. This coverage will be attained for any distribution $F$ which assigns zero probability to $F_{1/2}$ (which includes all continuous distributions). It will be exceeded by any $F$ which assigns nonzero probability to $F_{1/2}$.

Discussion

At this point we have some choices. The commonest is to make the limits symmetric by setting $u$ reasonably close to $n+1-l$. In fact, by stipulating $u=n+1-l$, the confidence limits can be found for any $n$ with a quick search or by applying the Binomial quantile function.

For example, let $n=10$ and $\alpha=10\%$ (to illustrate a $1-\alpha=90\%$ CI procedure). Let's tally the lower part of the cumulative Binomial distribution with parameters $10$ and $1/2$:

> i <- 0:5; names(i) <- i; print(pbinom(i, 10, 1/2), digits=1)
    0     1     2     3     4     5   
0.001 0.011 0.055 0.172 0.377 0.623 

(This is an R command and its response.) Because the value at $2$, equal to $5.5\%$, is close to $\alpha/2$, it is tempting to take $l=3$ and $u=10+1-3=8$, for then the coverage will be $1 - 0.055 - 0.055 = 0.89$ which is close to the target of $90\%$. If you must achieve the desired coverage, then you need to take $l=2$ and $u=8$ or $l=3$ and $u=9$, both with coverage $1 - 0.011 - .055 = 0.935$.

As a check, let's simulate a lot of datasets from any distribution whatsoever, compute these CIs for the datasets, and tally the proportion of CIs that do cover the true median. This R example uses a Normal distribution:

n <- 10
n.sim <- 1e4
x <- apply(matrix(rnorm(n*n.sim), nrow=n), 2, sort)
covers <- function(x, l, u) mean(x[l, ] <= 0 & x[u, ] >= 0)
c(l3.u8=covers(x,3,8), l2.u8=covers(x,2,8), l3.u9=covers(x,3,9))

The output is

 l3.u8  l2.u8  l3.u9 
 0.8904 0.9357 0.9319 

The coverages agree closely with the theoretical values.

As another example, let's draw samples from a discrete distribution, such as a Poisson:

lambda <- 2
x <- apply(matrix(rpois(n*n.sim, 2), nrow=n), 2, sort)
med <- round(lambda + 1/3 - 0.02/lambda)
c(l3.u8=covers(x,3,8), l2.u8=covers(x,2,8), l3.u9=covers(x,3,9))

 l3.u8  l2.u8  l3.u9 
0.9830 0.9845 0.9964 

This time the coverages are much higher than anticipated. The reason is that there is a $27\%$ chance that a random value equals the median. This greatly increases the chance that the CI covers the median. This is not a problem or a paradox. By definition, the coverage has to be at least $1-\alpha$ no matter what the distribution $F$ is--but it's possible (as in this case) that the coverage for particular distributions is substantially greater than $1-\alpha$.

Therein lies the tradeoff: when you assume nothing about $F$, the CI based on order statistics is the only one you can construct. Its coverage for your true (but unknown) $F$ might be quite a bit higher than you expect. That means your CI will be wider than if you had made some stronger assumptions about $\Omega$ by limiting the possibilities for $F$.

Related Question