Confidence Interval – How to Obtain for Percentiles

confidence intervalquantilestolerance-interval

I have a bunch of raw data values that are dollar amounts and I want to find a confidence interval for a percentile of that data. Is there a formula for such a confidence interval?

Best Answer

This question, which covers a common situation, deserves a simple, non-approximate answer. Fortunately, there is one.

Suppose $X_1, \ldots, X_n$ are independent values from an unknown distribution $F$ whose $q^\text{th}$ quantile I will write $F^{-1}(q)$. This means each $X_i$ has a chance of (at least) $q$ of being less than or equal to $F^{-1}(q)$. Consequently the number of $X_i$ less than or equal to $F^{-1}(q)$ has a Binomial$(n,q)$ distribution.

Motivated by this simple consideration, Gerald Hahn and William Meeker in their handbook Statistical Intervals (Wiley 1991) write

A two-sided distribution-free conservative $100(1-\alpha)\%$ confidence interval for $F^{-1}(q)$ is obtained ... as $[X_{(l)}, X_{(u)}]$

where $X_{(1)}\le X_{(2)}\le \cdots \le X_{(n)}$ are the order statistics of the sample. They proceed to say

One can choose integers $0 \le l \le u \le n$ symmetrically (or nearly symmetrically) around $q(n+1)$ and as close together as possible subject to the requirements that $$B(u-1;n,q) - B(l-1;n,q) \ge 1-\alpha.\tag{1}$$

The expression at the left is the chance that a Binomial$(n,q)$ variable has one of the values $\{l, l+1, \ldots, u-1\}$. Evidently, this is the chance that the number of data values $X_i$ falling within the lower $100q\%$ of the distribution is neither too small (less than $l$) nor too large ($u$ or greater).

Hahn and Meeker follow with some useful remarks, which I will quote.

The preceding interval is conservative because the actual confidence level, given by the left-hand side of Equation $(1)$, is greater than the specified value $1-\alpha$. ...

It is sometimes impossible to construct a distribution-free statistical interval that has at least the desired confidence level. This problem is particularly acute when estimating percentiles in the tail of a distribution from a small sample. ... In some cases, the analyst can cope with this problem by choosing $l$ and $u$ nonsymmetrically. Another alternative may be to use a reduced confidence level.


Let's work through an example (also provided by Hahn & Meeker). They supply an ordered set of $n=100$ "measurements of a compound from a chemical process" and ask for a $100(1-\alpha)=95\%$ confidence interval for the $q=0.90$ percentile. They claim $l=85$ and $u=97$ will work.

Figure showing Binomial(100, 0.90) distribution

The total probability of this interval, as shown by the blue bars in the figure, is $95.3\%$: that's as close as one can get to $95\%$, yet still be above it, by choosing two cutoffs and eliminating all chances in the left tail and the right tail that are beyond those cutoffs.

Here are the data, shown in order, leaving out $81$ of the values from the middle:

$$\matrix{ 1.49&1.66&2.05&\ldots&\mathbf {24.33}&24.72&25.46&25.67&25.77&26.64\\ 28.28&28.28&29.07&29.16&31.14&31.83&\mathbf{33.24}&37.32&53.43&58.11}$$

The $85^\text{th}$ largest is $24.33$ and the $97^\text{th}$ largest is $33.24$. The interval therefore is $[24.33, 33.24]$.

Let's re-interpret that. This procedure was supposed to have at least a $95\%$ chance of covering the $90^\text{th}$ percentile. If that percentile actually exceeds $33.24$, that means we will have observed $97$ or more out of $100$ values in our sample that are below the $90^\text{th}$ percentile. That's too many. If that percentile is less than $24.33$, that means we will have observed $84$ or fewer values in our sample that are below the $90^\text{th}$ percentile. That's too few. In either case--exactly as indicated by the red bars in the figure--it would be evidence against the $90^\text{th}$ percentile lying within this interval.


One way to find good choices of $l$ and $u$ is to search according to your needs. Here is a method that starts with a symmetric approximate interval and then searches by varying both $l$ and $u$ by up to $2$ in order to find an interval with good coverage (if possible). It is illustrated with R code. It is set up to check the coverage in the preceding example for a Normal distribution. Its output is

Simulation mean coverage was 0.9503; expected coverage is 0.9523

The agreement between simulation and expectation is excellent.

#
# Near-symmetric distribution-free confidence interval for a quantile `q`.
# Returns indexes into the order statistics.
#
quantile.CI <- function(n, q, alpha=0.05) {
  #
  # Search over a small range of upper and lower order statistics for the 
  # closest coverage to 1-alpha (but not less than it, if possible).
  #
  u <- qbinom(1-alpha/2, n, q) + (-2:2) + 1
  l <- qbinom(alpha/2, n, q) + (-2:2)
  u[u > n] <- Inf
  l[l < 0] <- -Inf
  coverage <- outer(l, u, function(a,b) pbinom(b-1,n,q) - pbinom(a-1,n,q))
  if (max(coverage) < 1-alpha) i <- which(coverage==max(coverage)) else
    i <- which(coverage == min(coverage[coverage >= 1-alpha]))
  i <- i[1]
  #
  # Return the order statistics and the actual coverage.
  #
  u <- rep(u, each=5)[i]
  l <- rep(l, 5)[i]
  return(list(Interval=c(l,u), Coverage=coverage[i]))
}
#
# Example: test coverage via simulation.
#
n <- 100      # Sample size
q <- 0.90     # Percentile
#
# You only have to compute the order statistics once for any given (n,q).
#
lu <- quantile.CI(n, q)$Interval
#
# Generate many random samples from a known distribution and compute 
# CIs from those samples.
#
set.seed(17)
n.sim <- 1e4
index <- function(x, i) ifelse(i==Inf, Inf, ifelse(i==-Inf, -Inf, x[i]))
sim <- replicate(n.sim, index(sort(rnorm(n)), lu))
#
# Compute the proportion of those intervals that cover the percentile.
#
F.q <- qnorm(q)
covers <- sim[1, ] <= F.q & F.q <= sim[2, ]
#
# Report the result.
#
message("Simulation mean coverage was ", signif(mean(covers), 4), 
        "; expected coverage is ", signif(quantile.CI(n,q)$Coverage, 4))