I have a bunch of raw data values that are dollar amounts and I want to find a confidence interval for a percentile of that data. Is there a formula for such a confidence interval?
Confidence Interval – How to Obtain for Percentiles
confidence intervalquantilestolerance-interval
Best Answer
This question, which covers a common situation, deserves a simple, non-approximate answer. Fortunately, there is one.
Suppose $X_1, \ldots, X_n$ are independent values from an unknown distribution $F$ whose $q^\text{th}$ quantile I will write $F^{-1}(q)$. This means each $X_i$ has a chance of (at least) $q$ of being less than or equal to $F^{-1}(q)$. Consequently the number of $X_i$ less than or equal to $F^{-1}(q)$ has a Binomial$(n,q)$ distribution.
Motivated by this simple consideration, Gerald Hahn and William Meeker in their handbook Statistical Intervals (Wiley 1991) write
where $X_{(1)}\le X_{(2)}\le \cdots \le X_{(n)}$ are the order statistics of the sample. They proceed to say
The expression at the left is the chance that a Binomial$(n,q)$ variable has one of the values $\{l, l+1, \ldots, u-1\}$. Evidently, this is the chance that the number of data values $X_i$ falling within the lower $100q\%$ of the distribution is neither too small (less than $l$) nor too large ($u$ or greater).
Hahn and Meeker follow with some useful remarks, which I will quote.
Let's work through an example (also provided by Hahn & Meeker). They supply an ordered set of $n=100$ "measurements of a compound from a chemical process" and ask for a $100(1-\alpha)=95\%$ confidence interval for the $q=0.90$ percentile. They claim $l=85$ and $u=97$ will work.
The total probability of this interval, as shown by the blue bars in the figure, is $95.3\%$: that's as close as one can get to $95\%$, yet still be above it, by choosing two cutoffs and eliminating all chances in the left tail and the right tail that are beyond those cutoffs.
Here are the data, shown in order, leaving out $81$ of the values from the middle:
$$\matrix{ 1.49&1.66&2.05&\ldots&\mathbf {24.33}&24.72&25.46&25.67&25.77&26.64\\ 28.28&28.28&29.07&29.16&31.14&31.83&\mathbf{33.24}&37.32&53.43&58.11}$$
The $85^\text{th}$ largest is $24.33$ and the $97^\text{th}$ largest is $33.24$. The interval therefore is $[24.33, 33.24]$.
Let's re-interpret that. This procedure was supposed to have at least a $95\%$ chance of covering the $90^\text{th}$ percentile. If that percentile actually exceeds $33.24$, that means we will have observed $97$ or more out of $100$ values in our sample that are below the $90^\text{th}$ percentile. That's too many. If that percentile is less than $24.33$, that means we will have observed $84$ or fewer values in our sample that are below the $90^\text{th}$ percentile. That's too few. In either case--exactly as indicated by the red bars in the figure--it would be evidence against the $90^\text{th}$ percentile lying within this interval.
One way to find good choices of $l$ and $u$ is to search according to your needs. Here is a method that starts with a symmetric approximate interval and then searches by varying both $l$ and $u$ by up to $2$ in order to find an interval with good coverage (if possible). It is illustrated with
R
code. It is set up to check the coverage in the preceding example for a Normal distribution. Its output isThe agreement between simulation and expectation is excellent.