[Math] Computing reverse percentile

interpolationpercentilestatistics

I am actually looking for a way to indicate a certain point on an interpolated curve in a plot (see this question), but judging by the amount of answers there, that's just not possible.

So is there a way to compute the missing coordinate of that point?

More formally:

Given the 1st, 25th, 50th, 75th, 99th and 100th percentile of some data and a value y, how do I compute x s.t. y is the x-th percentile (i.e. s.t. (x, y)lies on the curve interpolated from the aforementioned percentiles)?

If not, can I compute it if I have access to the data?

example percentiles:

0.01    1.4
0.25    1.4
0.5     1.5
0.75    1.5
0.99    8.9
1   18907.4

mean:   8.0722091348
stdev: 220.0677459302

Best Answer

(a) If you just have a few scattered percentiles, then you could do crude interpolation. (b) If you have sufficient data, you can do a lot better. (c) If you know the CDF of the population, you can get an exact answer for the population.

Suppose you know the distribution is $X \sim Norm(\mu = 100, \sigma=15)$. Then the probability that a random observation $X$ lies below 107 can be found with software or by converting the 'raw score' 107 to the 'standard score' $Z = (107 - 100)/15 = 7/15 = .4667$ and consulting printed CDF tables of the standard normal distribution. In R statistical software $P(X \le 107) = P(Z \le 7/15) = 0.6796.$

 pnorm(107, 100, 15)  # Norm(100, 15)
 ## 0.6796308
 pnorm(7/15)          # Norm(0, 1), default parameters assumed
 ## 0.6796308

If this distribution describes scores on the XYZ College Admissions Test, and State University is willing to accept students scoring within the top 10%, then what cutoff point on the XYZ test will they use? The quantile function is the inverse of the CDF. By using printed normal tables in reverse or in software you can find the answer. You want $c$ such that $P(X \le c) = .90.$ The answer is that they will probably insist on a score of $c = 120$ or better.

qnorm(.9, 100, 15)
## 119.2233
pnorm(119, 100, 15)
## 0.8973627
pnorm(120, 100, 15)
## 0.9087888

Now suppose you have data. In particular, I have simulated 1000 XYZ exam scores (rounded to integers) and put them into the vector x. I checked, and the sample mean is $\bar X = 100.5$ and the sample standard deviation is $S = 14.25,$ so this is a fairly typical sample.

What fraction of these 1000 scores lies at or below 107? The answer is 69% (not far from the theoretical 0.6796 above). One could sort the 1000 observations from smallest to largest and count the number at or below 107. (The expression x <= 107 is a logical vector of TRUEs and FALSEs. The mean of a logical vector is its proportion of TRUEs.)

x = round(rnorm(1000, 100, 15))
## 100.462
## 14.25208
mean(x <= 107)
## 0.69

What is a number $c$ below which about 90% of the data lie? Again, we could get this from a sorted list of the scores? The answer is 119. That is, observation number 900 in the sorted list is 119.

quantile(x, .9)
##   90% 
##   119

A histogram of the 1000 simulated test scores is shown below. The vertical lines show values discussed above. The superimposed curve is the density function of $Norm(100, 15).$ (The fit is about as good as one should expect for a sample of size 1000.)

enter image description here

Related Question