Solved – Derive percentiles from binned data

average-precisionbinningquantiles

The question below was asked on a sister site (Stack Overflow) back in 2010 by a user still active there (to me it seems more suitable here, for example quite similar to 21422):

I have a bunch of data in Excel that I need to get certain percentile  
information from. The problem is that instead of having the data set made up of  
each value, I instead have info on the number of or "bucket" data.
For example, imagine that my actual data set looks like this:  
1,1,2,2,2,2,3,3,4,4,4  

The data set that I have is this:
Value    No. of occurrences
  1              2
  2              4
  3              2
  4              3
Is there an easy way for me to calculate percentile information (as well as the  
median) without having to explode the summary data out to full data set? (Once I  
did that, I know that I could just use the Percentile(A1:A5, p) function)  

This is important because my data set is very large. If I exploded the data out,  
I would have hundreds of thousands of rows and I would have to do it for a  
couple of hundred data sets.

I doubt after all this time the OP is still concerned but if not off topic here I would appreciate your views (but won’t be upset if this is closed!)

I am interested because (from a very long time ago!) I thought I had learned that, once binned, key information for accurate calculation of percentiles was irretrievably lost. But this is much more more my curiosity than a pressing need.

Best Answer

It is quite easy actually.

Let's say the sum of the counts is N, and you that you want the 0.3 (30%) bottom percentile. This means the threshold value will occur after 0.3*N counts.

Now you look at the cumulative distribution, and when it reaches 0.3*N, you have the value. It is very easy to implement.

For example, you had this:

Value    No. of occurrences
  1              2
  2              4
  3              2
  4              3

So you convert it into the cumulative:

Value    No. of occurrences
 <=1              2
 <=2              6
 <=3              8
 <=4              11

Here N=11 so 0.3*11=3.3. This happens in the second bin, so the 0.3 percentile is 2.

Related Question