The question below was asked on a sister site (Stack Overflow) back in 2010 by a user still active there (to me it seems more suitable here, for example quite similar to 21422):
I have a bunch of data in Excel that I need to get certain percentile information from. The problem is that instead of having the data set made up of each value, I instead have info on the number of or "bucket" data. For example, imagine that my actual data set looks like this: 1,1,2,2,2,2,3,3,4,4,4 The data set that I have is this: Value No. of occurrences 1 2 2 4 3 2 4 3 Is there an easy way for me to calculate percentile information (as well as the median) without having to explode the summary data out to full data set? (Once I did that, I know that I could just use the Percentile(A1:A5, p) function) This is important because my data set is very large. If I exploded the data out, I would have hundreds of thousands of rows and I would have to do it for a couple of hundred data sets.
I doubt after all this time the OP is still concerned but if not off topic here I would appreciate your views (but won’t be upset if this is closed!)
I am interested because (from a very long time ago!) I thought I had learned that, once binned, key information for accurate calculation of percentiles was irretrievably lost. But this is much more more my curiosity than a pressing need.
Best Answer
It is quite easy actually.
Let's say the sum of the counts is
N
, and you that you want the 0.3 (30%) bottom percentile. This means the threshold value will occur after0.3*N
counts.Now you look at the cumulative distribution, and when it reaches
0.3*N
, you have the value. It is very easy to implement.For example, you had this:
So you convert it into the cumulative:
Here
N=11
so0.3*11=3.3
. This happens in the second bin, so the 0.3 percentile is 2.