[Math] Most accurate ways to calculate Percentile of a huge dataset

statistics

I am wondering what is the best way of calculating the most accurate percentile of a huge dataset (Billions of data).

I used to assume I got a Normal distribution and use Logarithmic to reduce skewness.

I also used Frequency table to approximate the 99,95,90th percentile.

Is there another way of calculating the percentile as accurately as possible?

Note that even though the performance in R is not that important, The computation last 1 hour or 2 hours doesn't matter than much. I still need a good enough performance to finish the computation within 10 hours.

Best Answer

Once you have an approximation to the $90$th percentile, for example, you can make one pass through the data and find the exact value. Set a range that you are sure the actual $90$th percentile lies in. Read the data, counting the values that fall below your range, retaining the values in your range, and counting the values above your range. The better your approximation, the narrower the range can be and the fewer values you need to save. Then using the counts, you can find the rank in your range that is the exact $90$th percentile.

Related Question