[Math] Approximate 99th Percentile of dataset

statistics

So I got a huge set of randomly distributed data. I want to find the 99th percentile.

The data changes everyday, so far I am sorting my data which is taking a long time everyday cause the data is huge. For example if I had numbers 1 to 100 as my data, the 99th percentile would be 99 after sorting.

So If I have 1000 data, each day I replace 20 of them with new data and I would have to recalculate 99 percentile. Notice there is 980 data that are shared. I am wondering, is there a way to approximate the 99 percentile from median, standard deviation and etc?

Best Answer

Both offered answers do not improve on the main bottleneck, which is the $O(n\log n)$ sorting running time (for an array of size $n$).

One can use a clever selection algorithm, which partially sorts the data, to avoid the full sort, and end up with a linear-time computation.

Most of the time this is used to compute the median, which is the 50-th percentile, but can be used for arbitrary selection. This makes heavy use of the main routine behind QuickSort and hence is called the QuickSelect. Here is the Wiki link for the full discussion of the field.

Related Question