Solved – the closest approximation to computing the $X$th percentile (say 95th) from bins of data

quantiles

I have a few bins of data. From each bin(which has a few thousand data points), I can compute the total number of data points, the mean of the data points, etc.. I can probably compute any/most statistical entities(mean, median, etc.) of each bin(computationally easy). However, I want to find the 95th percentile of all the data points. The total number of buckets are not that many(~90), so again, I have some room for computation amongst individual statistical entities of each bin.

For finding the 95th percentile, I would generally sort all the data points and pick out the one at the 95th percentile. That won't work in my case(too many data points, computationally intensive). Is there any close approximation to the 95th percentile that I can compute?

There are two phases in my problem – one when I can compute statistical entities for each bin, and the second when I do something with the statistical entities of each bin. Phase 1 and phase 2 happen in different locations, and I can't pass too much data across the two of them.

This is turning out to be a mix of a statistical and computer systems design problem. Please let me know if there is a better place for such a problem. :/

Best Answer

How are the data dispatched between the bins? If this is "at random", you can get an unbiased estimation $q_{i,0.95}$ of the 95th percentile in each bin, with variance proportional to $1\over n_i$ where $n_i$ is the size of the bin number $i$.

Then, you can aggregate these estimations by a weighted mean: $$q_{0,95} = { \sum_i n_i q_{i,0.95} \over \sum_i n_i },$$ which is the linear combination with minimum variance.

Edit: precisions on the variance

Let’s say we estimate $q_\alpha$, the quantile of level $\alpha$, by the $r$-th order statistic $X_{(r)}$ in a sample of size $n$, where $r = \lceil n\alpha \rceil$. Then according to Davison, Statistical methods, (eqn 2.27) asymptotically this estimation is distributed as $$\mathcal N\left(q_\alpha, {c_\alpha \over n} \right)$$ where $c_\alpha$ is a constant depending on the distribution of the data (precisely, $c_\alpha = {\alpha (1-\alpha) \over n f(q_\alpha)^2}$ where $f$ is the density of the data).

With the above solution, denoting $n=\sum_i n_i$, you get asymptotically a variance ${1\over n^2} \sum_i n_i^2 {c_\alpha \over n_i} = {c_\alpha \over n}$, i.e. the same variance achieved by considering the whole sample...

So actually you don’t lose anything by this method. However if your bins are too small, I think that the variance will be higher than the asymptotic, and you can lose some precision. With bins containing "a few thousands data points" I think you’re safe.

Here is an illutration in R: estimation of the quantile of $10^6$ samples from a uniform distribution, by bins of 1000, or on the whole data set:

> x <- runif(1e6)
> sapply(seq(1,length(x),by=1000), function(a) quantile(x[a:(a+999)],0.25,type=1)) -> q
> mean(q)
[1] 0.2502185
> quantile(x,0.25, type=1)
  25% 
0.2505781