Solved – How should sampling ratios to estimate quantiles change with population size

approximationestimationquantilessample-sizesampling

I want to cut my data of size N into k equal-sized bins. But I am happy with roughly equal-sized bins, with some $\varepsilon$ error. As precise quantiles of the data are computationally costly (sorting time grows at rate $O(N \log N)$), I am happy to estimate the quantiles. Taking the quantiles of some random subsample of size n is an obvious way forward. But what is the recommendation / theory / formula for how large a sample to take? At what rate should that sample or the $\frac{n}{N}$ sampling ratio grow for the same precision (proportional deviations of bin shares)?

There are algorithms estimating population quantiles from small samples (like Harrell-Davis) or approximate quantiles from data streams. I am not sure if either is related to the problem at hand, namely having access to the entire population, just looking for a sensible way to ease the computation of quantiles at the cost of some precision.

Page 3 of this survey says that with simple random sampling,

in order to estimate the quantiles with precision $\varepsilon n$, with probability at least $1 − \delta$, a sample of size $\Theta ( \frac{1}{\varepsilon^2} \log \frac{1}{\delta} )$ is required, where 0 < δ < 1.

This suggests a sample around 20,000 for $\varepsilon = 0.1$ and $\delta = 0.1$? What is $\Theta$?

As 19 vingtiles cut the data into 20 bins, any of those being off must have a higher probability than a single one. Though oversampling in the 3rd population percentile, all vingtiles will be too high. That said, a biased series of quantiles (6%, 11% etc. instead of 5%, 10% etc.) still let me grasp a distribution quite well.

Best Answer

For the order of the sample size, there is direct reference here (with big-Theta notation):

in order to estimate the quantiles with precision $\varepsilon n$, with probability at least $1 − \delta$, a sample of size $\Theta ( \frac{1}{\varepsilon^2} \log \frac{1}{\delta} )$ is required, where 0 < δ < 1.

But I think this might be an easier problem than it looked like, at least with an asymptotic approximation. For any true/population/full-sample/N p-th quantile $q = F^{-1}(p)$ the limiting distribution is $$ \sqrt{n}(\hat{q}-q) = \sqrt{n}\Delta q \sim N \left(0,\frac{p(1-p)}{f(q)^2} \right) $$

but if we care about (say) 1 percentage point deviations ($\varepsilon = 0.01$) in the form $F(q + \Delta q) \in (p-0.01,p+0.01)$, we can approximate the mass in the $\Delta q$ neighborhood with $f(q) \Delta q$ and try to bound that. Saying that $|f(q) \Delta q | < 0.01$ with a 99% probability ($1-\delta$ above) then becomes the problem of which normal distribution has its 0.995 quantile at 0.01, because its variance then is the bounding $\frac{p(1-p)}{n}$. Solving for the worst-case scenario of $p=0.5$, this gives the critical sample size to be $n = 16,556$ as long as the approximations hold.

Related Question