Solved – How should sampling ratios to estimate quantiles change with population size

approximationestimationquantilessample-sizesampling

I want to cut my data of size N into k equal-sized bins. But I am happy with roughly equal-sized bins, with some $\varepsilon$ error. As precise quantiles of the data are computationally costly (sorting time grows at rate $O(N \log N)$), I am happy to estimate the quantiles. Taking the quantiles of some random subsample of size n is an obvious way forward. But what is the recommendation / theory / formula for how large a sample to take? At what rate should that sample or the $\frac{n}{N}$ sampling ratio grow for the same precision (proportional deviations of bin shares)?

There are algorithms estimating population quantiles from small samples (like Harrell-Davis) or approximate quantiles from data streams. I am not sure if either is related to the problem at hand, namely having access to the entire population, just looking for a sensible way to ease the computation of quantiles at the cost of some precision.

Page 3 of this survey says that with simple random sampling,

in order to estimate the quantiles with precision $\varepsilon n$, with probability at least $1 − \delta$, a sample of size $\Theta ( \frac{1}{\varepsilon^2} \log \frac{1}{\delta} )$ is required, where 0 < δ < 1.

This suggests a sample around 20,000 for $\varepsilon = 0.1$ and $\delta = 0.1$? What is $\Theta$?

As 19 vingtiles cut the data into 20 bins, any of those being off must have a higher probability than a single one. Though oversampling in the 3rd population percentile, all vingtiles will be too high. That said, a biased series of quantiles (6%, 11% etc. instead of 5%, 10% etc.) still let me grasp a distribution quite well.

Best Answer

For the order of the sample size, there is direct reference here (with big-Theta notation):

in order to estimate the quantiles with precision $\varepsilon n$, with probability at least $1 − \delta$, a sample of size $\Theta ( \frac{1}{\varepsilon^2} \log \frac{1}{\delta} )$ is required, where 0 < δ < 1.

But I think this might be an easier problem than it looked like, at least with an asymptotic approximation. For any true/population/full-sample/N p-th quantile $q = F^{-1}(p)$ the limiting distribution is $$ \sqrt{n}(\hat{q}-q) = \sqrt{n}\Delta q \sim N \left(0,\frac{p(1-p)}{f(q)^2} \right) $$

but if we care about (say) 1 percentage point deviations ($\varepsilon = 0.01$) in the form $F(q + \Delta q) \in (p-0.01,p+0.01)$, we can approximate the mass in the $\Delta q$ neighborhood with $f(q) \Delta q$ and try to bound that. Saying that $|f(q) \Delta q | < 0.01$ with a 99% probability ($1-\delta$ above) then becomes the problem of which normal distribution has its 0.995 quantile at 0.01, because its variance then is the bounding $\frac{p(1-p)}{n}$. Solving for the worst-case scenario of $p=0.5$, this gives the critical sample size to be $n = 16,556$ as long as the approximations hold.

Related Solutions

Solved – Estimate population quantiles from subpopulations’ quantiles

Because the subpopulations are arbitrary and not random, the best you can do is to use interval arithmetic.

Let the quantiles of a subpopulation be $x_{[1]} \le x_{[2]} \le \cdots \le x_{[m]}$ corresponding to percentiles $100 q_1, 100 q_2, \ldots, 100 q_m,$ respectively. This information means that

$100 q_i \%$ of the values are less than or equal to $x_{[i]}$ and
$100 q_{i-1} \%$ of the values are less than or equal to $x_{[i-1]},$ whence
$100(q_i - q_{i-1}) \%$ of the values lie in the interval $(x_{[i-1]}, x_{[i]}].$

In the case $i=1$, take $q_0 = 0$ and $x_{[0]}=-\infty.$ Similarly take $q_{m+1}=1$ and $x_{[m+1]} = \infty.$

Consider the set of all possible distributions consistent with this information. Let $F$ be the CDF of one of them and suppose $x\in (x_{[i-1]}, x_{[i]}]$ for some $i \in \{0, 1, \ldots, m\}.$ From the preceding information we know

$$q_{i-1} \le F(x) \le q_{i}.$$

The set of all possible CDFs therefore forms a "p-box" filling up these intervals. For example, let the quartiles be $\{-1, 0, 1\}$. The corresponding p-box lies between the upper (red) and lower (blue) curves. A possible distribution $F$ consistent with this p-box is shown in black.

The horizontal gray line shows how quantiles can be read off this plot: the 60th percentile, shown, must lie between $0$ and $1$ given that the 50th percentile is at $0$ and the 75th percentile is at $1$. The solid part of the gray line depicts the interval of possible values of the 60th percentile.

When presented with information of this sort for separate populations of sizes $n_1, n_2, \ldots, n_k$, having associated distributions $F_i, i=1, 2, \ldots, k,$ the distribution for the total population will be the weighted average of the $F_i$:

$$F(x) = \frac{n_1 F_1(x) + n_2 F_2(x) + \cdots + n_k F_k(x)}{n_1 + n_2 + \cdots + n_k}.$$

Because we do not know $F$, we replace it by the p-boxes obtained from the available information and use interval arithmetic to perform the computation. Interval arithmetic in this case is simple: when a value $u$ is known to be in an interval $[u_{-}, u_{+}]$ and $v$ is known to lie in $[v_{-}, v_{+}],$ then certainly $u+v$ is in $[u_{-}+v_{-}, u_{+} + v_{+}]$ and a constant positive multiple $\alpha u$ is in $[\alpha u_{-}, \alpha u_{+}].$ And that's all we can say.

For example, suppose we have the following quantile information for subpopulations of sizes $n_i = 5, 4, 7$:

Subpopulation 1 has quartiles at $-2, 0, 1$,
Subpopulation 2 has quintiles at $-1, 0, 1/2, 3/2,$ and
Subpopulation 3 has tertiles at $-4/3, -1/3.$

The resulting p-box computed using interval arithmetic is shown here:

Its 60th percentile (shown by the dashed gray line) must lie between $-4/3$ and $1$, but that is all we know for certain. The distribution of the collective population of $5+4+7=15$ individuals will have a CDF lying somewhere between the upper and lower bounds.

R code to compute and manipulate p-boxes is relatively straightforward to write because R supports step functions (the piecewise constant functions that form the envelopes of empirical p-boxes). The hard work is performed by the functions f (which converts quantile specifications into p-boxes) and mix (which forms positive linear combinations of p-boxes).

#
# Create a pair of functions giving the p-box of a set of quantiles.
#
f <- function(quantiles, quants=seq(0, 1, length.out=length(quantiles)+2)) {
  n <- length(quants)
  return (list(lower=stepfun(quantiles, quants[-n]), 
               upper=stepfun(quantiles, quants[-1])))
}
#
# Figure 1: show the p-box for a single population.
#
g <- f(quantiles <- qnorm(c(1,2,3)/4) / qnorm(3/4))
curve(g$upper(x), from=-3, to=2, ylim=c(0,1), n=1001, col="Red", lwd=2,
      ylab="Probability", main="Quartiles {-1, 0, 1}")
curve(g$lower(x), add=TRUE, n=1001, col="Blue", lwd=2)
curve(pnorm(x * qnorm(3/4)), add=TRUE)
lines(c(-3, quantiles[2]), c(0.6, 0.6), col="Gray", lty=2)
lines(c(quantiles[2],quantiles[3]), c(0.6, 0.6), col="Gray")
#
# Figure 2: show how to combine p-boxes using interval arithmetic.
#
quantiles <- list(c(-2, 0, 1), c(-1, 0, 1/2, 3/2), c(-4/3, -1/3))
weights <- c(5, 4, 7); weights <- weights / sum(weights)
mix <- function(x, components, weights) {
    matrix(unlist(lapply(components, function(u) u(x))), ncol=length(weights)) %*% weights
}
g.upper <- lapply(quantiles, function(q) f(q)$upper)
g.lower <- lapply(quantiles, function(q) f(q)$lower)
curve(mix(x, g.upper, weights), from=-5/2, to=2, ylim=c(0,1), 
      ylab="Probability", main="P-box for Three Subpopulations", n=1001,  col="Red")
curve(mix(x, g.lower, weights), add=TRUE, n=1001, col="Blue")
abline(h=0.6, lty=2, col="Gray")

Best Answer

Related Solutions

Solved – Estimate population quantiles from subpopulations’ quantiles

Related Question