Solved – Estimation of quantile given quantiles of subset

large dataparallel computingquantiles

Let's say we have sets $X=\{x_1, x_2, \ldots, x_m\}, Y=\{y_1, y_2, \ldots, y_n\}$ and we have some estimate (or exact) quantile information about them at some level $a$. How could we approximate the quantiles at that same level $a$ for the union of the sets $X\cup Y$, without assuming that they come from the same distribution?

If impossible, what if we relax those assumptions and assume they do come from the same distribution? In other words, how could we calculate quantiles for an extremely large dataset with a sequence of parallel computations on subsets?

edit: I found this: Estimate population quantiles from subpopulations' quantiles, which was helpful – any other thoughts? In particular, how would one parallelize the computation of quantiles?

edit: To be more precise, I'm interested in computing an estimate and confidence interval for the $p$-th quantile of a large dataset by computing the quantiles on subsamples of that dataset in parallel. Let's say that it would be acceptable to assume that the subsamples are randomly sampled from the larger dataset, if that makes the problem more tractable.

Best Answer

I suppose you deal with a data vector of length $n$ where $n$ is so large that it becomes necessary to spread the computations over many machines and that this vector cannot fit inside the memory of any single one of those machines. This squares neatly with the definition of big data as defined here.

Suppose that the dataset $\{X_i\}_{i=1}^n$ is composed of independent draws from a distribution $F$ and denote $q(\alpha)$ the $\alpha$ quantile of $F$. Throughout, I will assume that $F$ is differentiable at $q(\alpha)$ and that $F^\prime(q(\alpha))>0$.

The sample estimator of $q(\alpha)$ is $\hat{q}_n(\alpha)$ and is defined as the minimizer over $t\in\mathbb{R}$ of:

$$(0)\quad h_n(t)=\sum_{i=1}^n\rho_\alpha(X_i-t).$$

where $\rho_{\alpha}(u)=|u(\alpha-\mathbb{I}_{(u<0)})|$ and $\mathbb{I}$ is the indicator variable [0].

Assume that the dataset $\{X_i\}_{i=1}^n$ is partitioned into $k$ non-overlapping sub-samples of lengths $\{m_j\}_{j=1}^k$. I will denote $\{\hat{q}_n^{j}(\alpha)\}_{j=1}^k$ the estimators of $\hat{q}_n(\alpha)$ obtained by solving (0) on the respective sub-samples. Then, if:

$$\min_{j=1}^k\lim_{n\to\infty} \frac{m_j}{n}>0$$ the estimator:

$$\bar{q}_n(\alpha)=\sum_{j=1}^k\frac{m_j}{n}\hat{q}_n^{j}(\alpha)$$

satisfies:

$$\sqrt{n}(\bar{q}_n(\alpha)-\hat{q}_n(\alpha))=o_p(1).$$

(You can find more details of these computations including the asymptotic variances of $\bar{q}_n(\alpha)$ in [1]).

Therefore, letting $m_j=m\;\forall j$, you can use non overlapping sub-samples of your original data set to get a computationally convenient estimate of $\hat{q}_n(\alpha)$. Given $p$ computing units, this divide and conquer strategy will have cost $O(km/p)$ and storage $O(km)$ (versus $O(n)$ and $O(n)$ for computing (0) directly) at an asymptotically negligible cost in terms of precision. The finite sample costs will depend on how small $F^\prime(\alpha)$ is, but will typically be acceptable as soon as $m\geq 2^7$ to $m\geq 2^6$.

Compared to running the parallel version of $\texttt{nth_element}$ I linked to in my comment above, the implementation based on sub-samples will be simpler (and also scale much better as $p$ becomes larger than the number of cores on a single computing machine). Another advantage of the approach based on sub-samples over the one based on the parallel version of $\texttt{nth_element}$ is that the $O(km)$-space complexity can be split across the memory of multiple machines, even if $O(n)$ cannot fit inside the memory of any single one of them.

On the minus side, the solution based on $\bar{q}_n(\alpha)$ will not be permutation invariant (changing the order of the observation will give a different result).

[0]. Koencker, R. Quantile regression.
[1]. Knight, K. and Bassett, J.W. Second order improvements of sample quantiles using subsamples

Related Solutions

Population Estimation – Estimating Population Quantiles from Subpopulations’ Quantiles

Because the subpopulations are arbitrary and not random, the best you can do is to use interval arithmetic.

Let the quantiles of a subpopulation be $x_{[1]} \le x_{[2]} \le \cdots \le x_{[m]}$ corresponding to percentiles $100 q_1, 100 q_2, \ldots, 100 q_m,$ respectively. This information means that

$100 q_i \%$ of the values are less than or equal to $x_{[i]}$ and
$100 q_{i-1} \%$ of the values are less than or equal to $x_{[i-1]},$ whence
$100(q_i - q_{i-1}) \%$ of the values lie in the interval $(x_{[i-1]}, x_{[i]}].$

In the case $i=1$, take $q_0 = 0$ and $x_{[0]}=-\infty.$ Similarly take $q_{m+1}=1$ and $x_{[m+1]} = \infty.$

Consider the set of all possible distributions consistent with this information. Let $F$ be the CDF of one of them and suppose $x\in (x_{[i-1]}, x_{[i]}]$ for some $i \in \{0, 1, \ldots, m\}.$ From the preceding information we know

$$q_{i-1} \le F(x) \le q_{i}.$$

The set of all possible CDFs therefore forms a "p-box" filling up these intervals. For example, let the quartiles be $\{-1, 0, 1\}$. The corresponding p-box lies between the upper (red) and lower (blue) curves. A possible distribution $F$ consistent with this p-box is shown in black.

The horizontal gray line shows how quantiles can be read off this plot: the 60th percentile, shown, must lie between $0$ and $1$ given that the 50th percentile is at $0$ and the 75th percentile is at $1$. The solid part of the gray line depicts the interval of possible values of the 60th percentile.

When presented with information of this sort for separate populations of sizes $n_1, n_2, \ldots, n_k$, having associated distributions $F_i, i=1, 2, \ldots, k,$ the distribution for the total population will be the weighted average of the $F_i$:

$$F(x) = \frac{n_1 F_1(x) + n_2 F_2(x) + \cdots + n_k F_k(x)}{n_1 + n_2 + \cdots + n_k}.$$

Because we do not know $F$, we replace it by the p-boxes obtained from the available information and use interval arithmetic to perform the computation. Interval arithmetic in this case is simple: when a value $u$ is known to be in an interval $[u_{-}, u_{+}]$ and $v$ is known to lie in $[v_{-}, v_{+}],$ then certainly $u+v$ is in $[u_{-}+v_{-}, u_{+} + v_{+}]$ and a constant positive multiple $\alpha u$ is in $[\alpha u_{-}, \alpha u_{+}].$ And that's all we can say.

For example, suppose we have the following quantile information for subpopulations of sizes $n_i = 5, 4, 7$:

Subpopulation 1 has quartiles at $-2, 0, 1$,
Subpopulation 2 has quintiles at $-1, 0, 1/2, 3/2,$ and
Subpopulation 3 has tertiles at $-4/3, -1/3.$

The resulting p-box computed using interval arithmetic is shown here:

Its 60th percentile (shown by the dashed gray line) must lie between $-4/3$ and $1$, but that is all we know for certain. The distribution of the collective population of $5+4+7=15$ individuals will have a CDF lying somewhere between the upper and lower bounds.

R code to compute and manipulate p-boxes is relatively straightforward to write because R supports step functions (the piecewise constant functions that form the envelopes of empirical p-boxes). The hard work is performed by the functions f (which converts quantile specifications into p-boxes) and mix (which forms positive linear combinations of p-boxes).

#
# Create a pair of functions giving the p-box of a set of quantiles.
#
f <- function(quantiles, quants=seq(0, 1, length.out=length(quantiles)+2)) {
  n <- length(quants)
  return (list(lower=stepfun(quantiles, quants[-n]), 
               upper=stepfun(quantiles, quants[-1])))
}
#
# Figure 1: show the p-box for a single population.
#
g <- f(quantiles <- qnorm(c(1,2,3)/4) / qnorm(3/4))
curve(g$upper(x), from=-3, to=2, ylim=c(0,1), n=1001, col="Red", lwd=2,
      ylab="Probability", main="Quartiles {-1, 0, 1}")
curve(g$lower(x), add=TRUE, n=1001, col="Blue", lwd=2)
curve(pnorm(x * qnorm(3/4)), add=TRUE)
lines(c(-3, quantiles[2]), c(0.6, 0.6), col="Gray", lty=2)
lines(c(quantiles[2],quantiles[3]), c(0.6, 0.6), col="Gray")
#
# Figure 2: show how to combine p-boxes using interval arithmetic.
#
quantiles <- list(c(-2, 0, 1), c(-1, 0, 1/2, 3/2), c(-4/3, -1/3))
weights <- c(5, 4, 7); weights <- weights / sum(weights)
mix <- function(x, components, weights) {
    matrix(unlist(lapply(components, function(u) u(x))), ncol=length(weights)) %*% weights
}
g.upper <- lapply(quantiles, function(q) f(q)$upper)
g.lower <- lapply(quantiles, function(q) f(q)$lower)
curve(mix(x, g.upper, weights), from=-5/2, to=2, ylim=c(0,1), 
      ylab="Probability", main="P-box for Three Subpopulations", n=1001,  col="Red")
curve(mix(x, g.lower, weights), add=TRUE, n=1001, col="Blue")
abline(h=0.6, lty=2, col="Gray")

Solved – Quantiles of a compound gamma/negative binomial distribution

As a practical answer to the real questions you're addressing, such high quantiles will generally be quite sensitive to issues with model choice (especially such things as whether you model the right censoring and how heavy the tails are in the components).

But in any case - especially when dealing with high quantiles where ordinary simulation becomes impractical - that has great value in its own right; it's an interesting question from both theoretical and practical standpoints.

A couple of other approaches to this problem are (1) using the Fast Fourier Transform and (2) direct numerical integration.

One useful reference on this topic is Luo and Shevchenko (2009)$^{[1]}$.

In it they develop an adaptive direct numerical integration approach that's faster than simulation and competitive with FFT.

The more traditional approach in actuarial work was been (3) Panjer recursion, which can be found in numerous texts. Embrechts and Frei (2009)$^{[2]}$ discuss and compare Panjer recursion and FFT. (Note that both of these techniques involve discretization of the continuous distribution.)

On the other hand, doing a very unsophisticated version of simulation, and with no effort whatever made to be efficient, generating from a compound gamma-negative binomial isn't particularly onerous. This is timing on my kids' little laptop:

system.time(replicate(100000,sum(rgamma(MASS:::rnegbin(1,4,2),5,.1))))
   user  system elapsed 
   2.82    0.00    2.84

I think 2.8 seconds to generate 100$^{\,}$000 simulations of a compound distribution on a slow little laptop really isn't bad. With some effort to be efficient (of which one might suggest many possibilities), I imagine that could be made a good deal faster.

Here's the ecdf for $10^6$ simulations (which took about 29 seconds):

$\hspace{1cm}$ enter image description here

We see the characteristic discrete jump at zero you expect to see with a compound distribution.

[While it should be easy to make simulation a lot faster, all three alternatives mentioned here - if carried out sensibly - should be a lot faster still.]

You should note that the actuar package supports computation with compound distributions, and offers several methods for calculation with them.

See, for example, this vignette which discusses this facility.

[Of possibly some further passing interest, note that there is an R package for the Poisson-lognormal distribution -- poilog; if you need that distribution at some point it may be useful.]

Added in edit:

A potential quick approximation where the gamma shape parameter isn't changing -

In the gamma case, because a convolution of gammas with constant shape parameter is another gamma, you could write down the distribution of $Y|N=n$, and then evaluate the cdf and the density at a large number of grid-values at each $n$, then simply accumulate the sum directly (rather as one would for a KDE). The direct calculation only yields a lower bound to the true quantile, but if the negative binomial is not heavy tailed it should be quite rapid.

References:

[1]: Luo, X. and Shevchenko, P.V. (2009),
"Computing Tails of Compound Distributions Using Direct Numerical Integration,"
Journal of Computational Finance, 13 (2), 73-111.
[arXiv preprint available here]

[2]: Embrechts, P., and Frei, M. (2009),
"Panjer recursion versus FFT for compound distributions,"
Mathematical Methods of Operations Research, 69:3 (July) pp 497-508.
[seems to be a pre-publication version here]

Best Answer

Related Solutions

Population Estimation – Estimating Population Quantiles from Subpopulations’ Quantiles

Solved – Quantiles of a compound gamma/negative binomial distribution

Related Question