Solved – Estimation of quantile given quantiles of subset

large dataparallel computingquantiles

Let's say we have sets $X=\{x_1, x_2, \ldots, x_m\}, Y=\{y_1, y_2, \ldots, y_n\}$ and we have some estimate (or exact) quantile information about them at some level $a$. How could we approximate the quantiles at that same level $a$ for the union of the sets $X\cup Y$, without assuming that they come from the same distribution?

If impossible, what if we relax those assumptions and assume they do come from the same distribution? In other words, how could we calculate quantiles for an extremely large dataset with a sequence of parallel computations on subsets?

edit: I found this: Estimate population quantiles from subpopulations' quantiles, which was helpful – any other thoughts? In particular, how would one parallelize the computation of quantiles?

edit: To be more precise, I'm interested in computing an estimate and confidence interval for the $p$-th quantile of a large dataset by computing the quantiles on subsamples of that dataset in parallel. Let's say that it would be acceptable to assume that the subsamples are randomly sampled from the larger dataset, if that makes the problem more tractable.

Best Answer

I suppose you deal with a data vector of length $n$ where $n$ is so large that it becomes necessary to spread the computations over many machines and that this vector cannot fit inside the memory of any single one of those machines. This squares neatly with the definition of big data as defined here.

Suppose that the dataset $\{X_i\}_{i=1}^n$ is composed of independent draws from a distribution $F$ and denote $q(\alpha)$ the $\alpha$ quantile of $F$. Throughout, I will assume that $F$ is differentiable at $q(\alpha)$ and that $F^\prime(q(\alpha))>0$.

The sample estimator of $q(\alpha)$ is $\hat{q}_n(\alpha)$ and is defined as the minimizer over $t\in\mathbb{R}$ of:

$$(0)\quad h_n(t)=\sum_{i=1}^n\rho_\alpha(X_i-t).$$

where $\rho_{\alpha}(u)=|u(\alpha-\mathbb{I}_{(u<0)})|$ and $\mathbb{I}$ is the indicator variable [0].

Assume that the dataset $\{X_i\}_{i=1}^n$ is partitioned into $k$ non-overlapping sub-samples of lengths $\{m_j\}_{j=1}^k$. I will denote $\{\hat{q}_n^{j}(\alpha)\}_{j=1}^k$ the estimators of $\hat{q}_n(\alpha)$ obtained by solving (0) on the respective sub-samples. Then, if:

$$\min_{j=1}^k\lim_{n\to\infty} \frac{m_j}{n}>0$$ the estimator:

$$\bar{q}_n(\alpha)=\sum_{j=1}^k\frac{m_j}{n}\hat{q}_n^{j}(\alpha)$$

satisfies:

$$\sqrt{n}(\bar{q}_n(\alpha)-\hat{q}_n(\alpha))=o_p(1).$$

(You can find more details of these computations including the asymptotic variances of $\bar{q}_n(\alpha)$ in [1]).

Therefore, letting $m_j=m\;\forall j$, you can use non overlapping sub-samples of your original data set to get a computationally convenient estimate of $\hat{q}_n(\alpha)$. Given $p$ computing units, this divide and conquer strategy will have cost $O(km/p)$ and storage $O(km)$ (versus $O(n)$ and $O(n)$ for computing (0) directly) at an asymptotically negligible cost in terms of precision. The finite sample costs will depend on how small $F^\prime(\alpha)$ is, but will typically be acceptable as soon as $m\geq 2^7$ to $m\geq 2^6$.

Compared to running the parallel version of $\texttt{nth_element}$ I linked to in my comment above, the implementation based on sub-samples will be simpler (and also scale much better as $p$ becomes larger than the number of cores on a single computing machine). Another advantage of the approach based on sub-samples over the one based on the parallel version of $\texttt{nth_element}$ is that the $O(km)$-space complexity can be split across the memory of multiple machines, even if $O(n)$ cannot fit inside the memory of any single one of them.

On the minus side, the solution based on $\bar{q}_n(\alpha)$ will not be permutation invariant (changing the order of the observation will give a different result).

  • [0]. Koencker, R. Quantile regression.
  • [1]. Knight, K. and Bassett, J.W. Second order improvements of sample quantiles using subsamples