I suppose you deal with a data vector of length $n$ where $n$ is so large that it becomes necessary to spread the computations over many machines and that this vector cannot fit inside the memory of any single one of those machines. This squares neatly with the definition of big data as defined here.
Suppose that the dataset $\{X_i\}_{i=1}^n$ is composed of independent draws from a distribution $F$ and denote $q(\alpha)$ the $\alpha$ quantile of $F$. Throughout, I will assume that $F$ is differentiable at $q(\alpha)$ and that $F^\prime(q(\alpha))>0$.
The sample estimator of $q(\alpha)$ is $\hat{q}_n(\alpha)$ and is defined as the minimizer over $t\in\mathbb{R}$ of:
$$(0)\quad h_n(t)=\sum_{i=1}^n\rho_\alpha(X_i-t).$$
where $\rho_{\alpha}(u)=|u(\alpha-\mathbb{I}_{(u<0)})|$ and
$\mathbb{I}$ is the indicator variable [0].
Assume that the dataset $\{X_i\}_{i=1}^n$ is partitioned into $k$ non-overlapping sub-samples of lengths $\{m_j\}_{j=1}^k$. I will denote $\{\hat{q}_n^{j}(\alpha)\}_{j=1}^k$ the estimators of
$\hat{q}_n(\alpha)$ obtained by solving (0) on the respective sub-samples. Then, if:
$$\min_{j=1}^k\lim_{n\to\infty} \frac{m_j}{n}>0$$ the estimator:
$$\bar{q}_n(\alpha)=\sum_{j=1}^k\frac{m_j}{n}\hat{q}_n^{j}(\alpha)$$
satisfies:
$$\sqrt{n}(\bar{q}_n(\alpha)-\hat{q}_n(\alpha))=o_p(1).$$
(You can find more details of these computations including the asymptotic variances of $\bar{q}_n(\alpha)$ in [1]).
Therefore, letting $m_j=m\;\forall j$, you can use non overlapping sub-samples of your original data set to get a computationally convenient estimate of $\hat{q}_n(\alpha)$. Given $p$ computing units, this divide and conquer strategy will have cost $O(km/p)$ and storage $O(km)$ (versus $O(n)$ and $O(n)$ for computing (0) directly) at an asymptotically negligible cost in terms of precision. The finite sample costs will depend on how small $F^\prime(\alpha)$ is, but will typically be acceptable as soon as $m\geq 2^7$ to $m\geq 2^6$.
Compared to running the parallel version of $\texttt{nth_element}$ I linked to in my comment above, the implementation based on sub-samples will be simpler (and also scale much better as $p$ becomes larger than the number of cores on a single computing machine). Another advantage of the approach based on sub-samples over the one based on the parallel version of $\texttt{nth_element}$ is that the $O(km)$-space complexity can be split across the memory of multiple machines, even if $O(n)$ cannot fit inside the memory of any single one of them.
On the minus side, the solution based on $\bar{q}_n(\alpha)$ will not be permutation invariant (changing the order of the observation will give a different result).
- [0]. Koencker, R. Quantile regression.
- [1]. Knight, K. and Bassett, J.W. Second order improvements of sample quantiles using subsamples
- The first case was answered in detail in this question.
- One example of the second case is shown here, where the authors apply a normal approximation to the binomial distribution used in calculations of the first case.
The third case is given by Hahn and Meeker in their handbook Statistical Intervals (2nd ed., Wiley 2017):
A two-sided $100(1-\alpha)\%$ confidence interval for $x_q$, the $q$ quantile of the normal distribution, is
$$
\left[\bar{x}-t_{(1-\alpha/2;\,n-1,\,\delta)}\frac{s}{\sqrt{n}},\;\bar{x}-t_{(\alpha/2;\,n-1,\,\delta)}\frac{s}{\sqrt{n}}\right]
$$
where $t_{(\gamma;\,n-1,\,\delta)}$ is the $\gamma$ quantile of a noncentral $t$-distribution with $n-1$ degrees of freedom and noncentrality parameter $\delta = -\sqrt{n}z_{(q)}=\sqrt{n}z_{(1-p)}$.
Here, $z_{(q)}$ denotes $\Phi^{-1}(q)$, the $q$ quantile of the standard normal distribution.
For example, let's assume we drew $n=20$ samples from a normal distribution with unknown mean and standard deviation. The sample mean was $\bar{x}=10.5$ and the sample standard deviation was $s=3.19$. Then, the two-sided $95\%$ confidence interval for the $q=0.25$ quantile $x_{0.25}$ would be given by $(6.42; 9.76)$.
Here is some R
code and a small simulation to check the coverage. By changing the parameters, you can run your own simulations:
normquantCI <- function(x, conf_level = 0.95, q = 0.5) {
x <- na.omit(x)
n <- length(x)
xbar <- mean(x)
s <- sd(x)
ncp <- -sqrt(n)*qnorm(q)
tval <- qt(c((1 + conf_level)/2, (1 - conf_level)/2), n - 1, ncp)
se <- s/sqrt(n)
xbar - tval*se
}
# Simulate the coverage
set.seed(142857)
q <- 0.25 # Quantile to calculate the CI for
conf_level <- 0.95 # Confidence level
true_mean <- 100 # The true mean of the normal distribution
true_sd <- 15 # True sd of the normal distribution
sampsi <- 20 # The sample size
trueq <- qnorm(q, true_mean, true_sd) # The true quantile
res <- replicate(1e5, {
citmp <- normquantCI(rnorm(sampsi, true_mean, true_sd), conf_level = conf_level, q = q)
ifelse(citmp[1] < trueq & citmp[2] > trueq, 1, 0)
})
sum(res)/length(res)
[1] 0.95043
Best Answer
$q_Z$ could be anything.
To understand this situation, let us make a preliminary simplification. By working with $Y_i = X_i - q_i$ we obtain a more uniform characterization
$$\alpha = \Pr(X_i \le q_i) = \Pr(Y_i \le 0).$$
That is, each $Y_i$ has the same probability of being negative. Because
$$W = \sum_i Y_i = \sum_i X_i - \sum_i q_i = Z - \sum_i q_i,$$
the defining equation for $q_Z$ is equivalent to
$$\alpha = \Pr(Z \le q_Z) = \Pr(Z - \sum_i q_i \le q_Z - \sum_i q_i) = \Pr(W \le q_W)$$
with $q_Z = q_W + \sum_i q_i$.
What are the possible values of $q_W$? Consider the case where the $Y_i$ all have the same distribution with all probability on two values, one of them negative ($y_{-}$) and the other one positive ($y_{+}$). The possible values of the sum $W$ are limited to $ky_{-} + (n-k)y_{+}$ for $k=0, 1, \ldots, n$. Each of these occurs with probability
$${\Pr}_W(ky_{-} + (n-k)y_{+}) = \binom{n}{k}\alpha^k(1-\alpha)^{n-k}.$$
The extremes can be found by
Choosing $y_{-}$ and $y_{+}$ so that $y_{-} + (n-1)y_{+} \lt 0$; $y_{-}=-n$ and $y_{+}=1$ will accomplish this. This guarantees that $W$ will be negative except when all the $Y_i$ are positive. This chance equals $1 - (1-\alpha)^n$. It exceeds $\alpha$ when $n\gt 1$, implying the $\alpha$ quantile of $W$ must be strictly negative.
Choosing $y_{-}$ and $y_{+}$ so that $(n-1) y_{-} + y_{+} \gt 0$; $y_{-}=-1$ and $y_{+}=n$ will accomplish this. This guarantees that $W$ will be negative only when all the $Y_i$ are negative. This chance equals $\alpha^n$. It is less than $\alpha$ when $n\gt 1$, implying the $\alpha$ quantile of $W$ must be strictly positive.
This shows that the $\alpha$ quantile of $W$ could be either negative or positive, but is not zero. What could its size be? It has to equal some integral linear combination of $y_{-}$ and $y_{+}$. Making both these values integers assures all the possible values of $W$ are integral. Upon scaling $y_{\pm}$ by an arbitrary positive number $s$, we can guarantee that all integral linear combinations of $y_{-}$ and $y_{+}$ are integral multiples of $s$. Since $q_W \ne 0$, it must be at least $s$ in size. Consequently, the possible values of $q_W$ (and whence of $q_Z$) are unlimited, no matter what $n\gt 1$ may equal.
The only way to derive any information about $q_Z$ would be to make specific and strong constraints on the distributions of the $X_i$, in order to prevent and limit the kind of unbalanced distributions used to derive this negative result.