I suppose you deal with a data vector of length $n$ where $n$ is so large that it becomes necessary to spread the computations over many machines and that this vector cannot fit inside the memory of any single one of those machines. This squares neatly with the definition of big data as defined here.
Suppose that the dataset $\{X_i\}_{i=1}^n$ is composed of independent draws from a distribution $F$ and denote $q(\alpha)$ the $\alpha$ quantile of $F$. Throughout, I will assume that $F$ is differentiable at $q(\alpha)$ and that $F^\prime(q(\alpha))>0$.
The sample estimator of $q(\alpha)$ is $\hat{q}_n(\alpha)$ and is defined as the minimizer over $t\in\mathbb{R}$ of:
$$(0)\quad h_n(t)=\sum_{i=1}^n\rho_\alpha(X_i-t).$$
where $\rho_{\alpha}(u)=|u(\alpha-\mathbb{I}_{(u<0)})|$ and
$\mathbb{I}$ is the indicator variable [0].
Assume that the dataset $\{X_i\}_{i=1}^n$ is partitioned into $k$ non-overlapping sub-samples of lengths $\{m_j\}_{j=1}^k$. I will denote $\{\hat{q}_n^{j}(\alpha)\}_{j=1}^k$ the estimators of
$\hat{q}_n(\alpha)$ obtained by solving (0) on the respective sub-samples. Then, if:
$$\min_{j=1}^k\lim_{n\to\infty} \frac{m_j}{n}>0$$ the estimator:
$$\bar{q}_n(\alpha)=\sum_{j=1}^k\frac{m_j}{n}\hat{q}_n^{j}(\alpha)$$
satisfies:
$$\sqrt{n}(\bar{q}_n(\alpha)-\hat{q}_n(\alpha))=o_p(1).$$
(You can find more details of these computations including the asymptotic variances of $\bar{q}_n(\alpha)$ in [1]).
Therefore, letting $m_j=m\;\forall j$, you can use non overlapping sub-samples of your original data set to get a computationally convenient estimate of $\hat{q}_n(\alpha)$. Given $p$ computing units, this divide and conquer strategy will have cost $O(km/p)$ and storage $O(km)$ (versus $O(n)$ and $O(n)$ for computing (0) directly) at an asymptotically negligible cost in terms of precision. The finite sample costs will depend on how small $F^\prime(\alpha)$ is, but will typically be acceptable as soon as $m\geq 2^7$ to $m\geq 2^6$.
Compared to running the parallel version of $\texttt{nth_element}$ I linked to in my comment above, the implementation based on sub-samples will be simpler (and also scale much better as $p$ becomes larger than the number of cores on a single computing machine). Another advantage of the approach based on sub-samples over the one based on the parallel version of $\texttt{nth_element}$ is that the $O(km)$-space complexity can be split across the memory of multiple machines, even if $O(n)$ cannot fit inside the memory of any single one of them.
On the minus side, the solution based on $\bar{q}_n(\alpha)$ will not be permutation invariant (changing the order of the observation will give a different result).
- [0]. Koencker, R. Quantile regression.
- [1]. Knight, K. and Bassett, J.W. Second order improvements of sample quantiles using subsamples
The remark is rather ill-thought, as it confuses the theoretical quantile that is solution to $F(q_\alpha)=\alpha$ with the empirical quantile that is solution to $\hat{F}_n(\hat{q}_\alpha)=\alpha$. Assuming you have an iid sample $X_1,\ldots,X_n$ from $F$, it is always possible to derive $\hat{F}_n$ by the ecdf
function in R:
#Assuming x is the notation for the sample
Fn=ecdf(x)
plot(Fn)
#Taking a particular value of alpha
alpha=0.1017
abline(h=alpha)
This picture tells you where the empirical quantile should be, roughly, without giving you the solution to the equation $\hat{F}_n(\hat{q}_\alpha)=\alpha$ that you can solve by dyadic divide-and-conquer strategies or by calling the quantile
function.
If the probability $\alpha$ is arbitrary, there will be not exact solution to this equation$$\hat{F}_n(\hat{q}_\alpha)=\alpha$$since $\hat{F}_n$ only takes $n+1$ possible values. (This is also visible from the above graph.) In that case, the "solution" is found as the smallest observation for which $\hat{F}_n(x)$ is above $\alpha$, mimicking the resolution in the theoretical case where $$F^{-1}(\alpha)=\inf\{x\,|\, F(X)\geq \alpha \}$$
Best Answer
If you first estimate $\theta$ by $\hat\theta$, a direct estimate of the $\alpha$-quantile is $F_{\hat\theta}^{-1}(\alpha)$. This is a convergent and biased estimator, whose asymptotic variance can be derived by the delta-method.
Your first solution is validated by the Glivenko–Cantelli theorem, namely the fact that the empirical cdf converges to the true cdf:$$\hat{F}_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbb{I}_{X_i\le x} \stackrel{n\to\infty}{\longrightarrow}F_{\hat\theta}(x)$$Once again, $\hat{F}_n^{-1}(\alpha)$ is a convergent and biased estimator, which variance can be estimated by boostrap.
Your second method uses an average of estimators validated by your first method, hence it is equally valid and equally biased. However, for a given computing budget, i.e., a pre-determined total number of simulations, you have to run an experiment to compare both methods.
For instance, running a toy experiment aiming at estimating the normal 80% quantile (equal to 0.8416212) shows the difference between both your approaches.
leads to
while
leads to
hence to a smaller variance but to a larger bias. (This experiment does not account for the variability in replacing $\theta$ by $\hat\theta$.)