Solved – Estimate population quantiles from subpopulations’ quantiles

estimationinterval-arithmeticnonparametricquantiles

Suppose there is a population partitioned arbitrarily into a set of subpopulations that completely cover the original population. Assume that for some variable, we know each subpopulation's quintiles only (or some other fixed set of quantiles), but not its distribution. However, the entire population's percentiles (and of course distribution) are unknown.

Is there a way to estimate the quantiles of the whole population?

This may sound like an odd situation, but it happens for many published economic statistics where access to the full data is not possible, only summaries of it.

EDIT (CLARIFICATION): You may not assume that each subpopulation is drawn randomly. In fact, each is likely to have its own bias.

EDIT2: Additional details… (1) You may assume the variable is nonnegative, and further that we have a decent estimate for its minimum value across the population. (2) Rough or "unrefined" estimates are OK.

FINAL UPDATE: Expanding on @whuber's suggestions and code snippets, I thought I'd add a nice example with some real-world data (hourly wages, if you're curious) that shows the final population upper/lower bounds, along with the known population quantiles, which happened to be available for this case. I've also made some conservative assumptions about the min and max values of each subpopulation, which helps tighten up the intervals a bit.

interval example

Best Answer

Because the subpopulations are arbitrary and not random, the best you can do is to use interval arithmetic.

Let the quantiles of a subpopulation be $x_{[1]} \le x_{[2]} \le \cdots \le x_{[m]}$ corresponding to percentiles $100 q_1, 100 q_2, \ldots, 100 q_m,$ respectively. This information means that

  1. $100 q_i \%$ of the values are less than or equal to $x_{[i]}$ and

  2. $100 q_{i-1} \%$ of the values are less than or equal to $x_{[i-1]},$ whence

  3. $100(q_i - q_{i-1}) \%$ of the values lie in the interval $(x_{[i-1]}, x_{[i]}].$

In the case $i=1$, take $q_0 = 0$ and $x_{[0]}=-\infty.$ Similarly take $q_{m+1}=1$ and $x_{[m+1]} = \infty.$

Consider the set of all possible distributions consistent with this information. Let $F$ be the CDF of one of them and suppose $x\in (x_{[i-1]}, x_{[i]}]$ for some $i \in \{0, 1, \ldots, m\}.$ From the preceding information we know

$$q_{i-1} \le F(x) \le q_{i}.$$

The set of all possible CDFs therefore forms a "p-box" filling up these intervals. For example, let the quartiles be $\{-1, 0, 1\}$. The corresponding p-box lies between the upper (red) and lower (blue) curves. A possible distribution $F$ consistent with this p-box is shown in black.

Figure 1

The horizontal gray line shows how quantiles can be read off this plot: the 60th percentile, shown, must lie between $0$ and $1$ given that the 50th percentile is at $0$ and the 75th percentile is at $1$. The solid part of the gray line depicts the interval of possible values of the 60th percentile.

When presented with information of this sort for separate populations of sizes $n_1, n_2, \ldots, n_k$, having associated distributions $F_i, i=1, 2, \ldots, k,$ the distribution for the total population will be the weighted average of the $F_i$:

$$F(x) = \frac{n_1 F_1(x) + n_2 F_2(x) + \cdots + n_k F_k(x)}{n_1 + n_2 + \cdots + n_k}.$$

Because we do not know $F$, we replace it by the p-boxes obtained from the available information and use interval arithmetic to perform the computation. Interval arithmetic in this case is simple: when a value $u$ is known to be in an interval $[u_{-}, u_{+}]$ and $v$ is known to lie in $[v_{-}, v_{+}],$ then certainly $u+v$ is in $[u_{-}+v_{-}, u_{+} + v_{+}]$ and a constant positive multiple $\alpha u$ is in $[\alpha u_{-}, \alpha u_{+}].$ And that's all we can say.


For example, suppose we have the following quantile information for subpopulations of sizes $n_i = 5, 4, 7$:

  1. Subpopulation 1 has quartiles at $-2, 0, 1$,

  2. Subpopulation 2 has quintiles at $-1, 0, 1/2, 3/2,$ and

  3. Subpopulation 3 has tertiles at $-4/3, -1/3.$

The resulting p-box computed using interval arithmetic is shown here:

Figure 2

Its 60th percentile (shown by the dashed gray line) must lie between $-4/3$ and $1$, but that is all we know for certain. The distribution of the collective population of $5+4+7=15$ individuals will have a CDF lying somewhere between the upper and lower bounds.


R code to compute and manipulate p-boxes is relatively straightforward to write because R supports step functions (the piecewise constant functions that form the envelopes of empirical p-boxes). The hard work is performed by the functions f (which converts quantile specifications into p-boxes) and mix (which forms positive linear combinations of p-boxes).

#
# Create a pair of functions giving the p-box of a set of quantiles.
#
f <- function(quantiles, quants=seq(0, 1, length.out=length(quantiles)+2)) {
  n <- length(quants)
  return (list(lower=stepfun(quantiles, quants[-n]), 
               upper=stepfun(quantiles, quants[-1])))
}
#
# Figure 1: show the p-box for a single population.
#
g <- f(quantiles <- qnorm(c(1,2,3)/4) / qnorm(3/4))
curve(g$upper(x), from=-3, to=2, ylim=c(0,1), n=1001, col="Red", lwd=2,
      ylab="Probability", main="Quartiles {-1, 0, 1}")
curve(g$lower(x), add=TRUE, n=1001, col="Blue", lwd=2)
curve(pnorm(x * qnorm(3/4)), add=TRUE)
lines(c(-3, quantiles[2]), c(0.6, 0.6), col="Gray", lty=2)
lines(c(quantiles[2],quantiles[3]), c(0.6, 0.6), col="Gray")
#
# Figure 2: show how to combine p-boxes using interval arithmetic.
#
quantiles <- list(c(-2, 0, 1), c(-1, 0, 1/2, 3/2), c(-4/3, -1/3))
weights <- c(5, 4, 7); weights <- weights / sum(weights)
mix <- function(x, components, weights) {
    matrix(unlist(lapply(components, function(u) u(x))), ncol=length(weights)) %*% weights
}
g.upper <- lapply(quantiles, function(q) f(q)$upper)
g.lower <- lapply(quantiles, function(q) f(q)$lower)
curve(mix(x, g.upper, weights), from=-5/2, to=2, ylim=c(0,1), 
      ylab="Probability", main="P-box for Three Subpopulations", n=1001,  col="Red")
curve(mix(x, g.lower, weights), add=TRUE, n=1001, col="Blue")
abline(h=0.6, lty=2, col="Gray")
Related Question