Solved – Defining quantiles over a weighted sample

algorithmsquantilesweighted-sampling

I have a weighted sample, for which I wish to calculate quantiles.1

Ideally, where the weights are equal (whether = 1 or otherwise), the results would be consistent with those of scipy.stats.scoreatpercentile() and R's quantile(...,type=7).

One simple approach would be to "multiply out" the sample using the weights given. That effectively gives a locally "flat" ecdf in the areas of weight > 1, which intuitively seems like the wrong approach when the sample is actually a sub-sampling. In particular, it means that a sample with weights all equal to 1 has different quantiles than one with weights all equal to 2, or 3. (Note, however, that the paper referenced in [1] does appear to use this approach.)

http://en.wikipedia.org/wiki/Percentile#Weighted_percentile gives an alternative formulation for weighted percentile. Its not clear in this formulation whether adjacent samples with identical values should first be combined and their weights summed, and in any case its results do not appear to be consistent with R's default type 7 quantile() in the unweighted/equally weighted case. The wikipedia page on quantiles doesn't mention the weighted case at all.

Is there a weighted generalisation of R's "type 7" quantile function?

[using Python, but just looking for an algorithm, really, so any language will do]

M

[1] Weights are integers; the weights are those of the buffers which are combined in the "collapse" and "output" operations as described in http://infolab.stanford.edu/~manku/papers/98sigmod-quantiles.pdf. Essentially the weighted sample is a sub-sampling of the full unweighted sample, with each element x(i) in the sub-sample representing weight(i) elements in the full sample.

Best Answer

This is one possible approach:

Let's suppose you have a ordered sample $X_1 \le X_2 \le \cdots \le X_n$ with respective weights $W_1, W_2, \ldots, W_n$.

Define $$S_k = (k-1) W_k+ (N-1) \sum_{i=1}^{k-1} W_i$$ so $S_1=0$ and $S_n = (N-1) \sum_{i=1}^{N} W_i$.

For an interpolation of quantile $p$, find $k$ such that $\frac{S_k}{S_n} \le p \le \frac{S_{k+1}}{S_n}$. Your estimate could then be

$$X_k + (X_{k+1}-X_k)\frac{pS_n-S_k}{S_{k+1}-S_k}.$$

I think you will find that if the $W_i$ are all equal then this reproduces R-7. There are other approaches which do too, but I suspect they do not treat all the ordered weights as being equally important.