Solved – How to aggregate percentiles from previously aggregated percentiles

aggregationquantiles

I have previously aggregated data for every 5 minutes with a count of sensors at that location and 5% percentiles of a value of interest: 5th, 10th, …, 90th, 95th. A sensor can have multiple observations in that 5 minute period, or it can have none

I want to aggregate these data up further and calculate the same percentiles. How can I do so?

My thought was to treat the percentiles as observations and assign them a weight of 21 / c_t, where 21 is the number of percentiles at time t and c_t is the count at t. I would then calculate a weighted percentile from there.

Best Answer

From what I understand, you do a counting experiment for fixed time intervals. And you want to create data for larger time intervals from the original data.

In physics this would be called data combination. And since you're interested also in the statistical errors, you need error propagation.

The first and most important question is which is the underlying probability density function. This directly determines the uncertainty, which is used in the error propagation.

My guess would be that you have values, which are distributed as Poissonian pdf. But actually you have the error bands, which you want to include in the calculation of the new value.

In full glory, the general error propagation works like this: $$V^\prime(\vec{y}) = \mathbf{B}\cdot V(\vec{x})\cdot \mathbf{B}^T$$ where $y$ is the transformed value, $x$ is the input value and the matrix $V$ is the covariance matrix before and $V^\prime$ after the transformation. The transformation matrix is $B$.

You have the one dimensional problem, with only a single output dimension. In case there is no correlation between the values, the whole thing becomes trivial. If there are correlations, you'd need them in the sum.

There is no general formula for the covariance matrix, it depends on the dimensions involved and whether the transformation is linear or not.

So for your problem, this would mean something like the following. Let's say you're interested in the 10 minute values, and their corresponding quantiles.

You have data points: $c_0$, $c_1$,$c_2$,$c_3$,$c_4$, ... with quantiles $cq_0 = [q_{00}, q_{01}, q_{02}, ..]$.

Your new variable would be $d_i = c_j + c_k$, so just the counts added up -- a linear transformation from one variable to another.

The more interesting part is then the new quantile. In case of no correlation there is no dependence from one uncertainty to the other. As a result, the quantile values would also just add up.

But this is most certainly not the case, since it is a mix of information. The formal discussion of this is the whole point of the answer in the question/answer mentioned in the comment (https://stats.stackexchange.com/a/95841/120447).

Related Question