[Math] Aggregating standard deviation to a summary point

averagestandard deviationstatistics

I have a range of data (server performance statistics) is formatted as follows, for each server:

Time            | Average |  Min  |  Max  | StdDev  | SampleCount |
-------------------------------------------------------------------
Monday 1st      |    125  |   15  |  220  | 12.56   |     5       |
Tuesday 2nd     |    118  |   11  |  221  | 13.21   |     4       |
Wednesday 3rd   |    118  |   11  |  221  | 13.21   |     3       |
....            |    ...  |   ..  |  ...  | .....   |     .       |
and so on...

These data points are calculated from data that has a finer resolution (e.g. hourly data).

I need to aggregate this data into a single summary point so the end result is a list of servers and an aggregate average, min, max, standard deviation.

For average, I take the average of all the averages.
For min, we take the minimum min.
For max, we take the maximum max.

However, I'm not sure what method I should be using to aggregate standard deviation? I've seen various answers including square roots and variance but I really need a concrete answer on this – can anyone help?

Best Answer

General Solution

To compute mean, variance, and standard deviation you only need to keep track of three sums $s_0, s_1, s_2$ defined as follows for a set of values $X$:

$$(s_0, s_1, s_2) = \sum_{x \in X} (1, x, x^2)$$

In English, $s_0$ is the number of values, $s_1$ is the sum of the values, and $s_2$ is the sum of the square of each value. Given these sums, we can now derive mean (average) $\mu$, variance (population) $\sigma^2$, and standard deviation (population) $\sigma$:

$$\mu = \frac{s_1}{s_0} \qquad \sigma^2 = \frac{s_2}{s_0} - \left(\frac{s_1}{s_0}\right)^2 \qquad \sigma = \sqrt{\frac{s_2}{s_0} - \left(\frac{s_1}{s_0}\right)^2}$$

In English, the variance is the average of the square of each value minus the square of the average value.

Your particular case

You have $s_0, \mu, \sigma$, so you need to compute $s_1$ and $s_2$ by solving the above for those variables:

$$s_1 = s_0\mu \qquad s_2 = s_0\left(\mu^2 + \sigma^2\right)$$

Once you have $s_0, s_1, s_2$ for each data set, aggregation is just a matter of adding the corresponding sums together and deriving the desired aggregate values from those sums.

Variance Equation Derivation

We start with the standard equation for variance (population) and go from there:

$$\sigma^2 = \frac{1}{n}\sum_{x \in X} \left(x - \mu\right)^2 = \frac{1}{s_0}\sum_{x \in X} \left(x - \frac{s_1}{s_0}\right)^2$$

$$= \frac{1}{s_0}\sum_{x \in X} \left(x^2 - 2x\frac{s_1}{s_0} + \left(\frac{s_1}{s_0}\right)^2\right) = \frac{1}{s_0}\sum_{x \in X} x^2 - 2\frac{s_1}{s_0^2}\sum_{x \in X} x + \frac{s_1^2}{s_0^3}\sum_{x \in X} 1 $$

$$= \frac{1}{s_0}(s_2) - 2\frac{s_1}{s_0^2}(s_1) + \frac{s_1^2}{s_0^3}(s_0) = \frac{s_2}{s_0} - 2\frac{s_1^2}{s_0^2} + \frac{s_1^2}{s_0^2} = \frac{s_2}{s_0} - \left(\frac{s_1}{s_0}\right)^2 $$