Say I have a mean and standard deviation for a dataset of 5 elements.
I now add a sixth element. Is there a way to calculate the new mean and standard deviation using the information we had prior (i.e. not just recalculating the whole thing from scratch)?
For the mean, I see that I can just multiply the old one by $5$, add the new element, and divide by $6$.
I'm not sure if there's something I can do with the standard deviation, however.
$$\sigma_{old} = \sqrt{\sum_i (X_i – \mu_{old})^2}$$
$$\sigma_{new} = \sqrt{\sum_i (X_i – \mu_{new})^2 + (X_{new} – \mu_{new})^2}$$
$$\mu_{new} = \frac{\mu_{old}*N + X_{new}}{N+1}$$
$$\sigma^2_{new} = \sigma^2_{old} + \sum_i \left( (X_i – \mu_{new})^2 – (X_i – \mu_{old})^2 \right) + (X_{new} – \mu_{new})^2$$
After putting it in terms of the old stats, this becomes (I think)
$$\sigma^2_{new} = \sigma^2_{old} + \sum_i \left(2 X_i + \frac{(2N+1) \mu_{old} + X_{new}}{N+1} \right) \left(\frac{X_{new} – \mu_{old}}{N+1}\right) + (X_{new} – \frac{\mu_{old}*N + X_{new}}{N+1})^2$$
Is there anything better than this monstrosity?
Best Answer
Let's say you started with n points and have added an $(n+1)^{st}$. To handle the variance write $\mu_{new} = \mu_{old} + \delta$ . We see that we need to compute $$\sum_{i=1}^{n} (x_i - \mu_{new})^2$$ Where the sum is just taken over the old $x_i$'s (the contribution from the $(n+1)^{st}$ sample being easily incorporated. But $$(x_i - \mu_{new})^2 = (x_i - \mu_{old} - \delta)^2$$ So our sum becomes $$\sum_{i=1}^{n} (x_i - \mu_{new})^2 = \sum_{i}^{n} (x_i - \mu_{old})^2 - 2 \delta \sum_{1}^{n} (x_i - \mu_{old}) + n \delta^2 = \sum_{i}^{n} (x_i - \mu_{old})^2 + n \delta^2$$ Where the middle sum vanishes as the old x's sum to the old mean.
Combining all this (and trusting that no algebraic error has been made!) we see that $$Var_{new} = \frac{(x_{n+1}-\mu_{new})^2}{n+1}+ \frac{n}{n+1}Var_{old} + \frac{n}{n+1}\delta^2$$
Not too terrible!