Solved – Computation of new standard deviation using old standard deviation after change in dataset

online-algorithmsstandard deviation

I have an array of $n$ real values, which has mean $\mu_{old}$ and standard deviation $\sigma_{old}$. If an element of the array $x_i$ is replaced by another element $x_j$, then new mean will be

$\mu_{new}=\mu_{old}+\frac{x_j-x_i}{n}$

Advantage of this approach is it requires constant computation regardless of value of $n$. Is there any approach to calculate $\sigma_{new}$ using $\sigma_{old}$ like the computation of $\mu_{new}$ using $\mu_{old}$?

Best Answer

A section in the Wikipedia article on "Algorithms for calculating variance" shows how to compute the variance if elements are added to your observations. (Recall that the standard deviation is the square root of the variance.) Assume that you append $x_{n+1}$ to your array, then

$$\sigma_{new}^2 = \sigma_{old}^2 + (x_{n+1} - \mu_{new})(x_{n+1} - \mu_{old}).$$

EDIT: Above formula seems to be wrong, see comment.

Now, replacing an element means adding an observation and removing another one; both can be computed with the formula above. However, keep in mind that problems of numerical stability may ensue; the quoted article also proposes numerically stable variants.

To derive the formula by yourself, compute $(n-1)(\sigma_{new}^2 - \sigma_{old}^2)$ using the definition of sample variance and substitute $\mu_{new}$ by the formula you gave when appropriate. This gives you $\sigma_{new}^2 - \sigma_{old}^2$ in the end, and thus a formula for $\sigma_{new}$ given $\sigma_{old}$ and $\mu_{old}$. In my notation, I assume you replace the element $x_n$ by $x_n'$:

$$ \begin{eqnarray*} \sigma^2 &=& (n-1)^{-1} \sum_k (x_k - \mu)^2 \\ (n-1)(\sigma_{new}^2 - \sigma_{old}^2) &=& \sum_{k=1}^{n-1} ((x_k - \mu_{new})^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{new})^2 - (x_n - \mu_{old})^2) \\ &=& \sum_{k=1}^{n-1} ((x_k - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_n - \mu_{old})^2) \\ \end{eqnarray*}\\ $$

The $x_k$ in the sum transform into something dependent of $\mu_{old}$, but you'll have to work the equation a little bit more to derive a neat result. This should give you the general idea.

Related Question