Statistics – How to Calculate Variance from a Stream of Sample Values

algorithmsstatistics

I'd like to calculate a standard deviation for a very large (but known) number of sample values, with the highest accuracy possible. The number of samples is larger than can be efficiently stored in memory.

The basic variance formula is:

$\sigma^2 = \frac{1}{N}\sum (x – \mu)^2$

… but this formulation depends on knowing the value of $\mu$ already.

$\mu$ can be calculated cumulatively — that is, you can calculate the mean without storing every sample value. You just have to store their sum.

But to calculate the variance, is it necessary to store every sample value? Given a stream of samples, can I accumulate a calculation of the variance, without a need for memory of each sample? Put another way, is there a formulation of the variance which doesn't depend on foreknowledge of the exact value of $\mu$ before the whole sample set has been seen?

Best Answer

I'm a little late to the party, but it appears that this method is pretty unstable, but that there is a method that allows for streaming computation of the variance without sacrificing numerical stability.

Cook describes a method from Knuth, the punchline of which is to initialize $m_1 = x_1$, and $v_1 = 0$, where $m_k$ is the mean of the first $k$ values. From there,

$$ \begin{align*} m_k & = m_{k-1} + \frac{x_k - m_{k-1}}k \\ v_k & = v_{k-1} + (x_k - m_{k-1})(x_k - m_k) \end{align*} $$

The mean at this point is simply extracted as $m_k$, and the variance is $\sigma^2 = \frac{v_k}{k-1}$. It's easy to verify that it works for the mean, but I'm still working on grokking the variance.

Related Question