[Math] incrementally calculate gaussian distribution

normal distribution

Is there a way to incrementally calculate the parameters for a Gaussian distribution? The only parameters that are of interest are the mean and standard deviation, not the height (ie. normal distribution)

The problem is, I visit about 100 billion data points, of which I need to quickly calculate/update the distribution parameters, but I can not store all points in memory or visit all points for every estimation.

I'm looking for a way to calculate the distribution with only keeping a persistent state of size N(1) and the latest visited data point.

I can calculate the 'running mean' by this (assuming no numerical inaccuracies):

mean[0]=0;
mean[n+1] = mean[n] + (value - mean[n]) / (n+1)
#or alternatively: (more accurate in my experiments, but also slower)
mean[n+1] = (n * mean[n] + value) / (n+1)

but I don't know how to calculate the 'running deviation' (or variance for that matter)

I could build a histogram of values, but the problem is that the initial spread of the values is unknown and could span in the hundreds but could just as well span millions (depending on the specific dynamic system generating values). Furthermore, as the simulation goes on, the deviation typically increases monotonically, starting at 0.

Best Answer

Recall that the standard deviation $\sigma$ for some sample $X$ with results $x_{1},x_{2},\dots,x_{i}$ can be calculated by:

$$\sigma=\sqrt{\frac{\sum_{i}x_{i}^{2}}{n}-\left(\frac{\sum_{i}x_{i}}{n}\right)^{2}}=\sqrt{E(X^{2})-E^{2}(X)}$$

Therefore, if we add a new result $x_{\text{new}}$ to the sample, we can calculate the new mean $\bar{x}_{\text{new}}$ using:

$$\bar{x}_{\text{new}}=\frac{n\bar{x}+x_{\text{new}}}{n+1}$$

And then, rearranging $\sigma^{2}=E(X^{2})-E^{2}(X)=\frac{\sum_{i}x_{i}^{2}}{n}-\bar{x}^{2}$ gives $\sum_{i}x_{i}^{2}=n(\sigma^{2}+\bar{x}^{2})$, therefore, we find that:

$$\sigma^{2}_{\text{new}}=\frac{n(\sigma^{2}+\bar{x}^{2})+x_{\text{new}}^{2}}{n+1}-\bar{x}_{\text{new}}^{2}$$

Or, in pseudocode:

mean[n+1] = ((n * mean[n]) + value)/(n+1)
var[n+1] = (n * (var[n] + mean[n]^2) + value^2)/(n+1) - mean[n+1]^2