Mean Square Error – Implementing Online Algorithms for MSE Calculation

mseonline-algorithms

Given a dataset $\{(x_1, y_1), (x_2, y_2), \dots\}$, we can compute incrementally (or "online") the linear regression for those points.

In other words, given a new point $(x_i, y_i)$, we can recompute in time $O(1)$ the coefficients $\alpha$ and $\beta$ to get the best fit for the equation $y = \alpha x + \beta$. To do so, simply use the online algorithms to compute the variance and the covariance.

However, I don't know how to compute the mean square error incrementally in time $O(1)$ per new point:

$\text{MSE} = \frac{1}{N}\sum_i (y_i – \alpha x_i – \beta)^2$

Is there any known algorithm for that?

As a bonus, it would also be great if it was possible to remove any point from the dataset and recompute the MSE in constant time (i.e., doing the reverse operation).

Best Answer

Recall that simple linear regression can be described as

$$Y = m(X) + \frac{Cov(X,X)}{Var(X)}(X-m(X))$$

so to calculate it, you need means $m(X), m(Y)$, variance of $X$ and covariance between them. As you can learn from here mean and variance can be calculated using online algorithm in single pass. There is also online algorithm to calculate covariance as described in Wikipedia. So the only thing you need to do, is to substitute the regular estimators with their online counterparts. See also the Are there algorithms for computing "running" linear or logistic regression parameters? thread.

Taking this apart, what people would usually do in such cases is use the stochastic gradient descent as it is easy to implement (already implemented in many packages, e.g. Vowpal Wabbit, TensorFlow, Scikit-learn) and to adapt for more complicated cases. Also notice that while it would be possible to estimate it in single pass, then using more then one pass would give you more accurate results.

As about calculating mean squared error, this is very simple, you need only to store the sum of squared errors and at each iteration increase it by squared error for the current observation, and then just divide it by the number of the iterations made. Notice that MSE would change the same as your model changes, so how to calculate it is also a question what exactly does it have to measure. If you want to catch how good is your model currently but accounting for the past performance (previous iterations), then the above solution seems to work for this. If you want to measure only the current performance, then you'd need to re-calculate it at each iteration on the whole dataset. On another hand, if you want to measure the out-of-data performance, then probably the best you can do is to calculate it at each step for the hold-out sample, or for the $t+1,\dots,t+k$ observations ahead as described by Hyndman and Athanasopoulos.

Related Question