Linear Regression – Efficient Methods for Online Linear Regression

algorithmsonline-algorithmsreal timeregressiontime series

I'm analysing some data where I would like to perform ordinary linear regression, however this is not possible as I am dealing with an on-line setting with a continuous stream of input data (which will quickly get too large for memory) and need to update parameter estimates while this is being consumed. i.e. I cannot just load it all into memory and perform linear regression on the entire data set.

I'm assuming a simple linear multivariate regression model, i.e.

$$\mathbf y = \mathbf A\mathbf x + \mathbf b + \mathbf e$$

What's the best algorithm for creating a continuously updating estimate of the linear regression parameters $\mathbf A$ and $\mathbf b$?

Ideally:

I'd like an algorithm that is most $\mathcal O(N\cdot M)$ space and time complexity per update, where $N$ is the dimensionality of the independent variable ($\mathbf x$) and $M$ is the dimensionality of the dependent variable ($\mathbf y$).
I'd like to be able to
specify some parameter to determine
how much the parameters are updated
by each new sample, e.g. 0.000001
would mean that the next sample would
provide one millionth of the
parameter estimate. This would give
some kind of exponential decay for
the effect of samples in the distant
past.

Best Answer

Maindonald describes a sequential method based on Givens rotations. (A Givens rotation is an orthogonal transformation of two vectors that zeros out a given entry in one of the vectors.) At the previous step you have decomposed the design matrix $\mathbf{X}$ into a triangular matrix $\mathbf{T}$ via an orthogonal transformation $\mathbf{Q}$ so that $\mathbf{Q}\mathbf{X} = (\mathbf{T}, \mathbf{0})'$. (It's fast and easy to get the regression results from a triangular matrix.) Upon adjoining a new row $v$ below $\mathbf{X}$, you effectively extend $(\mathbf{T}, \mathbf{0})'$ by a nonzero row, too, say $t$. The task is to zero out this row while keeping the entries in the position of $\mathbf{T}$ diagonal. A sequence of Givens rotations does this: the rotation with the first row of $\mathbf{T}$ zeros the first element of $t$; then the rotation with the second row of $\mathbf{T}$ zeros the second element, and so on. The effect is to premultiply $\mathbf{Q}$ by a series of rotations, which does not change its orthogonality.

When the design matrix has $p+1$ columns (which is the case when regressing on $p$ variables plus a constant), the number of rotations needed does not exceed $p+1$ and each rotation changes two $p+1$-vectors. The storage needed for $\mathbf{T}$ is $O((p+1)^2)$. Thus this algorithm has a computational cost of $O((p+1)^2)$ in both time and space.

A similar approach lets you determine the effect on regression of deleting a row. Maindonald gives formulas; so do Belsley, Kuh, & Welsh. Thus, if you are looking for a moving window for regression, you can retain data for the window within a circular buffer, adjoining the new datum and dropping the old one with each update. This doubles the update time and requires additional $O(k (p+1))$ storage for a window of width $k$. It appears that $1/k$ would be the analog of the influence parameter.

For exponential decay, I think (speculatively) that you could adapt this approach to weighted least squares, giving each new value a weight greater than 1. There shouldn't be any need to maintain a buffer of previous values or delete any old data.

References

J. H. Maindonald, Statistical Computation. J. Wiley & Sons, 1984. Chapter 4.

D. A. Belsley, E. Kuh, R. E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. J. Wiley & Sons, 1980.

Best Answer

References

Related Solutions

Solved – Regularization and feature scaling in online learning

Related Question