Solved – Online algorithm for mean absolute deviation and large data set

algorithmslarge dataonline-algorithmsquantiles

I have a little problem that is making me freaking out.
I have to write procedure for an online acquisition process of a multivariate time series.
At every time interval (for example 1 second), I get a new sample, which is basically a floating point vector of size N.
The operation I need to do is a little bit tricky:

  1. For each new sample, I compute the percentuals for that sample (by normalizing the vector so that the elements will sum to 1).

  2. I calculate the average percentuals vector in the same way, but using the past values.

  3. For each past value, I compute the absolute deviation of the percentuals vector related to that sample with the global average percentuals vector computed at step 2. This way, the absolute deviation is always a number between 0 (when the vector is equal to the average vector) and 2 (when it is totaly different).

  4. Using the average of the deviations for all the previous samples, I compute the mean absolute deviation, which is again a number between 0 and 2.

  5. I use the mean absolute deviation to detect if a new sample is compatible with the other samples (by comparing its absolute deviation with the mean absolute deviation of the whole set computed at step 4).

Since every time a new sample is collected the global average changes (and so the mean absolute deviation changes as well), is there a way to compute this value without scanning the whole data set multiple times? (one time for computing the global average percentuals, and one time for collecting the absolute deviations).
Ok, I know it's absolutely easy to calculate the global averages without scanning the whole set, since I just have to use a temporary vector for storing the sum of each dimension, but what about the mean absolute deviation? Its calculation includes the abs() operator, so I need to access to all the past data!

Thanks for your help.

Best Answer

If you can accept some inaccuracy, this problem can be solved easily by binning counts. That is, pick some largeish number $M$ (say $M = 1000$), then initialize some integer bins $B_{i,j}$ for $i = 1\ldots M$ and $j = 1\ldots N$, where $N$ is the vector size, as zero. Then when you see the $k$th observation of a percentual vector, increment $B_{i,j}$ if the $j$th element of this vector is between $(i-1)/M$ and $i/M$, looping over the $N$ elements of the vector. (I am assuming your input vectors are non-negative, so that when you compute your 'percentuals', the vectors are in the range $[0,1]$. )

At any point in time, you can estimate the mean vector from the bins, and the mean absolute deviation. After observing $K$ such vectors, the $j$th element of the mean is estimated by $$\bar{X}_j = \frac{1}{K} \sum_i \frac{i - 1/2}{M} B_{i,j},$$ and the $j$th element of the mean absolute deviation is estimated by $$\frac{1}{K} \sum_i | \bar{X_j} - \frac{i - 1/2}{M} | B_{i,j}$$

edit: this is a specific case of a more general approach where you are building an empirical density estimate. This could be done with polynomials, splines, etc, but the binning approach is the easiest to describe and implement.