Solved – Algorithm for real-time normalization of time-series data

I'm working on an algorithm that takes in a vector of the most recent data point from a number of sensor streams and compares the euclidean distance to previous vectors. The problem is that the different data streams are from completely different sensors, so taking a simple euclidean distance will dramatically overemphasize some values. Clearly, I need some way to normalize the data. However, since the algorithm is designed to run in real time, I can't use any information about any data-stream as a whole in the normalization. So far I've just been keeping track of the largest value seen for each sensor in the start-up phase (the first 500 data vectors) and then dividing all future data from that sensor by that value. This is working surprisingly well, but feels very inelegant.

I haven't had much luck finding a pre-existing algorithm for this, but perhaps I'm just not looking in the right places. Does anyone know of one? Or have any ideas? I saw one suggestion to use a running mean (probably calculated by Wellford's algorithm), but that if I did that then multiple readings of the same value wouldn't show up as being the same, which seems like a pretty big problem, unless I'm missing something. Any thoughts are appreciated! Thanks!

Best Answer

From your question, I understand that you are looking to:

Find a way that normalizes the data contribution from each sensor.
See if the new data point is very different from previous points.

Here is where I would start

1.For your first question: removing the mean and whitening is what you are looking for. A whitening transform ensures that your features are all in the same dynamic range.

I will be making some simplifying assumptions which may be perfectly relevant but are perfectly suited as a starting point to be built upon.

Assuming that your data is uni-modal, that it, it has a single pronounced mean. I would begin by subtracting the mean of the data and performing a whitening transform (probably PCA, maybe ZCA depending on your data)

If you want to do this in real time, I would use a running sample count that performs the whitening on a moving window. Make sure that you have enough samples for your whitening to be accurate (whitening needs the covariance matrix to be invertible and for that you need more temporal samples than your have sensors).

Now if your data is not unimodal, I would probably cluster the data to see where the modes reside. At the very basic, for each new point arriving, I would assiciate it to the proper cluster and move from there.

2.To measure a distance effectively from past points, I would use the Mahalanobis distance. In all actuality, the Mahalanobis distance is pretty much the Euclidean distance in the whitened space.

In summary, please read about whitening and the Mahalanobis distance, I think these will point you in the direction you seek.

Best Answer

Related Solutions

Solved – Any alternatives for plotting (long) time series

Solved – Normalization vs Standardization for multivariate time-series

Related Question