I am working on building a real-time system for processing and aggregating somewhat sparse and irregular survey measurements (ranges from 0-100, usually on the order of 20-100 measurements). I am looking for a way to remove the noise and smooth out this data so our end users can see the overall trend. The main constraint of the problem is that the data needs to be normalized online — past measurements cannot change as a user should be able to reference historical data. Are there any good algorithms to do this? I was thinking about a simple moving average, but it tends to be pretty volatile if an outlier enters/exits the averaging window. Thanks so much!
Time-Series – Online Smoothing Algorithm for Sparse Data
algorithmsnormalizationsmoothingtime series
Related Solutions
First up, the requirements for compression and analysis/presentation are not necessarily the same -- indeed, for analysis you might want to keep all the raw data and have the ability to slice and dice it in various ways. And what works best for you will depend very much on what you want to get out of it. But there are a number of standard tricks that you could try:
- Use differences rather than raw data
- Use thresholding to remove low-level noise. (Combine with differencing to ignore small changes.)
- Use variance over some time window rather than average, to capture activity level rather than movement
- Change the time base from fixed intervals to variable length runs and accumulate into a single data point sequences of changes for which some criterion holds (eg, differences in same direction, up to some threshold)
- Transform data from real values to ordinal (eg low, medium, high); you could also do this on time bins rather than individual samples -- eg, activity level for each 5 minute stretch
- Use an appropriate convolution kernel* to smooth more subtly than your moving average or pick out features of interest such as sharp changes.
- Use an FFT library to calculate a power spectrum
The last may be a bit expensive for your purposes, but would probably give you some very useful presentation options, in terms of "sleep rhythms" and such. (I know next to nothing about Android, but it's conceivable that some/many/all handsets might have built in DSP hardware that you can take advantage of.)
* Given how central convolution is to digital signal processing, it's surprisingly difficult to find an accessible intro online. Or at least in 3 minutes of googling. Suggestions welcome!
The simplest algorithm is the median filter. You can find an C++ implementation in the R package robfilter. That implementation also include an 'online' version that only uses past data and implements some algorithmic short-cuts.
Of course you will still have to set the "width" argument yourself, but this is the counter part of looking for a simple algorithm (this package also contains more sophisticated smoothing algorithms).
The median-filter is essentially a rolling window median, so it inherits the good behaviour of the median in terms of insensitivity to outliers and non-parametric interpret-ability.
So, considering the dataset you posted, the median filter would yield:
and the code:
a1<-read.table("sodat.txt",header=TRUE)
library("robfilter")
d1<-med.filter(a1[,2],width=10,online=TRUE)
plot(d1)
Best Answer
Try a median filter: for each data point, take a window about that data point of a size you will need to experiment with, and let the new value for that data point be the median of all the points in the window. This is very robust to outliers, since medians are robust to outliers. It also has the virtue that, if your windows always include an odd number of data points, the result consists entirely of actual data points. They may shift about slightly from left to right or vice versa, but never farther than your window size. I think you will find that this filter crunches noise down very effectively.
One note: on the edges, you will need to have lop-sided windows, most likely. That shouldn't be too much of a problem - I'd still recommend having some algorithm ensure that every window has an odd number of data points in it.
One implementation detail is important: when you start replacing data points with medians of data points, don't allow the new median values to influence subsequent windows. Keep calculating medians based on the original data only.