Solved – How to reduce the number of data points in a series

data visualization

I haven't studied statistics for over 10 years (and then just a basic course), so maybe my question is a bit hard to understand.

Anyway, what I want to do is reduce the number of data points in a series. The x-axis is number of milliseconds since start of measurement and the y-axis is the reading for that point.

Often there is thousands of data points, but I might only need a few hundreds. So my question is: How do I accurately reduce the number of data points?

What is the process called? (So I can google it)
Are there any prefered algorithms (I will implement it in C#)

Hope you got some clues. Sorry for my lack of proper terminology.


Edit: More details comes here:

The raw data I got is heart rate data, and in the form of number of milliseconds since last beat. Before plotting the data I calculate number of milliseconds from first sample, and the bpm (beats per minute) at each data point (60000/timesincelastbeat).

I want to visualize the data, i.e. plot it in a line graph. I want to reduce the number of points in the graph from thousands to some hundreds.

One option would be to calculate the average bpm for every second in the series, or maybe every 5 seconds or so. That would have been quite easy if I knew I would have at least one sample for each of those periods (seconds of 5-seconds-intervals).

Best Answer

You have two problems: too many points and how to smooth over the remaining points.

Thinning your sample

If you have too many observations arriving in real time, you could always use simple random sampling to thin your sample. Note, for this too be true, the number of points would have to be very large.

Suppose you have N points and you only want n of them. Then generate n random numbers from a discrete uniform U(0, N-1) distribution. These would be the points you use.

If you want to do this sequentially, i.e. at each point you decide to use it or not, then just accept a point with probability p. So if you set p=0.01 you would accept (on average) 1 point in a hundred.

If your data is unevenly spread and you only want to thin dense regions of points, then just make your thinning function a bit more sophisticated. For example, instead of p, what about:

$$1-p \exp(-\lambda t)$$

where $\lambda$ is a positive number and $t$ is the time since the last observation. If the time between two points is large, i.e. large $t$, the probability of accepting a point will be one. Conversely, if two points are close together, the probability of accepting a point will be $1-p$.

You will need to experiment with values of $\lambda$ and $p$.

Smoothing

Possibly something like a simple moving average type scheme. Or you could go for something more advanced like a kernel smoother (as others suggested). You will need to be careful that you don't smooth too much, since I assume that a sudden drop should be picked up very quickly in your scenario.

There should be C# libraries available for this sort of stuff.

Conclusion

Thin if necessary, then smooth.

Related Question