Solved – How to reduce the number of data points in a series

data visualization

I haven't studied statistics for over 10 years (and then just a basic course), so maybe my question is a bit hard to understand.

Anyway, what I want to do is reduce the number of data points in a series. The x-axis is number of milliseconds since start of measurement and the y-axis is the reading for that point.

Often there is thousands of data points, but I might only need a few hundreds. So my question is: How do I accurately reduce the number of data points?

What is the process called? (So I can google it)
Are there any prefered algorithms (I will implement it in C#)

Hope you got some clues. Sorry for my lack of proper terminology.

Edit: More details comes here:

The raw data I got is heart rate data, and in the form of number of milliseconds since last beat. Before plotting the data I calculate number of milliseconds from first sample, and the bpm (beats per minute) at each data point (60000/timesincelastbeat).

I want to visualize the data, i.e. plot it in a line graph. I want to reduce the number of points in the graph from thousands to some hundreds.

One option would be to calculate the average bpm for every second in the series, or maybe every 5 seconds or so. That would have been quite easy if I knew I would have at least one sample for each of those periods (seconds of 5-seconds-intervals).

Best Answer

You have two problems: too many points and how to smooth over the remaining points.

Thinning your sample

If you have too many observations arriving in real time, you could always use simple random sampling to thin your sample. Note, for this too be true, the number of points would have to be very large.

Suppose you have N points and you only want n of them. Then generate n random numbers from a discrete uniform U(0, N-1) distribution. These would be the points you use.

If you want to do this sequentially, i.e. at each point you decide to use it or not, then just accept a point with probability p. So if you set p=0.01 you would accept (on average) 1 point in a hundred.

If your data is unevenly spread and you only want to thin dense regions of points, then just make your thinning function a bit more sophisticated. For example, instead of p, what about:

$$1-p \exp(-\lambda t)$$

where $\lambda$ is a positive number and $t$ is the time since the last observation. If the time between two points is large, i.e. large $t$, the probability of accepting a point will be one. Conversely, if two points are close together, the probability of accepting a point will be $1-p$.

You will need to experiment with values of $\lambda$ and $p$.

Smoothing

Possibly something like a simple moving average type scheme. Or you could go for something more advanced like a kernel smoother (as others suggested). You will need to be careful that you don't smooth too much, since I assume that a sudden drop should be picked up very quickly in your scenario.

There should be C# libraries available for this sort of stuff.

Conclusion

Thin if necessary, then smooth.

Related Solutions

Solved – Preferred methods for graphing time-series data to present “averages”

I suggest adding an example or two of what you are presently doing so we can better see what you are dealing with.

What you are concerned with is an important issue: how do you convey the "overall" pattern in the time series data while also not misleading viewers by showing just average values? One way I have dealt with this situation is plotting an average or median line along with surrounding quantile bands. For example,

enter image description here

Here, the time series data are from a bootstrap-based simulation so there are hundreds of values associated with each time point. The actual data are plotted in the black line with colored bands showing the variability of values from the simulation. This particular plot is maybe not the best example to show, but you can see that some points have much more variability than others, and you can also assess how the variability is skewed above/below the actual values depending on the position in the series.

UPDATE: Given your update here are some additional questions and thoughts... What decisions, if any, are made from this visualization? For example, are you looking for specific points in time where there is very slow response time, perhaps above a specific threshold? If so, it may be better to simply plot all of the points as a scatter plot, and then also plot a time series line showing the average value, as well as some lines delineating the bounds you are concerned about. This recommendation is not appropriate if you have numerous observations at some time points (too much clutter), or if your time measurement is not sufficiently coarse (in which case you can bin response data into minute-wide time of day intervals). But the visualization recommendation will certainly be affected by what decision(s) will be supported with it. In my example, I was looking at such plots side by side, one from one simulation and the other from another simulation (each simulation using different parameters) so I could assess the variability of the underlying model due to sampling error.

Solved – Plot a subset of categories on the x-axis in ggplot

I'm going to put on my mind reading hat and suggest that you simply add droplevels when you subset:

split1_data <- droplevels(subset(data,data$Loci %in% data$Loci[1:10]))

The likely cause of the "problem" is that Loci is a factor. Subsetting a factor may reduce the levels that are present, but it doesn't change the set of levels as an attribute of the factor. If this behavior of factors disturbs you, you can avoid it by using character vectors instead by default by setting options(stringsAsFactors = FALSE).

(But in the future, please note that it is in general impossible to diagnose problems like this without more detailed information about your data, say the output from str or dput. Please include such things in future questions.)

Best Answer

Related Solutions

Solved – Preferred methods for graphing time-series data to present “averages”

Solved – Plot a subset of categories on the x-axis in ggplot

Related Question