Solved – RNNs for Sparse Time Series Data

I have time series data that looks something like this:

    Series 1                         Series 2
    ╔══════╦════════╦════════╗       ╔══════╦════════╦════════╦════════╗
    ║ Time ║ Value1 ║ Value2 ║       ║ Time ║ Value1 ║ Value2 ║ Value3 ║
    ╠══════╬════════╬════════╣       ╠══════╬════════╬════════╬════════╣
    ║ 3:30 ║ 10     ║ 100    ║       ║ 3:32 ║ 12     ║ 56     ║ 34     ║
    ║ 3:31 ║ 11     ║ 50     ║       ║ 3:33 ║ 15     ║ 200    ║ 89     ║
    ║ 3:36 ║ 12     ║ 80     ║       ║ 3:35 ║ 15     ║ 90     ║ 66     ║
    ║ 3:50 ║ 11     ║ 80     ║       ║ 3:38 ║ 13     ║ 85     ║ 45     ║
    ║ 3:50 ║ 12     ║ 60     ║       ║ 3:45 ║ 14     ║ 65     ║ 121    ║
    ╚══════╩════════╩════════╝       ╚══════╩════════╩════════╩════════╝

The important features are

Data doesn't exist for every minute.
There can be multiple observations per minute.

I want to use a RNN to predict the next observation of Series 1 based on the observations of Series 1 and Series 2 up until that moment (and using keras specifically). The problem is that in all the RNNs I've found, the time interval between datapoints is constant. For example in an RNN that predicts the next character in a sentence, data in fed in 20 characters at a time, where the "time interval" between characters is constant.

So how can I adapt this to the data I have? My options seem to be:

Fill in the data for every minute, and average the observations if several of them appear in one minute. Then I could just use the data like the input to a character prediction RNN. The above data would become

Series 1
╔══════╦════════╦════════╗
║ Time ║ Value1 ║ Value2 ║
╠══════╬════════╬════════╣
║ 3:30 ║ 10     ║ 100    ║
║ 3:31 ║ 11     ║ 50     ║
║ 3:32 ║ 11     ║ 50     ║
║ 3:33 ║ 11     ║ 50     ║
║ 3:34 ║ 11     ║ 50     ║
║ 3:35 ║ 11     ║ 50     ║
║ 3:36 ║ 12     ║ 80     ║
║ 3:37 ║ 12     ║ 80     ║
║ ...  ║        ║        ║
║ 3:50 ║ 11.5   ║ 70     ║
╚══════╩════════╩════════╝

This way really isn't ideal because quite some information is contained in the fact that multiple observations occur in one minute. It'd also substantially increase data size and training time.
Input the "Time" variable into the RNN as well, together with a "time since the last observation" variable. I'd also have to combine the two series into one, together with a flag (1 or 2, as a one-hot encoding) indicating which of the two series it came from. Then I can just feed the data in 20 (or whatever number) observations at a time, instead of 20 minutes at a time. I'm not sure that this would work very well, and I can't find anything like this on the internet. Is this a valid approach?

Best Answer

This depends a bit on whether interpolating between points to infill missing data points makes sense, i.e. is there a high correlation between points or are they completely random at each minute reading. If they are correlated then filling them in with even time intervals is probably the way to go. If the values are random and independent of each other, then this would not work.

The second method would (probably) not work, one issue would be that the time itself does not really impact the values directly, so you would have an input without predictive power in your RNN. This tends to produce poor results. The other large issue is to get your RNN to produce time points in the future; ANNs typically don't produce values outside of the observed range of input data.

If the interpolation method doesn't work for your time series data, I might recommend resampling at a coarser frequency, for example every 5 or 10 seconds, and grab the closest point you can at each of those intervals (or average if needed). It would reduce some information but would avoid potentially interpolating where it doesn't make sense.

In short this is a non-trivial question and as far as I know there is no standard or universal way of handling the issue of variable time increments, but hopefully these are some helpful thoughts.

Best Answer

Related Solutions

Solved – Using RNN (LSTM) for predicting one feature value of a time series

Related Question