Solved – Time-series logistic regression

binary datalogistictime series

I'm trying to train a model with data that looks like this:

Binary time series data

There are theoretical reasons to believe that the Y row can be predicted using the X rows, with a window of about 2 seconds. Note that this plot is just one example. There are around 100 time series like that in my dataset for training/validation.

I have a basic knowledge in ML, but never done anything with time series before, so I'm looking for some advice.
I know that I can rearrange the data so that each sample will be a long vector containing window_size * features values, and pass it through a logistic regression, but I'm not sure that this is the best approach for a few reasons:

  • The large feature vector (let's say ~200 values) might cause overfitting, isn't it?
  • I saw examples that use LSTM and CRF for time series predictions. So there are probably reasons to use specialized algorithms for the tasks, but I don't know them properly to fully understand them. Will you suggest investing my time in these or stay with the a simpler logistic regression model for this type of data?
  • Using logistic regression won't really take time into account. For example, trying to generate predictions for time series with different sampling period will be impossible without resampling. Are the models that are built specifically for time series take it into account automatically?

Also, this looks like a general problem to me, that probably have a general solution. So, hints to python libraries that are built to process such time series directly (without me having to rearrange the data first) will be great!

Best Answer

Keras is a high-level API that allows implementation of recurrent neural networks (RNNs). RNNs like LSTMs might be good models for your data, depending on how much there is. Here is a good introductory tutorial.

What you suggest with the time window and logistic regression may perform better or it may perform much worse. Sometimes aggregating past time events is very effective, sometimes not so much. In my experience, it's more effective when there is less data and the function from previous time steps to the response variable is too complicated to learn given the small amount of data. Aggregating might allow you to kind of guide this learning with some domain expertise. (For me, this once outperformed a self-written vanilla RNN, but it may not have outperformed an LSTM or GRU). With a large amount of data, RNNs should outperform this method in terms of predictive performance. Another option is a Hidden Markov Model. One huge benefit of the time-window method however is that the aggregated features are interpretable. I would try both methods and see which is best.

In terms of overfitting, you can use shrinkage with logistic regression or dropout, shrinkage, or decreased nodes with an RNN. If you make sure to check error on a validation set, you can be vigilant about preventing overfitting.

You would have to resample for the RNN to take time into account (they naturally process sequences and it is up to you to make sure that the events in the sequence are evenly spaced). Pandas has a resampling functionality that may come in handy, although it's also possible that processing raw sequences rather than resampled sequences might give better results---again, would not hurt to try both ways.

A good general introduction and more to RNNs is Andrej Karpathy's blog post. Aslo, see Goodfellow et als text Chapter 10.