Time Series – How to Sample Time Series Data for Binary Classification Problem

classificationpredictive-modelssamplingtime series

I have tried reviewing similar questions to this problem (ex1, ex2), but most of them do not seem to get at how to sample the data (if at all). I will try to outline my problem below:

  1. I have plants that generally die (imbalanced data problem) and I want to predict how likelihood that they will survive.
  2. Each day the plants are measured (height, color, petals, etc), these make up the features of the data and each day there is a new timestep and set of features.
  3. Measurements stops at day 60 and the model makes a prediction there. Some features are engineered here like growth rate for day 60 over day 40.
  4. Although measurements stop there, state is observed a year later for whether or not it survived. The idea being if the plant was healthy in the first 60 days it is likely to survive and if not then it will die.

This has worked well with logistic regression and GBMs I've made for it and the sampling was easy because I could sample all the data at the day-60 point in time. But now I'd like to make a new model that can predict every plant at any point in the first 60 days and here is where I'm not sure how to sample.

The obvious starting point is that the training and testing sets will have a mutually exclusive set of plants to prevent data leakage. But I'm not sure how to sample each plants data points if at all. No sampling strategy would mean including all 60 timesteps for each training plant in the training dataset. A simple strategy might be randomly sampling a single timestep from each plant.

Here's an example for why I'm considering a sample strategy at all: Some plants might start out very healthy in the first 20 days which will be very strong survivors. These I'd definitely want sampled in because they almost always survive. But many plants will be slow growers and the slow growers can be a mix of surviving and dying. The slow growing survivors will have positive labels but much weaker data in the first 20 days and I'm worried that their early timesteps will wash out signals from the stronger plants.

So I'm wondering what others think about applying a sampling strategy. While writing this out I've convinced myself more that one is probably not needed for a tree based model which may be able to find appropriate splits for the data (for example at timestep 20 and plant height 6", probability of survival might be 95%, well within threshold, where as timestep 20 and plant height 3" might have probability of survival of 20%, these will be ignored until a later timestep when more data is available).

If you think I should use a different model for this I'm open to suggestions but I was hoping to not change my architecture too much or get too complicated.

Best Answer

As I understand you have data for each day of the plant's growth process. My initial idea would be to use a classical regression model and simply introduce a new feature that represents the current day of the sample, and include all the other data from the corresponding day and plant. This would have multiple benefits:

  • you can use each plant for 60 data points, each being one state representing one day of the plant's development, therefore using all the provided data
  • the information you have is fully included in each sample
  • you can train the model to predict the survivability of a plant based on its characteristics AND based on the day, so you don't need to rely on the 60-day mark

To ensure no data leakage can happen, the plants used in train and tests set cannot be the same, meaning you need to choose plants solely for the train set and plants solely for the test set.

Related Question