Time Series Classification – How to Format Data for Machine Learning Models

classificationdata transformationmachine learningneural networkstime series

I am working on a project where a physical test over time is conducted to decide whether an object is diagnosed as class $A$ or class $B$. Typically these tests can take around 2.5-3 hours and so each timestep $t$ is recorded. $t$ is usually around one second long and so each row has a set of features at a particular second during the test. Once the test is completed, you decide whether the object is of type $A$ or $B$. Humans usually look at the time-series plot of the test to determine this classification – but I am tasked with automating it.

The issue is each CSV file for a test is approx 9000 to 11000 (2.5 to 3 multiplied by 3600 seconds in an hour) rows since the test can vary in time. The amount of features/columns are fixed. I have $N$ CSV files, which represent the time series data for one test done on one sample object (note, one sample object always gets tested once). So, my question: Is there a way to aggregate each sample so that I can have single dataframe to train my classifier on? Or is there another approach?

To add more clarity, I will have to make predictions on CSV files that come with inconsistent row dimensions due to the variance in testing time.

E.g.;

  • csv1.shape = (8751, 1257) –> Prediction: Class $A$
  • csv2.shape = (10321, 1257) –> Prediction: Class $A$
  • csv3.shape = (9978, 1257) –> Prediction: Class $B$

Best Answer

You could try treating each object's time series as a function, and use tools from functional time series analysis (or functional data analysis) to analyze it.

Here is one paper Clustering and Forecasting Multiple Functional Time Series which attempts to do this.

Also there is an R package for functional time series analysis here. Possibly you could look at running principal component analysis and then classify time series based on the coefficients.

Related Question