Solved – How to classify M multivariate time series into K known categories

classificationmachine learningtime series

I have $M$ stacks of multivariate time series ($N$ variables sampled over $T$ time points), and each of the $M$ multivariate time series is associated with one of $K$ known classes. My goals are:

(1) being able to predict to which of the $K$ classes a given multivariate time series belongs;

(2) determine what is the subset of $N$ variables that contributes the most to each of the $K$ classes over the $M$ multivariate time series.

EXAMPLE

I have $M$ different trials, each of them storing the recorded activity of $N$ neurons ($N$>1000) while the animal is either moving one finger (class $K_1$) or another (class $K_2$). As such, each of the $M$ trials is associated with either one or the other class. My goals are:

(1) to predict which one of the two fingers the animal will move given the recorded activity of the $N$ neurons.

(2) identifying which are the subsets of neurons that are responsible for the movement of either one or the other finger.

Best Answer

First of all, let's make sure I understood your question. You have $M$ time series (say, $M=100$ just to fix ideas), of probably different length. You don't say anything about how long these time series are, but let's assume they typically contain $\approx N_T=1000$ samples. Each time series has more than $N>1000$ features.

Suppose all your time series are stored as a tidy data frame, i.e., one feature for column, and one observation of each time series for row. You have a data frame bigger than $10^5\times 1000$. This is hefty. Assumption: you have enough memory to handle this baby, or you can afford to pay for some AWS/Google Cloud ML/whatever cloud Machine Learning instances. If you don't, my answer probably won't work for you, and you should look into Vopal Wabbit, though I have no idea if it supports time series classification.

I would try two different approaches, and use $k$-fold cross-validation (with $k=5$ or $10$) to choose among them. The problem seems complicated to me, so I would reserve a few of your $M$ time series as a test set, but you don't strictly have to do this and with $k$-fold CV you could just use all of them to choose your favorite classification procedure.


Approach 1: extract time series-related features and use a Random Forest or GBM classifier

In this approach, you try to extract features which are "typical" of a time series from your time series, using dedicated packages, such as for example tsfresh in Python.

Step 0: normalization

This is not strictly necessary, but it's definitely good practice. In practice, consider activations $X_i(t_0),\dots,X_i(t_{Nj})$ of neuron $i$. You sum together the values of neuron activation at all time instants and across all time series instances, and you get a sample mean activation $\bar{X}_i$. Similarly, you compute a sample standard deviation $S_i$. You then standardize the neuron activation as

$$X'_i(t_j) =\frac{X_i(t_j)-\bar{X}_i}{S_i}$$

Repeat for all neurons.

Step 1: feature extraction & filtering

"Typical" features may be the absolute energy, the spectral density, the maximum, minimum, median, mean, number of peaks, etc.

enter image description here

Note that in this approach each neuron will count as a separate time series. You can (you must!) include the fact that the $N$ neurons belong to the same instance (to the same multivariate time series) by using an ID (grouping) variable. For example, if you look here, the id column would contain the values $1,\dots,M=100$, while the $N$ neurons would correspond to the 6 columns $F_x, \dots, T_z$ in the example.

You now have an even bigger list of features: you may need to filter some features out to keep the training phase of your classifier manageable. The fresh (featuRe Extraction based on Scalable Hypothesis tests) algorithm should do that. It is described here:

Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2016). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-prints: 1610.07717 URL: http://adsabs.harvard.edu/abs/2016arXiv161007717C

tsfresh (as the name suggests) implements the algorithm.

Step 2: classification

Suppose your set of "relevant" features contains $f$ features $v_1,\dots,v_{f}$. You then have a training set containing $M$ samples, where each sample is $P_i=(v_{1i},\dots,v_{fi},y_i)$, with $y\in{1,\dots,K}$ being the class label of sample $P_i$. You can now train a random forest classifier or a GBM (Gradient Boosted Machine) on your training set.

Note: once trained, each time you use your classifier for a prediction, you'll need to extract the features $v_1,\dots\,v_f$ first. This means that if you want to classify a new unseen time series, which doesn't belong to the training set but of course has to come from the same probability distribution, first of all you compute the features $v_1,\dots\,v_f$ starting from the time series of your $N$ neurons.


Approach 2: Dynamic Time Warping + 1-NN

Another approach, loosely related to the first one, is to define some metric in the time series space which allows us to quantify "how close" two time series are. However, in this approach, which is commonly used for time series classification, you won't be able to say which neuron contributes the most to classification, which goes against one of your requirements.

Step 0: normalization

Same as before.

Step 1: Dynamic Time Warping

Dynamic Time Warping finds an optimal match between two sequences by allowing a non-linear mapping of one sequence to another, and minimizing the distance between two sequences. The sequences are "warped" non-linearly to determine their similarity independent of any nonlinear variations in the time dimension.

Suppose $$L_k=\{\mathbf{X}_k(t_{0})=(X_{k1}(t_{0}),\dots,X_{kN}(t_{0})),\dots,\mathbf{X}_k(t_{m})=(X_{k1}(t_{m}),\dots,X_{kN}(t_{m})\}$$ is the $N-$variate time series $k$, from your training set of $M$ time series. Each time sample is a vector of $N$ scalars, the $N$ neurons activation at time $t$. Let's denote $\mathbf{X}_k(t_{i})=\mathbf{a}_i$ for simplicty. Now, let

$$L_h=\{\mathbf{X}_h(t_{0}),\dots,\mathbf{X}_h(t_{n})\}$$

denote another time series, and let's denote its elements $\mathbf{X}_h(t_{j})=\mathbf{b}_j$. We thus have two sequences of vectors

$$\mathbf{a}_1,\dots,\mathbf{a}_m \quad \text{and} \quad \mathbf{b}_1,\dots,\mathbf{b}_n $$

having possibly different lenghts. For DTW, we build an $m\times n$ matrix $M$(called path matrix) where $m_{ij}=||\mathbf{a}_i-\mathbf{b}_j||$. We want to find a path $W=w_1,\dots,w_k$ through $M$, which minimizes the sum of the distances $m_{ij}$, and which is subject to various constraints such as for example endpoint matching, continuity and monotonicity (see here for details).

Computing the DTW in the most naive implementation requires $O(nm)$ operations, which is considerable in your case. There are numerous tricks you can employ to speed-up the computations: locality constraint or windowing approximates the exact DTW computation, while the Keogh lower bound is a rigorous lower bound or DTW, whose computation has only linear complexity. This is discussed in detail here, where you can also find a Python implementation of DTW.

Step 2: classify based on 1-NN

At this point, classification is pretty simple. As mentioned in the linked references, DTW + 1-NN seems to do pretty well for time series classification. In practice this means that each time you get a new time series $L$, you compute the DTW between $L$ and $L_1,\dots,L_M$ (your training set), you find the time series $L_i$ such that its DTW distance from $L$ is minimum, and you assign $L$ the same class as $L_i$. Note that using the Keogh lower bound, you usually don't need to compute all $M$ DTW distances. Suppose your DTW distance between $L_1$ and $L$ is $d$: if you then compute the Keogh l.b. for $L_2$ and you get a value $k>d$, you can safely skip the computation of the distance between $L$ and $L_2$, and go to $L_3$.

Related Question