Solved – Many-to-many or many-to-one LSTM when predicting a value derived from a sequence of features

deep learninglstmneural networksrecurrent neural networktime series

Let's say I have a time series data set consisting of features that may correlate to whether or not the price of a stock will go up or down. Say these data points are at 5 minute intervals. I build an LSTM that takes in two hours of these sequential data points (24 time steps) and then attempts to predict if the price will have increased/decreased an hour after the last data point fed into the network.

As the training data is historic data, I have labels for every data point – whether or not the price increased/decreased an hour after this data point. In practice, I would be inputting a sequence of 24 data points, but will only base my prediction off of the final output. As the LSTM produces an output at each time step, should I be calculating the loss from all 24 of these outputs, or just the last one?.

The temporal nature of these data points is important for determining this up/down trend. Thus the output at the first time step will be a value derived from a single data point (with no previous sequence). I think that this would have a negative effect on the loss.

Or, do you think that because of the nature of an LSTM, and because it has learned on previous training sequences, it will actually make an accurate prediction on this single, first time step alone?

So can I include it, and every other time step's output, in the loss calculation?

Best Answer

The output of an LSTM block is not immediately interpretable as probability of an event happening as the activation function is tanh. To mitigate this, putting a Dense layer that takes the concatenated outputs from the LSTM would seem more sensible to me. So like this

$$ \sigma(W x + b)$$ where $ x = \begin{pmatrix} h_0 \\ h_1 \\ \vdots \\ h_{t-1}\end{pmatrix}$ is a vertically concatenated vector of outputs. $W\in \mathbb{R}^{24\times dim(h_i)\cdot t}$, $b\in\mathbb{R}^{24}$. Now you can use something like categorical crossentropy as your loss function.

As for your first data point problem, can you shift the starting point to be the 25-th datapoint?