I would suggest framing this as a classification problem and outputting 2 softmaxes each with size 300. This usually works better than the continuous output approach you have taken here.
You might expect this approach to work better, because in order for the LSTM to successfully execute the original regression approach, it would have to detect the onset, and then somehow pass down that information several hundred time-steps. In addition, there would probably have to be a counter-like mechanism embedded in the LSTM weights in order to figure out exactly where the deteted onset was. This is all super difficult for an LSTM to learn to do.
Also for that reason, I don't recommend just taking the hidden vector from the last time-step of LSTM and getting the output from that -- instead, try doing something with the full sequence of hidden states (flatten them or something).
A very short answer:
LSTM decouples cell state (typically denoted by c
) and hidden layer/output (typically denoted by h
), and only do additive updates to c
, which makes memories in c
more stable. Thus the gradient flows through c
is kept and hard to vanish (therefore the overall gradient is hard to vanish). However, other paths may cause gradient explosion.
A more detailed answer with mathematical explanation:
Let's review CEC (Constant Error Carousel) mechanism first. CEC says, from time step t
to t+1
, if the forget gate is 1 (There's no forget gate in the original LSTM paper, thus this is always the case), the gradient $dl/dc^{t}$ can flow without change.
Following to the BPTT formulae in paper LSTM: A Search Space Odyssey Appendix A.2 (y
in the paper is h
in other literature), the CEC flow actually corresponds to the equation $\delta c^t = \dots + \delta c^{t+1} \odot f^{t+1}$. When $f^{t+1}$ is close to 1, $\delta c^{t+1}$ accumulates to $\delta c^t$ losslessly.
However, LSTM is more than CEC. Apart from the CEC path from $c^{t}$ to $c^{t+1}$, other paths do exist between two adjacent time steps. For example, $y^t \rightarrow o^{t+1} \rightarrow y^{t+1}$. Walking through the back propagation process over 2 steps, we have: $\delta y^t \leftarrow R^T_o \delta o^{t+1} \leftarrow \delta y^{t+1} \leftarrow R^T_o \delta o^{t+2}$, we see $R^T_o$ is multiplied twice on this path just like vanilla RNNs, which may cause gradient explosion. Similarly, paths through input and forget gate are also capable of causing gradient explosion due to self-multiplication of matrices $R^T_i, R^T_f, R^T_z$.
Reference:
K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J.Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069, 2015.
Best Answer
The terminology is unfortunately inconsistent.
num_units
in TensorFlow is the number of hidden states, i.e. the dimension of $h_t$ in the equations you gave.Also, from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard9/tf.nn.rnn_cell.RNNCell.md :
"LSTM layer" is probably more explicit, example: