Solved – LSTM : shape of tensors

lstm

I'm trying to understand LSTM, using for instance http://colah.github.io/posts/2015-08-Understanding-LSTMs/

I get the overall idea, I guess. But I'm not quite sure I get the maths.

I'll set a very simple problem : I have a sequence of numbers and want to predict next number.

So x_t is of shape (1), and as I understand, h_t will be the prediction, so it should also be of shape (1). (I'm just here ignoring batch size)

Now, the equation producing h_t, using * operation should then have two operands of the same size as the result; that is, C_t and o_t should also be of shape (1).

Following on that idea, equation producing C_t forces also shape of (1) for f_t, i_t and ~C_t.

So… everything reduced to scalar real numbers in that case ?

what am I getting wrong ? because this would just not be able to learn much, would it ?

Best Answer

The input shape for a RNN is typically 3 dimensional:

Number of samples
Number of timesteps
Input dimensions (features)

So as you say you start with a sequence of numbers, that's basically the timesteps. To successfully train a NN you need several of those sequences, that's number of samples. The input dimensions is the number of input for each timestep. For example if you try to categorize the expected weather as 'Good' or 'Bad' based on the temperature and the wind of the last 10 hours using hourly measures then your input shape is (None, 10, 2). Where None means you can feed as much data series as you have, but each data series consist of 10 times of a pair of temperature and wind.
Having this input shape the context will inherit the same shape therefore C_t, f_t, i_t, and h_t will all be pairs of temperature and wind.
Perhaps a missing point is that you can use multiple units, then the same thing happens multiple times. More units can learn more patterns or more complex patterns.
The output of LSTM is either h_t (shape (2)) or the entire list of h with shape (10, 2). You use the latter one when you stack LSTMs. Nevertheless after the LSTM layer you always need to add a dense layer to interpret the outcome of the units and to combine them into the desired output shape. For Good/Bad classification the output shape can be (1) therefore a dense layer can be used with 1 neuron. For this example a softmax activation (of the dense layer) will make sure the result is categorized e.g. 0 or 1 that should also represent Good/Bad classification in training data.

Best Answer

Related Solutions

Neural Networks – Understanding Components of Temporal Fusion Transformer

Related Question