Solved – When computing parameters, why is dimensions of hidden-output state of an LSTM-cell assumed same as the number of LSTM-cell

lstmneural networksrecurrent neural networkword embeddings

I was trying to figure out how to estimate the number of parameters in an LSTM layer. What is the relationship of number of parameters with the num lstm-cells, input-dimension, and hidden output-state dimension of the LSTM layer?

If the LSTM input is 512-d (word embedding dimension), output hidden dimension is 256, and there are 256 lstm units (bidirectional layer) in each of the bidirectional LSTM layers, what's the params per cell and total in the layers?

I came across this link https://stackoverflow.com/questions/38080035/how-to-calculate-the-number-of-parameters-of-an-lstm-network, and it seems to suggest that hidden output state dimension = number of lstm cells in the layer. Why is that?

Best Answer

I came across this link https://stackoverflow.com/questions/38080035/how-to-calculate-the-number-of-parameters-of-an-lstm-network, and it seems to suggest that hidden output state dimension = number of lstm cells in the layer. Why is that?

Each cell's hidden state is 1 float. As an example, the reason you'd have output dimension 256 is because you have 256 units. Each unit produces 1 output dimension.

For example, see this documentation page for Pytorch https://www.pytorch.org/docs/stable/nn.html. If we look at the output entry for an LSTM, the hidden state has shape (num_layers * num_directions, batch, hidden_size). So for a model with 1 layer, 1 direction (i.e. not bidirectional), and batch size 1, we have hidden_size floats in total.

You can also see this if you keep track of the dimensions used in the LSTM computation. At each timestep (element of the input sequence) the layer of an LSTM carries out these operations, which are just compositions of matrix-vector products and activation functions. $$ \begin{aligned} i_t &= \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \\ f_t &= \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf}) \\ g_t &= \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg}) \\ o_t &= \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho}) \\ c_t &= f_t \odot c_{t-1} + i_t \odot g_t \\ h_t &= o_t \odot \tanh(c_t) \\ \end{aligned} $$ We're focused on the hidden state, $h_t$, so look at the operations involving $h_{t-1}$ because this is the hidden state at the previous time-step. The hidden-to-hidden connections must have have size hidden_size by hidden_size because they're matrices which must by conformable in a matrix-vector product where the vector has size hidden_size. The input-to-hidden connections must have size hidden size by input size because this is a matrix-vector product where the vector has size input size.

Importantly, your distinction between hidden size and number of units never makes an appearance. If hidden size and number of units were different, then this matrix-vector arithmetic would, somewhere, not be conformable because it wouldn't have compatible dimension.

As for counting the number of parameters in an LSTM model, see How can calculate number of weights in LSTM

I believe the confusion arises because OP has confused the hidden output state, which is an output of the model, with the weights of the hidden state. I think this is the case because you insist that the hidden state has shape (n,n). It’s not, but the hidden weights are square matrices. LSTM cells have memory, which is returned as a part of the output. This is used together with the model weights and biases to yield the prediction for the next time step. The difference between hidden state output and the hidden weights is that the model weights are the same for all time steps, while the hidden state can vary. This “memory” component is where LSTMs get their name.

Related Question