Solved – What are nodes in RNN/LSTM

lstmrecurrent neural networkterminology

In this blogpost "The Unreasonable Effectiveness of Recurrent Neural Networks" the author says, that he is training "a 2-layer LSTM with 512 hidden nodes" for character prediction.
So it will look somewhat like

y1 = rnn1.step(x)
y = rnn2.step(y1)

What I dont get: What are these 512 hidden nodes in a context of an LSTM?
My first guess was, that it might be the dimension of the
matrices like "self.W_hh" in the following code:

class RNN:
  # ...
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

So self.W_hh $\in \mathbb R^{\text{number of nodes*number of nodes}}$
(I assume, that it has to be quadratic matrices, since the output has to get reused in the next timestep, is that right?). But since the input vectors are one-hot encoded characters and also the output vectors are scores for character probability, the matrices have to have the dimension $\mathbb R^{\text{number of characters*number of characters}}$ and number of characters are 26.

So it left me with the question, what are the nodes?

Best Answer

For fully-connected layers, the number of 'nodes' is the output dimension of the weight matrix. In other words, if we have a hidden layer, where:

  • input layer, dimension $d_i$
  • hidden layer 1, dimension $d_h$

... then the weight matrix for hidden layer 1 will be $d_i \times d_h$. $d_h$ in this case is the number of 'nodes' of hidden layer 1, it is the output dimension.

In RNNs and LSTMs, these concepts are unchanged.

However, nuance: there is an embedding layer in the input and the output, for RNNs, in general. So, the layers are like this:

  • input layer, dimension $d_i$ (corresponds to the one-hot dimension)
  • embedding layer, dimension $d_h$ (embeds the one-hot vector into embedding dimension $d_h$; generally this is just a matrix multiplication)
  • LSTM, dimension $d_h$ (contains various $d_h \times d_h$ matrices)
  • output embedding, dimension $d_o$ (in char-rnn, $d_o = d_i$, it decodes from the output embedding vectors, back into a probabilitiy distribution over possible characters)