Solved – Keras difference between GRU and GRUCell

kerasneural networks

In Keras documentation at this page, https://keras.io/layers/recurrent/, I wonder what's the difference between GRU and GRUCell, which one I should use if I'd like to create a GRU recurrent network. Similarly, I have confusion between LSTM and LSTMCell, and RNN and SimpleRNN.

Checking the source code: the class inheritance is like

Layer => RNN
Layer => GRUCell
Layer => LSTMCell

RNN => SimpleRNN
RNN => GRU
RNN => LSTM

Also,

Best Answer

In GRU/LSTM Cell, there is no option of return_sequences. That means it is just a cell of an unfolded GRU/LSTM unit.

The argument of GRU/LSTM i.e. return_sequences, if return_sequences=True, then returns all the output state of the GRU/LSTM.

GRU/LSTM Cell computes and returns only one timestamp.

But, GRU/LSTM can return sequences of all timestamps.

In Figure 1, the unit in loop is GRU/LSTM. In Figure 2, the cells shown are GRU/LSTM Cell which is an unfolded GRU/LSTM unit.

Related Solutions

Solved – Time steps in Keras LSTM

As described by Andrey Karpathy, the basic recurrent neural network cell is something like

$$ h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$

so it takes previous hidden state $h_{t-1}$ and current input $x_t$, to produce hidden state $h_t$. Notice that $W_{hh}$ and $W_{xh}$ are not indexed by time $t$, we use the same weights for each timestep. In simplified python code, the forward pass is basically a for-loop:

for t in range(timesteps):
    h[t] = np.tanh(np.dot(Wxh, x[t]) + np.dot(Whh, h[t-1]))

So it doesn't matter how many timesteps there are, it is just a matter of how it is implemented. People often use fixed number of timesteps to simplify the code and work with simpler data structures.

In Keras, the RNN cells take as input tensors of shape (batch_size, timesteps, input_dim), but you can set them to None if you want to use varying sizes. For example, if you use (None, None, input_dim), then it will accept batches of any size and any number of timesteps, with input_dim number of features (this needs to be fixed). It is possible because this is a for-loop and we apply same function to every timestep. It would be more complicated in other cases, where varying sizes would need us to be using things like varying sizes for the vectors of parameters (say in densely-connected layer).