GRU Neural Networks: Parameters in a Gated Recurrent Unit Layer

gruneural networksrecurrent neural network

The title says it all — how many trainable parameters are there in a GRU layer? This kind of question comes up a lot when attempting to compare models of different RNN layer types, such as long short-term memory (LSTM) units vs GRU, in terms of the per-parameter performance. Since a larger number of trainable parameters will generally increase the capacity of the network to learn, comparing alternative models on a per-parameter basis is an apples-to-apples comparison of the relative effectiveness of GRUs and LSTMs.

Best Answer

The original GRU paper "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation" by Kyunghyum Cho et al. does not include bias parameters in their equations. Instead, the authors write

To make the equations uncluttered, we omit biases.

which does not help a reader understand how the authors envisioned using bias neurons; nor does it allow readers to easily count the number of bias neurons.

So we have to look elsewhere. According to Rahul Dey and Fathi M. Salem, "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks":

... the total number of parameters in the GRU RNN equals $3 \times (n^2 + nm + n)$.

where $m$ is the input dimension and $n$ is the output dimension. This is due to the fact that there are three sets of operations requiring weight matrices of these sizes.

Dey and Salem outline the GRU in this manner:

The GRU RNN reduce the gating signals to two from the LSTM RNN model. The two gates are called an update gate $z_t$ and a reset gate $r_t$. The GRU RNN model is presented in the form: $$\begin{align} h_t &= (1 - z_t)\odot h_{t-1} + z_t \odot \tilde{h}_t \\ \tilde{h}_t &= g(W_h x_t + U_h(r_t \odot h_{t-1}) + b_h) \end{align}$$ with the two gates presented as: $$\begin{align} z_t &= \sigma(W_z x_t + U_z h_{t-1} + b_z) \\ r_t &= \sigma(W_r x_t + U_r h_{t-1} + b_r) \end{align}$$

and in the beginning of the paper, they lay out the notation used as

$W$ is an $n \times m$ matrix, $U$ is an $n \times n$ matrix and $b$ is an $n \times 1$ matrix (or vector) for a hidden state of size $n$ and an input of size $m$.

These parameters counts might differ from what you find in software. It seems that some software (e.g. PyTorch, Keras) has made the decision to over-parameterize the model, by including additional bias units. In these software implementations, the total parameter count is given as

$$ 3 (n^2 + nm + 2n). $$

This appears to change three of the GRU equations:

$$\begin{align} \tilde{h}_t &= g(W_h x_t + b_{hW} + U_h(r_t \odot h_{t-1}) + b_{hU}) \\ z_t &= \sigma(W_z x_t + b_{zW} + U_z h_{t-1} + b_{zU}) \\ r_t &= \sigma(W_r x_t + b_{rW} + U_r h_{t-1} + b_{rU}) \end{align}$$

which we can see is algebraically the same, using the substitution $b_{iW} + b_{iU} = b_{i}$. I'm not sure why software would do this. Perhaps the intention is to create the GRU using compositions of existing linear layer classes, and biases are included in both linear layers. Perhaps this parameterization works better with CUDA devices for some reason.