GRU Neural Networks – How They Solve the Vanishing Gradient Problem

backpropagationgradientgruneural networksrecurrent neural network

I am learning the GRU model in deep learning and reading this article where details of BPTT are explained. Towards the end the author explained the values of the partial derivative $\frac{\partial h_i}{\partial h_{i-1}}$ in 3 cases: (1). $z= 1$. (2). $z=0, r=0$, and (3). $z=0, r=1$.

Here $z,r$ are update and reset gates respectively.

To clarify the notations and provide some details, the loss gradient at step $t$ with respect to the input-to-hidden weight in the update gate $z_t$, $U_z$ has form
\begin{align}
\frac{\partial L_t}{\partial U_z} &= \frac{\partial L_t}{\partial h_t} \sum_{k=1}^t \{ \left(\prod_{i=k+1}^t \frac{\partial h_i}{\partial h_{i-1}} \right) \frac{\partial^+ h_k}{\partial U_z} \}
\end{align}

To keep the notations simple I will drop the matrix transposes and Hadamard products in the partial derivative below:
\begin{align}
\frac{\partial h_i}{\partial h_{i-1}} &= \frac{\partial h_i}{\partial \tilde{h}_i} \frac{\partial \tilde{h}_i}{\partial h_{i-1}} + \frac{\partial h_i}{\partial z_i} \frac{\partial z_i}{\partial h_{i-1}} + \frac{\partial^{+} h_i}{\partial h_{i-1}} \\
&= (1-z_i) \{ W_r W_h (1- \tilde{h}_i^2) h_{i-1} r_i (1- r_i) + W_h(1- \tilde{h}_i^2) r_i\} \\
&+ W_z (h_{i-1}- \tilde{h}_i) z_i (1-z_i) + z_i
\end{align}

where $h_i$ is the hidden state at step $i$, $\tilde{h}_i$ is the current memory of the network at step $i$.

The 3 cases mentioned earlier is explained in more details here. On slide 4 the author mentioned that units with short-term dependencies will have active reset
gates $r_t$; and units with long term dependencies have active update
gates $z_t$.

I would like some help with
(1). Understanding the status of the gates in relation to short/long-term dependencies.

(2). In the 2nd case , when $z=r=0$, $\frac{\partial h_i}{\partial h_{i-1}}$ is $0$. Why $\frac{\partial L_t}{\partial U_z}$ still avoids shrinking to zero


Rehmer & Kroll (2020).

(8):
enter image description here
(9):
enter image description here

Best Answer

(1). Understanding the status of the gates in relation to short/long-term dependencies.

Active update ($z_t$) gate means that the information inside $h_{t-1}$ is propagated to $h_t$, because the update equation is $h_t=z_t h_{t-1} + (1-z_t)\tilde h_t$. So, the information travels through the time steps. This means long-term dependency.

Active reset ($r_t$) gate means the $\tilde h$ has information integrated into it from the previous time step due to the equation $\tilde h_t=\tanh(Wx_t+r_tUh_{t-1})$. At the same time, if $z_t$ is not very active, this information will be conveyed to the next steps, but will die out quickly. That is short-term dependency.

(2). In the 2nd case , when $z=r=0$, $\frac{\partial h_i}{\partial > h_{i-1}}$ is $0$. Why $\frac{\partial L_t}{\partial U_z}$ still avoids shrinking to zero

The derivative is not zero. Even in your formula, it looks like $W_h$, though I'm not sure the exact form. However, since $z_t$ and $r_t$ also depend on $h_{t-1}$, the derivative is not zero, at least not in a simple way.