Recurrent neural networks stability

dynamical systemsneural networks

I'm reading some papers on stability of neural networks mainly a dynamical system point of view.

RNN can be thought of as $h_t=f(h_{t-1},x_t,\theta)$ where $\theta$ represent some parameters that are adjusted while training the model, $x_i$ is the input to the RNN and $h_i$ is the hidden state of the network.

I have found some papers that actually make a simplifcation of the RNN by ignoring the input $x_i$ and turning it into $h_t=f(h_{t-1},\theta)$. Now by fixing $\theta$ they turn this into a differential equation $\dot{h(t)}=f(h(t),\theta)$ and apply the stability theorems to this new function.

Later they try to define a $f$ that fullfill the stability conditions of the theorems both on the pure differential equations side and on the numerical side.

Those theorems are stated in the form (which can be found on some differential equations books).

Given the differential equation $\dot{h(t)}=f(h(t),\theta)$, if the real part of the eigenvalues of $Jf$ being that the Jacobian of $f$ are negative then the solutions are stable.

As definition of stability it is I think the standard one: A solution $h(t)$ is stable if given $h(0)$ and for any $\epsilon >0$ there exists $\delta > 0$ such that if $y(t)$ is antoher solution such that $\|y(0)-h(0)\| \leq \delta$ then $\|y(t)-h(t)\|\leq \epsilon$ for $t\geq 0$.

As you see the stability is mainly referencing the hidden state but has nothing to do with the input to the RNN. In applications for example NLP you usually do not have a value for $h_0$ and set it to $0$ so if your RNN is for sentiment analysis for instance and your RNN output a sentiment class (happy, bad, etc) it will purely depend on the input and not on the initial hidden state. That being said I don't fully understand why is the stability then looked up from the point of view of $h$ and why is this expected to help somehow.

Here are two of the papers I'm looking at ODE-RU A dynamical system view on recurrent neural networks, AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks .

Does anyone have some understanding on this that could help me clarify it?

Best Answer

I think that there is some confusion in your question. Once we have that settled, everything will presumably be more clear.

  1. (Minor) Be careful not to be confused between continuous-time dynamical systems and discrete-time dynamical systems. Their functional form is similar but the meaning of the terms is quite different. When you have $\dot{h} = f(h)$ that $f$ represents the derivative of the system. The actual trajectory will have a different expression (the integral of that function) that we can write $x(t) = F(x(0), t)$. This would be valid for any $t$, so we can write it as $x(t+\tau) = F(x(t), \tau)$. The discrete-time dynamical system is actually an approximation of this (where you basically fixed a $\tau$ as your unitary time).
  2. Addressing now your question, we start by considering where that definition using the Jacobian comes from. From wikipedia

Many parts of the qualitative theory of differential equations and dynamical systems deal with asymptotic properties of solutions and the trajectories—what happens with the system after a long period of time. The simplest kind of behavior is exhibited by equilibrium points, or fixed points, and by periodic orbits. If a particular orbit is well understood, it is natural to ask next whether a small change in the initial condition will lead to similar behavior. Stability theory addresses the following questions: Will a nearby orbit indefinitely stay close to a given orbit? Will it converge to the given orbit? In the former case, the orbit is called stable; in the latter case, it is called asymptotically stable and the given orbit is said to be attracting.

An equilibrium solution $f_e$ to an autonomous system of first order ordinary differential equations is called: stable if for every (small) $\epsilon > 0$, there exists a $\delta > 0 $ such that every solution $f(t) $ having initial conditions within distance $ \delta $ i.e. $ \| > f(t_0) - f_e \| < \delta$ of the equilibrium remains within distance $ > \epsilon $ i.e. $\| f(t) - f_e \| < \epsilon$ for all $ t \ge t_0 > $.

(Note that $f$ here refers to what I've called F above, i.e., the solution of the system). So people have been wondering how to understand whether an orbit is stable, and among many criteria, one of the most used ones is the linearization of the system around points of the orbit. Again from wiki:

One of the key ideas in stability theory is that the qualitative behavior of an orbit under perturbations can be analyzed using the linearization of the system near the orbit. In particular, at each equilibrium of a smooth dynamical system with an n-dimensional phase space, there is a certain n×n matrix A whose eigenvalues characterize the behavior of the nearby points (Hartman–Grobman theorem).

This is why the Jacobian helps us determine the stability: we are basically asking, "what happens if I'm not exactly on that point but really close to it?".

  1. You have probably noticed that this definition of stability is fairly similar to yours, with the difference that in your definitions, the two trajectories are generic (not orbit). But we can show that the definition you are using is actually again determined by the Jacobian. Let us define

$$ \Delta(t) := y(t)-h(t) $$ As the (small) difference between the two trajectories. This means that $y= \Delta + h$. We have

$$ \dot\Delta = f(y) - f(h) $$

And since $\Delta$ is small, we can linearize the function around $h$

$$ f(y)=f(h+Delta) \approx f(h) + Jf(h) \cdot \Delta $$

Which implies

$$ \dot\Delta = f(h) + Jf(h) \cdot \Delta - f(h) = Jf(h) \cdot \Delta $$

This means that (locally in time and space) a neighbor trajectory of $h(t)$ will approach it or not based on the eigenvalues of the Jacobian matrix.

  1. Unfortunately, all of this only applies to autonomous dynamical systems. When we are input-driven systems (or, in general, non-autonomous systems), things can get really complicated, and, in fact, simple things - like defining what an attractive point is - tend to become difficult. See, for example, this paper.
  2. The two papers you linked (along with many others) are interested into this kind of stability as is it linked to the stability of the backrpropagation. The paper about AntisymmetricRNN explains this really nicely.
  3. When you say:

In applications for example NLP you usually do not have a value for h0 and set it to 0 so if your RNN is for sentiment analysis for instance and your RNN output a sentiment class (happy, bad, etc) it will purely depend on the input and not on the initial hidden state.

This is clearly correct, and to understand why the notion of stability is still important for your case, imagine that you have an input $u(t)$ and one neural network $\dot{h} = f(h(t),u(t); \theta)$ where $\theta$ is the set of parameters you can tune. What happens when we change the parameters (e.g., go through one training epoch) and we have a new set $\theta'$? This will give a new dynamics, namely: $\dot{h'} = f(h'(t),u(t); \theta')$ and we may ask ourselves: will the dynamics of this new model stay close to the old one when receiving the same input (in NLP, "when reading the same sentence?")? This is a really important question, as the state in which the network will be at the end of the sentence (let's call it $h(T)$) will affect the label assigned to that sentence!

  1. Practically, to improve learning, what you want is to change the behavior of the network for the signals (sentences) that are misclassified, but keep the same behavior for signals that were correct. This is, of course, really hard to approach with theory, but I hope that you now get the idea.

This is already a long answer, but I feel that I've skipped a lot of the saucy details. I've also been really informal in the math, as I only wanted to deliver the main ideas: things are much more complicated than that!

Related Question