I'm reading some papers on stability of neural networks mainly a dynamical system point of view.
RNN can be thought of as $h_t=f(h_{t-1},x_t,\theta)$ where $\theta$ represent some parameters that are adjusted while training the model, $x_i$ is the input to the RNN and $h_i$ is the hidden state of the network.
I have found some papers that actually make a simplifcation of the RNN by ignoring the input $x_i$ and turning it into $h_t=f(h_{t-1},\theta)$. Now by fixing $\theta$ they turn this into a differential equation $\dot{h(t)}=f(h(t),\theta)$ and apply the stability theorems to this new function.
Later they try to define a $f$ that fullfill the stability conditions of the theorems both on the pure differential equations side and on the numerical side.
Those theorems are stated in the form (which can be found on some differential equations books).
Given the differential equation $\dot{h(t)}=f(h(t),\theta)$, if the real part of the eigenvalues of $Jf$ being that the Jacobian of $f$ are negative then the solutions are stable.
As definition of stability it is I think the standard one: A solution $h(t)$ is stable if given $h(0)$ and for any $\epsilon >0$ there exists $\delta > 0$ such that if $y(t)$ is antoher solution such that $\|y(0)-h(0)\| \leq \delta$ then $\|y(t)-h(t)\|\leq \epsilon$ for $t\geq 0$.
As you see the stability is mainly referencing the hidden state but has nothing to do with the input to the RNN. In applications for example NLP you usually do not have a value for $h_0$ and set it to $0$ so if your RNN is for sentiment analysis for instance and your RNN output a sentiment class (happy, bad, etc) it will purely depend on the input and not on the initial hidden state. That being said I don't fully understand why is the stability then looked up from the point of view of $h$ and why is this expected to help somehow.
Here are two of the papers I'm looking at ODE-RU A dynamical system view on recurrent neural networks, AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks .
Does anyone have some understanding on this that could help me clarify it?
Best Answer
I think that there is some confusion in your question. Once we have that settled, everything will presumably be more clear.
(Note that $f$ here refers to what I've called F above, i.e., the solution of the system). So people have been wondering how to understand whether an orbit is stable, and among many criteria, one of the most used ones is the linearization of the system around points of the orbit. Again from wiki:
This is why the Jacobian helps us determine the stability: we are basically asking, "what happens if I'm not exactly on that point but really close to it?".
$$ \Delta(t) := y(t)-h(t) $$ As the (small) difference between the two trajectories. This means that $y= \Delta + h$. We have
$$ \dot\Delta = f(y) - f(h) $$
And since $\Delta$ is small, we can linearize the function around $h$
$$ f(y)=f(h+Delta) \approx f(h) + Jf(h) \cdot \Delta $$
Which implies
$$ \dot\Delta = f(h) + Jf(h) \cdot \Delta - f(h) = Jf(h) \cdot \Delta $$
This means that (locally in time and space) a neighbor trajectory of $h(t)$ will approach it or not based on the eigenvalues of the Jacobian matrix.
This is clearly correct, and to understand why the notion of stability is still important for your case, imagine that you have an input $u(t)$ and one neural network $\dot{h} = f(h(t),u(t); \theta)$ where $\theta$ is the set of parameters you can tune. What happens when we change the parameters (e.g., go through one training epoch) and we have a new set $\theta'$? This will give a new dynamics, namely: $\dot{h'} = f(h'(t),u(t); \theta')$ and we may ask ourselves: will the dynamics of this new model stay close to the old one when receiving the same input (in NLP, "when reading the same sentence?")? This is a really important question, as the state in which the network will be at the end of the sentence (let's call it $h(T)$) will affect the label assigned to that sentence!
This is already a long answer, but I feel that I've skipped a lot of the saucy details. I've also been really informal in the math, as I only wanted to deliver the main ideas: things are much more complicated than that!