I think you need target values. So for the sequence $(x_1, x_2, x_3)$, you'd need corresponding targets $(t_1, t_2, t_3)$. Since you seem to want to predict the next term of the original input sequence, you'd need:
$$t_1 = x_2,\ t_2 = x_3,\ t_3 = x_4$$
You'd need to define $x_4$, so if you had a input sequence of length $N$ to train the RNN with, you'd only be able to use the first $N-1$ terms as input values and the last $N-1$ terms as target values.
If we use a sum of square error term for the objective function, then how is it defined?
As far as I'm aware, you're right - the error is the sum across the whole sequence. This is because the weights $u$, $v$ and $w$ are the same across the unfolded RNN.
So,
$$E = \sum\limits_t E^t = \sum\limits_t (t^t - p^t)^2$$
Are weights updated only once the entire sequence was looked at (in this case, the 3-point sequence)?
Yes, if using back propagation through time then I believe so.
As for the differentials, you won't want to expand the whole expression out for $E$ and differentiate it when it comes to larger RNNs. So, some notation can make it neater:
- Let $z^t$ denote the input to the hidden neuron at time $t$ (i.e. $z^1 = ws + vx^1$)
- Let $y^t$ denote the output for the hidden neuron at time $t$ (i.e.
$y^1 = \sigma(ws + vx^1))$
- Let $y^0 = s$
- Let $\delta^t = \frac{\partial E}{\partial z^t}$
Then, the derivatives are:
$$\begin{align}\frac{\partial E}{\partial u} &= y^t \\\\
\frac{\partial E}{\partial v} &= \sum\limits_t\delta^tx^t \\\\
\frac{\partial E}{\partial w} &= \sum\limits_t\delta^ty^{t-1}
\end{align}$$
Where $t \in [1,\ T]$ for a sequence of length $T$, and:
$$\begin{equation}
\delta^t = \sigma'(z^t)(u + \delta^{t+1}w)
\end{equation}$$
This recurrent relation comes from realising that the $t^{th}$ hidden activity not only effects the error at the $t^{th}$ output, $E^t$, but it also effects the rest of the error further down the RNN, $E - E^t$:
$$\begin{align}
\frac{\partial E}{\partial z^t} &= \frac{\partial E^t}{\partial y^t}\frac{\partial y^t}{\partial z^t} + \frac{\partial (E - E^t)}{\partial z^{t+1}}\frac{\partial z^{t+1}}{\partial y^t}\frac{\partial y^t}{\partial z^t} \\\\
\frac{\partial E}{\partial z^t} &= \frac{\partial y^t}{\partial z^t}\left(\frac{\partial E^t}{\partial y^t} + \frac{\partial (E - E^t)}{\partial z^{t+1}}\frac{\partial z^{t+1}}{\partial y^t}\right) \\\\
\frac{\partial E}{\partial z^t} &= \sigma'(z^t)\left(u + \frac{\partial (E - E^t)}{\partial z^{t+1}}w\right) \\\\
\delta^t = \frac{\partial E}{\partial z^t} &= \sigma'(z^t)(u + \delta^{t+1}w) \\\\
\end{align}$$
Besides doing it that way, this doesn't look like vanilla back-propagation to me, because the same parameters appear in different layers of the network. How do we adjust for that?
This method is called back propagation through time (BPTT), and is similar to back propagation in the sense that it uses repeated application of the chain rule.
A more detailed but complicated worked example for an RNN can be found in Chapter 3.2 of 'Supervised Sequence Labelling with Recurrent Neural Networks' by Alex Graves - really interesting read!
It totally depends on the nature of your data and the inner correlations, there is no rule of thumb. However, given that you have a large amount of data a 2-layer LSTM can model a large body of time series problems / benchmarks.
Furthermore, you don't backpropagate-through-time to the whole series but usually to (200-300) last steps. To find the optimal value you can cross-validate using grid search or bayesian optimisation. Furthermore, you can have a look at the parameters here: https://github.com/wojzaremba/lstm/blob/master/main.lua.
So, the sequence length doesn't really affect your model training but it's like having more training examples, that you just keep the previous state instead of resetting it.
Best Answer
What the book mentions and what the author of the post meant are two different things.
As the book mentions, 'unfolding' is dependent on the length of the input sequence. To understand this, suppose you want to lay down the exact computations that are happening in an RNN, in that case, you have to 'unfold' your network and the size of your 'unfolded' graph would depend on the size of the input sequence. For more information refer to this page. It says that "By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word."
In case of the post, what the author meant is that during training you need 'unrolling' because you need to store the activations/ hidden states for backpropagation. During testing, you need not store the hidden states (as you don't need to do back-propagation), so no 'unrolling' is required.