Solved – Why does the neural net fail to learn higher frequency sine waves

gradient descentmachine learningneural networks

I am testing my neural network implementation. I have an input layer with a single unit, one hidden layer consisting of 65 tanh units, and an output layer consisting of a single linear output unit.

My data set consists of 100000 points $x_1, x_2, \ldots, x_{100000}$ sampled uniformly from $[-1,1]$, and the corresponding targets are $cos(16x_i)$, for $i=1, \ldots, 100000$.

I initialize the hidden layer's input weights to uniform random values in $[-\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}]$ and the output layer's unit weights are uniform random values in $[-\frac{1}{\sqrt{66}}, \frac{1}{\sqrt{66}}]$.

I'm using a fixed learning rate of $\nu = 0.05$.

After training the network for $100 \times 100000$ steps, I don't seem to be getting good results. I test the network on $100000$ values in $[-1,1]$, and the resulting function looks nothing like the function $cos(16x)$.

I'm tracking the progress of the training by calculating the average error over 5000 steps, here's what it looks like:

errors

When I use a simpler function such as $cos(4x)$ to train and then test, I get much better results, which leads me to believe that my implementation is OK for the most part.

Any suggestions on what might be going wrong? Did I need more hidden units and more training data? Do I need more layers? Should I be waiting longer for the gradient descent to converge? Some different learning rate? The above picture certainly looks like gradient descent has more or less converged, but those average errors still look a bit big?

Best Answer

At first glance it looks like your neural net might be doing about as well as it can given the information it has. For a network with only a single input neuron, there is only so much you will be able to achieve in terms of creating a complicated output function. The output of your network is given by

$f(x) = b^{(2)} + w_1^{(2)}g(w_1^{(1)} x + b_1^{(1)}) + w_2^{(2)}g(w_2^{(1)} x + b_2^{(1)}) + ... + w_{65}^{(2)}g(w_{65}^{(1)} x + b_{65}^{(1)})$

where $g()$ is the $\tanh$ function in this case, $b$ is the bias, and $w_n$ is the weight mapping $x$ to the input. In other words, the output of your network will (optimally) be the best approximation achievable of your desired function with an expansion of 65 $\tanh$ functions. All your algorithm can do is find the weights that fit the function best, but there is no way for it to make a more complicated model.

Let's compare with a Taylor series of $\cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + ...$ just for a sense of the limitations here. If the argument of the cosine is small, you can get a good approximation by keeping just a few terms. As the argument of the cosine gets larger, you need more and more terms to get a reasonable approximation. So maybe 65 terms is enough to almost perfectly estimate $\cos(x)$ on the interval [-1, 1], does a pretty good job for $\cos(4x)$ (where the argument of cosine is on the interval [-4,4]), but fails completely for $\cos(16x)$ (cosine takes on values [-16,16]).

Since the above is just an educated guess as to your problems, here are some suggestions to check:

  1. Restrict the interval to something like [-.1,.1] to see if you get a better approximation of $\cos(16x)$.

  2. If this is indeed the problem your network is having, you can verify this is a high bias model by plotting your training error and your testing error. They should converge to about the same value for a high bias model.

  3. If you find you do indeed have a high bias problem, the suggestion above to add another hidden layer or to increase the number of neurons in your hidden layer is a good one.

  4. You can also think about adding additional features to your input layer if possible. One thought that comes to mind is adding a feature that is the argument of the cosine modulo $2\pi$ (i.e. $16x\mod 2\pi$ in this case).

Related Question