Solved – Aren’t the iterations needed to train NN for XOR with MSE < 0.001 too high

machine learningneural networks

I use a neural network which consists of:

  • input layer (2 neurons)
  • hidden layer (2 neurons, 2 biases)
  • output layer (1 neuron, 1 bias)

The weights and biases are random initialized from range [-1, 1].

I use learning rate of 1 (if it's either 0.01, 0.1, 0.2, 0.5, 0.7 or 2 the NN converges in more iterations), sigmoid as activation function and stochastic gradient descent as a learning algorithm.

If the MSE is less than 0.001, the output for XOR is like:

  • [0, 0] -> 0.031
  • [0, 1] -> 0.971
  • [1, 0] -> 0.971
  • [1, 1] -> 0.030

And if the MSE is less than 0.0001, the output is:

  • [0, 0] -> 0.009
  • [0, 1] -> 0.991
  • [1, 0] -> 0.991
  • [1, 1] -> 0.008

So when I train the NN to get the MSE < 0.001, most of the time it takes ~10000 iterations. Less often, like 1/10 times, it takes ~40000 iterations, sometimes even ~100000 or ~1000000 and sometimes it's even unable to get this error (I skip it when it can't get this error with less than 1 billion iterations).

When I train it to get the MSE < 0.0001, the usual number of iterations is ~67000. Less often, like 1/20 times, it takes hundreds thousands/millions of iterations and also it's sometimes unable to get this error.

Thus, my questions are:

  • Is the MSE < 0.001 enough (not only for XOR, but also for other problems, like handwritten digits recognition)? Maybe 0.1 would be enough?
  • Aren't the iterations too high? I mean, what's the average number of iterations it should take?
  • Is it normal that it's sometimes even unable to get these small errors or it takes for e.g. MSE < 0.001 even hundreds thousands or millions of iterations? Shall I restart the NN when it doesn't converge?

Thanks in advance.

Best Answer

Tensor Playground is an interactive interface for developing neural networks to solve toy problems. Since the authors have already done the QA on their code, it makes it very easy to compare your results to a "gold standard" for this toy problem.

Using a network structure with 2 inputs, 2 hidden neurons and 1 output neurons and sigmoid activations, the network is slow to train. The decision boundary is not always the right shape, sometimes isolating just one quadrant, or creating a diagonal -- there are lots of options for how to orient a shape that is "mostly" right. After 4500 iterations, it kind of looks right. While drawing a blue band may or may not be what you had in mind, the extreme points farthest from the origin are all in the correct class; this is consistent with the 4 points given in OP's toy data. In this sense, the results are consistent.

enter image description here

You can let it run for a while to decide whether your result of extremely long training times are reproduced in Tensor Playground.

Keep in mind that the proof that an XOR network can work with this configuration (sigmoid units, 2-2-1 architecture) doesn't mean that it's easy to train. If we use a 2-4-2-1 architecture and $\tanh$ units the problem is much easier. This is the network after 200 training iterations; it's basically perfect.

enter image description here

As an aside, I think that part of the reason you're having trouble is that you're using MSE as a loss function (but there could also be bugs or other misspecification that are creating trouble). MSE has shallow gradients; XOR can be instead framed as a classification task, and cross-entropy loss has steeper gradients.

Neural networks require more experimentation than other approaches, so if what you're trying doesn't suit your needs, try something else!

Here's a checklist of things that I would look at to try and get this to work, roughly in order

  • write unit tests & check for bugs
  • poor initialization of weights
  • learning rate too high/too low
  • learning rate scheduling