Solved – Neural Network – Success after changing weights initialization strategy. What’s the explanation

backpropagationconvergencegradient descentneural networksweights

I'm implementing a neural network in javascript to recognize handwritten digits, while studying "Neural Networks and Deep Learning" by Michael Nielsen and following the feedforward-backpropagation algorithms detailed in "A Step by Step Backpropagation Example" by Matt Mazur.

I've been struggling for the past few days to make it converge on the MNIST training samples, to no avail. My network total error always converged to 0.5 or some value around that like 0.47. No matter what I tried I could not lower the error, which made the outputs completely useless with such a high error.

I was initializing weights and biases to Math.random() (i.e. a random real number between 0 and 0.999….). After stumbling upon some questions on this site, it was mentioned that some people initialized their weights and biases to a random number between -0.5 and 0.5. I tried that and presto! My network converged to zero error in just a few training iterations. I've run it several times so this is consistently happening now.

I've also tried my algorithm with XOR and other logical functions first, and then I went for MNIST, so I tend to think my algorithm implementation is correct (although surely it can be optimized, etc).

So, how is it possible that this change in initial random values of weights made ALL the difference? Is it because I included a range of both negative and positive numbers (-0.5 to 0.5)? Should that matter? If negative weights is what the network needed, shouldn't backpropagation take care of that? (I mean, maybe taking longer to converge, but not getting stuck at 0.5 no matter how many training epochs I run).

Other info about my network just in case: it has 784 inputs, 15 hidden neurons and 10 outputs. Hidden and output network values are passed through the sigmoid function. I'm using a learning rate of 0.5.

Best Answer

A short answer is that initialization matters because, in deep learning, we're looking for the minimum of a non-convex function, which can have multiple local minima. You start somewhere and move in the direction of the gradient, which could send you towards a nearby local minimum.

Section 8.4 of the book Deep Learning by Goodfellow, et al. talks about this extensively. They give heuristics for choosing a good initialization and explain some reasoning. Their book is conveniently available online for free here. Here are some highlights relevant to your question:

  • With a bad initialization, the algorithm might not converge at all due to numerical difficulties. Or, it might take a long time to converge or might converge to a not-very-good local minimum.
  • The scale of random initialization matters. On the one hand, you want them to be large enough to propagate information. On the other hand, you can think of your initialization as a prior that you expect the final parameters to be close to the initial ones. Large values signify a strong preference that these units interact. It can take a long time for gradient descent to "correct" this.

They list other trade-offs that are relevant to the scale of the initialization. However, note that by choosing only positive initial weights, that's another strong prior assumption. And, if your inputs are positive, the values might be blowing up as you propagate through the network.