Solved – How should I normalise the inputs to a neural network

neural networksnormalization

My neural network can have all sorts of inputs from different datasets. For example, with digit recognition using the MNIST dataset, there are 784 inputs (each pixel 28×28) and each value is between 0-255 (single grayscale). However, this would produce math range errors with the sigmoid function because too larger negative values would be produced on later layers. So, all value are divided by 255 to get decimal values between 0-1. Is that correct?

Well, does this mean with other datasets, that have big values, you can just divide them all by 10 or 100 to make them smaller? Does that matter? What is this process known as?

Best Answer

This process is known as "normalization" or "transformation" and is part of your feature engineering. Your problem applies to all machine learning algorithms, not just neural networks.

We usually prefer [0,1], because they are easier to deal with. There's no rule on what and how to normalize, but you should think:

Are my variables on a comparable scale?
Does my machine learning require normalization?
Is my variable discrete, should I transform it to continuous?

You certainly shouldn't just divide your variables randomly to make it smaller. In your image classification, dividing by 255 is good because the whole range is in [0,1]. You can't have anything less than 0 and greater than 1.

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Solved – Neural Network – Success after changing weights initialization strategy. What’s the explanation

A short answer is that initialization matters because, in deep learning, we're looking for the minimum of a non-convex function, which can have multiple local minima. You start somewhere and move in the direction of the gradient, which could send you towards a nearby local minimum.

Section 8.4 of the book Deep Learning by Goodfellow, et al. talks about this extensively. They give heuristics for choosing a good initialization and explain some reasoning. Their book is conveniently available online for free here. Here are some highlights relevant to your question:

With a bad initialization, the algorithm might not converge at all due to numerical difficulties. Or, it might take a long time to converge or might converge to a not-very-good local minimum.
The scale of random initialization matters. On the one hand, you want them to be large enough to propagate information. On the other hand, you can think of your initialization as a prior that you expect the final parameters to be close to the initial ones. Large values signify a strong preference that these units interact. It can take a long time for gradient descent to "correct" this.

They list other trade-offs that are relevant to the scale of the initialization. However, note that by choosing only positive initial weights, that's another strong prior assumption. And, if your inputs are positive, the values might be blowing up as you propagate through the network.

Best Answer

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

Solved – Neural Network – Success after changing weights initialization strategy. What’s the explanation

Related Question