Neural-Networks – Tackling Poor Performance with ReLU Activation Function on MNIST Dataset

I'm quite new to neural networks and currently I'm trying to train a non convolutional neural network on the MNIST data set. I'm observing some behaviour I don't quite understand.

This is the code written with keras as the library:

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
model = Sequential()

early_stopper = EarlyStopping(patience=3)

model.add(Flatten(input_shape=(28,28)))
model.add(Dense(128, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(10, activation="softmax"))
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(x_train, y_train, epochs=20, validation_data=(x_test, y_test), callbacks=[early_stopper])

This gives me a validation_acc of around 20%. The funny thing is, when I change the activation functions to "sigmoid", the loss function to "mean_squared_error" and the optimizer to "sgd" my performance improves to around 85% after 50 epochs.

Having read http://neuralnetworksanddeeplearning.com/ I wonder what's the reason for the bad performance of the network I presented in the code. ReLU, cross-entropy and a dynamic optimizer like Adam all seem to improve on the idea of a very vanilla neural network with stochastic gradient optimization, mean squared error as loss and sigmoid activation functions. Yet I get a really bad performance and if I increase the number of nodes in the hidden layers I often get a network that doesn't learn at all.

EDIT:
I figured out it has something to do with me not normalizing the input to values between 0 and 1 … but why is this the problem?

Best Answer

We can find a reasonable explanation for this behavior in the Neural Network FAQ. TL;DR - rescaling is really important for NNs because, in combination with the choice of initialization, it can avoid saturation.

But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation. Assume we have an MLP with one hidden layer applied to a classification problem and are therefore interested in the hyperplanes defined by each hidden unit. Each hyperplane is the locus of points where the net-input to the hidden unit is zero and is thus the classification boundary generated by that hidden unit considered in isolation. The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to $[-1,1]$ will work better than $[0,1]$, although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers.