Solved – Why is it hard for a neural network to learn the identity function

kerasmachine learningneural networks

I wanted to see if a neural network could learn the identity function using the MNIST handwritten dataset.

Here is the full code

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop

batch_size = 128
epochs = 20
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)

model = Sequential()
model.add(Dense(784, activation='relu', input_shape=(784,)))
model.add(Dense(784, activation='relu'))
model.add(Dense(784, activation='relu'))
model.add(Dense(784, activation='relu'))
model.summary()
model.compile(loss='mean_squared_error',
            optimizer=RMSprop(),
            metrics=['mean_absolute_percentage_error'])

history = model.fit(x_train, x_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, x_test))
score = model.evaluate(x_test, x_test, verbose=0)
print('Test loss:', score[0])
print('Test MAPE:', score[1])

and the output

**4 dense layers**
Epoch 20/20
60000/60000 [==============================] - 50s 840us/step - loss: 456.7581 - mean_absolute_percentage_error: 351097677.7045 - val_loss: 523.7151 - val_mean_absolute_percentage_error: 504905991.0656
Test loss: 523.7150838867187
Test MAPE: 504905988.5056

What I can't quite get my head around is why training cannot find the perfect solution to the problem and why it takes so long to even get close to it? Even with one dense layer the exact solution cannot be found:

**1 dense layer**
Epoch 20/20
60000/60000 [==============================] - 16s 268us/step - loss: 180.6187 - mean_absolute_percentage_error: 209296481.2373 - val_loss: 167.9543 - val_mean_absolute_percentage_error: 192590419.9936
Test loss: 167.954341796875
Test MAPE: 192590420.1984

Conceptually I can see that there is a solution space (not just the exact identity function) as it is likely there are some pixels that have the same value as each other in all the images which could be swapped in the training set at no loss (0's around the edge for example). With the knowledge that this is a local minimum can I learn anything from this to guide me out rather than playing with hyperparameters until I find something better?

Best Answer

For a single example, the network takes a 784-element vector as its input. So rephrasing the problem in OP's post, they wish to learn the function

$$ f(x) = Ix $$

where $I$ is the $784\times 784$ identity matrix.

Perfect fit is impossible with this model

The 1-layer network probably has an easier time because instead of attempting to "line up" four weight matrices through four nonlinearities, it only has to line up one, i.e. it is easier to find an approximation in $W_1, b_1$ for

$$ Ix = g(W_1 x+b_1). $$

But even the simple expression $Ix = g(W_1 x+b_1)$ should be an obvious warning that attempting to find a perfect fit is a fool's errand, because it is trying to approximate a linear function with a nonlinear function. In particular, because of how ReLUs are defined, any $x<0$ is set to 0, so this model will never achieve 0 error when any elements of $x$ are negative.

The UAT is an approximation theorem

Indeed, for any choice of nonlinear activation $g$, I can find an $x$ for which the error is positive. So then the interesting question becomes "Can we fit a model so that the error is at most $\epsilon$ for $x$ in some interval $\mathcal{I}$?" And this statement of the problem is more-or-less compatible with the caveats of the UAT. It also points us in a more profitable direction: instead of seeking 0 error, we wish to find minimal error when the inputs are in some interval.

In other words, theorems about neural networks don't guarantee that you can achieve 0 error, they guarantee that you can bound error for inputs in some interval (subject to some terms and conditions).

The UAT doesn't comment on whether it's easy to train any particular network.

Actually finding the weights and biases which achieve the minimum error is a very challenging problem. In particular, we don't have much reason to believe that the choice of initialization, optimizer, learning rate and number of epochs, etc. in this code snippet are best for this task.

This optimization problem is hard

A four-layer network with ReLU activations $g(x)=\max\{0, x\}$ is given by

$$ h(x)=g(W_4g(W_3g(W_2g(W_1x+b_1)+b_2)+b_3)+b_4). $$

So what you seek in your question is solutions $W_i, b_i$ such that $$ Ix = g(W_4g(W_3g(W_2g(W_1x+b_1)+b_2)+b_3)+b_4) $$ for all $x$, where $W_i, b_i$ are have appropriate shape.

This doesn't look particularly friendly to try and solve. Indeed, in light of my remarks about the UAT, we will have to restate this to bound the error and focus on an interval of inputs.

Even if we restate the problem in this way, it is still challenging from the perspective of gradient descent because of the dying ReLU phenomenon, the weaknesses of gradient descent, and the poor conditioning of the optimization task due to the scale of the inputs.

Tuning a neural network is the greater part of using neural networks.

If you don't want to spend a lot of time changing hyper-paremters, then you should use a different model.