Solved – More shallow network outperformed a deeper one in accuracy

conv-neural-networkdeep learningmachine learningneural networks

While trying to train different architectures of Convolutional Neural Networks (CNN) for a binary classification task (part of my undergraduate research, the statistical methods gives only 56% accuracy so I gave CNNs a try), there was a very strange thing: a relatively shallow network outperformed a deeper one!

The "shallow" network that I am talking about is the following:

32conv3x3 - mp2x2 - 32conv3x3 - mp2x2 - 64conv5x5 - mp2x2 - fc32 - fc32 - o

where conv denotes a convolutional layer, mp is max-pooling layer, fc is fully-connected layer, o is a one-neuron output (0 is one class, 1 is another class).

The one above resulted an accuracy of 66% on my task, while the "deeper" ones like VGG-11 and VGG-13 (actually I reduced the number of filters/neurons for each layer. Sorry, can't go deeper and wider on my no GPU core-i7 laptop) seems to get stuck on 55% accuracy already after 50 epochs (which is equal to the accuracy achieved by using statistical methods), I trained them for 300 epochs a few times.

Of course, I made sure that the samples from two classes used for training is balanced (so the network is not biased/overfitted towards one class). I am using ReLU as activation function and RMSProp as an optimizer. I am also using Dropout for regularization.

My guess is:

  • The learning rate for deeper networks are so small that it got stuck at a local minimum (the weight space in "deeper" architecture is smoother and has very vast "valleys" with local minimas), but doesn't Dropout intended to solve such problems?

  • 300 epochs is too small for such a deep network, and it needs to be trained longer to converge. But again, the objects that I am classifying are cell's nuclei images, so I don't think that it needs super deep networks with a super long time to train (such as those that are used to classify 1000+ different objects like cats, dogs, donkeys, etc.).

  • Maybe I need to take a better look on the data for any corruptions. But again, I have checked them for any obvious biases, and normalized the nuclei images (after filtering out the background using computer vision methods and assigning background pixels to 0) to the range [0-255].

Can anyone with more expertise in this field give me a hint on what really is going on? Any guess (not nessesary sure answers) is appreciated! Thank you in advance!

Best Answer

This is a well known phenomenon. A good discussion can be found in the Deep Residual Learning for Image Recognition, especially Figure 1. The short summary is that when a neural network is very deep for a given problem, it tends to try and recreate the identity. This is because the first portion of the network has found an effective set of weights that optimize the objective, and now the latter portion of the neural network is essentially adding noise. So the latter portion will attempt to create an identity function, which is terrible because you're trying to make an identity function from a nonlinear set of activations. As an analogy, it's like approximating a line with polynomials of degree >1: you get a wavy mess. The above paper proposes ResNet, which is a deep neural network that allows you to skip over activations, which significantly improves the quality of deeper neural nets.

Related Question