Solved – Lack of Batch Normalization Before Last Fully Connected Layer

batch normalizationconv-neural-networkmachine learningneural networks

In most neural networks that I've seen, especially CNNs, a commonality has been the lack of batch normalization just before the last fully connected layer. So usually there's a final pooling layer, which immediately connects to a fully connected layer, and then to an output layer of categories or regression. I can't find it now but, I remember seeing a vague reference on this that concluded batch normalization before the last FC layer didn't make much of a difference. If this is true, why is this?

In practice, it seems like the last FC layer tends to have around 10% of its neurons dead for any given input (although, I haven't measured neuron contiguity). This proportion tends to grow considerably when you increase the FC layer, especially when starting from previously pre-trained models.

Best Answer

I am pretty sure that batch norm before the last FC layer not only does not help, but it hurts performance pretty severely.

My intuition is that the network has to learn a representation which is mostly invariant to the stochasticity inherent in batch norm. At the same time, by the time it reaches the last layer, it has to convert that representation back into a fairly precise prediction. It's likely that a single FC layer is not powerful enough to perform that conversion.

Another way to say it is that batch norm (like dropout) adds stochasticity to the network, and the network learns to be robust to this stochasticity. However it's simply impossible for the network to cope with stochasticity right before the output.