Deep Learning – Why Normalize Images by Subtracting Dataset’s Image Mean Instead of Current Image Mean?

deep learningimage processing

There are some variations on how to normalize the images but most seem to use these two methods:

  1. Subtract the mean per channel calculated over all images (e.g. VGG_ILSVRC_16_layers)
  2. Subtract by pixel/channel calculated over all images (e.g. CNN_S, also see Caffe's reference network)

The natural approach would in my mind to normalize each image. An image taken in broad daylight will cause more neurons to fire than a night-time image and while it may inform us of the time we usually care about more interesting features present in the edges etc.

Pierre Sermanet refers in 3.3.3 that local contrast normalization that would be per-image based but I haven't come across this in any of the examples/tutorials that I've seen. I've also seen an interesting Quora question and Xiu-Shen Wei's post but they don't seem to support the two above approaches.

What exactly am I missing? Is this a color normalization issue or is there a paper that actually explain why so many use this approach?

Best Answer

Subtracting the dataset mean serves to "center" the data. Additionally, you ideally would like to divide by the sttdev of that feature or pixel as well if you want to normalize each feature value to a z-score.

The reason we do both of those things is because in the process of training our network, we're going to be multiplying (weights) and adding to (biases) these initial inputs in order to cause activations that we then backpropogate with the gradients to train the model.

We'd like in this process for each feature to have a similar range so that our gradients don't go out of control (and that we only need one global learning rate multiplier).

Another way you can think about it is deep learning networks traditionally share many parameters - if you didn't scale your inputs in a way that resulted in similarly-ranged feature values (ie: over the whole dataset by subtracting mean) sharing wouldn't happen very easily because to one part of the image weight w is a lot and to another it's too small.

You will see in some CNN models that per-image whitening is used, which is more along the lines of your thinking.

Related Question