Solved – How should I standardize input when fine-tuning a CNN

conv-neural-networkneural networkstransfer learning

I am working on a model for binary classification of skin samples from https://www.isic-archive.com as either benign or malignant.

I want to use the VGG16 model pre-trained on ImageNet and fine-tune some layers to my dataset. The VGG16 paper explains their preprocessing steps which I understand to be important to replicate if one wants to fine-tune someone's network.

The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel.

  1. Why didn't they also divide by the standard deviation? I thought this kind of standardization (i.e. zero center and unit variance) was good practice.

More importantly, since I am fine-tuning the network to my own dataset, I wonder if I should

  1. Standardize the input relative to ImageNet and my dataset, only to ImageNet, or only to my dataset?

Best Answer

First of all, VGG-16 may not be the right architecture for you. It has been superseded by various architectures, of which the most commonly used in application is ResNet. However, if you have very few data, transfer learning on VGG-16 may be more efficient than on ResNet. Bottom line, use both and compare them on the validation test.

Coming to your point about standardization: it's true that Simonyan & Zisserman didn't standardize the RGB intensities, but it's false that they didn't apply any other preprocessing: they applied significant data augmentation. See section 3.1 of their paper. You would need to apply the same data augmentation described there, to your training set.

If you choose to use a ResNet, you want to get this paper:

Han et al., Classification of the Clinical Images for Benign and Malignant Cutaneous Tumors Using a Deep Learning Algorithm, 2018

The model they use is ResNet-152: you might try with a smaller ResNet, if transfer-training for this one proves to be too much of a challenge. ResNets are so ubiquitous that you can find implementations of this architecture in basically all frameworks, e.g.

https://github.com/tensorflow/models/tree/master/official/resnet

https://github.com/pytorch/vision/tree/master/torchvision

https://github.com/keras-team/keras-applications


Standardization for a ResNet model

The paper above is behind a paywall (but I'm sure the authors will send you a copy, if you send them an email), so I cannot say for sure if they used standardization in this specific application: btw, note that they didn't just classify the skin lesion to be benign or malignant, but, to the best of my understanding, they classified it to one of 12 different classes. In general Best Practices for training ResNets do suggest to perform standardization. For example, among the BPs suggested in https://arxiv.org/pdf/1812.01187.pdf, we find:

  • Scale hue, saturation, and brightness with coefficients uniformly drawn from [0.6, 1.4]
  • Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and dividing by 58.393, 57.12, 57.375, respectively, which should be the sample mean & sample standard deviation of each channel, computed on the training set of the ISLVRC2012 dataset (a subset of ImageNet), which has 1.3 million images for training and 1000 classes.

Of course, if you plan to compare the results from the VGG-16 and the ResNet-152 (or ResNet-50: another commonly used ResNet model, which is less data-hungry than ResNet-152), you need to use the same standardization for both.

Concerning your second question (standardize the input relative to ImageNet and your dataset, only to ImageNet, or only to your dataset), option 3 is crazy, because when you feed new data to your NN, these data must be standardized too (since you standardized the training set, the weights of the NN after training are the "right ones" for standardized inputs). Now, to avoid test set leakage, the usual practice is to standardize new data using the sample mean (and sample standard deviation, if you're using ResNet-style normalization) computed on the training set (ISLVRC2012, in this case, because you did most of the training on it). Now, suppose you get a new skin sample image: you have to normalize it before feeding it to your NN. If you normalize it using sample mean & standard deviation based on your new dataset, you'll be doing something completely different from what you did during training, so I wouldn't expect the NN to work very well. When would option 3 make sense? In two cases: either when you train your NN from scratch on your new dataset, or when you unfreeze a lot of layers and retrain them. However, transfer learning is usually performed by unfreezing only the top layer of the NN (the softmax layer).

Choosing between option 1 and 2 depends on how large is your dataset wrt ImageNet (or to be precise, to the ISLVRC2012 dataset, the subset of ImageNet which has been used to train the ResNet), and how "extreme" in term of RGB values your images are wrt those of ISLVRC2012. I suggest you compute the sample mean and sample standard deviation for ISLVRC2012, and ISLVRC2012 + your train set. If, as I suppose, the difference is small, then just use the statistics computed on ISLVRC2012.


Standardization is not what you should really worry about

Finally, since you're doing transfer learning, it will be much more important to perform proper data augmentation, rather than to concentrate on the proper normalization. For obvious reasons, skin sample images will be scarce, and the dataset will be unbalanced (i.e., you will likely have much more images of benign lesions, than of malignant lesions). Since your question didn't ask about data augmentation, I won't touch the topic, but I suggest you read also:

https://www.nature.com/articles/nature21056

https://academic.oup.com/annonc/article/29/8/1836/5004443

https://www.jmir.org/2018/10/e11936

https://arxiv.org/pdf/1812.02316.pdf

Related Question