Check than the batch is giving different samples, looks like you are feeding always the same samples to the network...
Another thing you have to understand, is that you are using a pretrained model, which means that lots of patterns are already learnt, if your data fit in those patterns, is it possible than your problem was already solved in the model.
Transfer learning is when a model developed for one task is reused to work on a second task. Fine-tuning is one approach to transfer learning where you change the model output to fit the new task and train only the output model.
In Transfer Learning or Domain Adaptation, we train the model with a dataset. Then, we train the same model with another dataset that has a different distribution of classes, or even with other classes than in the first training dataset).
In Fine-tuning, an approach of Transfer Learning, we have a dataset, and we use let's say 90% of it in training. Then, we train the same model with the remaining 10%. Usually, we change the learning rate to a smaller one, so it does not have a significant impact on the already adjusted weights. You can also have a base model working for a similar task and then freezing some of the layers to keep the old knowledge when performing the new training session with the new data. The output layer can also be different and have some of it frozen regarding the training.
In my experience learning from scratch leads to better results, but it is much costly than the others especially regarding time and resources consumption.
Using Transfer Learning you should freeze some layers, mainly the pre-trained ones and only train in the added ones, and decrease the learning rate to adjust the weights without mixing their meaning for the network. If you speed up the learning rate you normally face yourself with poor results due to the big steps in the gradient descent optimisation. This can lead to a state where the neural network cannot find the global minimum but only a local one.
Using a pre-trained model in a similar task, usually have great results when we use Fine-tuning. However, if you do not have enough data in the new dataset or even your hyperparameters are not the best ones, you can get unsatisfactory results. Machine learning always depends on its dataset and network's parameters. And in that case, you should only use the "standard" Transfer Learning.
So, we need to evaluate the trade-off between the resources and time consumption with the accuracy we desire, to choose the best approach.
Best Answer
First of all, VGG-16 may not be the right architecture for you. It has been superseded by various architectures, of which the most commonly used in application is ResNet. However, if you have very few data, transfer learning on VGG-16 may be more efficient than on ResNet. Bottom line, use both and compare them on the validation test.
Coming to your point about standardization: it's true that Simonyan & Zisserman didn't standardize the RGB intensities, but it's false that they didn't apply any other preprocessing: they applied significant data augmentation. See section 3.1 of their paper. You would need to apply the same data augmentation described there, to your training set.
If you choose to use a ResNet, you want to get this paper:
Han et al., Classification of the Clinical Images for Benign and Malignant Cutaneous Tumors Using a Deep Learning Algorithm, 2018
The model they use is ResNet-152: you might try with a smaller ResNet, if transfer-training for this one proves to be too much of a challenge. ResNets are so ubiquitous that you can find implementations of this architecture in basically all frameworks, e.g.
https://github.com/tensorflow/models/tree/master/official/resnet
https://github.com/pytorch/vision/tree/master/torchvision
https://github.com/keras-team/keras-applications
Standardization for a ResNet model
The paper above is behind a paywall (but I'm sure the authors will send you a copy, if you send them an email), so I cannot say for sure if they used standardization in this specific application: btw, note that they didn't just classify the skin lesion to be benign or malignant, but, to the best of my understanding, they classified it to one of 12 different classes. In general Best Practices for training ResNets do suggest to perform standardization. For example, among the BPs suggested in https://arxiv.org/pdf/1812.01187.pdf, we find:
Of course, if you plan to compare the results from the VGG-16 and the ResNet-152 (or ResNet-50: another commonly used ResNet model, which is less data-hungry than ResNet-152), you need to use the same standardization for both.
Concerning your second question (standardize the input relative to ImageNet and your dataset, only to ImageNet, or only to your dataset), option 3 is crazy, because when you feed new data to your NN, these data must be standardized too (since you standardized the training set, the weights of the NN after training are the "right ones" for standardized inputs). Now, to avoid test set leakage, the usual practice is to standardize new data using the sample mean (and sample standard deviation, if you're using ResNet-style normalization) computed on the training set (ISLVRC2012, in this case, because you did most of the training on it). Now, suppose you get a new skin sample image: you have to normalize it before feeding it to your NN. If you normalize it using sample mean & standard deviation based on your new dataset, you'll be doing something completely different from what you did during training, so I wouldn't expect the NN to work very well. When would option 3 make sense? In two cases: either when you train your NN from scratch on your new dataset, or when you unfreeze a lot of layers and retrain them. However, transfer learning is usually performed by unfreezing only the top layer of the NN (the softmax layer).
Choosing between option 1 and 2 depends on how large is your dataset wrt ImageNet (or to be precise, to the ISLVRC2012 dataset, the subset of ImageNet which has been used to train the ResNet), and how "extreme" in term of RGB values your images are wrt those of ISLVRC2012. I suggest you compute the sample mean and sample standard deviation for ISLVRC2012, and ISLVRC2012 + your train set. If, as I suppose, the difference is small, then just use the statistics computed on ISLVRC2012.
Standardization is not what you should really worry about
Finally, since you're doing transfer learning, it will be much more important to perform proper data augmentation, rather than to concentrate on the proper normalization. For obvious reasons, skin sample images will be scarce, and the dataset will be unbalanced (i.e., you will likely have much more images of benign lesions, than of malignant lesions). Since your question didn't ask about data augmentation, I won't touch the topic, but I suggest you read also:
https://www.nature.com/articles/nature21056
https://academic.oup.com/annonc/article/29/8/1836/5004443
https://www.jmir.org/2018/10/e11936
https://arxiv.org/pdf/1812.02316.pdf