Machine Learning – How to Subtract Mean on Train/Valid/Test Set

cross-validationdata preprocessingmachine learning

I'm doing data preprocessing and going to build a Convonets on my data after.

My question is:
Say I have a total data sets with 100 images, I was calculating mean for each one of the 100 images and then subtract it from each of the images, then split this into train and validation set, and I do the same steps to process on a given test set, but it seems like this is not a correct way doing it according to this link:http://cs231n.github.io/neural-networks-2/#datapre

"Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test)."

I'm guessing what the author is saying is that, do not compute mean and subtract it within each image but compute the mean of the total image set(i.e. (image1 + … + image100)/100) and subtract the mean to each of the image.

I don't quite understand can anyone explain? and also possibly explain why what I was doing is wrong(if it is wrong indeed).

Best Answer

Let's assume you have 100 images in total; 90 are training data and 10 are test data.

The authors correctly asserts that using the whole 100 image sample to compute the sample mean $\hat{\mu}$ is wrong. That is because in this case you would have information leakage. Information from your "out-of-sample" elements would be move to your training set. In particular for the estimation of $\hat{\mu}$ , if you use 100 instead of 90 images you allow your training set to have a more informed mean than it should have too. As a result your training error would be potentially lower than it should be.

The estimated $\hat{\mu}$ is common throughout the training/validation/testing procedure. The same $\hat{\mu}$ is to be use to centre all your data. (I mention this later because I have the slight impression you use the mean of each separate image to centre that image.)

Related Question