Data Augmentation – Should it be Applied on Training Set Only?

data augmentationdeep learningmachine learningregularization

Is it common practice to apply data augmentation to training set only, or to both training and test sets?

Best Answer

In terms of the concept of augmentation, ie making the data set bigger for some reason, we'd tend to only augment the training set. We'd evaluate the result of different augmentation approaches on a validation set.

However, as @Łukasz Grad points out, we might need to perform a similar procedure to the test set as was done on the training set. This is typically so that the input data from the test set resembles as much as possible that of the training set. For example, @Łukasz Grad points out the example of image cropping, where we'd need to crop the test images too, so they are the same size as the training images. However, in the case of the training images, we might use each training image multiple times, with crops in different locations/offsets. At test time we'd likely either do a single centred crop, or do random crops and take an average.

Running the augmentation procedure against test data is not to make the test data bigger/more accurate, but just to make the input data from the test set resemble that of the input data from the training set, so we can feed it into the same net (eg same dimensions). We'd never consider that the test set is 'better' in some way, by applying an augmentation procedure. At least, that's not something I've ever seen.

On the other hand, for the training set, the point of the augmentation is to reduce overfitting during training. And we evaluate the quality of the augmentation by then running the trained model against our more-or-less fixed test/validation set.

Related Question