How to Perform Data Augmentation and Train-Validate Split

classificationcross-validationdata augmentationdatasetmachine learning

I am doing image classification using machine learning.

Suppose I have some training data (images) and will split the data into training and validation sets. And I also want to augment the data (produce new images from the original ones) by random rotations and noise injection. The augmentaion is done offline.

Which is the correct way to do data augmentation?

  1. First split the data into training and validation sets, then do data augmentation on both training and validation sets.

  2. First split the data into training and validation sets, then do data augmentation only on the training set.

  3. First do data augmentation on the data, then split the data into training and validation set.

Best Answer

First split the data into training and validation sets, then do data augmentation on the training set.

You use your validation set to try to estimate how your method works on real world data, thus it should only contain real world data. Adding augmented data will not improve the accuracy of the validation. It will at best say something about how well your method responds to the data augmentation, and at worst ruin the validation results and interpretability.

Related Question