Data Augmentation Techniques for General Datasets

data augmentationdatasetindependencemachine learningpredictive-models

In many machine learning applications, the so called data augmentation methods have allowed building better models. For example, assume a training set of $100$ images of cats and dogs. By rotating, mirroring, adjusting contrast, etc. it is possible to generate additional images from the original ones.

In the case of images, the data augmentation is relatively straightforward. However, suppose (for example) that one has a training set of $100$ samples and few hundred continuous variables that represent different things. The data augmentation does not anymore seem so intuitive. What could be done in such case?

Best Answer

I understand this question as involving both feature construction and dealing with the wealth of features you already have + will construct, relative to your observations (N << P).

Feature Construction

Expanding upon @yasin.yazici's comment, some possible ways to augment the data would be:

  • PCA
  • Auto-encoding
  • Transform's such as log, powers, etc.
  • Binning continuous variables into discrete categories (i.e., continuous variable is 1 SD above mean, 1 below mean, etc.)
  • Composite variables (for example, see here)

I'm sure there are many more I'm missing.

Feature Selection / Dimensionality reduction

You may reduce dimensionality with techniques such as PCA (although perhaps not after augmenting your data with PCA variables). Alternatively, you may use algorithms that perform feature selection for you, such as lasso, random forest, etc.

Related Question