Solved – What are some techniques to augment tabular data

data augmentationdata transformationdatasetsampling

As we know we can perform data augmentation to "image dataset". We can apply random rotation, shifts, shear and flips over images.

Are there techniques to augment tabular small dataset?
I know the there are sampling (oversampling, undersampling) methods like SMOTE. But oversampling generate synthetic data that reduces the authenticity of actual data. Whereas in image augmentation we are generating new data by simply processing the original images which do not generate synthetic data.

So, is there any technique or idea that can be used to augment small tabular datasets by not generating synthetic data?

Best Answer

SMOTE has many variants. SMOTE should be treated as a conservative density estimation of the data, which makes the conservative assumption that the line segments between close neighbors of some class belong to the same class. Sampling from this rough, conservative density estimation absolutely makes sense, but does not work necessarily, depending on the distribution of the data.

There are more advanced variants of SMOTE carrying out more proper density estimation. Let me recommend my own package smote-variants implementing 85 variants of SMOTE for binary oversampling (out of which 61 can be used for multiclass oversampling, too), and further model selection functionalities: https://github.com/gykovacs/smote_variants

You can also access a recent comparative study from the GitHub page, which clearly shows the benefits of oversampling in classification scenarios (Table 3): https://www.researchgate.net/publication/334732374_An_empirical_comparison_and_evaluation_of_minority_oversampling_techniques_on_a_large_number_of_imbalanced_datasets

Related Question