As we know we can perform data augmentation to "image dataset". We can apply random rotation, shifts, shear and flips over images.
Are there techniques to augment tabular small dataset?
I know the there are sampling (oversampling, undersampling) methods like SMOTE. But oversampling generate synthetic data that reduces the authenticity of actual data. Whereas in image augmentation we are generating new data by simply processing the original images which do not generate synthetic data.
So, is there any technique or idea that can be used to augment small tabular datasets by not generating synthetic data?
Best Answer
SMOTE has many variants. SMOTE should be treated as a conservative density estimation of the data, which makes the conservative assumption that the line segments between close neighbors of some class belong to the same class. Sampling from this rough, conservative density estimation absolutely makes sense, but does not work necessarily, depending on the distribution of the data.
There are more advanced variants of SMOTE carrying out more proper density estimation. Let me recommend my own package smote-variants implementing 85 variants of SMOTE for binary oversampling (out of which 61 can be used for multiclass oversampling, too), and further model selection functionalities: https://github.com/gykovacs/smote_variants
You can also access a recent comparative study from the GitHub page, which clearly shows the benefits of oversampling in classification scenarios (Table 3): https://www.researchgate.net/publication/334732374_An_empirical_comparison_and_evaluation_of_minority_oversampling_techniques_on_a_large_number_of_imbalanced_datasets