Solved – Data Augmentation Techniques for Cat/Binary/Continuous Numerical Dataset

categorical dataclassificationdata augmentation

I am using the bank marketing dataset from the UCI ML repo to build an example of a big data storage system along with ETL workflows and Machine Learning models. I would like to create more data so I can feed it to the storage solution pretending it is new "fresh" data from different time periods.

I know there are techniques to create more data by adding noise while at the same time maintaining the same underlying structure. Could anybody suggest some that would apply to this dataset and add a motivation?

I am dealing with just numeric features of type categorical, continuous and binary (no image or text data). I don't think this matters but in case it does, this is a binary classification problem.

Thanks for all your inputs!

Best Answer

When the variables are all continuous, I've seen this be done by adding a vector of values from some normal distribution with mean 0. The problem comes with the categorical data, where you have to change some of the categorical values intelligently enough that the new data point is different but realistic. This could be done by switching each category with a small probability. If that's not reasonable, you could also just add random noise at the continuous variables and leave the rest of them alone.