Solved – SMOTE in unbalanced dataset with binary features

oversamplingrsamplingsmoteunbalanced-classes

after reading different posts about unbalanced datasets I didn't make my mind clear about my specific problem so that's why I'm posting a new question.

In my case, I have a dataset with around 20K rows and 40 features. I'm trying to do binary classification but in the data the minority class is only the 7% of the instances. I read about using different sampling methods to deal with this problem. Among those I tried SMOTE by using the "unbalanced" R package but I have several doubts about if this package is doing well with my data. From those 40 features I have only 1 that is numeric one (age) and all the others are binary features (yes/no for given diseases). As far as I know, SMOTE works with continuous data since it calculates the Euclidean distance among neighbors.

Does any of you knows if I'm doing correctly by applying this technique to my dataset with binary features?? And in case it's not, how could I manage this problem??

Thanks you so much in advance.

Best Answer

Unless the age feature is very important, SMOTE will not amount to much more than random oversampling with replacement in this case, assuming you are forcing the binary attributes to be exactly 0 or 1.

This is because the synthetic examples will necessarily be equal to one of the two original examples used in their creation (whichever the random weights are closest to).

The proper solution to your problem depends on what the problem really is.

If your problem is relative class imbalance, i.e. you are worried that the classifier will give too much weight to either false positives or false negatives because of the relative weight of the classes in your dataset, then you can look into cost-sensitive learning (ideal if you can determine the costs of different types of mistakes) or random sampling methods. I'm sure there's a synthetic oversampling method out there designed for binary data as well, but I wouldn't count on it making a huge difference.

However, if what you are worried about is the dearth of minority class data, i.e. you believe that you don't have a representative sample of that class (for example, you might be having trouble classifying very rare cases when they only occur once in your dataset), then finding more data of that class is really the only option that works. See http://tjo018.inha.ac.kr/Achievements/Research/Journals/Journal2004_02.pdf for more details on this particular problem.