Balancing Methods – How to Handle Imbalanced Data Sets in Classification

classificationdata miningmachine learningrunbalanced-classes

I'm trying to solve one classification problem from the UCI database repository. Unfortunately (or fortunately), I've noticed that my dataset is imbalanced. I've structured the data as 5 classes, according to the final mark reached by the pupil, like so:

  • If student gets a mark from 0 to 7 => class 1 [FAIL(E)]
  • If student gets a mark from 8 to 9 => class 2 [SUFFICIENT(D)]
  • If student gets a mark from 10 to 11 => class 3 [GOOD(C)]
  • If student gets a mark from 12 to 15 => class 4 [NOTABLE(B)]
  • If student gets a mark from 16 to 19 => class 5 [OUTSTANDING(A)]

My problem is that, as I said, data are imbalanced, so I want to balance it.

I've thought about applying some kind of undersampling method, but my dataset has only 649 instances so I think removing some of them is not the best idea. Then I thought about doing some oversampling, in order to replicate some of the minority class examples and then get classes balanced, but I'm still unsure if that could work.

I would be very grateful if you could give me a hand with this.
Its the first time I face a real problem with imbalanced data.

Best Answer

Since you're using R, you could make use of some elaborated methods like ROSE and SMOTE. But I'm not enrirely certain if re-balancing your dataset is the right solution in your case.

An alternative could be a cost-sensitive algorithm like C5.0 that doesn't need balanced data. You could also think about applying Markov chains to your problem.