Balancing Methods – How to Handle Imbalanced Data Sets in Classification

I'm trying to solve one classification problem from the UCI database repository. Unfortunately (or fortunately), I've noticed that my dataset is imbalanced. I've structured the data as 5 classes, according to the final mark reached by the pupil, like so:

If student gets a mark from 0 to 7 => class 1 [FAIL(E)]
If student gets a mark from 8 to 9 => class 2 [SUFFICIENT(D)]
If student gets a mark from 10 to 11 => class 3 [GOOD(C)]
If student gets a mark from 12 to 15 => class 4 [NOTABLE(B)]
If student gets a mark from 16 to 19 => class 5 [OUTSTANDING(A)]

My problem is that, as I said, data are imbalanced, so I want to balance it.

I've thought about applying some kind of undersampling method, but my dataset has only 649 instances so I think removing some of them is not the best idea. Then I thought about doing some oversampling, in order to replicate some of the minority class examples and then get classes balanced, but I'm still unsure if that could work.

I would be very grateful if you could give me a hand with this.
Its the first time I face a real problem with imbalanced data.

Best Answer

Since you're using R, you could make use of some elaborated methods like ROSE and SMOTE. But I'm not enrirely certain if re-balancing your dataset is the right solution in your case.

An alternative could be a cost-sensitive algorithm like C5.0 that doesn't need balanced data. You could also think about applying Markov chains to your problem.

Best Answer

Related Solutions

Solved – Cross validation and imbalanced learning

Solved – Random sampling methods for handling class imbalance