MATLAB: Leraning classification with most training samples in one category

boostingclassifierlearning algorithmmachine learning

My question is not Matlab specific but more theoretical.
I'm currently using boosting to create a two class classifier and my week classifiers are trees. While I have a fairly large number of training examples in both classes, most of them are in one single class. I have the intuition that this difference in the number of examples in each class for the training set would deviate the resulting classifier from a "fair" one, towards one that benefits the class with more examples.
Am I right? what are the accepted ways to cope with this issue?
Thanks in advance!

Best Answer

The answer depends on how you define a "fair" classifier. If the ultimate goal of your analysis is to minimize the overall classification error and if the class proportions in the training set are representative of the real world, you get an optimal classifier from your imbalanced data. If the class proportions in the training set are not what you normally expect or if you want to assign different costs for misclassification of the majority and minority classes, you would need to adjust your learning method accordingly.
In general, there are 4 ways of dealing with skewed data:
1. Adjusting class prior probabilities to reflect realistic proportions.
2. Adjusting misclassification costs to represent realistic penalties.
3. Oversampling the minority class.
4. Undersampling the majority class.
For binary classification, strategies 1 and 2 are equivalent.
If you use fitensemble or TreeBagger, the easiest thing would be to set 'prior' to 'uniform' for an equal mix or to whatever you like.
If you like oversampling or undersampling, nothing in official MATLAB is available out of the box. It wouldn't be too hard to code though.
For undersampling the majority class, personally I had good experience with RUSBoost:
Seiffert, C., Khoshgoftaar, T., Hulse, J.V., and Napolitano, A. (2008) Rusboost: Improving classification performance when training data is skewed, in International Conference on Pattern Recognition, pp. 1–4.
For oversampling the minority class, a popular method is SMOTE. You might want to look into its boosting extension.
Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2002) Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. (2003) Smoteboost: improving prediction of the minority class in boosting, in VIIth European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD ´ 03), Lecture Notes on Computer Science, vol. 2838, Springer-Verlag, Lecture Notes on Computer Science, vol. 2838, pp. 107–119.
Related Question