Solved – Imbalanced dataset binary classification

binary dataclassificationmachine learningunbalanced-classes

I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?

Regrds.

Best Answer

You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.

Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.

Related Solutions

Solved – Imbalanced data classification using boosting algorithms

If you have R2012b or later, use the RUSBoost algorithm. It is recommended for imbalanced datasets.

If you go with GentleBoost, you need to optimize the tree complexity and the number of trees in the ensemble. (You could also play with the learning rate.) Both parameters are likely far off their optimal values in your code.

First, fitensemble for GentleBoost by default produces decision stumps (trees with two leaves). Since the minority class is only 8% of the data, stumps are not sensitive to observations of the minority class. I often set the minimal leaf size to one half of the size of the minority class. The optimal setting for the leaf size may not be exactly that but should be in that ballpark. Do:

tmp = ClassificationTree.template('minleaf',some_number);
ens = fitensemble(Xtrain,Ytrain,'GentleBoost',Ntrees,tmp,'prior','uniform')

Second, 10 trees are most usually not enough. Inspect the ensemble accuracy by cross-validation or using an independent test set to decide how many trees are needed. Typically, a few hundred should be used for boosting.

Also, after you train the ensemble, don't just look at the classification error. Use the perfcurve function to compute a performance curve and find the optimal threshold on the classification score. It is up to you to define what "optimal" means. You can assign, for instance, different misclassification costs to the two classes and find the threshold minimizing the expected cost. .....

Solved – Binary classification in imbalanced data

A few general strategies:

First and foremost, in imbalanced classification problems you want to do stratified cross-validation. This allows you to train your models with the same distribution in your samples.
Second, you should probably use Cohen's Kappa metric when tuning your models. It is better in imbalanced scenarios because it takes into account random chance as well. A more detailed description was provided in the answer to this question
If you are adventurous, you can look into cost-sensitive machine learning. In these methods you essentially tell the algorithm that it is better to positively identify certain classes. For example, it would be much worse to misidentify a person with cancer as opposed to accurately identifying them. There many methods including sampling (over, under, SMOTE, SMOTEBoost and EasyEnsemble as referenced in this prior question regarding imbalanced datasets and CSL), Weighting, Thresholding, and Ensemble Methods. These are mostly algorithm agnostic methods, there are also algorithms with CSL builtin but I think this is enough to get your started.

Best Answer

Related Solutions

Solved – Imbalanced data classification using boosting algorithms

Solved – Binary classification in imbalanced data

Related Question