Solved – Classification problem using imbalanced dataset

machine learningunbalanced-classes

I am working on a pattern identification/classification problem on an imbalanced dataset, with target to non target proportion in population approx as 1%:99%. There are around 0.5 million records in my dataset.

I am restricted to use SAS E-Miner for this analysis. Currently I am using the following approach:

  1. Give appropriate decision weights (profit matrix)
  2. Undersample on the majority class or "good" records.
  3. Running a decision tree on the sample.

My questions are:

  1. How can I undo the effect of undersampling of majority class?
  2. Does giving appropriate decision weights actually helps in removing the bias introduced because of undersampling? Or are these really two independent things.
  3. Even if these decision weights are applied, how do we determine the optimal decision threshold for basing our decisions.

I have tried a boosting algorithm (without adjusting prior probabilities and without using decision weights) but the number of rules/patterns that get thrown up are around 20+ which seems like a mild concern to me.

Would appreciate any inputs from CV community folks.

Best Answer

Removing samples from the majority class may cause the classifier to miss important concepts/features pertaining to the majority class.

One strategy called informed undersampling demonstrated good results. Unsupervised learning algorithm is used to perform independent random sampling from majority class. Multiple classifiers based on the combination of each majority class subset with the minority class data are chosen.

Another example of informed undersampling uses the K-nearest neighbor (KNN) classifier to achieve undersampling. One of the four methods on KNN, looks most straightforward, called NearMiss-3, selects a given number of the closest majority samples for each minority sample to guarantee that every minority sample is surrounded by some majority samples. However, another method, NearMiss-2, in which the majority class samples are selected if their average distance to the three farthest minority class samples are the smallest, is proved the most competitive approach in imbalanced learning.

The profit (cost) matrix can be considered as a numerical representation of the penalty of classifying samples from one class to another. In decision tree,

(1) cost-sensitive adjustments can be applied to the decision threshold;

ROC curve is applied to plot the range of performance values as the decision threshold is moved from the point where the total misclassifications on majority class are maximally costly to the point where total misclassifications on the minority class are maximally costly. The most dominant point on the ROC curve corresponds to the final decision threshold. Read this paper for more details.

(2) cost-sensitive considerations can be given to the split criteria at each node;

This is achieved by fitting an impurity function, and the split with maximum fitting accuracy at each node is selected. This tutorial generalizes the effects of decision tree growth for any choice of spit criteria.

(3) cost-sensitive pruning schemes can be applied to the tree.

Pruning improves generalization by removing leaves with class probability estimates below a specified threshold. Laplace smoothing method on pruning technique is described in the same tutorial here to reduce the probability that pruning removes leaves on the minority class.